Skip to content
This repository has been archived by the owner on Sep 21, 2021. It is now read-only.

Question about luke-luke-4.10.4-field-reconstruction #175

Open
chris-bamford opened this issue Nov 15, 2019 · 17 comments
Open

Question about luke-luke-4.10.4-field-reconstruction #175

chris-bamford opened this issue Nov 15, 2019 · 17 comments

Comments

@chris-bamford
Copy link

chris-bamford commented Nov 15, 2019

Hi, I am opening an issue as I can't find another way to ask Dmitry a question - sorry if this is the wrong approach.

Before I realised that you had made this patch I wrote my own code to do the same thing, which I am still experimenting with. Then I found your patch and compared the code, specifically the part which reconstructs the data when there is no position information.

I note that (IIUC) your new code takes the first term and adds it to the GrowableStringArray without checking if it is for the current document. Is this correct? In the alternate case where position info is available, there is a loop which discards any terms which are not for the current document:

int num = dpe.advance(docNum); if (num != docNum) { // either greater than or NO_MORE_DOCS continue; // no data for this term in this doc }

Shouldn't both cases do the same thing?

In my code (as there is no DocsAndPositionsEnum) I use the DocsEnum instead to cross reference against the current document:

`
DocsEnum docsEnum = termsEnum.docs(null, null);

if (docsEnum != null) {
    int termDoc;
    while ((termDoc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        if (termDoc != docNum) {
            continue;  // this term is not for this document
        }

        GrowableStringArray gsa = (GrowableStringArray)
                res.getReconstructedFields().get(fld);
        if (gsa == null) {
            gsa = new GrowableStringArray();
            res.getReconstructedFields().put(fld, gsa);
        }
        String term = termsEnum.term().utf8ToString();
        System.out.print("(Cross-referenced term " + term + " via doc " + termDoc + ") ");
        gsa.append(0, "|", term);
    }
}

`

Do you see what I mean? Or have I missed something?

Thanks

  • Chris
@DmitryKey
Copy link
Owner

Hi Chris!
Thanks for reporting and paying attention to this old little patch. I think you are right in that we do need to check whether the term belongs to the target document. So in the notation of the relevant Lucene version we'do:

      // first use bytesRef to extract the indexed term value
      String docTerm = bytesRef.utf8ToString();

      DocsAndPositionsEnum newDpe = te.docsAndPositions(live, dpe, 0);
      DocsEnum docs = te.docs(live, dpe);
      int docId;
      boolean docFound = false;
      while( (docId = docs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
      if (docId == docNum)
          docFound = true;
      }
      if (docFound) {
          // only then store the value in a GrowableStringArray
      }

and I think your code should work. Can you test it for fields with and without positions? You can find a test case in #20 .

@chris-bamford
Copy link
Author

Hi Dmitry,

Great, I’ll take a look.
Thanks

Chris

@chris-bamford
Copy link
Author

chris-bamford commented Nov 18, 2019

Hi Dmitry,
I have a question - in your original code when you wrote the term with no position info to the 0th slot in the growablestringarray, you then do this
// we are done. Move to the next field
break;
Why? Surely we can continue to loop through all of the terms in the field and also add them to the 0th slot? Why abandon at this stage?

Thanks,

  • Chris

@chris-bamford
Copy link
Author

chris-bamford commented Nov 19, 2019

Hi Dmitry,

I attach a patch + test for your perusal. You'll have to remove the ".txt" suffix as it won't accept .patch files.

Unstored_Field_With_Missing_Posns_Reconstruction.patch.txt

Kind regards,

  • Chris

@chris-bamford
Copy link
Author

chris-bamford commented Nov 21, 2019 via email

@DmitryKey
Copy link
Owner

Hi Dmitry,
I have a question - in your original code when you wrote the term with no position info to the 0th slot in the growablestringarray, you then do this
// we are done. Move to the next field
break;
Why? Surely we can continue to loop through all of the terms in the field and also add them to the 0th slot? Why abandon at this stage?

Thanks,

  • Chris

I wish I had left more detail in the comments as to why I did that way. May be it was optimizing for some special case. At this point, re-reading the code I would agree with you: iterate and append all. Only thing is, what if the field's length is beyond capacity of the UI to show it? May be some trimming is necessary.

@DmitryKey
Copy link
Owner

Hi Dmitry,

I attach a patch + test for your perusal. You'll have to remove the ".txt" suffix as it won't accept .patch files.

Unstored_Field_With_Missing_Posns_Reconstruction.patch.txt

Kind regards,

  • Chris

thanks, Chris! Actually, can you submit it as a pull request?

@DmitryKey
Copy link
Owner

Hi Dmitry, I have an unrelated Luke question, hope you don’t mind. Since upgrading my Mac to Mojave, the fonts in Luke are terrible and I don’t know how to fix it. Settings -> Display font ... reveals a huge list of equally horrible fonts. I’m pretty sure it didn’t used to be this bad. Any ideas? I have googled a lot but have not found anything. I have been using the 4.10.4 and 5.5.5 versions of Luke. Thanks Chris

Is there a way for you to use the Java FX or, better, Swing based Luke?

chris-bamford pushed a commit to chris-bamford/luke that referenced this issue Nov 25, 2019
@chris-bamford
Copy link
Author

Hi Dmitry

Please bear with me, I have not done this before (pull request).
I have created my own repo, created a local branch and applied the patch to it, committed and pushed. Now I am unsure what to do:

https://github.com/chris-bamford/luke

Can you take a look and advise please?

Thanks

  • Chris

@DmitryKey
Copy link
Owner

DmitryKey commented Nov 25, 2019

I see you forked from this repo -- this is a good start. Next, go to the github ui, switch to your branch, and click New pull request. What was the base branch for your branch?

@chris-bamford
Copy link
Author

I clicked new pull request and it says
"There isn’t anything to compare.
DmitryKey:master and chris-bamford:issue/175-improve-reconstruction-unstored-fields are entirely different commit histories."
It shows my diffs (unified or split), but that's it, no way to confirm or proceed.

In answer to your question, is this what you mean:
Screenshot 2019-11-26 at 10 11 01

Sorry for my inexperience.

@DmitryKey
Copy link
Owner

hi Chris,

no problem. I compared to different branches and may be you created your branch off of luke-thinlet branch?

luke-thinlet...chris-bamford:issue/175-improve-reconstruction-unstored-fields

@chris-bamford
Copy link
Author

I honestly cannot remember. I think it would be better if you sent me step-by-step instructions so I can learn!
Thanks

@DmitryKey
Copy link
Owner

hey @chris-bamford , from what it looks, you branched from the luke-thinlet. So after you click New pull request, instead of master, choose luke-thinlet and view the differences. At that point you should only see your own code differences and no conflicts.

@chris-bamford
Copy link
Author

chris-bamford commented Dec 3, 2019 via email

@chris-bamford
Copy link
Author

chris-bamford commented Dec 11, 2019 via email

@DmitryKey
Copy link
Owner

Hi Chris,

Thanks for your PR. Been travelling, could not respond earlier.

I left a comment in #177

chris-bamford pushed a commit to chris-bamford/luke that referenced this issue Dec 16, 2019
DmitryKey added a commit that referenced this issue Dec 16, 2019
…uction-2

Issue #175: Offered improvement to reconstruction of unstored fields …
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants