-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pisa index for Pyterrier TextScorer #30
Comments
This would indeed be useful and is something that we've wanted for a while (cc @Parry-Parry). Unfortunately, PISA's data structures make this a bit challenging to implement. A variation on this would be a scorer that takes the document IDs and scores them based on what's in the index. Similar challenges there too, though. |
Pisa has a forward index right (I'm looking at http://data.terrier.org/indices/msmarco_passage/pisa_unstemmed/latest/)? So its likely to somehow be possible? @JMMackenzie @amallia do you have any code snippets? Forward lookups would also allow use of PRF techniques too (see #29) |
PISA does have a forward index, but I believe it is mostly used during parsing/indexing (cc @elshize) . In the past, I have hacked forward index representations into PISA for things like PRF using this: https://github.com/JMMackenzie/fwd It is simple and not well tested, but it basically reads PISA's canonical file format and builds a container of (compressed) document vectors; those vectors can then be iterated and etc. It is possible to get something "proper" into PISA, but I don't think the current findx stuff in there is fit for what you want. I'll let Michal provide further info since he mostly worked on the forward index. |
Just for further context, what does the parent/OP post here do exactly? I'm not super familiar with all of the magic bells and whistles of PyTerrier :-) |
The short of it is that the forward index we currently have in PISA was not designed for lookups necessarily, though it should be possible with sizes file. It also needs a lexicon to do a term lookup if you don't already know it. So it could be awkward to use but doable. It's also not compressed so that's a trade off. That said, it can be memory mapped, the entire index doesn't need to be in memory. I'm currently traveling so I won't be able to have a look until after Nov 18, but in the meantime it would be helpful if someone could explain what is needed from PISA, then I could help us get there. Showing some APIs or function signatures that are called on the index could be helpful. |
Scores/reranks the text of documents using BM25, where the IDF values come from an existing index. To be fair, I could probably mock this up to just about work if I could access the Pisa lexicon alone, i.e. an API in Python that returns the document frequency and global term frequency of a term in the lexicon, etc.: index.getDocumentFrequency(term : str) -> int
index.getTermFrequency(term : str) -> int
index.getTotalNumberOfTokens() -> int (For reference, num docs/terms is exposed in pyterrier_pisa at https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/__init__.py#L175-L181 via https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/_pisathon.cpp#L631-L652)
This is the variant that needs access to the forward index.
Acknowledged. For reference, for inspecting the contents of an arbitrary document, the Terrier API from Python looks like this: The simplest variant in Python would be something like: index.getDocumentContents(docid : int) -> Iterator[(int, int)] # termid, freq
index.getTermDocumentFrequency(term : int) -> int
index.getTermFrequency(term : int) -> int
index.getDocumentLength(docid : int) (That would allow both a reranker to be implemented in Python, and PRF techniques) Sean's suggestion of " a scorer that takes the document IDs and scores them based on what's in the index." could also be more directly implemented by Pisa, if a higher-level scorer API was exposed that takes the document IDs and scores: index.score(query, docids : List[int], wmodel : str) -> List[float] |
As an addendum, I think we'd also need a function that maps a string ID to an integer one: index.get_doc_id(docno: str) -> int |
For your call there Sean, we can do that easily - You'd just need to load a "document mapping" that comes from PISA's I think the other calls can all be exposed directly from PISA's Perhaps the easiest way to see where things come from/what they are is to look at the scorers: https://github.com/pisa-engine/pisa/blob/master/include/pisa/scorer/bm25.hpp For example (in the BM25 scorer listed):
Hopefully I haven't missed the mark on anything here! |
As @JMMackenzie mentioned, the following should be easy to extract from that wand_data object: index.getTermDocumentFrequency(term : int) -> int
index.getTermFrequency(term : int) -> int
index.getDocumentLength(docid : int) It will be slightly more complicated to get: index.getDocumentContents(docid : int) -> Iterator[(int, int)] # termid, freq This is because the forward index is a very simple structure with only list of IDs. Thus, you will need to construct an iterator that maps the ID list to its frequency using inverted index lookups. So for each term ID in the content, you have to grab a posting list for that term and look up the document frequency. Finally, @JMMackenzie does all that sound correct? |
Sorry about the late response, but I do agree with you @elshize. |
Hey,
Is it possible to use a PISA index in pt.TextScorer? For example, in the following code:
thanks
The text was updated successfully, but these errors were encountered: