Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pisa index for Pyterrier TextScorer #30

Open
Mandeep-Rathee opened this issue Nov 7, 2024 · 10 comments
Open

pisa index for Pyterrier TextScorer #30

Mandeep-Rathee opened this issue Nov 7, 2024 · 10 comments

Comments

@Mandeep-Rathee
Copy link

Mandeep-Rathee commented Nov 7, 2024

Hey,

Is it possible to use a PISA index in pt.TextScorer? For example, in the following code:

from pyterrier_pisa import PisaIndex

df = pd.DataFrame(
    [
        ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"],
        ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"],
    ], columns=["qid", "query", "docno","text"])
existing_index =  PisaIndex.from_dataset('msmarco_passage')
textscorer = pt.TextScorer(takes="docs", body_attr="text", wmodel="BM25", background_index=existing_index)
rtr = textscorer.transform(df)

thanks

@seanmacavaney
Copy link
Collaborator

This would indeed be useful and is something that we've wanted for a while (cc @Parry-Parry). Unfortunately, PISA's data structures make this a bit challenging to implement.

A variation on this would be a scorer that takes the document IDs and scores them based on what's in the index. Similar challenges there too, though.

@cmacdonald
Copy link
Contributor

Pisa has a forward index right (I'm looking at http://data.terrier.org/indices/msmarco_passage/pisa_unstemmed/latest/)? So its likely to somehow be possible?

@JMMackenzie @amallia do you have any code snippets?

Forward lookups would also allow use of PRF techniques too (see #29)

@Mandeep-Rathee Mandeep-Rathee changed the title pisa indes for Pyterrier TextScorer pisa index for Pyterrier TextScorer Nov 7, 2024
@JMMackenzie
Copy link

PISA does have a forward index, but I believe it is mostly used during parsing/indexing (cc @elshize) .

In the past, I have hacked forward index representations into PISA for things like PRF using this: https://github.com/JMMackenzie/fwd

It is simple and not well tested, but it basically reads PISA's canonical file format and builds a container of (compressed) document vectors; those vectors can then be iterated and etc.

It is possible to get something "proper" into PISA, but I don't think the current findx stuff in there is fit for what you want. I'll let Michal provide further info since he mostly worked on the forward index.

@JMMackenzie
Copy link

Just for further context, what does the parent/OP post here do exactly? I'm not super familiar with all of the magic bells and whistles of PyTerrier :-)

@elshize
Copy link

elshize commented Nov 8, 2024

The short of it is that the forward index we currently have in PISA was not designed for lookups necessarily, though it should be possible with sizes file. It also needs a lexicon to do a term lookup if you don't already know it. So it could be awkward to use but doable. It's also not compressed so that's a trade off. That said, it can be memory mapped, the entire index doesn't need to be in memory.

I'm currently traveling so I won't be able to have a look until after Nov 18, but in the meantime it would be helpful if someone could explain what is needed from PISA, then I could help us get there. Showing some APIs or function signatures that are called on the index could be helpful.

@cmacdonald
Copy link
Contributor

cmacdonald commented Nov 8, 2024

Just for further context, what does the parent/OP post here do exactly?

Scores/reranks the text of documents using BM25, where the IDF values come from an existing index. To be fair, I could probably mock this up to just about work if I could access the Pisa lexicon alone, i.e. an API in Python that returns the document frequency and global term frequency of a term in the lexicon, etc.:

index.getDocumentFrequency(term : str) -> int
index.getTermFrequency(term : str) -> int
index.getTotalNumberOfTokens() -> int

(For reference, num docs/terms is exposed in pyterrier_pisa at https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/__init__.py#L175-L181 via https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/_pisathon.cpp#L631-L652)

A variation on this would be a scorer that takes the document IDs and scores them based on what's in the index. Similar challenges there too, though.

This is the variant that needs access to the forward index.

It also needs a lexicon to do a term lookup if you don't already know it.

Acknowledged. For reference, for inspecting the contents of an arbitrary document, the Terrier API from Python looks like this:
https://pyterrier.readthedocs.io/en/latest/terrier-index-api.html#what-terms-occur-in-the-11th-document

The simplest variant in Python would be something like:

index.getDocumentContents(docid : int) -> Iterator[(int, int)] # termid, freq
index.getTermDocumentFrequency(term : int) -> int
index.getTermFrequency(term : int) -> int
index.getDocumentLength(docid : int)

(That would allow both a reranker to be implemented in Python, and PRF techniques)

Sean's suggestion of " a scorer that takes the document IDs and scores them based on what's in the index." could also be more directly implemented by Pisa, if a higher-level scorer API was exposed that takes the document IDs and scores:

index.score(query, docids : List[int], wmodel : str) -> List[float]

@seanmacavaney
Copy link
Collaborator

The simplest variant in Python would be something like:

As an addendum, I think we'd also need a function that maps a string ID to an integer one:

index.get_doc_id(docno: str) -> int

@JMMackenzie
Copy link

For your call there Sean, we can do that easily - You'd just need to load a "document mapping" that comes from PISA's lexicon tool (these are more or less just binary blobs of text files).

I think the other calls can all be exposed directly from PISA's wand_data types: https://github.com/pisa-engine/pisa/blob/master/include/pisa/wand_data.hpp

Perhaps the easiest way to see where things come from/what they are is to look at the scorers: https://github.com/pisa-engine/pisa/blob/master/include/pisa/scorer/bm25.hpp

For example (in the BM25 scorer listed):

    // IDF (inverse document frequency)
    float query_term_weight(uint64_t df, uint64_t num_docs) const {
        auto fdf = static_cast<float>(df);
        float idf = std::log((float(num_docs) - fdf + 0.5F) / (fdf + 0.5F));
        static const float epsilon_score = 1.0E-6;
        return std::max(epsilon_score, idf) * (1.0F + m_k1);
    }

    TermScorer term_scorer(uint64_t term_id) const override {
        auto term_len = this->m_wdata.term_posting_count(term_id);
        auto term_weight = query_term_weight(term_len, this->m_wdata.num_docs());
        auto s = [&, term_weight](uint32_t doc, uint32_t freq) {
            return term_weight * doc_term_weight(freq, this->m_wdata.norm_len(doc));
        };
        return s;
    }

Hopefully I haven't missed the mark on anything here!

@elshize
Copy link

elshize commented Nov 21, 2024

As @JMMackenzie mentioned, the following should be easy to extract from that wand_data object:

index.getTermDocumentFrequency(term : int) -> int
index.getTermFrequency(term : int) -> int
index.getDocumentLength(docid : int)

It will be slightly more complicated to get:

index.getDocumentContents(docid : int) -> Iterator[(int, int)] # termid, freq

This is because the forward index is a very simple structure with only list of IDs. Thus, you will need to construct an iterator that maps the ID list to its frequency using inverted index lookups. So for each term ID in the content, you have to grab a posting list for that term and look up the document frequency.

Finally, get_doc_id is easy to get from a document lexicon, this is an analogous structure to term lexicon but for documents, but you'll have to load it in memory because unlike with terms, documents will not be sorted alphabetically, so you can't do a binary search as with terms, which can be done with mmap; you'd have to be scanning the entire list, which makes little sense, so if you can afford to load it, that's the way to do it. The easiest way would be to create a payload vector from an mmap memory: https://github.com/pisa-engine/pisa/blob/master/tools/lexicon.cpp#L44 which supports iterators https://github.com/pisa-engine/pisa/blob/master/include/pisa/payload_vector.hpp#L304 The value are strings and the position is the ID, so with that, you can build a hash map or something.

@JMMackenzie does all that sound correct?

@JMMackenzie
Copy link

Sorry about the late response, but I do agree with you @elshize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants