pisa index for Pyterrier TextScorer #30

Mandeep-Rathee · 2024-11-07T10:12:28Z

Hey,

Is it possible to use a PISA index in pt.TextScorer? For example, in the following code:

from pyterrier_pisa import PisaIndex

df = pd.DataFrame(
    [
        ["q1", "chemical reactions", "d1", "professor protor poured the chemicals"],
        ["q1", "chemical reactions", "d2", "chemical brothers turned up the beats"],
    ], columns=["qid", "query", "docno","text"])
existing_index =  PisaIndex.from_dataset('msmarco_passage')
textscorer = pt.TextScorer(takes="docs", body_attr="text", wmodel="BM25", background_index=existing_index)
rtr = textscorer.transform(df)

thanks

seanmacavaney · 2024-11-07T11:11:38Z

This would indeed be useful and is something that we've wanted for a while (cc @Parry-Parry). Unfortunately, PISA's data structures make this a bit challenging to implement.

A variation on this would be a scorer that takes the document IDs and scores them based on what's in the index. Similar challenges there too, though.

cmacdonald · 2024-11-07T12:34:24Z

Pisa has a forward index right (I'm looking at http://data.terrier.org/indices/msmarco_passage/pisa_unstemmed/latest/)? So its likely to somehow be possible?

@JMMackenzie @amallia do you have any code snippets?

Forward lookups would also allow use of PRF techniques too (see #29)

JMMackenzie · 2024-11-08T04:32:18Z

PISA does have a forward index, but I believe it is mostly used during parsing/indexing (cc @elshize) .

In the past, I have hacked forward index representations into PISA for things like PRF using this: https://github.com/JMMackenzie/fwd

It is simple and not well tested, but it basically reads PISA's canonical file format and builds a container of (compressed) document vectors; those vectors can then be iterated and etc.

It is possible to get something "proper" into PISA, but I don't think the current findx stuff in there is fit for what you want. I'll let Michal provide further info since he mostly worked on the forward index.

JMMackenzie · 2024-11-08T04:33:36Z

Just for further context, what does the parent/OP post here do exactly? I'm not super familiar with all of the magic bells and whistles of PyTerrier :-)

elshize · 2024-11-08T05:13:11Z

The short of it is that the forward index we currently have in PISA was not designed for lookups necessarily, though it should be possible with sizes file. It also needs a lexicon to do a term lookup if you don't already know it. So it could be awkward to use but doable. It's also not compressed so that's a trade off. That said, it can be memory mapped, the entire index doesn't need to be in memory.

I'm currently traveling so I won't be able to have a look until after Nov 18, but in the meantime it would be helpful if someone could explain what is needed from PISA, then I could help us get there. Showing some APIs or function signatures that are called on the index could be helpful.

cmacdonald · 2024-11-08T10:27:13Z

Just for further context, what does the parent/OP post here do exactly?

Scores/reranks the text of documents using BM25, where the IDF values come from an existing index. To be fair, I could probably mock this up to just about work if I could access the Pisa lexicon alone, i.e. an API in Python that returns the document frequency and global term frequency of a term in the lexicon, etc.:

index.getDocumentFrequency(term : str) -> int
index.getTermFrequency(term : str) -> int
index.getTotalNumberOfTokens() -> int

(For reference, num docs/terms is exposed in pyterrier_pisa at https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/__init__.py#L175-L181 via https://github.com/terrierteam/pyterrier_pisa/blob/main/src/pyterrier_pisa/_pisathon.cpp#L631-L652)

A variation on this would be a scorer that takes the document IDs and scores them based on what's in the index. Similar challenges there too, though.

This is the variant that needs access to the forward index.

It also needs a lexicon to do a term lookup if you don't already know it.

Acknowledged. For reference, for inspecting the contents of an arbitrary document, the Terrier API from Python looks like this:
https://pyterrier.readthedocs.io/en/latest/terrier-index-api.html#what-terms-occur-in-the-11th-document

The simplest variant in Python would be something like:

index.getDocumentContents(docid : int) -> Iterator[(int, int)] # termid, freq
index.getTermDocumentFrequency(term : int) -> int
index.getTermFrequency(term : int) -> int
index.getDocumentLength(docid : int)

(That would allow both a reranker to be implemented in Python, and PRF techniques)

Sean's suggestion of " a scorer that takes the document IDs and scores them based on what's in the index." could also be more directly implemented by Pisa, if a higher-level scorer API was exposed that takes the document IDs and scores:

index.score(query, docids : List[int], wmodel : str) -> List[float]

seanmacavaney · 2024-11-08T11:07:47Z

The simplest variant in Python would be something like:

As an addendum, I think we'd also need a function that maps a string ID to an integer one:

index.get_doc_id(docno: str) -> int

JMMackenzie · 2024-11-18T07:00:36Z

For your call there Sean, we can do that easily - You'd just need to load a "document mapping" that comes from PISA's lexicon tool (these are more or less just binary blobs of text files).

I think the other calls can all be exposed directly from PISA's wand_data types: https://github.com/pisa-engine/pisa/blob/master/include/pisa/wand_data.hpp

Perhaps the easiest way to see where things come from/what they are is to look at the scorers: https://github.com/pisa-engine/pisa/blob/master/include/pisa/scorer/bm25.hpp

For example (in the BM25 scorer listed):

    // IDF (inverse document frequency)
    float query_term_weight(uint64_t df, uint64_t num_docs) const {
        auto fdf = static_cast<float>(df);
        float idf = std::log((float(num_docs) - fdf + 0.5F) / (fdf + 0.5F));
        static const float epsilon_score = 1.0E-6;
        return std::max(epsilon_score, idf) * (1.0F + m_k1);
    }

    TermScorer term_scorer(uint64_t term_id) const override {
        auto term_len = this->m_wdata.term_posting_count(term_id);
        auto term_weight = query_term_weight(term_len, this->m_wdata.num_docs());
        auto s = [&, term_weight](uint32_t doc, uint32_t freq) {
            return term_weight * doc_term_weight(freq, this->m_wdata.norm_len(doc));
        };
        return s;
    }

Hopefully I haven't missed the mark on anything here!

elshize · 2024-11-21T10:54:22Z

As @JMMackenzie mentioned, the following should be easy to extract from that wand_data object:

index.getTermDocumentFrequency(term : int) -> int
index.getTermFrequency(term : int) -> int
index.getDocumentLength(docid : int)

It will be slightly more complicated to get:

index.getDocumentContents(docid : int) -> Iterator[(int, int)] # termid, freq

This is because the forward index is a very simple structure with only list of IDs. Thus, you will need to construct an iterator that maps the ID list to its frequency using inverted index lookups. So for each term ID in the content, you have to grab a posting list for that term and look up the document frequency.

Finally, get_doc_id is easy to get from a document lexicon, this is an analogous structure to term lexicon but for documents, but you'll have to load it in memory because unlike with terms, documents will not be sorted alphabetically, so you can't do a binary search as with terms, which can be done with mmap; you'd have to be scanning the entire list, which makes little sense, so if you can afford to load it, that's the way to do it. The easiest way would be to create a payload vector from an mmap memory: https://github.com/pisa-engine/pisa/blob/master/tools/lexicon.cpp#L44 which supports iterators https://github.com/pisa-engine/pisa/blob/master/include/pisa/payload_vector.hpp#L304 The value are strings and the position is the ID, so with that, you can build a hash map or something.

@JMMackenzie does all that sound correct?

JMMackenzie · 2025-01-07T06:14:01Z

Sorry about the late response, but I do agree with you @elshize.

Mandeep-Rathee changed the title ~~pisa indes for Pyterrier TextScorer~~ pisa index for Pyterrier TextScorer Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pisa index for Pyterrier TextScorer #30

pisa index for Pyterrier TextScorer #30

Mandeep-Rathee commented Nov 7, 2024 •

edited

Loading

seanmacavaney commented Nov 7, 2024

cmacdonald commented Nov 7, 2024

JMMackenzie commented Nov 8, 2024

JMMackenzie commented Nov 8, 2024

elshize commented Nov 8, 2024

cmacdonald commented Nov 8, 2024 •

edited

Loading

seanmacavaney commented Nov 8, 2024

JMMackenzie commented Nov 18, 2024

elshize commented Nov 21, 2024

JMMackenzie commented Jan 7, 2025

pisa index for Pyterrier TextScorer #30

pisa index for Pyterrier TextScorer #30

Comments

Mandeep-Rathee commented Nov 7, 2024 • edited Loading

seanmacavaney commented Nov 7, 2024

cmacdonald commented Nov 7, 2024

JMMackenzie commented Nov 8, 2024

JMMackenzie commented Nov 8, 2024

elshize commented Nov 8, 2024

cmacdonald commented Nov 8, 2024 • edited Loading

seanmacavaney commented Nov 8, 2024

JMMackenzie commented Nov 18, 2024

elshize commented Nov 21, 2024

JMMackenzie commented Jan 7, 2025

Mandeep-Rathee commented Nov 7, 2024 •

edited

Loading

cmacdonald commented Nov 8, 2024 •

edited

Loading