Number of partitions is fixed to 100 #66

afurkank · 2024-02-04T13:54:48Z

Hi, thanks for this great repo!

In indexing.py, the number of partitions is set to 100 here.

Since this condition will always be false, the index will always consist of 100 partitions.

Is this the intended behavior? Would that affect the retrieval effectiveness?

The text was updated successfully, but these errors were encountered:

cmacdonald · 2024-02-04T20:04:45Z

Hi @afurkank

Thanks for the comments and noting this. Partitions code is a bit of a mess.

I think we were trying to reproduce the logic at
https://github.com/cmacdonald/ColBERT/blob/v0.2/colbert/index_faiss.py#L31
which is however initialised to None at:
https://github.com/cmacdonald/ColBERT/blob/v0.2/colbert/utils/parser.py#L80

The problem is how the Factory class knows what the partitions setting should be:
https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L498
(I wanted to avoid having index properties files, as upstream ColBERT doesnt have them)

The easiest hack would be to look for invfqp.*.faiss in https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L612-L629
and open the first one, warning if more than one is found.

Craig

afurkank · 2024-02-04T20:35:52Z

Thanks for the quick response.

The problem is how the Factory class knows what the partitions setting should be:
https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L498
(I wanted to avoid having index properties files, as upstream ColBERT doesnt have them)

So this shouldn't affect the scores if I understood it correctly?

afurkank · 2024-02-04T20:49:54Z

So I just did a comparison with the Vaswani dataset between indexing with number of partitions fixed to 100 and when number of partitions is 1 << math.ceil(math.log2(8 * math.sqrt(num_embeddings))). It appears there is very little difference.

For example, when number of partitions is fixed to 100, nDCG@10 for the Vaswani dataset is 0.426272.
When number of partitions is 1 << math.ceil(math.log2(8 * math.sqrt(num_embeddings))), nDCG@10 for the same dataset is 0.425488.

cmacdonald · 2024-02-05T14:32:33Z

The Faiss ANN stage is only for identifying candidates. The 2nd stage reranking process will hide much of the difference.

Vaswani is small enough that probably enough candidate documents would be identified for each query. Even at ColBERT it might be enough. Num partitions would also have a efficiency impact.

Do you have colbert index for msmarco? it would be reasonably straightforward to built faiss indices with both 100 and the default value.

afurkank · 2024-02-05T16:51:15Z

I do not have the index for msmarco unfortunately. I don't have a PC with enough compute power to index that big of a dataset.

I could open a pull request for fixing the issue of number of partitions being fixed to 100 and the issue of not accepting different faiss indexes with names including the partition number(such as invfqp.*.faiss, as you mentioned earlier) if it would help.

cmacdonald · 2024-02-05T17:23:02Z

PR would help massively, thank you @afurkank!. We'll just not merge it till we have checked the effectiveness numbers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number of partitions is fixed to 100 #66

Number of partitions is fixed to 100 #66

afurkank commented Feb 4, 2024

cmacdonald commented Feb 4, 2024

afurkank commented Feb 4, 2024

afurkank commented Feb 4, 2024

cmacdonald commented Feb 5, 2024

afurkank commented Feb 5, 2024 •

edited

Loading

cmacdonald commented Feb 5, 2024 •

edited

Loading

Number of partitions is fixed to 100 #66

Number of partitions is fixed to 100 #66

Comments

afurkank commented Feb 4, 2024

cmacdonald commented Feb 4, 2024

afurkank commented Feb 4, 2024

afurkank commented Feb 4, 2024

cmacdonald commented Feb 5, 2024

afurkank commented Feb 5, 2024 • edited Loading

cmacdonald commented Feb 5, 2024 • edited Loading

afurkank commented Feb 5, 2024 •

edited

Loading

cmacdonald commented Feb 5, 2024 •

edited

Loading