-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Number of partitions is fixed to 100 #66
Comments
Hi @afurkank Thanks for the comments and noting this. Partitions code is a bit of a mess. I think we were trying to reproduce the logic at The problem is how the Factory class knows what the partitions setting should be: The easiest hack would be to look for invfqp.*.faiss in https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L612-L629 Craig |
Thanks for the quick response.
So this shouldn't affect the scores if I understood it correctly? |
So I just did a comparison with the Vaswani dataset between indexing with number of partitions fixed to 100 and when number of partitions is For example, when number of partitions is fixed to 100, nDCG@10 for the Vaswani dataset is |
The Faiss ANN stage is only for identifying candidates. The 2nd stage reranking process will hide much of the difference. Vaswani is small enough that probably enough candidate documents would be identified for each query. Even at ColBERT it might be enough. Num partitions would also have a efficiency impact. Do you have colbert index for msmarco? it would be reasonably straightforward to built faiss indices with both 100 and the default value. |
I do not have the index for msmarco unfortunately. I don't have a PC with enough compute power to index that big of a dataset. I could open a pull request for fixing the issue of number of partitions being fixed to 100 and the issue of not accepting different faiss indexes with names including the partition number(such as invfqp.*.faiss, as you mentioned earlier) if it would help. |
PR would help massively, thank you @afurkank!. We'll just not merge it till we have checked the effectiveness numbers. |
Hi, thanks for this great repo!
In
indexing.py
, the number of partitions is set to 100 here.Since this condition will always be false, the index will always consist of 100 partitions.
Is this the intended behavior? Would that affect the retrieval effectiveness?
The text was updated successfully, but these errors were encountered: