Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of partitions is fixed to 100 #66

Open
afurkank opened this issue Feb 4, 2024 · 6 comments
Open

Number of partitions is fixed to 100 #66

afurkank opened this issue Feb 4, 2024 · 6 comments

Comments

@afurkank
Copy link

afurkank commented Feb 4, 2024

Hi, thanks for this great repo!

In indexing.py, the number of partitions is set to 100 here.

Since this condition will always be false, the index will always consist of 100 partitions.

Is this the intended behavior? Would that affect the retrieval effectiveness?

@cmacdonald
Copy link
Collaborator

Hi @afurkank

Thanks for the comments and noting this. Partitions code is a bit of a mess.

I think we were trying to reproduce the logic at
https://github.com/cmacdonald/ColBERT/blob/v0.2/colbert/index_faiss.py#L31
which is however initialised to None at:
https://github.com/cmacdonald/ColBERT/blob/v0.2/colbert/utils/parser.py#L80

The problem is how the Factory class knows what the partitions setting should be:
https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L498
(I wanted to avoid having index properties files, as upstream ColBERT doesnt have them)

The easiest hack would be to look for invfqp.*.faiss in https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L612-L629
and open the first one, warning if more than one is found.

Craig

@afurkank
Copy link
Author

afurkank commented Feb 4, 2024

Thanks for the quick response.

The problem is how the Factory class knows what the partitions setting should be:
https://github.com/terrierteam/pyterrier_colbert/blob/main/pyterrier_colbert/ranking.py#L498
(I wanted to avoid having index properties files, as upstream ColBERT doesnt have them)

So this shouldn't affect the scores if I understood it correctly?

@afurkank
Copy link
Author

afurkank commented Feb 4, 2024

So I just did a comparison with the Vaswani dataset between indexing with number of partitions fixed to 100 and when number of partitions is 1 << math.ceil(math.log2(8 * math.sqrt(num_embeddings))). It appears there is very little difference.

For example, when number of partitions is fixed to 100, nDCG@10 for the Vaswani dataset is 0.426272.
When number of partitions is 1 << math.ceil(math.log2(8 * math.sqrt(num_embeddings))), nDCG@10 for the same dataset is 0.425488.

@cmacdonald
Copy link
Collaborator

The Faiss ANN stage is only for identifying candidates. The 2nd stage reranking process will hide much of the difference.

Vaswani is small enough that probably enough candidate documents would be identified for each query. Even at ColBERT it might be enough. Num partitions would also have a efficiency impact.

Do you have colbert index for msmarco? it would be reasonably straightforward to built faiss indices with both 100 and the default value.

@afurkank
Copy link
Author

afurkank commented Feb 5, 2024

I do not have the index for msmarco unfortunately. I don't have a PC with enough compute power to index that big of a dataset.

I could open a pull request for fixing the issue of number of partitions being fixed to 100 and the issue of not accepting different faiss indexes with names including the partition number(such as invfqp.*.faiss, as you mentioned earlier) if it would help.

@cmacdonald
Copy link
Collaborator

cmacdonald commented Feb 5, 2024

PR would help massively, thank you @afurkank!. We'll just not merge it till we have checked the effectiveness numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants