-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LoTTE Benchmark to MTEB #2009
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Added a few suggestions.
@@ -1434,3 +1435,18 @@ def load_results( | |||
url={https://arxiv.org/abs/2412.08329}, | |||
}""", | |||
) | |||
|
|||
|
|||
MTEB_LOTTE = Benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will just appear as an empty leaderboard. We would probably want at least some models evaluated on it before adding it to the leaderboard (otherwise it will seem like a bug).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried evaluating some models on Colab and the dataset is huge so it takes a long time to load for me and to evaluate. I tried chunking it down to ensure its working but I'm not able to run the entire benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fine in the current version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, should we maybe consider downsampling the dataset then? Not really worth adding a benchmark if people can't run it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a very large dataset. It has 5 splits of approximately 30 MB each, as available at https://huggingface.co/datasets/colbertv2/lotte. However, the data may be different in the tar file, because its size is approximately 2 GB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dataset is fine, but sentence transformers takes too long to encode it even on Colab, can you run this on your end and let me know if it works?
When I'm trying to run your task, I receive
|
Can you please run a check to make sure everything is working correctly? Because the data is not loading at the moment. >>> mteb.get_task("LoTTE").load_data()
{'queries': {'test': {}}, 'corpus': {'test': {}}, 'relevant_docs': {'test': {}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also when I'm trying to run I receive error
TypeError: can only concatenate str (not "dict") to str
Can you try to start run to validate if your data is loading correctly? If you have low ram, you can do this on kaggle/colab
merged_queries = {} | ||
merged_corpus = {} | ||
merged_relevant = {} | ||
for domain in self.queries: | ||
if split in self.queries[domain]: | ||
merged_queries.update(self.queries[domain][split]) | ||
for key, value in self.queries[domain].items(): | ||
if key.startswith(split) and key != split: | ||
merged_queries.update(value) | ||
for domain in self.corpus: | ||
if split in self.corpus[domain]: | ||
merged_corpus.update(self.corpus[domain][split]) | ||
for domain in self.relevant_docs: | ||
if split in self.relevant_docs[domain]: | ||
merged_relevant.update(self.relevant_docs[domain][split]) | ||
for key, value in self.relevant_docs[domain].items(): | ||
if key.startswith(split) and key != split: | ||
merged_relevant.update(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to merge all queries, corpus, etc. I think you should left corpus and queries per domain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay fixed in latest
if corpus_file.exists(): | ||
with open(corpus_file, encoding="utf-8") as f: | ||
self.corpus[domain][split] = dict( | ||
line.strip().split("\t", 1) for line in f if line.strip() | ||
) | ||
elif metadata_file.exists(): | ||
corpus = {} | ||
with open(metadata_file, encoding="utf-8") as f: | ||
for line in f: | ||
try: | ||
obj = json.loads(line) | ||
doc_id = obj.get("pid") or obj.get("id") | ||
text = obj.get("text") or obj.get("body") | ||
if doc_id and text: | ||
corpus[doc_id] = text | ||
except Exception as e: | ||
logger.error(f"Error parsing {metadata_file}: {e}") | ||
self.corpus[domain][split] = corpus | ||
else: | ||
logger.warning(f"No corpus file found for {domain} {split}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to run task, but there is error occurred.
task.queries["writing"].keys() # dict_keys(['test', 'test.forum'])
task.relevant_docs["writing"].keys() # dict_keys(['test', 'test.forum'])
task.corpus["writing"].keys() # dict_keys(['test'])
MTEB expecting that all data for task run will be in one split, but for now corpus have different naming scheme. I think we should change domains to writing.search
and writing.forum
to align with mteb approach. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats why i had earlier merged them. we now load data per domain without merging the “search” and “forum” items into one key. Instead, for each domain we create separate sub‑dictionaries for “search” and “forum” queries (and similarly for qrels).
{ "corpus": { "writing": { ... }, "recreation": { ... }, ... }, "queries": { "writing": { "search": { ... }, "forum": { ... } }, "recreation": { "search": { ... }, "forum": { ... } }, ... },
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should create them like this:
- corpus: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
- queries: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
- relevant_docs: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
because mteb can't handle nested dicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay updated that
Description:
This PR integrates the LoTTE (Long-Tail Topic-stratified Evaluation for IR) benchmark into MTEB. LoTTE consists of domain-specific retrieval tasks derived from StackExchange and GooAQ, evaluating models on natural, information-seeking queries in long-tail topics.
Closes #1836
Changes:
Testing:
✅ Verified that LoTTERetrieval runs successfully with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 and intfloat/multilingual-e5-small.
✅ Ensured scores are neither trivial (near 100%) nor random (near 0%).
✅ Passed make test and make lint.
Notes:
Updates: