Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LoTTE Benchmark to MTEB #2009

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

agu18dec
Copy link

@agu18dec agu18dec commented Feb 7, 2025

Description:
This PR integrates the LoTTE (Long-Tail Topic-stratified Evaluation for IR) benchmark into MTEB. LoTTE consists of domain-specific retrieval tasks derived from StackExchange and GooAQ, evaluating models on natural, information-seeking queries in long-tail topics.

Closes #1836

Changes:

  • Added LoTTERetrieval task under mteb/tasks/Retrieval/eng/LoTTE_Retrieval.py.
  • Implemented dataset loading, transformation, and evaluation logic.
  • Registered LoTTE as a benchmark in benchmarks.py.
  • Updated metadata to ensure compliance with TaskMetadata.

Testing:
✅ Verified that LoTTERetrieval runs successfully with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 and intfloat/multilingual-e5-small.
✅ Ensured scores are neither trivial (near 100%) nor random (near 0%).
✅ Passed make test and make lint.

Notes:

  • Dataset is hosted on Hugging Face, using revision "main".
  • Benchmark supports "dev" and "test" splits, with "success@5" as the main metric.
  • Looking forward to feedback!

Updates:

  • Moved LoTTERetrieval.py to mteb/tasks/Retrieval/eng/
  • Used dataset_transform() instead of load_data()
  • Ensured eval_splits=["test"]
  • Fixed eval_langs while keeping domain-specific mappings

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Added a few suggestions.

mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated Show resolved Hide resolved
@@ -1434,3 +1435,18 @@ def load_results(
url={https://arxiv.org/abs/2412.08329},
}""",
)


MTEB_LOTTE = Benchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just appear as an empty leaderboard. We would probably want at least some models evaluated on it before adding it to the leaderboard (otherwise it will seem like a bug).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried evaluating some models on Colab and the dataset is huge so it takes a long time to load for me and to evaluate. I tried chunking it down to ensure its working but I'm not able to run the entire benchmark.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fine in the current version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should we maybe consider downsampling the dataset then? Not really worth adding a benchmark if people can't run it?

Copy link
Collaborator

@Samoed Samoed Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a very large dataset. It has 5 splits of approximately 30 MB each, as available at https://huggingface.co/datasets/colbertv2/lotte. However, the data may be different in the tar file, because its size is approximately 2 GB.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dataset is fine, but sentence transformers takes too long to encode it even on Colab, can you run this on your end and let me know if it works?

mteb/benchmarks/benchmarks.py Outdated Show resolved Hide resolved
@Samoed
Copy link
Collaborator

Samoed commented Feb 7, 2025

When I'm trying to run your task, I receive

  File "/home/samoed/Desktop/mteb/orig_mteb/mteb/abstasks/AbsTaskRetrieval.py", line 130, in _load_corpus
    corpus_ds = load_dataset(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2606, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2314, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 601, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'corpus' not found. Available: ['lifestyle', 'pooled', 'recreation', 'science', 'technology', 'writing']

@Samoed
Copy link
Collaborator

Samoed commented Feb 8, 2025

Can you please run a check to make sure everything is working correctly? Because the data is not loading at the moment.

>>> mteb.get_task("LoTTE").load_data()
{'queries': {'test': {}}, 'corpus': {'test': {}}, 'relevant_docs': {'test': {}}}

Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when I'm trying to run I receive error

TypeError: can only concatenate str (not "dict") to str

Can you try to start run to validate if your data is loading correctly? If you have low ram, you can do this on kaggle/colab

mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

I will unsubscribe from this seems like it is in good hands with @Samoed. I will happily do a review once @Samoed is happy.

@KennethEnevoldsen KennethEnevoldsen removed their request for review February 10, 2025 14:00
Comment on lines 183 to 200
merged_queries = {}
merged_corpus = {}
merged_relevant = {}
for domain in self.queries:
if split in self.queries[domain]:
merged_queries.update(self.queries[domain][split])
for key, value in self.queries[domain].items():
if key.startswith(split) and key != split:
merged_queries.update(value)
for domain in self.corpus:
if split in self.corpus[domain]:
merged_corpus.update(self.corpus[domain][split])
for domain in self.relevant_docs:
if split in self.relevant_docs[domain]:
merged_relevant.update(self.relevant_docs[domain][split])
for key, value in self.relevant_docs[domain].items():
if key.startswith(split) and key != split:
merged_relevant.update(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to merge all queries, corpus, etc. I think you should left corpus and queries per domain

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay fixed in latest

Comment on lines 145 to 164
if corpus_file.exists():
with open(corpus_file, encoding="utf-8") as f:
self.corpus[domain][split] = dict(
line.strip().split("\t", 1) for line in f if line.strip()
)
elif metadata_file.exists():
corpus = {}
with open(metadata_file, encoding="utf-8") as f:
for line in f:
try:
obj = json.loads(line)
doc_id = obj.get("pid") or obj.get("id")
text = obj.get("text") or obj.get("body")
if doc_id and text:
corpus[doc_id] = text
except Exception as e:
logger.error(f"Error parsing {metadata_file}: {e}")
self.corpus[domain][split] = corpus
else:
logger.warning(f"No corpus file found for {domain} {split}.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run task, but there is error occurred.

task.queries["writing"].keys() # dict_keys(['test', 'test.forum'])
task.relevant_docs["writing"].keys() # dict_keys(['test', 'test.forum'])
task.corpus["writing"].keys() # dict_keys(['test'])

MTEB expecting that all data for task run will be in one split, but for now corpus have different naming scheme. I think we should change domains to writing.search and writing.forum to align with mteb approach. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats why i had earlier merged them. we now load data per domain without merging the “search” and “forum” items into one key. Instead, for each domain we create separate sub‑dictionaries for “search” and “forum” queries (and similarly for qrels).

{ "corpus": { "writing": { ... }, "recreation": { ... }, ... }, "queries": { "writing": { "search": { ... }, "forum": { ... } }, "recreation": { "search": { ... }, "forum": { ... } }, ... },

Copy link
Collaborator

@Samoed Samoed Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should create them like this:

  • corpus: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
  • queries: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
  • relevant_docs: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
    because mteb can't handle nested dicts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay updated that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add LOTTE
3 participants