Add LoTTE Benchmark to MTEB #2009

agu18dec · 2025-02-07T09:52:29Z

Description:
This PR integrates the LoTTE (Long-Tail Topic-stratified Evaluation for IR) benchmark into MTEB. LoTTE consists of domain-specific retrieval tasks derived from StackExchange and GooAQ, evaluating models on natural, information-seeking queries in long-tail topics.

Closes #1836

Changes:

Added LoTTERetrieval task under mteb/tasks/Retrieval/eng/LoTTE_Retrieval.py.
Implemented dataset loading, transformation, and evaluation logic.
Registered LoTTE as a benchmark in benchmarks.py.
Updated metadata to ensure compliance with TaskMetadata.

Testing:
✅ Verified that LoTTERetrieval runs successfully with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 and intfloat/multilingual-e5-small.
✅ Ensured scores are neither trivial (near 100%) nor random (near 0%).
✅ Passed make test and make lint.

Notes:

Dataset is hosted on Hugging Face, using revision "main".
Benchmark supports "dev" and "test" splits, with "success@5" as the main metric.
Looking forward to feedback!

Updates:

Moved LoTTERetrieval.py to mteb/tasks/Retrieval/eng/
Used dataset_transform() instead of load_data()
Ensured eval_splits=["test"]
Fixed eval_langs while keeping domain-specific mappings

mteb/tasks/Retrieval/lotte/LoTTERetrieval.py

KennethEnevoldsen

Thanks for the PR. Added a few suggestions.

mteb/tasks/Retrieval/lotte/LoTTERetrieval.py

KennethEnevoldsen · 2025-02-07T11:13:36Z

mteb/benchmarks/benchmarks.py

@@ -1434,3 +1435,18 @@ def load_results(
     url={https://arxiv.org/abs/2412.08329}, 
 }""",
 )
+
+
+MTEB_LOTTE = Benchmark(


This will just appear as an empty leaderboard. We would probably want at least some models evaluated on it before adding it to the leaderboard (otherwise it will seem like a bug).

I tried evaluating some models on Colab and the dataset is huge so it takes a long time to load for me and to evaluate. I tried chunking it down to ensure its working but I'm not able to run the entire benchmark.

Is this fine in the current version?

Hmm, should we maybe consider downsampling the dataset then? Not really worth adding a benchmark if people can't run it?

This is not a very large dataset. It has 5 splits of approximately 30 MB each, as available at https://huggingface.co/datasets/colbertv2/lotte. However, the data may be different in the tar file, because its size is approximately 2 GB.

the dataset is fine, but sentence transformers takes too long to encode it even on Colab, can you run this on your end and let me know if it works?

mteb/benchmarks/benchmarks.py

Samoed · 2025-02-07T22:16:50Z

When I'm trying to run your task, I receive

  File "/home/samoed/Desktop/mteb/orig_mteb/mteb/abstasks/AbsTaskRetrieval.py", line 130, in _load_corpus
    corpus_ds = load_dataset(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2606, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2314, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 601, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'corpus' not found. Available: ['lifestyle', 'pooled', 'recreation', 'science', 'technology', 'writing']

Samoed · 2025-02-08T10:08:25Z

Can you please run a check to make sure everything is working correctly? Because the data is not loading at the moment.

>>> mteb.get_task("LoTTE").load_data()
{'queries': {'test': {}}, 'corpus': {'test': {}}, 'relevant_docs': {'test': {}}}

Samoed

Also when I'm trying to run I receive error

TypeError: can only concatenate str (not "dict") to str

Can you try to start run to validate if your data is loading correctly? If you have low ram, you can do this on kaggle/colab

mteb/tasks/Retrieval/eng/LoTTERetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

…d dataset if missing

…o add-lotte-task

mteb/tasks/Retrieval/eng/LoTTERetrieval.py

…e-task

KennethEnevoldsen · 2025-02-10T13:59:37Z

I will unsubscribe from this seems like it is in good hands with @Samoed. I will happily do a review once @Samoed is happy.

Samoed · 2025-02-11T08:31:19Z

mteb/tasks/Retrieval/eng/LoTTERetrieval.py

+        merged_queries = {}
+        merged_corpus = {}
+        merged_relevant = {}
+        for domain in self.queries:
+            if split in self.queries[domain]:
+                merged_queries.update(self.queries[domain][split])
+            for key, value in self.queries[domain].items():
+                if key.startswith(split) and key != split:
+                    merged_queries.update(value)
+        for domain in self.corpus:
+            if split in self.corpus[domain]:
+                merged_corpus.update(self.corpus[domain][split])
+        for domain in self.relevant_docs:
+            if split in self.relevant_docs[domain]:
+                merged_relevant.update(self.relevant_docs[domain][split])
+            for key, value in self.relevant_docs[domain].items():
+                if key.startswith(split) and key != split:
+                    merged_relevant.update(value)


I don't think we need to merge all queries, corpus, etc. I think you should left corpus and queries per domain

okay fixed in latest

Samoed · 2025-02-11T21:29:44Z

mteb/tasks/Retrieval/eng/LoTTERetrieval.py

+                if corpus_file.exists():
+                    with open(corpus_file, encoding="utf-8") as f:
+                        self.corpus[domain][split] = dict(
+                            line.strip().split("\t", 1) for line in f if line.strip()
+                        )
+                elif metadata_file.exists():
+                    corpus = {}
+                    with open(metadata_file, encoding="utf-8") as f:
+                        for line in f:
+                            try:
+                                obj = json.loads(line)
+                                doc_id = obj.get("pid") or obj.get("id")
+                                text = obj.get("text") or obj.get("body")
+                                if doc_id and text:
+                                    corpus[doc_id] = text
+                            except Exception as e:
+                                logger.error(f"Error parsing {metadata_file}: {e}")
+                    self.corpus[domain][split] = corpus
+                else:
+                    logger.warning(f"No corpus file found for {domain} {split}.")


I tried to run task, but there is error occurred.

task.queries["writing"].keys() # dict_keys(['test', 'test.forum']) task.relevant_docs["writing"].keys() # dict_keys(['test', 'test.forum']) task.corpus["writing"].keys() # dict_keys(['test'])

MTEB expecting that all data for task run will be in one split, but for now corpus have different naming scheme. I think we should change domains to writing.search and writing.forum to align with mteb approach. What do you think?

thats why i had earlier merged them. we now load data per domain without merging the “search” and “forum” items into one key. Instead, for each domain we create separate sub‑dictionaries for “search” and “forum” queries (and similarly for qrels).

{ "corpus": { "writing": { ... }, "recreation": { ... }, ... }, "queries": { "writing": { "search": { ... }, "forum": { ... } }, "recreation": { "search": { ... }, "forum": { ... } }, ... },

You should create them like this:

corpus: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },

queries: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },

relevant_docs: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
because mteb can't handle nested dicts

okay updated that

Add LoTTE Benchmark to MTEB

dea866b

Samoed reviewed Feb 7, 2025

View reviewed changes

mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated Show resolved Hide resolved

Samoed reviewed Feb 7, 2025

View reviewed changes

mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated Show resolved Hide resolved

Samoed reviewed Feb 7, 2025

View reviewed changes

mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated Show resolved Hide resolved

KennethEnevoldsen approved these changes Feb 7, 2025

View reviewed changes

incorporated PR feedback

d666301

Samoed reviewed Feb 7, 2025

View reviewed changes

mteb/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

incorporating pr feedback across the board

71a2c5c

agu18dec requested a review from KennethEnevoldsen February 8, 2025 08:58

loads data correctly

b0ecd06

Samoed requested changes Feb 9, 2025

View reviewed changes

mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated Show resolved Hide resolved

agu18dec and others added 3 commits February 9, 2025 01:46

Update mteb/tasks/Retrieval/eng/LoTTERetrieval.py

7c60444

Co-authored-by: Roman Solomatin <[email protected]>

Custom load_data for LoTTE: iterate domains then splits; auto-downloa…

9a60678

…d dataset if missing

Merge branch 'add-lotte-task' of https://github.com/agu18dec/mteb int…

45de3f5

…o add-lotte-task

Samoed reviewed Feb 10, 2025

View reviewed changes

mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated Show resolved Hide resolved

Merge branch 'main' of https://github.com/agu18dec/mteb into add-lott…

2d05af9

…e-task

KennethEnevoldsen removed their request for review February 10, 2025 14:00

agu18dec added 2 commits February 10, 2025 23:54

bug fixes

ef20316

all make tests pass

af744f4

Samoed reviewed Feb 11, 2025

View reviewed changes

no merging

e1763fc

Samoed reviewed Feb 11, 2025

View reviewed changes

agu18dec added 2 commits February 11, 2025 16:03

ensuring tasks can be run

bc8c216

fixed structure

77bfb5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LoTTE Benchmark to MTEB #2009

Add LoTTE Benchmark to MTEB #2009

agu18dec commented Feb 7, 2025 •

edited

Loading

KennethEnevoldsen left a comment

KennethEnevoldsen Feb 7, 2025

agu18dec Feb 7, 2025

agu18dec Feb 8, 2025

KennethEnevoldsen Feb 10, 2025

Samoed Feb 10, 2025 •

edited

Loading

agu18dec Feb 11, 2025

Samoed commented Feb 7, 2025

Samoed commented Feb 8, 2025

Samoed left a comment

KennethEnevoldsen commented Feb 10, 2025

Samoed Feb 11, 2025

agu18dec Feb 11, 2025

Samoed Feb 11, 2025

agu18dec Feb 12, 2025

Samoed Feb 12, 2025 •

edited

Loading

agu18dec Feb 12, 2025

Add LoTTE Benchmark to MTEB #2009

Are you sure you want to change the base?

Add LoTTE Benchmark to MTEB #2009

Conversation

agu18dec commented Feb 7, 2025 • edited Loading

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Samoed Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Samoed commented Feb 7, 2025

Samoed commented Feb 8, 2025

Samoed left a comment

Choose a reason for hiding this comment

KennethEnevoldsen commented Feb 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Samoed Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agu18dec commented Feb 7, 2025 •

edited

Loading

Samoed Feb 10, 2025 •

edited

Loading

Samoed Feb 12, 2025 •

edited

Loading