Add Gene-MTEB tasks #1959

shangshang-wang · 2025-02-04T19:11:26Z

Closing #1781

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: To add specific benchmarks tailored to metagenomic analysis, where tasks involve classification, multi-label classification and clustering based on public available metagenomic data samples.

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Samoed

Great additions!

Samoed · 2025-02-04T19:16:56Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+        dataset={
+            "path": "metagene-ai/HumanMicrobiomeProjectDemonstration",
+            "name": "disease",
+            "revision": "main",


You need to specify exact revision of your dataset

Samoed · 2025-02-04T19:17:49Z

README.md

I think this is from your fork. Please update the README.

Samoed · 2025-02-04T19:20:18Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+        test_size = 1 - (desired_train_samples / M)
+        split_datasets = full_train_dataset.train_test_split(
+            test_size=test_size,
+            shuffle=True,
+            seed=42)
+        new_train_dataset = split_datasets['train']
+        new_test_dataset = split_datasets['test']
+        self.dataset = datasets.DatasetDict({
+            'train': new_train_dataset,
+            'test': new_test_dataset
+        })


Can you upload dataset with these splits created directly?

Or can you create function in this file that would be used in all tasks

I would def. upload it directly (unless there is a very strong reason for doing otherwise)

Samoed · 2025-02-04T19:39:45Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.method = "logReg"


Your tasks can be run with only a logreg?

Samoed · 2025-02-04T19:40:04Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+        if self.data_loaded:
+            return
+
+        from transformers.trainer_utils import set_seed


Can you move imports to start of file?

Samoed · 2025-02-04T19:40:46Z

mteb/tasks/Clustering/metagene/HumanMicrobiomeProjectReferenceClustering.py

+            'train': new_train_dataset,
+            'test': new_test_dataset
+        })
+        print(f"\nSplitting the data with test_size={test_size}")


Please, remove prints or change to logging

KennethEnevoldsen · 2025-02-04T21:16:28Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+        main_score="accuracy",
+        date=("2009-10-09", "2012-11-22"),
+        domains=["Medical"],
+        task_subtypes=None,


Suggested change

task_subtypes=None,

task_subtypes=[],

KennethEnevoldsen · 2025-02-04T21:17:19Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+        date=("2009-10-09", "2012-11-22"),
+        domains=["Medical"],
+        task_subtypes=None,
+        license="not specified",


No license? This can really limit how useable the task is

KennethEnevoldsen · 2025-02-04T21:18:16Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+
+class HumanMicrobiomeProjectDemonstrationClassificationDisease(AbsTaskClassification):
+    metadata = TaskMetadata(
+        name="HumanMicrobiomeProjectDemonstrationClassificationDisease",


This is a very long title. I will appear very odd in the leaderboard. Would it be worth reducing it?

KennethEnevoldsen · 2025-02-04T21:19:37Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+        test_size = 1 - (desired_train_samples / M)
+        split_datasets = full_train_dataset.train_test_split(
+            test_size=test_size,
+            shuffle=True,
+            seed=42)
+        new_train_dataset = split_datasets['train']
+        new_test_dataset = split_datasets['test']
+        self.dataset = datasets.DatasetDict({
+            'train': new_train_dataset,
+            'test': new_test_dataset
+        })


I would def. upload it directly (unless there is a very strong reason for doing otherwise)

KennethEnevoldsen · 2025-02-05T07:42:48Z

mteb/tasks/Classification/metagene/HumanMicrobiomeProjectDemonstrationClassification.py

+        },
+        type="Classification",
+        category="s2s",
+        modalities=["text"],


A text modality seems someone misleadering. It input is a string of base pairs right?

keeping "text" here might be fine but it should be clear for the description. I would also make it clear in the task subtypes.

UpUp Ashton Wang and others added 9 commits December 20, 2024 05:14

first add classification and clustering tasks for METAGENE-1

cbc474b

remove mini test classification task

a539481

use huggingface data path

a5f8eda

Clean up README.md

816ede3

rename multi-label classification task

f9a8b36

Update README.md

fd36575

Update README.md

d465435

Add descriptions and references for metagene tasks

98a77cd

add task metadata

cdfacab

Samoed reviewed Feb 4, 2025

View reviewed changes

KennethEnevoldsen reviewed Feb 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gene-MTEB tasks #1959

Add Gene-MTEB tasks #1959

shangshang-wang commented Feb 4, 2025 •

edited by Samoed

Loading

Samoed left a comment

Samoed Feb 4, 2025

Samoed Feb 4, 2025

Samoed Feb 4, 2025

Samoed Feb 4, 2025

KennethEnevoldsen Feb 4, 2025

Samoed Feb 4, 2025

Samoed Feb 4, 2025

Samoed Feb 4, 2025

KennethEnevoldsen Feb 4, 2025

KennethEnevoldsen Feb 4, 2025

KennethEnevoldsen Feb 4, 2025

KennethEnevoldsen Feb 4, 2025

KennethEnevoldsen Feb 5, 2025

KennethEnevoldsen Feb 5, 2025

Add Gene-MTEB tasks #1959

Are you sure you want to change the base?

Add Gene-MTEB tasks #1959

Conversation

shangshang-wang commented Feb 4, 2025 • edited by Samoed Loading

Code Quality

Documentation

Testing

Adding datasets checklist

Samoed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shangshang-wang commented Feb 4, 2025 •

edited by Samoed

Loading