Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Gene-MTEB tasks #1959

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

shangshang-wang
Copy link

@shangshang-wang shangshang-wang commented Feb 4, 2025

Closing #1781

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: To add specific benchmarks tailored to metagenomic analysis, where tasks involve classification, multi-label classification and clustering based on public available metagenomic data samples.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great additions!

dataset={
"path": "metagene-ai/HumanMicrobiomeProjectDemonstration",
"name": "disease",
"revision": "main",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to specify exact revision of your dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is from your fork. Please update the README.

Comment on lines +70 to +80
test_size = 1 - (desired_train_samples / M)
split_datasets = full_train_dataset.train_test_split(
test_size=test_size,
shuffle=True,
seed=42)
new_train_dataset = split_datasets['train']
new_test_dataset = split_datasets['test']
self.dataset = datasets.DatasetDict({
'train': new_train_dataset,
'test': new_test_dataset
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you upload dataset with these splits created directly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can you create function in this file that would be used in all tasks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would def. upload it directly (unless there is a very strong reason for doing otherwise)

Comment on lines +40 to +42
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.method = "logReg"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your tasks can be run with only a logreg?

if self.data_loaded:
return

from transformers.trainer_utils import set_seed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move imports to start of file?

'train': new_train_dataset,
'test': new_test_dataset
})
print(f"\nSplitting the data with test_size={test_size}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove prints or change to logging

main_score="accuracy",
date=("2009-10-09", "2012-11-22"),
domains=["Medical"],
task_subtypes=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task_subtypes=None,
task_subtypes=[],

date=("2009-10-09", "2012-11-22"),
domains=["Medical"],
task_subtypes=None,
license="not specified",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No license? This can really limit how useable the task is


class HumanMicrobiomeProjectDemonstrationClassificationDisease(AbsTaskClassification):
metadata = TaskMetadata(
name="HumanMicrobiomeProjectDemonstrationClassificationDisease",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very long title. I will appear very odd in the leaderboard. Would it be worth reducing it?

Comment on lines +70 to +80
test_size = 1 - (desired_train_samples / M)
split_datasets = full_train_dataset.train_test_split(
test_size=test_size,
shuffle=True,
seed=42)
new_train_dataset = split_datasets['train']
new_test_dataset = split_datasets['test']
self.dataset = datasets.DatasetDict({
'train': new_train_dataset,
'test': new_test_dataset
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would def. upload it directly (unless there is a very strong reason for doing otherwise)

},
type="Classification",
category="s2s",
modalities=["text"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A text modality seems someone misleadering. It input is a string of base pairs right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping "text" here might be fine but it should be clear for the description. I would also make it clear in the task subtypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants