Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 Dataset Overview Issue #194

Open
KennethEnevoldsen opened this issue Feb 14, 2025 · 2 comments
Open

v2 Dataset Overview Issue #194

KennethEnevoldsen opened this issue Feb 14, 2025 · 2 comments
Labels
dataset new dataset to add v2

Comments

@KennethEnevoldsen
Copy link
Owner

KennethEnevoldsen commented Feb 14, 2025

An overview of datasets to add in the new version

New datasets

Danish:

Norwegian:

Multilingual:

Remove

  • Da Political Comments: Quality if questionable and no clear paper attached to it. Similar to DKHate
  • Massive Intent: Translated dataset (MUNI is a strictly more realistic evaluation set)
  • Massive Scenario: Translated dataset
  • Potentially remove
    • LCC, few samples and only labelled by one guy.
    • Twitterhjerne: Questionable quality, we can probably replace it with reasonable retrieval alternatives from MTEB

Other Updates

  • Replace dataset with their improved/faster variant in MTEB
  • Downsample datasets where needed (at least the largest ones)
  • See if there are relevant dataset within MTEB that can be added
@michaeldinzinger
Copy link

Hi, the FAQ dataset mentioned above (WebFAQ), which could be interesting for the SEB, includes:

Language # QA pairs # Test
swe 159k 10k
dan 138k 10k
nor 63.2k 6324
isl 4778 478

As mentioned in the MTEB PR for this dataset, it is a Retrieval dataset, the QAs were extracted from FAQ Pages of the web and the language identification was done with fastText with a confidence threshold.

The FAQ dataset also includes FAQs aligned between languages, e.g., for dan-nor. Perhaps, thas is interesting for the benchmark, as well. (The BitextMining task for this dataset is not yet on MTEB, but soon, I hope)

@KennethEnevoldsen
Copy link
Owner Author

KennethEnevoldsen commented Mar 7, 2025

This is great! Thanks for sharing @michaeldinzinger - the question bitext dataset is probably also a good addition to MTEB. My examination of the questions at least looks quite reasonable

I really like the project btw. I could imagine that you could do a project adapting this dataset to a multilngual instruction-tuning dataset (basically by removing any sample where the answer is not commonly known, probably a good portion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset new dataset to add v2
Projects
None yet
Development

No branches or pull requests

2 participants