v2 Dataset Overview Issue #194

KennethEnevoldsen · 2025-02-14T13:56:57Z

An overview of datasets to add in the new version

New datasets

Danish:

Applied:
- Municipal chatbot: MUNI (already added)
Historical
- historical clustering (already added). Potentially also Author style clustering? #144
- historical dataset which I discussed with Alie
  Potential bilingual English-Danish parallel corpus within the medical domain #186
Linguistic Acceptability:
- Potential datasets to add from danish-semantic-reasoning-benchmark #172
- Add DDisco #169 (already added in mteb)
Add 1000 talemaader
- Could we as a pair classification dataset where the correct def. is correct and the abstract def and random def are wrong ("konkrete fejlfortolkninger" we will probably have to check")

Norwegian:

Legal dataset: Mail communication with Hans
Check that there if there is datasets to add from here: #142

Multilingual:

Add ScandiSent #151 (language ids might not be great)
Potentially FAQs from: fix: Add WebFAQ Retrieval dataset embeddings-benchmark/mteb#2236

Remove

Da Political Comments: Quality if questionable and no clear paper attached to it. Similar to DKHate
Massive Intent: Translated dataset (MUNI is a strictly more realistic evaluation set)
Massive Scenario: Translated dataset
Potentially remove
- LCC, few samples and only labelled by one guy.
- Twitterhjerne: Questionable quality, we can probably replace it with reasonable retrieval alternatives from MTEB

Other Updates

Replace dataset with their improved/faster variant in MTEB
Downsample datasets where needed (at least the largest ones)
See if there are relevant dataset within MTEB that can be added

michaeldinzinger · 2025-03-07T12:10:21Z

Hi, the FAQ dataset mentioned above (WebFAQ), which could be interesting for the SEB, includes:

Language	# QA pairs	# Test
swe	159k	10k
dan	138k	10k
nor	63.2k	6324
isl	4778	478

As mentioned in the MTEB PR for this dataset, it is a Retrieval dataset, the QAs were extracted from FAQ Pages of the web and the language identification was done with fastText with a confidence threshold.

The FAQ dataset also includes FAQs aligned between languages, e.g., for dan-nor. Perhaps, thas is interesting for the benchmark, as well. (The BitextMining task for this dataset is not yet on MTEB, but soon, I hope)

KennethEnevoldsen · 2025-03-07T15:27:46Z

This is great! Thanks for sharing @michaeldinzinger - the question bitext dataset is probably also a good addition to MTEB. My examination of the questions at least looks quite reasonable

I really like the project btw. I could imagine that you could do a project adapting this dataset to a multilngual instruction-tuning dataset (basically by removing any sample where the answer is not commonly known, probably a good portion).

KennethEnevoldsen added the dataset new dataset to add label Feb 14, 2025

KennethEnevoldsen pinned this issue Feb 14, 2025

KennethEnevoldsen added the v2 label Feb 14, 2025

KennethEnevoldsen mentioned this issue Mar 4, 2025

fix: Add WebFAQ Retrieval dataset embeddings-benchmark/mteb#2236

Merged

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2 Dataset Overview Issue #194

v2 Dataset Overview Issue #194

KennethEnevoldsen commented Feb 14, 2025 •

edited

Loading

michaeldinzinger commented Mar 7, 2025

KennethEnevoldsen commented Mar 7, 2025 •

edited

Loading

v2 Dataset Overview Issue #194

v2 Dataset Overview Issue #194

Comments

KennethEnevoldsen commented Feb 14, 2025 • edited Loading

New datasets

Remove

Other Updates

michaeldinzinger commented Mar 7, 2025

KennethEnevoldsen commented Mar 7, 2025 • edited Loading

KennethEnevoldsen commented Feb 14, 2025 •

edited

Loading

KennethEnevoldsen commented Mar 7, 2025 •

edited

Loading