You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could we as a pair classification dataset where the correct def. is correct and the abstract def and random def are wrong ("konkrete fejlfortolkninger" we will probably have to check")
Hi, the FAQ dataset mentioned above (WebFAQ), which could be interesting for the SEB, includes:
Language
# QA pairs
# Test
swe
159k
10k
dan
138k
10k
nor
63.2k
6324
isl
4778
478
As mentioned in the MTEB PR for this dataset, it is a Retrieval dataset, the QAs were extracted from FAQ Pages of the web and the language identification was done with fastText with a confidence threshold.
The FAQ dataset also includes FAQs aligned between languages, e.g., for dan-nor. Perhaps, thas is interesting for the benchmark, as well. (The BitextMining task for this dataset is not yet on MTEB, but soon, I hope)
This is great! Thanks for sharing @michaeldinzinger - the question bitext dataset is probably also a good addition to MTEB. My examination of the questions at least looks quite reasonable
I really like the project btw. I could imagine that you could do a project adapting this dataset to a multilngual instruction-tuning dataset (basically by removing any sample where the answer is not commonly known, probably a good portion).
An overview of datasets to add in the new version
New datasets
Danish:
Potential bilingual English-Danish parallel corpus within the medical domain #186
Norwegian:
Multilingual:
Remove
Other Updates
The text was updated successfully, but these errors were encountered: