fix: Add BRIGHT (long) #2041

KennethEnevoldsen · 2025-02-12T11:33:33Z

fixed #1978

Fixed bug in validate_and_filter()
- Added tests for this
Added bright (long)
- problem: bright (long) uses recall@1 as its main score. Not sure how to resolve that? (added issue Allow benchmark to select custom metrics #2046)
Fixed bright to only use standard split

@x-tabdeveloping seems like the filtering does not check for splits, this seems a bit worrying to me:

Results from BRIGHT (long):

Results from BRIGHT:

Did a bit of a debugging process. It I have now fixed it in validate_and_filter().

You can see the diff here:

import mteb

task = mteb.get_task("BrightRetrieval")
res = mteb.load_results(tasks=[task])
res_filtered = res.filter_models().join_revisions()
[m for m in res_filtered.model_results if m.model_name == "GritLM/GritLM-7B"][
    0
].task_results[0].get_score()
# 0.310709 # expected


bench: mteb.Benchmark = mteb.get_benchmark("BRIGHT")
bright_results = bench.load_results(res)

[m for m in bright_results.model_results if m.model_name == "GritLM/GritLM-7B"][
    0
].task_results[0].get_score()
# before: 0.310709 # both splits
# after: 0.20627749999999997 # only standard split

bench_long: mteb.Benchmark = mteb.get_benchmark("BRIGHT (long)")
bright_long_results = bench.load_results(res)

[m for m in bright_long_results.model_results if m.model_name == "GritLM/GritLM-7B"][
    0
].task_results[0].get_score()
# before fix: 0.310709 # problem here
# after fix: 0.46735625 # intended 


model_res = [m for m in res.model_results if m.model_name == "GritLM/GritLM-7B"][0]
bright_task_res = model_res.task_results[0]

bright_task_res.get_score()
# 0.310709 # expected

model_res = [m for m in res.model_results if m.model_name == "GritLM/GritLM-7B"][0]
bright_task_res = model_res.task_results[0]

bright_task_res.get_score()
# 0.310709 # expected

bright_task_res.get_score(splits = ["long"])
# 0.46735625 # match score on results not on legacy leaderboard
# using recall at 1 as leaderboard
bright_task_res.get_score(splits = ["long"], getter = lambda x: x["recall_at_1"])
# matches leaderboard

bright_task_res.get_score(splits = ["standard"])
# 0.20627749999999997 # match score on BRIGHT leaderboard


filtered_task = bright_task_res.validate_and_filter_scores(task=bench_long.tasks[0])

bench_long.tasks[0].eval_splits # ['long']
filtered_task.scores.keys() 
# before fix: dict_keys(['standard', 'long']) !?
# after fix: dict_keys(['long'])
filtered_task.get_score()

# after: 0.46735625

Fixes #1978

KennethEnevoldsen · 2025-02-13T12:50:40Z

Both leaderboards in the new version:

KennethEnevoldsen · 2025-02-13T14:45:30Z

Discussed this with @x-tabdeveloping in person and everything seems good.

The test fails due to a connection error in HF that is not on our end. So will merge this.

KennethEnevoldsen added 2 commits February 12, 2025 12:14

fix: Add BRIGHT Long

14ac4dd

Fixes #1978

fix: Add BRIGHT(long)

331d491

KennethEnevoldsen requested a review from x-tabdeveloping February 12, 2025 11:37

KennethEnevoldsen added 3 commits February 13, 2025 13:40

fix bug in task results

e7eec52

updated bright

62520c0

updated tests for TaskResults

bee8ff2

KennethEnevoldsen merged commit 3537223 into main Feb 13, 2025
7 of 8 checks passed

KennethEnevoldsen deleted the KennethEnevoldsen/issue-Leaderboard-BRIGHT-Long-gone branch February 13, 2025 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add BRIGHT (long) #2041

fix: Add BRIGHT (long) #2041

KennethEnevoldsen commented Feb 12, 2025 •

edited

Loading

KennethEnevoldsen commented Feb 13, 2025

KennethEnevoldsen commented Feb 13, 2025

fix: Add BRIGHT (long) #2041

fix: Add BRIGHT (long) #2041

Conversation

KennethEnevoldsen commented Feb 12, 2025 • edited Loading

KennethEnevoldsen commented Feb 13, 2025

KennethEnevoldsen commented Feb 13, 2025

KennethEnevoldsen commented Feb 12, 2025 •

edited

Loading