Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add BRIGHT (long) #2041

Merged

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Feb 12, 2025

fixed #1978

  • Fixed bug in validate_and_filter()
    • Added tests for this
  • Added bright (long)
  • Fixed bright to only use standard split

@x-tabdeveloping seems like the filtering does not check for splits, this seems a bit worrying to me:

Results from BRIGHT (long):
Screenshot 2025-02-12 at 12 33 57

Results from BRIGHT:
Screenshot 2025-02-12 at 12 38 01


Did a bit of a debugging process. It I have now fixed it in validate_and_filter().

You can see the diff here:

import mteb

task = mteb.get_task("BrightRetrieval")
res = mteb.load_results(tasks=[task])
res_filtered = res.filter_models().join_revisions()
[m for m in res_filtered.model_results if m.model_name == "GritLM/GritLM-7B"][
    0
].task_results[0].get_score()
# 0.310709 # expected


bench: mteb.Benchmark = mteb.get_benchmark("BRIGHT")
bright_results = bench.load_results(res)

[m for m in bright_results.model_results if m.model_name == "GritLM/GritLM-7B"][
    0
].task_results[0].get_score()
# before: 0.310709 # both splits
# after: 0.20627749999999997 # only standard split

bench_long: mteb.Benchmark = mteb.get_benchmark("BRIGHT (long)")
bright_long_results = bench.load_results(res)

[m for m in bright_long_results.model_results if m.model_name == "GritLM/GritLM-7B"][
    0
].task_results[0].get_score()
# before fix: 0.310709 # problem here
# after fix: 0.46735625 # intended 


model_res = [m for m in res.model_results if m.model_name == "GritLM/GritLM-7B"][0]
bright_task_res = model_res.task_results[0]

bright_task_res.get_score()
# 0.310709 # expected

model_res = [m for m in res.model_results if m.model_name == "GritLM/GritLM-7B"][0]
bright_task_res = model_res.task_results[0]

bright_task_res.get_score()
# 0.310709 # expected

bright_task_res.get_score(splits = ["long"])
# 0.46735625 # match score on results not on legacy leaderboard
# using recall at 1 as leaderboard
bright_task_res.get_score(splits = ["long"], getter = lambda x: x["recall_at_1"])
# matches leaderboard

bright_task_res.get_score(splits = ["standard"])
# 0.20627749999999997 # match score on BRIGHT leaderboard


filtered_task = bright_task_res.validate_and_filter_scores(task=bench_long.tasks[0])

bench_long.tasks[0].eval_splits # ['long']
filtered_task.scores.keys() 
# before fix: dict_keys(['standard', 'long']) !?
# after fix: dict_keys(['long'])
filtered_task.get_score()

# after: 0.46735625

@KennethEnevoldsen
Copy link
Contributor Author

Both leaderboards in the new version:

Screenshot 2025-02-13 at 13 50 03 Screenshot 2025-02-13 at 13 49 53

@KennethEnevoldsen
Copy link
Contributor Author

Discussed this with @x-tabdeveloping in person and everything seems good.

The test fails due to a connection error in HF that is not on our end. So will merge this.

@KennethEnevoldsen KennethEnevoldsen merged commit 3537223 into main Feb 13, 2025
7 of 8 checks passed
@KennethEnevoldsen KennethEnevoldsen deleted the KennethEnevoldsen/issue-Leaderboard-BRIGHT-Long-gone branch February 13, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leaderboard: BRIGHT Long gone
1 participant