Add LongBench V2 benchmark #249

eshwarprasadS · 2025-04-30T18:08:07Z

Adding LongBench to eval options,

Install extras with:

pip install instructlab-eval[longbench]

Uses VLLM backend for serving the model for generation

Runs like so:

evaluator = LongBenchEvaluator(
    model_path="path/to/model",
    num_gpus=N,
    output_file="path/to/results.json",
    eval_config={"batch_size": "auto"},
    vllm_config={"max_model_len": max_len}
)

results = evaluator.run()  # Returns LongBenchResult

Output json looks like so:

{
  "en_multidoc": 0.5424139838230786,
  "zh_multidoc": 0.24335639081098673,
  "en_singledoc": 0.4233139199560039,
  "zh_singledoc": 0.46157875457875464,
  "en_summ": 0.27244809337990245,
  "zh_summ": 0.1359562304911904,
  "en_fewshot": 0.45692449627485754,
  "zh_fewshot": 0.24416666666666667,
  "en_synthetic": 0.3799285714285714,
  "zh_synthetic": 0.4775,
  "code_avg": 0.30225,
  "overall_score": 0.3581670097645466
}

Signed-off-by: eshwarprasadS <[email protected]>

RobotSail

Thanks for the PR @eshwarprasadS !

The PR has all of the right ideas, there are just a few minor changes that you'll want to make which I've outlined in this review. Once we've addressed those, this should be good to merge

requirements-longbench.txt

src/instructlab/eval/longbench.py

RobotSail · 2025-05-01T03:44:05Z

src/instructlab/eval/longbench.py

+        ) / 2
+
+        # Calculate overall score
+        all_scores = [v for k, v in eval_results.items() if k != "overall_score"]


Why do we check if k != "overall_score"? We shouldn't have set this key yet

src/instructlab/eval/longbench.py

…-cuda extras Signed-off-by: eshwarprasadS <[email protected]>

…y served openai-compatible model endpoints Signed-off-by: eshwarprasadS <[email protected]>

… name parameter Signed-off-by: eshwarprasadS <[email protected]>

Signed-off-by: eshwarprasadS <[email protected]>

eshwarprasadS added 2 commits April 30, 2025 17:17

init add longbench benchmark implementation

081c196

Signed-off-by: eshwarprasadS <[email protected]>

add requirements file, fix optional dependency options, lint

6d78f5b

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added dependencies Pull requests that update a dependency file ci-failure labels Apr 30, 2025

RobotSail reviewed May 1, 2025

View reviewed changes

feat: add requirements-cuda, move vllm and flash-attn to requirements…

b98c531

…-cuda extras Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 1, 2025

fix: fix openai-chat-completions model for enabling external / locall…

d474124

…y served openai-compatible model endpoints Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 5, 2025

fix: change model backend to local completions, add support for model…

5a93905

… name parameter Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 9, 2025

fix: linting...

2f7f548

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 9, 2025

linting changes...

ec64062

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 13, 2025

fix: type changes and linting

8aefc14

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 13, 2025

add evaluator registration to evaluator group in unit test

6406546

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added testing Relates to testing ci-failure and removed ci-failure labels May 13, 2025

typing fixes to accommodate API call args, linting...

e38a0a6

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 13, 2025

fix: typing for py3.10

00b7c89

Signed-off-by: eshwarprasadS <[email protected]>

mergify bot added ci-failure and removed ci-failure labels May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LongBench V2 benchmark #249

Add LongBench V2 benchmark #249

eshwarprasadS commented Apr 30, 2025 •

edited

Loading

RobotSail left a comment

RobotSail May 1, 2025

Add LongBench V2 benchmark #249

Are you sure you want to change the base?

Add LongBench V2 benchmark #249

Conversation

eshwarprasadS commented Apr 30, 2025 • edited Loading

RobotSail left a comment

Choose a reason for hiding this comment

RobotSail May 1, 2025

Choose a reason for hiding this comment

eshwarprasadS commented Apr 30, 2025 •

edited

Loading