Skip to content

jaaack-wang/multi-problem-eval-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo is for our paper titled Evaluating LLMs with Multiple Problems at once (accepted to The fourth iteration of the Generation, Evaluation & Metrics (GEM) Workshop). The repo provides the data and code used for our paper. For reproducibility, we can check out this link to download the raw and parsed LLM outputs.

ZeMPE

ZeMPE (Zero-Shot Multi-Problem Evaluation) contains 53,100 zero-shot multi-problem prompts. ZeMPE is synthetically generated by leveraging 6 classification and 12 reasoning benchmarks that already exist and are widely used. More concretely, there are 13,500 prompts for classification-related MPE tasks, 21,600 prompts for single-source reasoning-related MPE tasks, 18,000 prompts for mixed-source reasoning-related MPE tasks. Correspondingly, the zip file ZeMPE.zip contains three files inside, i.e., ZeMPE_Clf.json, ZeMPE_MultiReasonSS.json, and ZeMPE_MultiReasonMS.json. Note that we also include prompts for the standard single-problem evaluation that were used in our paper as baselines.

ZeMPE.zip is password protected to avoid data contamintation. The password is ZeroShot and MPE with an underscore in between, 12 characters in total!

Early Version

The paper has went through three versions. The first version only includes classification part of the results, whereas the second and third versions contain both classification and reasoning results. You can check out the earlyVersion branch for code and data used in the early version.

Citation

@misc{wang2025evaluatingllmsmultipleproblems,
      title={Evaluating LLMs with Multiple Problems at once}, 
      author={Zhengxiang Wang and Jordan Kodner and Owen Rambow},
      year={2025},
      eprint={2406.10786},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.10786}, 
}

About

Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published