GitHub - jaaack-wang/multi-problem-eval-llm: Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities

This repo is for our paper titled Evaluating LLMs with Multiple Problems at once (accepted to The fourth iteration of the Generation, Evaluation & Metrics (GEM) Workshop). The repo provides the data and code used for our paper. For reproducibility, we can check out this link to download the raw and parsed LLM outputs.

ZeMPE

ZeMPE (Zero-Shot Multi-Problem Evaluation) contains 53,100 zero-shot multi-problem prompts. ZeMPE is synthetically generated by leveraging 6 classification and 12 reasoning benchmarks that already exist and are widely used. More concretely, there are 13,500 prompts for classification-related MPE tasks, 21,600 prompts for single-source reasoning-related MPE tasks, 18,000 prompts for mixed-source reasoning-related MPE tasks. Correspondingly, the zip file ZeMPE.zip contains three files inside, i.e., ZeMPE_Clf.json, ZeMPE_MultiReasonSS.json, and ZeMPE_MultiReasonMS.json. Note that we also include prompts for the standard single-problem evaluation that were used in our paper as baselines.

ZeMPE.zip is password protected to avoid data contamintation. The password is ZeroShot and MPE with an underscore in between, 12 characters in total!

Early Version

The paper has went through three versions. The first version only includes classification part of the results, whereas the second and third versions contain both classification and reasoning results. You can check out the earlyVersion branch for code and data used in the early version.

Citation

@misc{wang2025evaluatingllmsmultipleproblems,
      title={Evaluating LLMs with Multiple Problems at once}, 
      author={Zhengxiang Wang and Jordan Kodner and Owen Rambow},
      year={2025},
      eprint={2406.10786},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.10786}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ZeMPE.zip		ZeMPE.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ZeMPE

Early Version

Citation

About

Uh oh!

Releases

Packages

Languages

License

jaaack-wang/multi-problem-eval-llm

Folders and files

Latest commit

History

Repository files navigation

ZeMPE

Early Version

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages