This repo is for our paper titled Evaluating LLMs with Multiple Problems at once (accepted to The fourth iteration of the Generation, Evaluation & Metrics (GEM) Workshop). The repo provides the data and code used for our paper. For reproducibility, we can check out this link to download the raw and parsed LLM outputs.
ZeMPE (Zero-Shot Multi-Problem Evaluation) contains 53,100 zero-shot multi-problem prompts. ZeMPE is synthetically generated by leveraging 6 classification and 12 reasoning benchmarks that already exist and are widely used. More concretely, there are 13,500 prompts for classification-related MPE tasks, 21,600 prompts for single-source reasoning-related MPE tasks, 18,000 prompts for mixed-source reasoning-related MPE tasks. Correspondingly, the zip file ZeMPE.zip
contains three files inside, i.e., ZeMPE_Clf.json
, ZeMPE_MultiReasonSS.json
, and ZeMPE_MultiReasonMS.json
. Note that we also include prompts for the standard single-problem evaluation that were used in our paper as baselines.
ZeMPE.zip
is password protected to avoid data contamintation. The password is ZeroShot
and MPE
with an underscore in between, 12 characters in total!
The paper has went through three versions. The first version only includes classification part of the results, whereas the second and third versions contain both classification and reasoning results. You can check out the earlyVersion
branch for code and data used in the early version.
@misc{wang2025evaluatingllmsmultipleproblems,
title={Evaluating LLMs with Multiple Problems at once},
author={Zhengxiang Wang and Jordan Kodner and Owen Rambow},
year={2025},
eprint={2406.10786},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2406.10786},
}