Installation from source:
git clone https://github.com/sapienzanlp/MetaQAEval.git
cd metaQAeval
conda create -n meta-qa-eval python==3.12
conda activate meta-qa-eval
pip install -e .
Under the scripts
folder, we provide the scripts to:
- Generate the outputs for a specific LLM over a given dataset.
- Run the evaluation of the outputs generated with an LLM over a given dataset using:
- RegEx and xFinder (for free-text generation).
- Logprobs and Perplexity for first-token probabilities.
- Obtain the statistics for the MMLU categories and subcategories (MMLU domains).
- Run the adversarial experiments for LLM-based evaluation strategies (xFinder) using:
- Our newly-introduced resource: MMLU-Adversarial.
- Prompts testing the ability of xFinder to solve the MCQA task.
In the subsequent scripts you can specify some parameters: DATASET
, DATASET_NAME
, and MODEL
.
DATASET
can be:mmlu
,arc
,obqa
.DATASET_NAME
is thedataset_name
field of the dataset in the original hf repository. For themmlu
the name will beall
to evaluate all the categories.MODEL
is the hf id of the chosen model.
bash scripts/generate_output.sh <DATASET> <DATASET_NAME> <MODEL>
For example:
bash scripts/generate_output.sh mmlu all meta-llama/Llama-3.1-8B-Instruct
bash scripts/eval.sh <DATASET> <MODEL> # will execute regex and xfinder evaluation
bash scripts/logprobs.sh <DATASET> <MODEL>
bash scripts/perplexity.sh <DATASET> <MODEL>
For example:
bash scripts/eval.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/logprobs.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/perplexity.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/mmlu_domains_analysis.sh <MODEL>
For example:
bash scripts/mmlu_domains_analysis.sh meta-llama/Llama-3.1-8B-Instruct
bash scripts/xfinder_mmlu-adversarial.sh
bash scripts/xfinder_adversarial.sh <DATASET> <DATASET_NAME> <PROMPT_ID>
The PROMPT_ID
is the id of the prompt in prompts/xfinder_adversarial_prompts.json
.
For example:
bash scripts/xfinder_mmlu-adversarial.sh
bash scripts/xfinder_adversarial.sh arc ARC-Challenge 1
If you use any part of this work, please consider citing the paper as follows:
@inproceedings{molfese2025rightanswerwrongscore,
title={Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering},
author={Francesco Maria Molfese and Luca Moroni and Luca Gioffrè and Alessandro Scirè and Simone Conia and Roberto Navigli},
booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
pages={},
year={2025}
}
The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.
We gratefully acknowledge the support of Future AI Research (PNRR MUR project PE0000013-FAIR).