Skip to content

Code and data for "Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering"

License

Notifications You must be signed in to change notification settings

SapienzaNLP/mcqa-eval

Repository files navigation

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Conference arXiv License: CC BY-NC-SA 4.0 Hugging Face Dataset

A repository containing the original code and outputs for the ACL 2025 paper "Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering" by Francesco Maria Molfese, Luca Moroni, Luca Gioffré, Alessandro Scirè, Simone Conia, Roberto Navigli.

🛠️ Installation

Installation from source:

git clone https://github.com/sapienzanlp/MetaQAEval.git
cd metaQAeval
conda create -n meta-qa-eval python==3.12
conda activate meta-qa-eval
pip install -e .

Reproducibility

Under the scripts folder, we provide the scripts to:

  1. Generate the outputs for a specific LLM over a given dataset.
  2. Run the evaluation of the outputs generated with an LLM over a given dataset using:
    • RegEx and xFinder (for free-text generation).
    • Logprobs and Perplexity for first-token probabilities.
  3. Obtain the statistics for the MMLU categories and subcategories (MMLU domains).
  4. Run the adversarial experiments for LLM-based evaluation strategies (xFinder) using:
    • Our newly-introduced resource: MMLU-Adversarial.
    • Prompts testing the ability of xFinder to solve the MCQA task.

Parameter ranges

In the subsequent scripts you can specify some parameters: DATASET, DATASET_NAME, and MODEL.

  • DATASET can be: mmlu, arc, obqa.
  • DATASET_NAME is the dataset_name field of the dataset in the original hf repository. For the mmlu the name will be all to evaluate all the categories.
  • MODEL is the hf id of the chosen model.

Generate the outputs:

bash scripts/generate_output.sh <DATASET> <DATASET_NAME> <MODEL>

For example:

bash scripts/generate_output.sh mmlu all meta-llama/Llama-3.1-8B-Instruct

Evaluate the models:

bash scripts/eval.sh <DATASET> <MODEL> # will execute regex and xfinder evaluation
bash scripts/logprobs.sh <DATASET> <MODEL>
bash scripts/perplexity.sh <DATASET> <MODEL>

For example:

bash scripts/eval.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/logprobs.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/perplexity.sh mmlu meta-llama/Llama-3.1-8B-Instruct

Statistics for the MMLU domains:

bash scripts/mmlu_domains_analysis.sh <MODEL>

For example:

bash scripts/mmlu_domains_analysis.sh meta-llama/Llama-3.1-8B-Instruct

Adversarial experiments:

bash scripts/xfinder_mmlu-adversarial.sh
bash scripts/xfinder_adversarial.sh <DATASET> <DATASET_NAME> <PROMPT_ID>

The PROMPT_ID is the id of the prompt in prompts/xfinder_adversarial_prompts.json.

For example:

bash scripts/xfinder_mmlu-adversarial.sh 
bash scripts/xfinder_adversarial.sh arc ARC-Challenge 1

Cite this work

If you use any part of this work, please consider citing the paper as follows:

@inproceedings{molfese2025rightanswerwrongscore,
  title={Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering}, 
  author={Francesco Maria Molfese and Luca Moroni and Luca Gioffrè and Alessandro Scirè and Simone Conia and Roberto Navigli},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
  pages={},
  year={2025}
}

🪪 License

The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.

Acknowledgements

We gratefully acknowledge the support of Future AI Research (PNRR MUR project PE0000013-FAIR).

About

Code and data for "Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •