Skip to content

Official implementation of the paper "Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning". (By Xinghao Chen)

License

Notifications You must be signed in to change notification settings

EIT-NLP/Distilling-CoT-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Introduction

Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model.

Overview of CoT Distillation. Different teacher models generate CoT supervision with varying levels of granularity and formats to fine-tune the student model.

Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs.

Todo

  • Release evaluation code on math and commonsense reasoning
  • Release SFT datasets
  • Add instructions for SFT on LLaMA-Factory

Experiments Setup

We conducted extensive experiments on four mathematical reasoning datasets of varying difficulty and three commonsense reasoning datasets, using four teacher models to distill reasoning skills to seven student models.

Datasets

Training Dataset Samples (Training) Samples (Testing) Fields Human Annotation
SVAMP 700 300 Arithmetic problems Yes
GSM8K 7.4k 1.3k Grade-school math Yes
AQuA-RAT 6.1k 254 Algebraic reasoning, multi-step Yes
Math 1.3k 500 Pre-Algebra, Algebra, Counting & Probability, Number Theory Yes
CommonsenseQA 9.7k 1.2k Commonsense knowledge Yes
OpenBookQA 4.9k 500 Domain-specific knowledge No
StrategyQA 2k 290 Multi-step reasoning Yes

Models

Teacher models: GPT-4o, Gemini-1.5-Flash, LLaMA 3 70B

Student models: LLaMA 3.2 1B, LLaMA 3.2 3B, Gemma 2B, BLOOM 560M, 1.1B, 1.7B, 3B

Installation

Our experiment uses a pipeline of LLaMA-Factory to fine-tune the student models.

Important

Installation is mandatory.

conda create -n llama_factory python==3.10
conda activate llama_factory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

For the evaluation environment:

conda create -n evaluation python==3.10
conda activate evaluation
cd Evaluation
pip install -r requirements.txt

Training

The training data are provided in the data folder. Please refer to data/readme.md to see a list of our datasets. Here's how to set up training:

  1. After cloning the LLaMA Factory repository, copy all contents from this repository's data folder into the data folder of the LLaMA Factory directory.

  2. We provide training configs generation code config/yamlgeneration.py. You can modify dataset_name, gpu_devices, and models and then run the following command to generate configs:

cd config
python yamlgeneration.py
  1. To fine-tune the target LLM, run the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train config/<your_dataset_name>/<models>_<your_dataset_name>.yaml

Or you can run:

export CUDA_VISIBLE_DEVICES=0,1,2,3

for config_file in /code/LLaMA-Factory/config/<your_dataset_name>/*.yaml; do
    llamafactory-cli train "$config_file"
done

to train on multiple configurations continuously.

We provide our training configs in config/examples for your reference.

Evaluation

The evaluation code is built from MAmmoTH.

Single Evaluation Run

To perform a single evaluation, use the following commands:

For Mathematical Reasoning:

CUDA_VISIBLE_DEVICES=0 python run_open.py \
    --model path_to_your_model \
    --shots 0 \
    --dataset your_dataset_name \
    --model_max_length 1024 \
    --dtype bfloat16 \
    --form your_model_form

For Commonsense Reasoning:

CUDA_VISIBLE_DEVICES=0 python run_reasoning.py \
    --model path_to_your_model \
    --dataset your_dataset_name \
    --output test.json \
    --model_max_length 640 \
    --dtype bfloat16 \
    --form your_model_form

Batch Evaluation

To run large-scale evaluations across multiple models:

Modify the following parameters in evaluate_models.py or autoevaluate.py:

  • num_gpus: Number of GPUs to utilize.
  • output_file: Path to save the evaluation results.
  • model_dir: Directory containing the models to evaluate.

Run the respective evaluation scripts:

# For Mathematical Reasoning:
python evaluate_models.py
# For Commonsense Reasoning:
python autoevaluate.py

Arguments Explanation:

  • model: Path to your fine-tuned model.

  • shots: Number of few-shot examples (set to 0 for zero-shot evaluation).

  • dataset: Name of the dataset (see valid options below).

  • model_max_length: Maximum sequence length.

  • dtype: Data type for evaluation.

  • form: Model template (choose from gemma, llama, alpaca).

dataset Options:

Mathematical Reasoning Datasets: svamp, gsm8k, aqua, math

Commonsense Reasoning Datasets: csqa_test.json, openbookQA_test.json, strategyQA_test.json

Acknowledgments

The evaluation code is built from MAmmoTH.

Citation

@misc{chen2025unveilingkeyfactorsdistilling,
      title={Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning}, 
      author={Xinghao Chen and Zhijing Sun and Wenjin Guo and Miaoran Zhang and Yanjun Chen and Yirong Sun and Hui Su and Yijie Pan and Dietrich Klakow and Wenjie Li and Xiaoyu Shen},
      year={2025},
      eprint={2502.18001},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18001}, 
}

Contact

If you have any questions, feel free to raise an issue or contact us at [email protected].

About

Official implementation of the paper "Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning". (By Xinghao Chen)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages