Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Introduction

Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model.

Overview of CoT Distillation. Different teacher models generate CoT supervision with varying levels of granularity and formats to fine-tune the student model.

Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs.

Todo

Release evaluation code on math and commonsense reasoning
Release SFT datasets
Add instructions for SFT on LLaMA-Factory

Experiments Setup

We conducted extensive experiments on four mathematical reasoning datasets of varying difficulty and three commonsense reasoning datasets, using four teacher models to distill reasoning skills to seven student models.

Datasets

Training Dataset	Samples (Training)	Samples (Testing)	Fields	Human Annotation
SVAMP	700	300	Arithmetic problems	Yes
GSM8K	7.4k	1.3k	Grade-school math	Yes
AQuA-RAT	6.1k	254	Algebraic reasoning, multi-step	Yes
Math	1.3k	500	Pre-Algebra, Algebra, Counting & Probability, Number Theory	Yes
CommonsenseQA	9.7k	1.2k	Commonsense knowledge	Yes
OpenBookQA	4.9k	500	Domain-specific knowledge	No
StrategyQA	2k	290	Multi-step reasoning	Yes

Models

Teacher models: GPT-4o, Gemini-1.5-Flash, LLaMA 3 70B

Student models: LLaMA 3.2 1B, LLaMA 3.2 3B, Gemma 2B, BLOOM 560M, 1.1B, 1.7B, 3B

Installation

Our experiment uses a pipeline of LLaMA-Factory to fine-tune the student models.

Important

Installation is mandatory.

conda create -n llama_factory python==3.10
conda activate llama_factory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

For the evaluation environment:

conda create -n evaluation python==3.10
conda activate evaluation
cd Evaluation
pip install -r requirements.txt

Training

The training data are provided in the data folder. Please refer to data/readme.md to see a list of our datasets. Here's how to set up training:

After cloning the LLaMA Factory repository, copy all contents from this repository's data folder into the data folder of the LLaMA Factory directory.
We provide training configs generation code config/yamlgeneration.py. You can modify dataset_name, gpu_devices, and models and then run the following command to generate configs:

cd config
python yamlgeneration.py

To fine-tune the target LLM, run the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train config/<your_dataset_name>/<models>_<your_dataset_name>.yaml

Or you can run:

export CUDA_VISIBLE_DEVICES=0,1,2,3

for config_file in /code/LLaMA-Factory/config/<your_dataset_name>/*.yaml; do
    llamafactory-cli train "$config_file"
done

to train on multiple configurations continuously.

We provide our training configs in config/examples for your reference.

Evaluation

The evaluation code is built from MAmmoTH.

Single Evaluation Run

To perform a single evaluation, use the following commands:

For Mathematical Reasoning:

CUDA_VISIBLE_DEVICES=0 python run_open.py \
    --model path_to_your_model \
    --shots 0 \
    --dataset your_dataset_name \
    --model_max_length 1024 \
    --dtype bfloat16 \
    --form your_model_form

For Commonsense Reasoning:

CUDA_VISIBLE_DEVICES=0 python run_reasoning.py \
    --model path_to_your_model \
    --dataset your_dataset_name \
    --output test.json \
    --model_max_length 640 \
    --dtype bfloat16 \
    --form your_model_form

Batch Evaluation

To run large-scale evaluations across multiple models:

Modify the following parameters in evaluate_models.py or autoevaluate.py:

num_gpus: Number of GPUs to utilize.
output_file: Path to save the evaluation results.
model_dir: Directory containing the models to evaluate.

Run the respective evaluation scripts:

# For Mathematical Reasoning:
python evaluate_models.py
# For Commonsense Reasoning:
python autoevaluate.py

Arguments Explanation:

model: Path to your fine-tuned model.
shots: Number of few-shot examples (set to 0 for zero-shot evaluation).
dataset: Name of the dataset (see valid options below).
model_max_length: Maximum sequence length.
dtype: Data type for evaluation.
form: Model template (choose from gemma, llama, alpaca).

dataset Options:

Mathematical Reasoning Datasets: svamp, gsm8k, aqua, math

Commonsense Reasoning Datasets: csqa_test.json, openbookQA_test.json, strategyQA_test.json

Acknowledgments

The evaluation code is built from MAmmoTH.

Citation

@misc{chen2025unveilingkeyfactorsdistilling,
      title={Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning}, 
      author={Xinghao Chen and Zhijing Sun and Wenjin Guo and Miaoran Zhang and Yanjun Chen and Yirong Sun and Hui Su and Yijie Pan and Dietrich Klakow and Wenjie Li and Xiaoyu Shen},
      year={2025},
      eprint={2502.18001},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18001}, 
}

Contact

If you have any questions, feel free to raise an issue or contact us at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Evaluation		Evaluation
config		config
data		data
image		image
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Introduction

Todo

Experiments Setup

Datasets

Models

Installation

Training

Evaluation

Single Evaluation Run

Batch Evaluation

Arguments Explanation:

Acknowledgments

Citation

Contact

About

Releases

Packages

Contributors 2

Languages

License

EIT-NLP/Distilling-CoT-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning

Introduction

Todo

Experiments Setup

Datasets

Models

Installation

Training

Evaluation

Single Evaluation Run

Batch Evaluation

Arguments Explanation:

Acknowledgments

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages