Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate, Yubo Wang+, arXiv'25 #1832

AkihikoWatanabe · 2025-03-25T00:54:11Z

URL

https://arxiv.org/abs/2501.17703

Authors

Yubo Wang
Xiang Yue
Wenhu Chen

Abstract

Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of ([query; noisy response], critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K examples. It can match or outperform strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it can match the performance of SimpleRL, which is a deepseek-r1 replication trained with 140x more compute. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that CFT offers a more effective alternative to advance the reasoning of language models.

Translation (by gpt-4o-mini)

監視付きファインチューニング（SFT）は、言語モデルを訓練して与えられた指示に対する注釈付き応答を模倣させるために一般的に使用されます。本論文では、このパラダイムに挑戦し、モデルが正しい応答を単に模倣するのではなく、ノイズのある応答を批評することを学ぶ戦略である批評ファインチューニング（CFT）を提案します。人間の学習プロセスにインスパイアを受け、批判的思考を重視するCFTは、標準的なSFTでは見落とされがちな深い分析と微妙な理解を促進します。CFTの効果を検証するために、WebInstructから50Kサンプルのデータセットを構築し、GPT-4oを教師として使用して、([query; noisy response], critique)の形式で批評を生成しました。このデータセットに対するCFTは、Qwen2.5、Qwen2.5-Math、DeepSeek-Mathなどの異なるベースモデルで、6つの数学ベンチマークにおいてSFTに対して一貫して4-10%の改善をもたらしました。さらに、MetaMathおよびNuminaMathデータセットに拡張し、SFTに対して同様の向上を観察しました。特に、私たちのモデルQwen2.5-Math-CFTは、50Kの例に対して8xH100で1時間のトレーニングのみで済みます。これは、2M以上のサンプルを使用するQwen2.5-Math-Instructのような強力な競合と同等かそれ以上の性能を発揮します。さらに、140倍の計算リソースで訓練されたdeepseek-r1の複製であるSimpleRLと同等の性能を発揮します。アブレーション研究により、CFTはノイズのある応答のソースや教師の批評モデルに対して堅牢であることが示されました。これらの発見を通じて、CFTは言語モデルの推論を進展させるためのより効果的な代替手段を提供することを主張します。

Summary (by gpt-4o-mini)

批評ファインチューニング（CFT）は、言語モデルがノイズのある応答を批評することを学ぶ新しい戦略で、従来の監視付きファインチューニング（SFT）に挑戦します。CFTは人間の学習プロセスにインスパイアを受け、深い分析を促進します。WebInstructから構築した50Kサンプルのデータセットを用いて、CFTは複数のベースモデルでSFTに対して4-10%の性能向上を示しました。特に、Qwen2.5-Math-CFTは少ないトレーニングで強力な競合と同等の性能を発揮し、CFTの堅牢性も確認されました。CFTは言語モデルの推論を進展させる効果的な手法であると主張します。

AkihikoWatanabe · 2025-03-25T00:54:26Z

元ポスト: https://x.com/WenhuChen/status/1885060597500567562

AkihikoWatanabe · 2025-03-25T01:09:28Z

Critique Fine-Tuning (CFT) を提案。CFTでは、query x, noisy response y ¹ が与えられたときに、それに対する批評 cを学習する。cはgivenではないので、GPT4oのような強力なモデルによって合成する。

目的関数は以下。[x; y] がgivenな時にcを生成する確率を最大化する。シンプル。

RLを用いた手法との比較。1/10程度のデータ量、1/100程度のGPU時間で同等の性能を達成できる。

本論文で利用しているWebInstructからサンプリングしたデータでは、たとえば約50%程度のyが正解, 残りは不正解（程度のnoisyデータを利用している） ↩

AkihikoWatanabe added the Pocket label Mar 25, 2025

AkihikoWatanabe changed the title a Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate, Yubo Wang+, arXiv'25 Mar 25, 2025

AkihikoWatanabe added Finetuning (SFT) NLP LanguageModel Distillation SelfCorrection and removed Pocket Distillation SelfCorrection labels Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate, Yubo Wang+, arXiv'25 #1832

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate, Yubo Wang+, arXiv'25 #1832

AkihikoWatanabe commented Mar 25, 2025 •

edited

Loading

AkihikoWatanabe commented Mar 25, 2025

AkihikoWatanabe commented Mar 25, 2025 •

edited

Loading

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate, Yubo Wang+, arXiv'25 #1832

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate, Yubo Wang+, arXiv'25 #1832

Comments

AkihikoWatanabe commented Mar 25, 2025 • edited Loading

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)

AkihikoWatanabe commented Mar 25, 2025

AkihikoWatanabe commented Mar 25, 2025 • edited Loading

Footnotes

AkihikoWatanabe commented Mar 25, 2025 •

edited

Loading

AkihikoWatanabe commented Mar 25, 2025 •

edited

Loading