Scaling Data-Constrained Language Models, Niklas Muennighoff+, arXiv'23 #1829

AkihikoWatanabe · 2025-03-23T23:24:18Z

URL

https://arxiv.org/abs/2305.16264

Authors

Niklas Muennighoff
Alexander M. Rush
Boaz Barak
Teven Le Scao
Aleksandra Piktus
Nouamane Tazi
Sampo Pyysalo
Thomas Wolf
Colin Raffel

Abstract

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.

Translation (by gpt-4o-mini)

現在の言語モデルのスケーリングのトレンドは、パラメータ数とトレーニングデータセットのサイズの両方を増加させることにあります。このトレンドを外挿すると、トレーニングデータセットのサイズは、インターネット上で利用可能なテキストデータの量によって制限される可能性があることが示唆されます。この制限に動機づけられ、データが制約された状況での言語モデルのスケーリングを調査します。具体的には、データの繰り返しの程度と計算予算を変化させた大規模な実験を実施し、最大9000億トレーニングトークンと90億パラメータのモデルを使用します。固定された計算予算のもとで制約されたデータを用いる場合、最大4エポックの繰り返しデータでトレーニングを行っても、ユニークなデータを使用した場合と比較して損失にほとんど変化は見られませんでした。しかし、繰り返しが増えると、計算を追加する価値は最終的にゼロに減少します。私たちは、繰り返しトークンの価値の減少と過剰なパラメータを考慮した計算最適性のスケーリング法則を提案し、実証的に検証します。最後に、トレーニングデータセットをコードデータで拡張したり、一般的に使用されるフィルターを削除したりするなど、データ不足を軽減するアプローチを実験します。私たちの400回のトレーニング実行から得られたモデルとデータセットは、https://github.com/huggingface/datablations で自由に入手可能です。

Summary (by gpt-4o-mini)

言語モデルのスケーリングにおいて、データ制約下でのトレーニングを調査。9000億トークンと90億パラメータのモデルを用いた実験で、繰り返しデータを使用しても損失に大きな変化は見られず、繰り返しの価値が減少することを確認。計算最適性のスケーリング法則を提案し、データ不足を軽減するアプローチも実験。得られたモデルとデータセットは公開。

AkihikoWatanabe added the Pocket label Mar 23, 2025

AkihikoWatanabe changed the title あ Scaling Data-Constrained Language Models, Niklas Muennighoff+, arXiv'23 Mar 23, 2025

AkihikoWatanabe added Scaling Laws MachineLearning LanguageModel NLP labels Mar 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Data-Constrained Language Models, Niklas Muennighoff+, arXiv'23 #1829

Scaling Data-Constrained Language Models, Niklas Muennighoff+, arXiv'23 #1829

AkihikoWatanabe commented Mar 23, 2025 •

edited

Loading

Scaling Data-Constrained Language Models, Niklas Muennighoff+, arXiv'23 #1829

Scaling Data-Constrained Language Models, Niklas Muennighoff+, arXiv'23 #1829

Comments

AkihikoWatanabe commented Mar 23, 2025 • edited Loading

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)

AkihikoWatanabe commented Mar 23, 2025 •

edited

Loading