Scaling Laws for Neural Language Models, Jared Kaplan+, arXiv'20 #1828

AkihikoWatanabe · 2025-03-23T23:23:49Z

URL

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

我々は、言語モデルの性能に関する経験的スケーリング法則を研究する。損失は、モデルサイズ、データセットサイズ、およびトレーニングに使用される計算量に対して冪則的にスケールし、一部の傾向は7桁以上にわたる。ネットワークの幅や深さなどの他のアーキテクチャの詳細は、広範囲にわたって最小限の影響を与える。過学習のモデル/データセットサイズへの依存関係や、モデルサイズに対するトレーニング速度の依存関係は、単純な方程式によって支配される。これらの関係により、固定された計算予算の最適な配分を決定することが可能になる。より大きなモデルは、サンプル効率が著しく高く、最適な計算効率のトレーニングは、比較的少量のデータで非常に大きなモデルをトレーニングし、収束する前にかなり早く停止することを含む。

言語モデルの性能に関するスケーリング法則を研究し、損失がモデルサイズ、データセットサイズ、計算量に対して冪則的にスケールすることを示す。アーキテクチャの詳細は影響が少なく、過学習やトレーニング速度は単純な方程式で説明される。これにより、計算予算の最適な配分が可能となり、大きなモデルはサンプル効率が高く、少量のデータで早期に収束することが示された。

AkihikoWatanabe added the Pocket label Mar 23, 2025

AkihikoWatanabe changed the title あ Scaling Laws for Neural Language Models, Jared Kaplan+, arXiv'20 Mar 23, 2025

AkihikoWatanabe added Scaling Laws MachineLearning NLP LanguageModel labels Mar 24, 2025