Principles for tuning learning rate #33
Replies: 1 comment
-
I think most of us try to copy the hparams from someone we trust and who did something similar - especially important that they used the same framework usually. For example, before we launched BLOOM-176B training I begged to see how Megatron-Turing NLG 530B had their lr setup - this is because they were the only ones who used Megatron-Deepspeed and they had an even bigger model than ours. And it proved to be a successful strategy. Then typically problems happen when you have either bad or badly shuffled data and usually other than skipping bad data pockets often rolling back to an older good checkpoint and lowering the LR tends to help one make a progress. But if the data is bad it's going to get you again and again. Stop and clean up your data and shuffle it well. Reading LLM training logbooks is a very enlightening experience to vicariously live the big training battles and learn how to troubleshoot an ailing training. @StellaAthena's Common LLM Settings spreadsheet is another great reference point. Also I have started compiling types of instabilities here, but it's just a first step - a lot more know how needs to be gathered. I'm sure others will have some great insights to add in this discussion. |
Beta Was this translation helpful? Give feedback.
-
When training a large-scale model, what is the principle of tuning the learning rate? I know you can just follow the Bible for most of the LLM in NLP, but I am sure you have to figure it out in many other cases such as a new LLM architecture, new dataset, or new problem. My personal experience is to use as large lr as possible to make the convergence faster (Similar to batch size we would like the largest bs to fit in the memory we have). However, a larger lr may cause instability in training like spikes or even converges to a bad spot :(
Would like to hear your thoughts.
Continue discussion of the issue #32 (comment).
shout out to @stas00 for organizing the community!
Beta Was this translation helpful? Give feedback.
All reactions