Replies: 1 comment 2 replies
-
Thank you for the kind words, @haidahaha
We spent about 8 months training smaller models, running a lot of benchmarks to decide which framework would work the best. The HPC we were using had an extremely low 135Gbps intra-node comms, so Deepspeed ZeRO was unfortunately not an option, and we chose Megatron-LM w/ 3D parallelism and later switched to Megatron-Deepspeed to improve the performance even more. The most intense learning was once we tried to train a 104B model and it was quite a disaster since we couldn't overcome the diverging of the training, no matter what we have tried - and we tried a lot of different workarounds/solutions. The main culprit was us using fp16 and there was no other choice (other than fp32) with V100s. Just in time for BLOOM training A100s have arrived and so we scrambled to write a BF16 optimizer, which saved the day in combination with using a much cleaner data.
Fired 95% of the volunteers who contributed ziltch. It was a complete non-sense to have hundreds of ghosts that were there watching but when asked for help pretended to not be there. It was really a tiny handful of people who did all the work. We actually could have used a lot more help.
I personally had only time and resources to learn reactively when something either broke or needed a speed up. We had a team that was focused on modeling.
Well, we didn't really do much innovation modeling-wise w/ BLOOM - we didn't have much time and it was also very difficult to mod Megatron-LM - so we used their GPT model and did some small tweaks to it - like adding additional We also were under an extreme pressure to start training long before we were ready, because we were given A100s for 3 months from the moment they have arrived. But, of course, the software wasn't ready, because we needed time w/ A100s to develop and debug the software - not talking about an insane amount of hardware failures. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the awesome repo with full of practical content! I am embarking on the similar journey, and finding the experience from your previous trainings very insightful.
One thing I saw you mentioned about how you went from "Zero" to "Hero":
I am curious about your "intense learning process". Some questions I have are:
Beta Was this translation helpful? Give feedback.
All reactions