Learning process - from Zero to Hero #57

haidahaha · 2024-07-28T09:46:39Z

haidahaha
Jul 28, 2024

Thank you for the awesome repo with full of practical content! I am embarking on the similar journey, and finding the experience from your previous trainings very insightful.

One thing I saw you mentioned about how you went from "Zero" to "Hero":

Special thanks go to Thom Wolf who proposed that I lead the BLOOM-176B training back when I didn't know anything about large scale training. This was the project that catapulted me into the intense learning process. And, of course, HuggingFace for giving me the opportunity to work full time on BLOOM-176B and later on IDEFICS-80B trainings.

I am curious about your "intense learning process". Some questions I have are:

Can you share more detail about your learning process?
What did you learn?
What were useful?
If you were to start the learning process again, would you have done it differently?
How deep did you get into the detail of ML model architecture? Did you learn all the in and out of those models?
How were the collaboration between you and the ML researchers? Are the models built by ML researchers and deployed by you? Are you also involving in building the models?

stas00 · 2024-07-29T17:09:49Z

stas00
Jul 29, 2024
Maintainer

Thank you for the kind words, @haidahaha

Can you share more detail about your learning process?

We spent about 8 months training smaller models, running a lot of benchmarks to decide which framework would work the best. The HPC we were using had an extremely low 135Gbps intra-node comms, so Deepspeed ZeRO was unfortunately not an option, and we chose Megatron-LM w/ 3D parallelism and later switched to Megatron-Deepspeed to improve the performance even more.

The most intense learning was once we tried to train a 104B model and it was quite a disaster since we couldn't overcome the diverging of the training, no matter what we have tried - and we tried a lot of different workarounds/solutions. The main culprit was us using fp16 and there was no other choice (other than fp32) with V100s.

Just in time for BLOOM training A100s have arrived and so we scrambled to write a BF16 optimizer, which saved the day in combination with using a much cleaner data.

What did you learn?

How to operate and debug a massive LLM training. e.g. there were many hanging issues in pytorch at that time.
How to automate things, how to recover from divergence, how to reset optimizer states.
How to use SLURM effectively in a sort of hostile environment where one had to elbow one's way in to get resources (pre-training) and how to tune things up in an unstable environment. For the final training we had a dedicated cluster so it was much easier then.
How to find and secure expertize from a few people/teams that helped a lot when we had serious problems. You don't have to be an expert in everything if you surround yourself with experts.

What were useful?

Doing lots of experiments, well documenting them.
Surrounding ourselves with experts from Megaton-LM, Deepspeed and PyTorch teams.
Having a few experienced volunteers to validate ideas with
Having an amazing support from the HPC team

If you were to start the learning process again, would you have done it differently?

Fired 95% of the volunteers who contributed ziltch. It was a complete non-sense to have hundreds of ghosts that were there watching but when asked for help pretended to not be there. It was really a tiny handful of people who did all the work.

We actually could have used a lot more help.

How deep did you get into the detail of ML model architecture? Did you learn all the in and out of those models?

I personally had only time and resources to learn reactively when something either broke or needed a speed up.

We had a team that was focused on modeling.

How were the collaboration between you and the ML researchers? Are the models built by ML researchers and deployed by you? Are you also involving in building the models?

Well, we didn't really do much innovation modeling-wise w/ BLOOM - we didn't have much time and it was also very difficult to mod Megatron-LM - so we used their GPT model and did some small tweaks to it - like adding additional LayerNorm and ALiBI positional encoding.

We also were under an extreme pressure to start training long before we were ready, because we were given A100s for 3 months from the moment they have arrived. But, of course, the software wasn't ready, because we needed time w/ A100s to develop and debug the software - not talking about an insane amount of hardware failures.

2 replies

alielfilali01 Jul 29, 2024

I just wanted to say that i love this discussion soo insightful! Thanks

haidahaha Aug 2, 2024
Author

Thank you very much for your detailed response! I have a few follow-up questions, if you don't mind.

Many of the experiences you mentioned are unique to the BLOOM training, particularly due to its open collaboration nature. These include constraints on compute resources, a non-negotiable deadline, and a large volunteer base with varying levels of meaningful contributions. Now that you've transitioned to Contextual.AI, a private company environment, how do you perceive the differences between these environments, and how have they impacted your training and learning experiences?
The engineering aspects of large-scale model training are often not discussed openly or frequently enough. Apart from your repository, there are only a handful of detailed accounts available, such as those from Imbue and Meta’s OPT 175B. The entrypoint is therefore much harder for those without access to compute resources to gain firsthand experience in the field. Do you see this as the biggest challenge for newcomers trying to break into this area? What advice would you give to those looking to enter this field?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning process - from Zero to Hero #57

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Learning process - from Zero to Hero #57

haidahaha Jul 28, 2024

Replies: 1 comment · 2 replies

stas00 Jul 29, 2024 Maintainer

alielfilali01 Jul 29, 2024

haidahaha Aug 2, 2024 Author

haidahaha
Jul 28, 2024

Replies: 1 comment 2 replies

stas00
Jul 29, 2024
Maintainer

haidahaha Aug 2, 2024
Author