ERROR when trainning iql with checkpoint=True on multiple GPUs #47
-
When I train bert.pahse1 on multiple GPUs, I turned on checkpointing, set batchsize to 6500 with five 4090s. Everything is ok, it took 15 hours to train 170 million samples and got a well-behaved model.
the training program crashed with these output:
without checkpointing, I can only set batchsize up to 600 with my five 4090s, the training speed is so slow that it is almost impossible to complete. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
sorry, I tried training without checkpoingting, its faster than with it. |
Beta Was this translation helpful? Give feedback.
-
close it |
Beta Was this translation helpful? Give feedback.
sorry, I tried training without checkpoingting, its faster than with it.