You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following the training tutorial in the README without modifying any code or parameters, I trained the model on an H800x8 machine with 80GB. Each epoch takes 55 minutes, while the total_epochs in the code is set to 1000 by default. This seems inconsistent with the 10-hour training time mentioned in the paper. I’m not sure what went wrong. Could you please provide some guidance?
The text was updated successfully, but these errors were encountered:
Hi there. Total training time can depend a lot on your infra structure. We trained with a batch size of 16 on 4xV100 32GB GPUs, also pulling data directly from a fast access storage system. Many details can impact the training time.
I believe it would be more interesting for reproducibility to match the total number of steps, even if it takes more time. All my checkpoints are around 1 million steps (102400 to be more specific). Consider reaching that number of steps.
Following the training tutorial in the README without modifying any code or parameters, I trained the model on an H800x8 machine with 80GB. Each epoch takes 55 minutes, while the total_epochs in the code is set to 1000 by default. This seems inconsistent with the 10-hour training time mentioned in the paper. I’m not sure what went wrong. Could you please provide some guidance?
The text was updated successfully, but these errors were encountered: