Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the time of training #5

Open
SmudgedWings opened this issue Nov 20, 2024 · 1 comment
Open

About the time of training #5

SmudgedWings opened this issue Nov 20, 2024 · 1 comment

Comments

@SmudgedWings
Copy link

Following the training tutorial in the README without modifying any code or parameters, I trained the model on an H800x8 machine with 80GB. Each epoch takes 55 minutes, while the total_epochs in the code is set to 1000 by default. This seems inconsistent with the 10-hour training time mentioned in the paper. I’m not sure what went wrong. Could you please provide some guidance?

@felipecadar
Copy link
Collaborator

felipecadar commented Nov 21, 2024

Hi there. Total training time can depend a lot on your infra structure. We trained with a batch size of 16 on 4xV100 32GB GPUs, also pulling data directly from a fast access storage system. Many details can impact the training time.
I believe it would be more interesting for reproducibility to match the total number of steps, even if it takes more time. All my checkpoints are around 1 million steps (102400 to be more specific). Consider reaching that number of steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants