About the time of training #5

SmudgedWings · 2024-11-20T06:01:57Z

Following the training tutorial in the README without modifying any code or parameters, I trained the model on an H800x8 machine with 80GB. Each epoch takes 55 minutes, while the total_epochs in the code is set to 1000 by default. This seems inconsistent with the 10-hour training time mentioned in the paper. I’m not sure what went wrong. Could you please provide some guidance?

felipecadar · 2024-11-21T11:49:02Z

Hi there. Total training time can depend a lot on your infra structure. We trained with a batch size of 16 on 4xV100 32GB GPUs, also pulling data directly from a fast access storage system. Many details can impact the training time.
I believe it would be more interesting for reproducibility to match the total number of steps, even if it takes more time. All my checkpoints are around 1 million steps (102400 to be more specific). Consider reaching that number of steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the time of training #5

About the time of training #5

SmudgedWings commented Nov 20, 2024

felipecadar commented Nov 21, 2024 •

edited

Loading

About the time of training #5

About the time of training #5

Comments

SmudgedWings commented Nov 20, 2024

felipecadar commented Nov 21, 2024 • edited Loading

felipecadar commented Nov 21, 2024 •

edited

Loading