Skip to content

fix_calc_max_steps #295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

Ssukriti
Copy link
Collaborator

the effective batch size is always per_Device_batch_size * gradient accumulation

with my testing : with gradient accumulation 4 , num_epochs = 20

previously without fix it took 80 epochs as max_steps was 80 which overrides num_epochs. Takes evry long to train

after fix:
finishes training in 23 epochs , which is closer to 20 specified . Train time has reduced

From this blog https://lightning.ai/blog/gradient-accumulation/
batch size of 256 but can only fit a batch size of 64 into GPU memory, we can perform gradient accumulation over four batches of size 64. (After processing all four batches, we will have the accumulated gradients equivalent to a single batch of size 256.)
also here : https://discuss.huggingface.co/t/how-do-you-calculate-max-steps/40177 , same strategy is applied

Signed-off-by: Sukriti-Sharma4 <[email protected]>
@Ssukriti Ssukriti changed the title calc_max_steps fix_calc_max_steps Dec 11, 2023
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@Ssukriti
Copy link
Collaborator Author

I am going to do a quality test with this once we figure out some other issues and have a benchmark for code in main . Currently paused this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant