Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The training can not converge and the value of grad_norm is nan. #36

Open
sjtuljw520 opened this issue Jul 12, 2024 · 1 comment
Open

Comments

@sjtuljw520
Copy link

Hi, thank you for sharing the code.
I try to train my model with config file "configs/tracking/petr/f1_q500_800x320.py" and "configs/tracking/petr/f3_q500_800x320.py", but both the training of first stage (with f1_q500_800x320.py) and second stage (with f3_q500_800x320.py) can not converge. specially, the grad_norm becomes nan during traning.

you can see the traning log in the link below. Can you help to what happens here. Maybe there are some mistakes in the config file?
https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_first_stage.log
https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_second_stage.log

@ziqipang
Copy link
Contributor

@sjtuljw520 Interesting, I haven't encountered this problem before. As a sanity check, can you correct run inference of my checkpoints?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants