The training can not converge and the value of grad_norm is nan. #36

sjtuljw520 · 2024-07-12T06:16:04Z

Hi, thank you for sharing the code.
I try to train my model with config file "configs/tracking/petr/f1_q500_800x320.py" and "configs/tracking/petr/f3_q500_800x320.py", but both the training of first stage (with f1_q500_800x320.py) and second stage (with f3_q500_800x320.py) can not converge. specially, the grad_norm becomes nan during traning.

you can see the traning log in the link below. Can you help to what happens here. Maybe there are some mistakes in the config file?
https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_first_stage.log
https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_second_stage.log

ziqipang · 2024-07-25T05:37:52Z

@sjtuljw520 Interesting, I haven't encountered this problem before. As a sanity check, can you correct run inference of my checkpoints?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The training can not converge and the value of grad_norm is nan. #36

The training can not converge and the value of grad_norm is nan. #36

sjtuljw520 commented Jul 12, 2024

ziqipang commented Jul 25, 2024

The training can not converge and the value of grad_norm is nan. #36

The training can not converge and the value of grad_norm is nan. #36

Comments

sjtuljw520 commented Jul 12, 2024

ziqipang commented Jul 25, 2024