-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练中途突然报错 NCCL watchdog thread terminated with exception #1817
Comments
这个会比较奇怪,怎么可能阻塞30分钟都拿不到数据 |
遇到同样问题 |
py-spy进程主要有两种结果:
|
问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。 |
|
拉取最新代码+更新transformers==4.45.0+更新accelerate==0.34.2
|
pip list | grep swift看看 |
|
你这是2.0版本的swift吧,是不是得换3.0以上的。还有问题的原因其实应该是某个数据batch是纯文本数据,导致vision encoder模型没有数据流入,但是其又需要训练,因此和其他的rank不同步了(因为其他rank有图像数据),导致NCCL阻塞。 |
Describe the bug

使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢?
我的运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft
--model_type minicpm-v-v2_6-chat
--model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6
--sft_type lora
--dataset xxx.json
--save_steps 50
--val_dataset xxx.json
--deepspeed default-zero2
torch版本:2.1.2+cu118

训练中途:
The text was updated successfully, but these errors were encountered: