Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练中途突然报错 NCCL watchdog thread terminated with exception #1817

Open
Wuyingwen opened this issue Aug 26, 2024 · 9 comments
Open

Comments

@Wuyingwen
Copy link

Describe the bug
使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1250, OpType=ALLREDUCE, NumelIn=20280320, NumelOut=20280320, Timeout(ms)=1800000) ran for 1800782 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
image
该报错的意思是,一直在等某张GPU的数据计算完成然后all_reduce,但是卡在了某张GPU上(该GPU上数据没有完成计算),最终报错 time out。但是如果是数据有问题,在读取阶段应该能直接跳过有问题数据,这种在GPU上卡住算不出来的问题如何解决呢?
我的运行命令:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 swift sft
--model_type minicpm-v-v2_6-chat
--model_id_or_path ../checkpoint/openbmb/MiniCPM-V-2_6
--sft_type lora
--dataset xxx.json
--save_steps 50
--val_dataset xxx.json
--deepspeed default-zero2

torch版本:2.1.2+cu118
训练中途:
image

@tastelikefeet
Copy link
Collaborator

这个会比较奇怪,怎么可能阻塞30分钟都拿不到数据
py-spy dump --pid xxx
看下每个进程都阻塞在了哪里

@yunkchen
Copy link

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

@yunkchen
Copy link

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

py-spy进程主要有两种结果:

Process 175053: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)

Thread 175053 (active): "MainThread"
    synchronize (torch/cuda/__init__.py:792)
    synchronize (deepspeed/accelerator/cuda_accelerator.py:78)
    independent_gradient_partition_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:764)
    overlapping_partition_gradients_reduce_epilogue (deepspeed/runtime/zero/stage_1_and_2.py:863)
    allreduce_gradients (deepspeed/runtime/engine.py:1912)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (deepspeed/runtime/engine.py:1993)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:2151)
    training_step (transformers/trainer.py:3452)
    _inner_training_loop (transformers/trainer.py:2326)
    train (transformers/trainer.py:1991)
    train (swift/trainers/mixin.py:426)
    llm_sft (swift/llm/sft.py:413)
    x_main (swift/utils/run_utils.py:32)
    <module> (swift/cli/sft.py:5)
Thread 175383 (idle): "Thread-1"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 175888 (idle): "Thread-2"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 176363 (idle): "Thread-3 (_pin_memory_loop)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:31)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:54)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 176488 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Process 177063: /usr/local/bin/python -u /usr/local/lib/python3.10/site-packages/swift/cli/sft.py --model_type qwen2-vl-7b-instruct --model_id_or_path Qwen2-VL-7B-Instruct --sft_type full --freeze_vit false --max_length 2048 --lazy_tokenize true --gradient_accumulation_step 2 --batch_size 1 --num_train_epochs 1 --learning_rate 1e-5 --weight_decay 0.1 --lr_scheduler_type cosine --warmup_ratio 0.05 --save_steps 200 --logging_steps 1 --dataloader_num_workers 1 --dataset qwen2-vl-val.jsonl --dataset_test_ratio 0.005 --output_dir qwen2-vl-7b-20240912 --deepspeed default-zero2
Python v3.10.14 (/usr/local/bin/python3.10)

Thread 177063 (idle): "MainThread"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    _worker_loop (torch/utils/data/_utils/worker.py:275)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _launch (multiprocessing/popen_fork.py:71)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:281)
    _Popen (multiprocessing/context.py:224)
    start (multiprocessing/process.py:121)
    __init__ (torch/utils/data/dataloader.py:1040)
    _get_iterator (torch/utils/data/dataloader.py:387)
    __iter__ (torch/utils/data/dataloader.py:439)
    __iter__ (accelerate/data_loader.py:451)
    _inner_training_loop (transformers/trainer.py:2284)
    train (transformers/trainer.py:1991)
    train (swift/trainers/mixin.py:426)
    llm_sft (swift/llm/sft.py:413)
    x_main (swift/utils/run_utils.py:32)
    <module> (swift/cli/sft.py:5)
Thread 177190 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 177191 (idle): "Thread-3 (_serve)"
    accept (socket.py:293)
    accept (multiprocessing/connection.py:609)
    accept (multiprocessing/connection.py:463)
    _serve (multiprocessing/resource_sharer.py:138)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

@Nioolek
Copy link

Nioolek commented Sep 18, 2024

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

@Jintao-Huang
Copy link
Collaborator

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

#2114

@yunkchen
Copy link

SIZE_FACTOR=28 MAX_PIXELS=100352 NFRAMES=24 \
swift sft \
    --model_type qwen2-vl-7b-instruct \
    --model_id_or_path Qwen2-VL-7B-Instruct \
    --sft_type full \
    --freeze_vit false \
    --max_length 2048 \
    --lazy_tokenize true \
    --gradient_accumulation_step 2 \
    --batch_size 1 \
    --num_train_epochs 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.05 \
    --save_steps 200 \
    --logging_steps 1 \
    --dataloader_num_workers 8 \
    --dataset qwen2-vl-val.jsonl \
    --dataset_test_ratio 0.005 \
    --output_dir qwen2-vl-7b-20240912 \
    --deepspeed default-zero2

遇到同样问题

问题相同。设置 --freeze_vit false 就会出现卡死。设置--freeze_vit true 就能正常训练。

#2114

拉取最新代码+更新transformers==4.45.0+更新accelerate==0.34.2
还是出现训练卡住的现象

Train:   0%|          | 0/40340 [00:00<?, ?it/s][WARNING:swift] Current length of row(2130) is larger than the max_length(2048), deleted.
[WARNING:swift] Current length of row(3365) is larger than the max_length(2048), deleted.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.
[ERROR:swift] Error occurs in lazy tokenize: File not found: /mnt_wg/zhoumo.xjq/TDS1M/video/335337510318.mp4
[INFO:swift] Using environment variable `NFRAMES`, Setting nframes: 24.
[INFO:swift] Setting fps: None. You can adjust this hyperparameter through the environment variable: `FPS`.
[INFO:swift] Setting min_pixels: 100352. You can adjust this hyperparameter through the environment variable: `MIN_PIXELS`.
[INFO:swift] Setting total_pixels: 19267584. You can adjust this hyperparameter through the environment variable: `TOTAL_PIXELS`.

@Jintao-Huang
Copy link
Collaborator

pip list | grep swift看看

@yunkchen
Copy link

pip list | grep swift看看

root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift
Name: ms-swift
Version: 2.5.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /root/swift
Editable project location: /root/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl
Required-by:

@zsxm1998
Copy link
Contributor

zsxm1998 commented Mar 7, 2025

pip list | grep swift看看

root@dlcprsc93a7i8zci-master-0:~# pip show ms-swift
Name: ms-swift
Version: 2.5.0.dev0
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /root/swift
Editable project location: /root/swift
Requires: accelerate, addict, aiohttp, attrdict, binpacking, dacite, datasets, einops, importlib_metadata, jieba, matplotlib, modelscope, nltk, numpy, oss2, pandas, peft, requests, rouge, safetensors, tensorboard, tqdm, transformers, transformers_stream_generator, trl
Required-by:

你这是2.0版本的swift吧,是不是得换3.0以上的。还有问题的原因其实应该是某个数据batch是纯文本数据,导致vision encoder模型没有数据流入,但是其又需要训练,因此和其他的rank不同步了(因为其他rank有图像数据),导致NCCL阻塞。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants
@yunkchen @zsxm1998 @Nioolek @Jintao-Huang @tastelikefeet @Wuyingwen and others