[Question] multi-gpu PPO training raise bug #113
Labels
algorithms
Proposal of new algorithms
bug
Something isn't working
question
Further information is requested
Required prerequisites
Questions
When conducting Qwen-2-VL multi-gpu PPO training, the code will raise a error like:
***** Running training *****
Training 1/1.0 epoch: 0%| | 0/104.0 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion
-sizes[i] <= index && index < sizes[i] && "index out of bounds"
failed.[rank6]: Traceback (most recent call last):
[rank6]: File "", line 198, in _run_module_as_main
[rank6]: File "", line 88, in _run_code
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 508, in
[rank6]: sys.exit(main())
[rank6]: ^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 503, in main
[rank6]: trainer.train()
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_to_text/ppo.py", line 468, in train
[rank6]: inference_batches, training_batches = self.rollout(prompt_only_batch)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 268, in rollout
[rank6]: actor_batch, response_lens = self.actor_step(mini_batch)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 187, in actor_step
[rank6]: sequences = self.actor_model.module.generate(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/generation/utils.py", line 2255, in generate
[rank6]: result = self._sample(
[rank6]: ^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/generation/utils.py", line 3247, in _sample
[rank6]: model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1770, in prepare_inputs_for_generation
[rank6]: if cache_position[0] != 0:
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^
[rank6]: RuntimeError: CUDA error: device-side assert triggered
[rank6]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank6]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank6]: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.[rank6]:[E111 21:23:17.223268078 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f378b9166e4 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f378bd82a18 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3741814726 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f37418193f0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3741820b5a in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f374182261d in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f378b9166e4 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f378bd82a18 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3741814726 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f37418193f0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3741820b5a in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f374182261d in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f374148f71b in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)
I also try to switch deepspeed zero2, and this error is also raised
It seems the error is caused in
self.actor_model.module.generate
:`
def actor_step(
self, mini_prompt_only_batch: PromptOnlyBatch
) -> list[dict[str, Any], list[int]]:
The text was updated successfully, but these errors were encountered: