[Question] multi-gpu PPO training raise bug #113

JoeYing1019 · 2025-01-11T13:25:26Z

Required prerequisites

I have read the documentation https://align-anything.readthedocs.io.
I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Questions

When conducting Qwen-2-VL multi-gpu PPO training, the code will raise a error like:

***** Running training *****
Training 1/1.0 epoch: 0%| | 0/104.0 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
[rank6]: Traceback (most recent call last):
[rank6]: File "", line 198, in _run_module_as_main
[rank6]: File "", line 88, in _run_code
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 508, in
[rank6]: sys.exit(main())
[rank6]: ^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 503, in main
[rank6]: trainer.train()
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_to_text/ppo.py", line 468, in train
[rank6]: inference_batches, training_batches = self.rollout(prompt_only_batch)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 268, in rollout
[rank6]: actor_batch, response_lens = self.actor_step(mini_batch)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 187, in actor_step
[rank6]: sequences = self.actor_model.module.generate(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/generation/utils.py", line 2255, in generate
[rank6]: result = self._sample(
[rank6]: ^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/generation/utils.py", line 3247, in _sample
[rank6]: model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1770, in prepare_inputs_for_generation
[rank6]: if cache_position[0] != 0:
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^
[rank6]: RuntimeError: CUDA error: device-side assert triggered
[rank6]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank6]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank6]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank6]:[E111 21:23:17.223268078 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f378b9166e4 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f378bd82a18 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3741814726 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f37418193f0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3741820b5a in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f374182261d in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f378b9166e4 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f378bd82a18 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3741814726 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f37418193f0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3741820b5a in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f374182261d in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f374148f71b in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)

I also try to switch deepspeed zero2, and this error is also raised

It seems the error is caused in self.actor_model.module.generate :

`
def actor_step(
self, mini_prompt_only_batch: PromptOnlyBatch
) -> list[dict[str, Any], list[int]]:

    infer_batch = self.infer_batch(mini_prompt_only_batch)
    actor_batch = copy.deepcopy(infer_batch)
    sequences = self.actor_model.module.generate(
        **infer_batch,
        generation_config=self.generation_config,
        synced_gpus=True,
        do_sample=True,
    )`

The text was updated successfully, but these errors were encountered:

Gaiejj · 2025-01-12T20:06:15Z

Thank you very much for informing us of this bug, we will look into the issue and fix it as soon as possible! 😊

Gaiejj · 2025-01-18T07:25:34Z

Hey! I think we've finally pinpointed the issue. Please refer to what's mentioned in this issue: QwenLM/Qwen2.5-VL#596:
We will provide a solution in our framework soon. Thanks for raising this up!

JoeYing1019 added the question Further information is requested label Jan 11, 2025

Gaiejj added bug Something isn't working algorithms Proposal of new algorithms labels Jan 12, 2025

Gaiejj self-assigned this Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] multi-gpu PPO training raise bug #113

[Question] multi-gpu PPO training raise bug #113

JoeYing1019 commented Jan 11, 2025 •

edited

Loading

Gaiejj commented Jan 12, 2025

Gaiejj commented Jan 18, 2025

[Question] multi-gpu PPO training raise bug #113

[Question] multi-gpu PPO training raise bug #113

Comments

JoeYing1019 commented Jan 11, 2025 • edited Loading

Required prerequisites

Questions

Gaiejj commented Jan 12, 2025

Gaiejj commented Jan 18, 2025

JoeYing1019 commented Jan 11, 2025 •

edited

Loading