Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] multi-gpu PPO training raise bug #113

Open
3 tasks done
JoeYing1019 opened this issue Jan 11, 2025 · 2 comments
Open
3 tasks done

[Question] multi-gpu PPO training raise bug #113

JoeYing1019 opened this issue Jan 11, 2025 · 2 comments
Assignees
Labels
algorithms Proposal of new algorithms bug Something isn't working question Further information is requested

Comments

@JoeYing1019
Copy link

JoeYing1019 commented Jan 11, 2025

Required prerequisites

Questions

When conducting Qwen-2-VL multi-gpu PPO training, the code will raise a error like:

***** Running training *****
Training 1/1.0 epoch: 0%| | 0/104.0 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [0,0,0], thread: [0,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
[rank6]: Traceback (most recent call last):
[rank6]: File "", line 198, in _run_module_as_main
[rank6]: File "", line 88, in _run_code
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 508, in
[rank6]: sys.exit(main())
[rank6]: ^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 503, in main
[rank6]: trainer.train()
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_to_text/ppo.py", line 468, in train
[rank6]: inference_batches, training_batches = self.rollout(prompt_only_batch)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 268, in rollout
[rank6]: actor_batch, response_lens = self.actor_step(mini_batch)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/opt/tiger/align-anything-dev-video/align_anything/trainers/text_image_to_text/ppo.py", line 187, in actor_step
[rank6]: sequences = self.actor_model.module.generate(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]: return func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/generation/utils.py", line 2255, in generate
[rank6]: result = self._sample(
[rank6]: ^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/generation/utils.py", line 3247, in _sample
[rank6]: model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/home/tiger/.local/lib/python3.11/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1770, in prepare_inputs_for_generation
[rank6]: if cache_position[0] != 0:
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^
[rank6]: RuntimeError: CUDA error: device-side assert triggered
[rank6]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank6]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank6]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank6]:[E111 21:23:17.223268078 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f378b9166e4 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f378bd82a18 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3741814726 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f37418193f0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3741820b5a in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f374182261d in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 6] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f378b9166e4 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f378bd82a18 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f3741814726 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f37418193f0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f3741820b5a in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f374182261d in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #8: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f378b96c446 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f374148f71b in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f378bdfd5c0 in /home/tiger/.local/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x89144 (0x7f3794334144 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x1097dc (0x7f37943b47dc in /usr/lib/x86_64-linux-gnu/libc.so.6)

I also try to switch deepspeed zero2, and this error is also raised

It seems the error is caused in self.actor_model.module.generate :

`
def actor_step(
self, mini_prompt_only_batch: PromptOnlyBatch
) -> list[dict[str, Any], list[int]]:

    infer_batch = self.infer_batch(mini_prompt_only_batch)
    actor_batch = copy.deepcopy(infer_batch)
    sequences = self.actor_model.module.generate(
        **infer_batch,
        generation_config=self.generation_config,
        synced_gpus=True,
        do_sample=True,
    )`
@JoeYing1019 JoeYing1019 added the question Further information is requested label Jan 11, 2025
@Gaiejj
Copy link
Member

Gaiejj commented Jan 12, 2025

Thank you very much for informing us of this bug, we will look into the issue and fix it as soon as possible! 😊

@Gaiejj Gaiejj added bug Something isn't working algorithms Proposal of new algorithms labels Jan 12, 2025
@Gaiejj Gaiejj self-assigned this Jan 12, 2025
@Gaiejj
Copy link
Member

Gaiejj commented Jan 18, 2025

Hey! I think we've finally pinpointed the issue. Please refer to what's mentioned in this issue: QwenLM/Qwen2.5-VL#596:
We will provide a solution in our framework soon. Thanks for raising this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithms Proposal of new algorithms bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants