Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseek-R1 AssertionError occurred in the batch request of the client #3477

Closed
Roysky opened this issue Feb 11, 2025 · 8 comments
Closed

deepseek-R1 AssertionError occurred in the batch request of the client #3477

Roysky opened this issue Feb 11, 2025 · 8 comments
Assignees
Labels

Comments

@Roysky
Copy link

Roysky commented Feb 11, 2025

While using deepseek-R1 for inference on 2 nodes * 8 GPUs (H800), an AssertionError occurred during the client batch request.

The specific error is as follows:

[2025-02-11 01:42:04] INFO: 10.81.10.40:51432 - "GET /v1/batches/batch_2c036fce-9c71-4d76-9fdb-4701d9f59861 HTTP/1.1" 200 OK
[2025-02-11 01:42:04] DetokenizerManager hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/detokenizer_manager.py", line 240, in run_detokenizer_process
manager.event_loop()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/detokenizer_manager.py", line 143, in event_loop
self.trim_matched_stop(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/detokenizer_manager.py", line 105, in trim_matched_stop
assert len(output) > 0
AssertionError
[2025-02-11 01:42:04] Received sigquit from a child proces. It usually means the child failed.

The environment configuration is as follows:

  • sglang version: 0.4.2.post3
  • env: 2 nodes * H800(8gpus)

Startup command:

node1

python -m sglang.launch_server --model-path DeepSeek-R1 --tp 16 --nccl-init-addr 10.1.10.42:5000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0

node2

python -m sglang.launch_server --model-path DeepSeek-R1 --tp 16 --nccl-init-addr 10.1.10.42:5000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0

@zhaochenyang20
Copy link
Collaborator

Oh. Currently do not use batch in dpsk models. We find this problem. Batch can be easily changed by chat completions.

@zhaochenyang20
Copy link
Collaborator

@tanconghui
Copy link
Contributor

Do you have a plan to fix this issue? we need batch API in our scenario.

Oh. Currently do not use batch in dpsk models. We find this problem. Batch can be easily changed by chat completions.

@zhaochenyang20
Copy link
Collaborator

Yeah. As that have been said, @FrankLeeeee is on this, dpsk model's batch. Wait and see, thanks!

@FrankLeeeee
Copy link
Collaborator

@tanconghui @Roysky do you still encounter this issue with the latest release? i cannot reproduce the error. If you can provide me with a script to reproduce the error, that will help as well.

@FrankLeeeee FrankLeeeee mentioned this issue Feb 21, 2025
6 tasks
@FrankLeeeee
Copy link
Collaborator

@tanconghui @Roysky you can take a look at #3754 , I didn't encounter the error any more with this fix.

@tanconghui
Copy link
Contributor

Thanks, FrankLeeeee. I also noticed this issue. But maybe it is better use a UUUID stead of the custom_id as the request id? For example, if two batches are processing in the same time, and samples with same custom_id exsit in these two batches, the current solution #3754 still seems problematic.

@tanconghui @Roysky you can take a look at #3754 , I didn't encounter the error any more with this fix.

@zhaochenyang20
Copy link
Collaborator

@tanconghui We just merged this into main. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants