[Bug] It will stuck when run python -m sglang.launch_server #3488

Tian14267 · 2025-02-11T11:04:58Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

When I run python -m sglang.launch_server, it will stuck in here:

INFO 02-11 10:25:22 __init__.py:190] Automatically detected platform cuda.
[2025-02-11 10:25:26] server_args=ServerArgs(model_path='/data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', tokenizer_path='/data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='/data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=8001, mem_fraction_static=0.85, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=4, stream_interval=1, stream_output=False, random_seed=261414591, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=80, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False)
INFO 02-11 10:25:29 __init__.py:190] Automatically detected platform cuda.
INFO 02-11 10:25:29 __init__.py:190] Automatically detected platform cuda.
INFO 02-11 10:25:29 __init__.py:190] Automatically detected platform cuda.
INFO 02-11 10:25:29 __init__.py:190] Automatically detected platform cuda.
INFO 02-11 10:25:29 __init__.py:190] Automatically detected platform cuda.
[2025-02-11 10:25:32 TP3] Init torch distributed begin.
[2025-02-11 10:25:32 TP0] Init torch distributed begin.
[2025-02-11 10:25:32 TP1] Init torch distributed begin.
[2025-02-11 10:25:32 TP2] Init torch distributed begin.
[2025-02-11 10:25:33 TP1] sglang is using nccl==2.21.5
[2025-02-11 10:25:33 TP0] sglang is using nccl==2.21.5
[2025-02-11 10:25:33 TP2] sglang is using nccl==2.21.5
[2025-02-11 10:25:33 TP3] sglang is using nccl==2.21.5
[2025-02-11 10:25:33 TP1] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-11 10:25:33 TP2] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-11 10:25:33 TP0] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-11 10:25:33 TP3] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[2025-02-11 10:25:33 TP3] Load weight begin. avail mem=23.07 GB
[2025-02-11 10:25:33 TP1] Load weight begin. avail mem=23.07 GB
[2025-02-11 10:25:33 TP2] Load weight begin. avail mem=23.07 GB
[2025-02-11 10:25:33 TP0] Load weight begin. avail mem=23.07 GB
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:01<00:08,  1.19s/it]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:02<00:07,  1.25s/it]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:03<00:06,  1.30s/it]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:05<00:05,  1.32s/it]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:06<00:03,  1.30s/it]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:07<00:02,  1.28s/it]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:08<00:01,  1.24s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:09<00:00,  1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:09<00:00,  1.18s/it]

[2025-02-11 10:25:43 TP2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=7.53 GB
[2025-02-11 10:25:43 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=7.53 GB
[2025-02-11 10:25:43 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=7.53 GB
[2025-02-11 10:25:44 TP3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=7.53 GB
[2025-02-11 10:25:44 TP3] KV Cache is allocated. K size: 2.03 GB, V size: 2.03 GB.
[2025-02-11 10:25:44 TP1] KV Cache is allocated. K size: 2.03 GB, V size: 2.03 GB.
[2025-02-11 10:25:44 TP3] Memory pool end. avail mem=2.28 GB
[2025-02-11 10:25:44 TP1] Memory pool end. avail mem=2.28 GB
[2025-02-11 10:25:44 TP0] KV Cache is allocated. K size: 2.03 GB, V size: 2.03 GB.
[2025-02-11 10:25:44 TP0] Memory pool end. avail mem=2.28 GB
[2025-02-11 10:25:44 TP2] KV Cache is allocated. K size: 2.03 GB, V size: 2.03 GB.
[2025-02-11 10:25:44 TP2] Memory pool end. avail mem=2.28 GB
[2025-02-11 10:25:44 TP3] Capture cuda graph begin. This can take up to several minutes.
[2025-02-11 10:25:44 TP2] Capture cuda graph begin. This can take up to several minutes.
[2025-02-11 10:25:44 TP0] Capture cuda graph begin. This can take up to several minutes.
[2025-02-11 10:25:44 TP1] Capture cuda graph begin. This can take up to several minutes.
  0%|                                                                                                                                                                                                             | 0/13 [00:00<?, ?it/s]2025-02-11 10:25:45,462 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-11 10:25:45,467 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-11 10:25:45,468 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-11 10:25:45,492 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False

Reproduction

The comand I use is

CUDA_VISIBLE_DEVICES=1,2,3,4 /data/miniconda3/envs/fffan_sglang/bin/python -m sglang.launch_server \
      --model /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
      --trust-remote-code \
      --tp 4 \
      --host 0.0.0.0 --port 8001

Environment

INFO 02-11 10:59:21 __init__.py:190] Automatically detected platform cuda.
Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 550.142
PyTorch: 2.5.1+cu124
sglang: 0.4.2.post4
sgl_kernel: 0.0.3.post3
flashinfer: 0.2.0.post2
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.61.1
tiktoken: 0.8.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology: 
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PIX	PIX	PIX	SYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU1	PIX	 X 	PIX	PIX	SYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU2	PIX	PIX	 X 	PIX	SYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU3	PIX	PIX	PIX	 X 	SYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU4	SYS	SYS	SYS	SYS	 X 	PIX	PIX	PIX	32-63,96-127	1		N/A
GPU5	SYS	SYS	SYS	SYS	PIX	 X 	PIX	PIX	32-63,96-127	1		N/A
GPU6	SYS	SYS	SYS	SYS	PIX	PIX	 X 	PIX	32-63,96-127	1		N/A
GPU7	SYS	SYS	SYS	SYS	PIX	PIX	PIX	 X 	32-63,96-127	1		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

The text was updated successfully, but these errors were encountered:

shuaills · 2025-02-11T11:08:56Z

Can you launch it using docker? docker pull lmsysorg/sglang:dev

Tian14267 · 2025-02-11T11:12:24Z

Can you launch it using docker? docker pull lmsysorg/sglang:dev

I Have another same machine, and that machine can run well. So I am confused.
I want to ultimately achieve multi machine and multi card deployment, so not use docker.

Tian14267 · 2025-02-11T11:14:12Z

docker pull lmsysorg/sglang:dev

If I use sglang to deploy model with 2 nodes with docker, can you tell me how to do ?
@shuaills

shuaills · 2025-02-11T11:17:25Z

Please check https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker

Tian14267 · 2025-02-11T11:18:45Z

Please check https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker

OK. I will Try it. Thank you very much .

Tian14267 · 2025-02-12T00:50:22Z

@shuaills

Hello, I can't find and pull that docker. Can you help to find My error in stuck of run python -m sglang.launch_server 。 Any information you need ,please tell me . Than you very much.

Oh, my another machine-2 get this error:

[2025-02-12 01:08:38 TP1] Capture cuda graph begin. This can take up to several minutes.
[2025-02-12 01:08:38 TP3] Capture cuda graph begin. This can take up to several minutes.
[2025-02-12 01:08:38 TP2] Capture cuda graph begin. This can take up to several minutes.
[2025-02-12 01:08:38 TP0] Capture cuda graph begin. This can take up to several minutes.
  0%|                                                                                                                                                                                                             | 0/13 [00:00<?, ?it/s]2025-02-12 01:08:39,383 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[2025-02-12 01:08:39 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 232, in __init__
    self.capture()
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 299, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 357, in capture_one_batch_size
    self.model_runner.attn_backend.init_forward_metadata_capture_cuda_graph(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 291, in init_forward_metadata_capture_cuda_graph
    self.indices_updater_decode.update(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 536, in update_single_wrapper
    self.call_begin_forward(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 640, in call_begin_forward
    wrapper.begin_forward(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/flashinfer/decode.py", line 862, in plan
    self._cached_module = get_batch_prefill_module("fa2")(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/flashinfer/prefill.py", line 196, in backend_module
    module = gen_batch_prefill_module(backend, *args)
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/flashinfer/jit/attention.py", line 464, in gen_batch_prefill_module
    return gen_customize_batch_prefill_module(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/flashinfer/jit/attention.py", line 899, in gen_customize_batch_prefill_module
    return load_cuda_ops(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/flashinfer/jit/core.py", line 120, in load_cuda_ops
    module = torch_cpp_ext.load(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1314, in load
    return _jit_compile(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1721, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1803, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1852, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1787, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 215, in __init__
    self.init_cuda_graphs()
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 730, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 234, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Ninja is required to load C++ extensions
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
4. set --cuda-graph-max-bs to a smaller value (e.g., 32)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 


[2025-02-12 01:08:39] Received sigquit from a child proces. It usually means the child failed.
2025-02-12 01:08:39,431 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
/data/miniconda3/envs/fffan_sglang/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Killed

Before my machinc-1 get stuck, machine-1 also get same error with machine-2.

Tian14267 · 2025-02-12T05:46:16Z

@shuaills
I use docker lmsysorg/sglang:dev and get some problem, can you help to take a look ? Thank you very much!
#3510

shuaills self-assigned this Feb 11, 2025

Tian14267 closed this as completed Feb 11, 2025

Tian14267 reopened this Feb 12, 2025

Tian14267 mentioned this issue Feb 12, 2025

something wrong when use SLURM for Multi-Node Inference #3512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] It will stuck when run python -m sglang.launch_server #3488

[Bug] It will stuck when run python -m sglang.launch_server #3488

Tian14267 commented Feb 11, 2025

shuaills commented Feb 11, 2025

Tian14267 commented Feb 11, 2025

Tian14267 commented Feb 11, 2025 •

edited

Loading

shuaills commented Feb 11, 2025

Tian14267 commented Feb 11, 2025

Tian14267 commented Feb 12, 2025 •

edited

Loading

Tian14267 commented Feb 12, 2025

[Bug] It will stuck when run python -m sglang.launch_server #3488

[Bug] It will stuck when run python -m sglang.launch_server #3488

Comments

Tian14267 commented Feb 11, 2025

Checklist

Describe the bug

Reproduction

Environment

shuaills commented Feb 11, 2025

Tian14267 commented Feb 11, 2025

Tian14267 commented Feb 11, 2025 • edited Loading

shuaills commented Feb 11, 2025

Tian14267 commented Feb 11, 2025

Tian14267 commented Feb 12, 2025 • edited Loading

Tian14267 commented Feb 12, 2025

Tian14267 commented Feb 11, 2025 •

edited

Loading

Tian14267 commented Feb 12, 2025 •

edited

Loading