You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text was updated successfully, but these errors were encountered:
19157681683
changed the title
i use 3 A800 to deploy deepseek r1,but one A800 just a IB,how i adjust the number of tp in the deploy command
i use 3 A800 to deploy deepseek r1,but one A800 just one IB,how i adjust the number of tp in the deploy command
Feb 12, 2025
You can definitely give a try for your current deploy command:
For the node with one IB adapter:
Use a lower tensor parallelism value. In your case, set --tp 12 and configure it to use its available IB (e.g., export NCCL_IB_HCA=mlx5_0).
For the nodes with two IB adapters:
Use a higher tensor parallelism value since they can support more communication bandwidth. In your case, set --tp 24 and assign the proper IB (e.g., export NCCL_IB_HCA=mlx5_1).
If this doesn't work or you prefer a uniform configuration, force all nodes to use the same IB adapter (for example, set NCCL_IB_HCA=mlx5_0 on every node) and then use the same --tp (e.g., --tp 12) across all nodes. This may simplify deployment, but it might not fully leverage the extra IB capacity on the nodes that have two adapters.
node 1
export NCCL_IB_HCA=mlx5_0
python3 -m sglang.launch_server --model-path /x32001214/model/bf16/DeepSeek-R1-BF16 --tp 12 --dist-init-addr 0.0.0.0:9997 --nnodes 3 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 8888
node 2
export NCCL_IB_HCA=mlx5_1
python3 -m sglang.launch_server --model-path /x32001214/model/bf16/DeepSeek-R1-BF16 --tp 24 --dist-init-addr 10.160.199.103:30172 --nnodes 3 --node-rank 1 --trust-remote-code
node 3
export NCCL_IB_HCA=mlx5_1
python3 -m sglang.launch_server --model-path /x32001214/model/bf16/DeepSeek-R1-BF16 --tp 24 --dist-init-addr 10.160.199.103:30172 --nnodes 3 --node-rank 2 --trust-remote-code
The text was updated successfully, but these errors were encountered: