Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: add mtu size configuration and RDMA QoS #4646

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cyclinder
Copy link
Collaborator

Thanks for contributing!

Notice:

What issue(s) does this PR fix:

Fixes #

Special notes for your reviewer:

@cyclinder cyclinder added the release/none no release note label Feb 12, 2025
@cyclinder cyclinder force-pushed the docs/mtu branch 2 times, most recently from 0ce904d to 9f20778 Compare February 12, 2025 08:10
| 配置复杂度 | 配置相对简单 | 配置较为复杂,需要硬件支持和配置 |
| 兼容性 | 兼容性较好,适用于大多数环境 | 依赖硬件支持,兼容性较差 |
| 适用场景 | 适用于大多数场景,包括裸金属,虚拟机等 | 只适用于裸金属,不适用于虚拟机场景 |
| 成本 | 成本较低,因为不需要额外的硬件支持 | 成本较高,需要支持 SR-IOV 的硬件设备 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

适用 roce 和 ib 的情况

@@ -366,6 +382,33 @@ Spiderpool 使用了 [sriov-network-operator](https://github.com/k8snetworkplumb
EOF
```

4.(可选)自定义 SR-IOV VF 的 MTU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要直接对应真实情况 ,环境中就是会设置为 8000 大包

1 安装 spiderpool 前的 网卡准备环节
(1)母接口 , mtu 8000, 重启要持久化
(2)qos

2 此处调整为 8000

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cyclinder cyclinder force-pushed the docs/mtu branch 5 times, most recently from ef347e5 to f6b8f14 Compare February 19, 2025 06:46
@cyclinder cyclinder changed the title Docs: add mtu size configuration for SpiderMultusConfig Docs: add mtu size configuration and RDMA QoS Feb 19, 2025
1. Configure the script

```shell
# List of network cards, multiple targets separated by ",", such as eth0,eth1
Copy link
Collaborator

@weizhoublue weizhoublue Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个脚本的产品化程度不够, 用 AI 编程器 快速补齐 校验

1 各种 变量 ,需要支持 环境变量传入。用命令行方式传入参数,而不是每次需要修改脚本
ECN_NICS=${ECN_NICS:-""}
2 必要的变量 为空, 要告警退出
3 一些变量中 网卡 是否存在,是否是 RDMA 网卡,它们后续要设置配置的文件路径是否存在。都要前置 检测
4 ECN_PRIORITY 要检验 0-7

5 存储服务器上, GPU_NIC_LIST 是可为空的 。 GPU 服务器上, STORAGE_NIC_LIST 是可有可无的。
至少 ,GPU_NIC_LIST 和 STORAGE_NIC_LIST 不能都为空

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all done.


```shell
# List of network cards, multiple targets separated by ",", such as eth0,eth1
export ECN_NICS=""
Copy link
Collaborator

@weizhoublue weizhoublue Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

严格来说,它是 GPU_NIC_LIST

无论 存储 还是 GPU 网络,都需要设置 ECN ,不能用 ECN_NICS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那存储和 GPU 网络的 无损队列是否一致?

# List of network cards, multiple targets separated by ",", such as eth0,eth1
export ECN_NICS=""
# Priority of ECN packets, range 0-7
export ECN_PRIORITY=5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU_NIC__PRIORITY

# Priority of ECN packets, range 0-7
export ECN_PRIORITY=5
# DSCP of ECN packets, range 0-63
export ECN_QOS=48
Copy link
Collaborator

@weizhoublue weizhoublue Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU_NIC_QOS
传入为空时,需要 支持 根据 GPU_NIC__PRIORITY 自动算 默认值

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这支持,如果为空,GPU_NIC_QOS = GPU_NIC__PRIORITY * 8


mkdir -p /etc/systemd/system/rdma-qos.d
cat <<F_EOF >/etc/systemd/system/rdma-qos.d/10-qos.conf
GROUP_COMPUTING_CONFIG="group=computing;nic=$ECN_NICS;ecn_priority=$ECN_PRIORITY;ecn_qos=$ECN_QOS"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GROUP_COMPUTING_CONFIG=""
[ -n "$GPU_NIC_LIST"] && GROUP_COMPUTING_CONFIG=....

export STORAGE_NICS=""
export STORAGE_ECN_PRIORITY=""
export STORAGE_ECN_QOS=""
cat <<"EOF" >rdma_qos.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 >rdma_qos.sh 是多此一举 ? 直接 跑 后面 紧急的代码 不就好了

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是在终端执行 export + cat 等命令后生成脚本再执行。该脚本生成 systemd script 和 unit file 才跑起来。

否则 需要单独 cp 到文件执行 或 拷贝多个命令执行,容易复制出错

pfc_queue=$(echo "${qos_queues[*]}" | sed 's? ?,?g' | tr -d ' ')
mlnx_qos -i "$nic_item" --trust=dscp --pfc ${pfc_queue}

echo "echo 1 >/sys/class/net/$nic_item/ecn/roce_np/enable/$ecn_priority"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些路径是否存在,没有检测和报错

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

2. Add script permissions and execute

```shell
chmod +x rdma-qos.sh && bash rdma-qos.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最简单情况下,如下命令 就能完成 配置

GPU_NIC_LIST=”“   GPU_NIC_PRIORITY=""  \ 
STORAGE_NIC_LIST=""  STORAGE_NIC_PRIORITY=""  \
  rdma-qos.sh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最简单情况:

GPU_NIC_LIST=xxx rdma-qos.sh

@cyclinder cyclinder force-pushed the docs/mtu branch 8 times, most recently from 9cac1d7 to d521ef5 Compare February 19, 2025 12:19
@cyclinder cyclinder force-pushed the docs/mtu branch 2 times, most recently from a166f07 to 90d0486 Compare February 19, 2025 12:26
# set -x
# set -o xtrace
set -o errexit

Copy link
Collaborator

@weizhoublue weizhoublue Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不如把 那 8 个变量 直接 放进来,还不需要 配置文件 和 格式解析了,比较这个脚本也没什么特别的复杂

这避免 在解析配置时,遇到 特殊字符 出问题

GPU_NIC_LIST="${GPU_NIC_LIST}"
.......
....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没法直接放进去, 子脚本无法直接获取到 变量值, 因为用 cat 《“eof” ,注意加了引号

Copy link
Collaborator

@weizhoublue weizhoublue Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是可以的,这不是堵塞实现的原因

a=10
cat <<EOF > /tmp/a
a=${a}
echo \${a}
EOF

不过改动较大,你自己把握

[ -n "$GPU_NIC_LIST" ] && validate_nic "$GPU_NIC_LIST" "$GPU_RDMA_PRIORITY"
[ -n "$STORAGE_NIC_LIST" ] && validate_nic "$STORAGE_NIC_LIST" "$STORAGE_RDMA_PRIORITY"

echo "debug, GPU_NIC_LIST=$GPU_NIC_LIST, GPU_RDMA_PRIORITY=$GPU_RDMA_PRIORITY, GPU_CNP_PRIORITY=$GPU_CNP_PRIORITY, GPU_RDMA_QOS=$GPU_RDMA_QOS, GPU_CNP_QOS=$GPU_CNP_QOS"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里日志 应该 置后,放到 算出 GPU_RDMA_QOS
这样,才能看到最终所有的 生效配置

@cyclinder cyclinder force-pushed the docs/mtu branch 3 times, most recently from ada1b6b to bc51eb5 Compare February 19, 2025 13:26
[Install]
WantedBy=timers.target
T_EOF

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我建议 systemctl 之前,先直接 跑一次 脚本,一方面是 直接生效了,另一方面,是可以看未来 system service 是否 成功

Copy link
Collaborator Author

@cyclinder cyclinder Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果想在脚本内 while 1min 循环执行,而这里又想只跑一次,同个脚本两种行为用一个变量来指定?

systemctl daemon-reload
systemctl enable rdma-qos.service
systemctl enable rdma-qos.timer
systemctl start rdma-qos.service
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start -> restart , 支持重装的场景

systemctl enable rdma-qos.service
systemctl enable rdma-qos.timer
systemctl start rdma-qos.service
systemctl start rdma-qos.timer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要两个 service
看着挺绕的,为什么不启动一个 常驻的 service, 每 1 min 调用一次设置

systemctl restart rdma-qos.service
echo -e "\e[31m Done \e[0m"

systemctl status rdma-qos.service
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这条命令会让 终端 卡主,不适合自动化


qos_queues=(0 0 0 0 0 0 0 0)
qos_queues[$rdma_priority]=1
qos_queues[$cnp_priority]=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cnp 是 不用 开 pfc 的

[Install]
WantedBy=multi-user.target
SYS_EOF

Copy link
Collaborator

@weizhoublue weizhoublue Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

启动服务前,先直接跑一把 该脚本, 有错 就退出 ,不安装 service 了
然后这个脚本支持 DEBUG=true 的 变量, 启动服务前,开启 debug, 而 真正 跑起来后, debug=false, 减少 日志

@cyclinder cyclinder force-pushed the docs/mtu branch 2 times, most recently from 6b995d3 to 0b954fb Compare February 20, 2025 07:28
echo -e "\e[31m Pre-run rdma_qos.sh once \e[0m"
/usr/local/bin/rdma_qos.sh

sed -i 's?RUN_ONCE=true?RUN_ONCE=false?' /usr/local/bin/rdma_qos.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本里
DEBUG=${DEBUG:-""}

调用时 DEBUG=true /usr/local/bin/rdma_qos.sh

不需要 还要 动脚本

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


chmod +x /usr/local/bin/rdma_qos.sh
echo -e "\e[31m Pre-run rdma_qos.sh once \e[0m"
/usr/local/bin/rdma_qos.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还得 给个 日志

/usr/local/bin/rdma_qos.sh || { echo "error, failed to set qos" ; exit 1 ; }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

After=network.target

[Service]
Type=simple
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本中 set -o errexit , 有问题直接退出了, 也没报错日志

这里需要开启
Restart=always

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/none no release note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants