Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: supply id for minor meaningless device #2250

Merged
merged 1 commit into from
Nov 28, 2024

Conversation

ferris-cx
Copy link
Contributor

Ⅰ. Describe what this PR does

The RDMA device is mounted inside the container based on the scheduling assignment result.

Ⅱ. Does this pull request fix one issue?

The scheduling framework already supports joint allocation of Gpus and RDMA devices, but several semantics need to be tested in practice, including preference and samePCIE. In order to allow RDMA devices to access the container, we need to fix the lack of BDF addresses in the results assigned by the scheduling algorithm to RDMA devices

Ⅲ. Describe how to verify it

  1. In the k8s cluster, prepare one or more servers that support rdma network adapters as cluster nodes. Install the new version of the koordlet component, koord-manager, and the revamped multus-cni on each node, and check the node status. The number of resources is displayed as the actual number of RDMA nics on the node.

  2. Write a pod.YAML to apply for RDMA network card resources, kubectl apply-f pod. On pod note (device-allocated), check whether the rdma device bdf address (busID field) is included in the rdma allocation result, and check the accuracy.

  3. When the pod is in the running state, enter the container and run the ifconfig command to check the number of network cards. If the case is correct, multiple network card lists will be output, such as net1, net2, net3, and so on.

Ⅳ. Special notes for reviews

  1. Deploy components that support RDMA, including koordlet, koord-manager, koord-scheduler, and multus-cni.

  2. Due to the complete end-to-end passthrough, multi-CNI plug-in is also required, which complies with CNI specifications and supports multi-NIC allocation. Assigning the RDMA network card PF/VF to the Pod mentioned here requires the device ID to be injected into the component, otherwise it will not work. This change will be maintained separately in the multi-cni project or another PR

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 60.00000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 66.05%. Comparing base (0ce6314) to head (96cf7a6).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/scheduler/plugins/deviceshare/plugin.go 60.00% 7 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2250      +/-   ##
==========================================
- Coverage   66.05%   66.05%   -0.01%     
==========================================
  Files         454      454              
  Lines       53406    53438      +32     
==========================================
+ Hits        35280    35298      +18     
- Misses      15588    15597       +9     
- Partials     2538     2543       +5     
Flag Coverage Δ
unittests 66.05% <60.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ferris-cx ferris-cx force-pushed the koord-scheduler-rdma branch from 71e856a to f1589bf Compare November 19, 2024 06:35
@ZiMengSheng ZiMengSheng force-pushed the koord-scheduler-rdma branch 5 times, most recently from 75e692e to 3ad4389 Compare November 19, 2024 13:22
@ZiMengSheng
Copy link
Contributor

/hold

@ZiMengSheng
Copy link
Contributor

/hold cancel

@ZiMengSheng ZiMengSheng changed the title add rdma device busId at scheduler api: supply id for minor meaningless device Nov 20, 2024
@ZiMengSheng ZiMengSheng force-pushed the koord-scheduler-rdma branch 2 times, most recently from c1555a8 to 186918a Compare November 20, 2024 09:51
@songtao98
Copy link
Contributor

/lgtm

@songtao98 songtao98 added the lgtm label Nov 28, 2024
@ZiMengSheng
Copy link
Contributor

/approve

@koordinator-bot koordinator-bot bot merged commit 13a7107 into koordinator-sh:main Nov 28, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants