support scheduler plugins #3612

KunWuLuan · 2025-05-16T08:08:58Z

Support PodGroup of scheduler-plugins

#3611

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

KunWuLuan · 2025-05-27T01:38:24Z

@kevin85421 Hi, Please have a look when you have time. Thanks!

kevin85421 · 2025-05-27T21:54:48Z

@KunWuLuan can you resolve the conflict?

MortalHappiness

Please make sure your PR can run successfully. I tried to run your PR but it failed to create PodGroup.

You also need to update the following places

kuberay/helm-chart/kuberay-operator/values.yaml

Lines 49 to 76 in 75ea7ae

    
           # Enable customized Kubernetes scheduler integration. If enabled, Ray workloads will be scheduled 
        
           # by the customized scheduler. 
        
           #  * "enabled" is the legacy option and will be deprecated soon. 
        
           #  * "name" is the standard option, expecting a scheduler name, supported values are 
        
           #    "default", "volcano", and "yunikorn". 
        
           # 
        
           # Note: "enabled" and "name" should not be set at the same time. If both are set, an error will be thrown. 
        
           # 
        
           # Examples: 
        
           #  1. Use volcano (deprecated) 
        
           #       batchScheduler: 
        
           #         enabled: true 
        
           # 
        
           #  2. Use volcano 
        
           #       batchScheduler: 
        
           #         name: volcano 
        
           # 
        
           #  3. Use yunikorn 
        
           #       batchScheduler: 
        
           #         name: yunikorn 
        
           # 
        
           batchScheduler: 
        
             # Deprecated. This option will be removed in the future. 
        
             # Note, for backwards compatibility. When it sets to true, it enables volcano scheduler integration. 
        
             enabled: false 
        
             # Set the customized scheduler name, supported values are "volcano" or "yunikorn", do not set 
        
             # "batchScheduler.enabled=true" at the same time as it will override this option. 
        
             name: ""

kuberay/ray-operator/apis/config/v1alpha1/config_utils.go

Line 25 in 4a9e13e

if config.BatchScheduler == volcano.GetPluginName() || config.BatchScheduler == yunikorn.GetPluginName() {
Add a sample config named ray-cluster.kube-scheduler.yaml to ray-operator/config/samples/ folder

Follow-up:

Add a doc here https://github.com/ray-project/ray/tree/aec27916429e6d97564ec315f5f4b5654de12963/doc/source/cluster/kubernetes/k8s-ecosystem

ray-operator/controllers/ray/batchscheduler/scheduler-plugins/scheduler_plugins.go

KunWuLuan · 2025-05-29T14:52:04Z

Hi, I have update the helm chart and make sure it works in my local environment. Please have a look when you have time.Thanks
@MortalHappiness

andrewsykim · 2025-05-29T16:09:59Z

ray-operator/config/samples/ray-cluster.kube-scheduler.yaml

+metadata:
+  name: test-podgroup-0
+  labels:
+    ray.io/gang-scheduling-enabled: "true"


I don't think we should be using ray.io/* labels for any of the scheduler integrations. The label prefix should be specific to the integration (see other examples with Volcano, Yunikorn and Kueue)

Hi, ray.io/gang-scheduling-enabled is used in the sample of yunikorn, and ray.io/scheduler-name: volcano is used when we need to use volcano.
And these labels are defined here:

kuberay/ray-operator/controllers/ray/utils/constant.go

Lines 35 to 37 in 75a63a5

RaySchedulerName = "ray.io/scheduler-name"

RayPriorityClassName = "ray.io/priority-class-name"

RayClusterGangSchedulingEnabled = "ray.io/gang-scheduling-enabled"

We didn't create a new label for new scheduler.

andrewsykim · 2025-05-29T16:12:56Z

ray-operator/controllers/ray/batchscheduler/scheduler-plugins/scheduler_plugins.go

+	if !k.isGangSchedulingEnabled(rc) {
+		return nil
+	}
+	replica := int32(1)


Why start at 1? Is it for the head pod? If so can you add a comment

Yes, it is for the head pod.

andrewsykim · 2025-05-29T16:13:42Z

ray-operator/controllers/ray/batchscheduler/scheduler-plugins/scheduler_plugins.go

+// AddMetadataToPod adds essential labels and annotations to the Ray pods
+// the scheduler needs these labels and annotations in order to do the scheduling properly
+func (k *KubeScheduler) AddMetadataToPod(_ context.Context, app *rayv1.RayCluster, groupName string, pod *corev1.Pod) {
+	// when gang scheduling is enabled, extra annotations need to be added to all pods


annotations or labels?

I made the mistakes. Thanks.

andrewsykim · 2025-05-29T16:13:52Z

ray-operator/controllers/ray/batchscheduler/scheduler-plugins/scheduler_plugins.go

+
+// AddMetadataToPod adds essential labels and annotations to the Ray pods
+// the scheduler needs these labels and annotations in order to do the scheduling properly
+func (k *KubeScheduler) AddMetadataToPod(_ context.Context, app *rayv1.RayCluster, groupName string, pod *corev1.Pod) {


s/app/cluster

Could you please explain the comment? Thank you!

andrewsykim · 2025-05-29T16:14:41Z

ray-operator/controllers/ray/batchscheduler/scheduler-plugins/scheduler_plugins.go

+	// when gang scheduling is enabled, extra annotations need to be added to all pods
+	if k.isGangSchedulingEnabled(app) {
+		// the group name for the head and each of the worker group should be different
+		pod.Labels[KubeSchedulerPodGroupLabelKey] = app.Name


Should PodGroups be scheduled at the worker group level or RayCluster level? I feel like worker group level could make more sense in some cases

If the head is not available, the workers will be blocked, this is unacceptable for my customers. So I think the head should be checked with the workers.

MortalHappiness · 2025-05-30T03:59:59Z

Please also resolve conflicts. We now use single go.mod in the repo root. Thanks.

MortalHappiness

You also need to add permission for creating PodGroup here

kuberay/helm-chart/kuberay-operator/templates/_helpers.tpl

Line 121 in 9658af3

rules:

Signed-off-by: kunwuluan <[email protected]>

update ValidateBatchSchedulerConfig() update helm chart Rename the function. Signed-off-by: kunwuluan <[email protected]>

… start the operator. Signed-off-by: kunwuluan <[email protected]>

MortalHappiness · 2025-05-30T09:40:03Z

helm-chart/kuberay-operator/values.yaml

+#  4. Use PodGroup
+#       batchScheduler:
+#         name: kube-scheduler
+#


You also need to update L53 to

# "default", "volcano", "yunikorn", and "kube-scheduler".

MortalHappiness · 2025-05-30T09:51:30Z

ray-operator/controllers/ray/batchscheduler/scheduler-plugins/scheduler_plugins_test.go

+func TestCalculateDesiredResources(t *testing.T) {
+	a := assert.New(t)
+
+	cluster := createTestRayCluster(1)
+
+	totalResource := utils.CalculateDesiredResources(&cluster)
+
+	// 256m * 3 (requests, not limits)
+	a.Equal("768m", totalResource.Cpu().String())
+
+	// 256Mi * 3 (requests, not limits)
+	a.Equal("768Mi", totalResource.Memory().String())
+
+	// 2 GPUs total
+	a.Equal("2", totalResource.Name("nvidia.com/gpu", resource.BinarySI).String())
+}


What are you testing here? It seems that you didn't test anything specific to the kube scheduler plugin.

MortalHappiness · 2025-05-30T09:52:32Z

Please also install pre-commit locally and then fix the CI lint error. Thanks.

kevin85421 · 2025-05-31T00:35:31Z

I discussed this with @MortalHappiness offline. Since this PR still requires some work before it can be merged, I’ve removed the release-blocker label. We can consider cherry-pick this PR after branch cut.

Signed-off-by: KunWuLuan <[email protected]>

KunWuLuan changed the title ~~support scheduler plugins~~ (WIP)support scheduler plugins May 16, 2025

KunWuLuan force-pushed the feat/support-scheduler-plugins branch 6 times, most recently from 55d376d to a054ab9 Compare May 21, 2025 04:03

KunWuLuan changed the title ~~(WIP)support scheduler plugins~~ support scheduler plugins May 27, 2025

kevin85421 assigned kevin85421 and MortalHappiness May 27, 2025

kevin85421 added 1.4.0 release-blocker labels May 27, 2025

MortalHappiness reviewed May 29, 2025

View reviewed changes

KunWuLuan force-pushed the feat/support-scheduler-plugins branch from 34d3f44 to a8048b0 Compare May 29, 2025 15:15

andrewsykim requested changes May 29, 2025

View reviewed changes

andrewsykim reviewed May 29, 2025

View reviewed changes

MortalHappiness reviewed May 30, 2025

View reviewed changes

KunWuLuan added 3 commits May 30, 2025 15:43

support scheduler plugins

85aaad8

Signed-off-by: kunwuluan <[email protected]>

add unit test

9761593

update ValidateBatchSchedulerConfig() update helm chart Rename the function. Signed-off-by: kunwuluan <[email protected]>

Update the role in helm chart. And ensure the crd is installed before…

21f4f88

… start the operator. Signed-off-by: kunwuluan <[email protected]>

KunWuLuan force-pushed the feat/support-scheduler-plugins branch from a8048b0 to 21f4f88 Compare May 30, 2025 08:59

MortalHappiness reviewed May 30, 2025

View reviewed changes

kevin85421 removed the release-blocker label May 31, 2025

KunWuLuan force-pushed the feat/support-scheduler-plugins branch from 325c140 to 54f786f Compare May 31, 2025 04:56

KunWuLuan force-pushed the feat/support-scheduler-plugins branch from 54f786f to 7c572f7 Compare May 31, 2025 04:57

Fix CI lint

1e40ceb

Signed-off-by: KunWuLuan <[email protected]>

KunWuLuan force-pushed the feat/support-scheduler-plugins branch from 7c572f7 to 1e40ceb Compare May 31, 2025 06:31

	# Enable customized Kubernetes scheduler integration. If enabled, Ray workloads will be scheduled
	# by the customized scheduler.
	# * "enabled" is the legacy option and will be deprecated soon.
	# * "name" is the standard option, expecting a scheduler name, supported values are
	# "default", "volcano", and "yunikorn".
	#
	# Note: "enabled" and "name" should not be set at the same time. If both are set, an error will be thrown.
	#
	# Examples:
	# 1. Use volcano (deprecated)
	# batchScheduler:
	# enabled: true
	#
	# 2. Use volcano
	# batchScheduler:
	# name: volcano
	#
	# 3. Use yunikorn
	# batchScheduler:
	# name: yunikorn
	#
	batchScheduler:
	# Deprecated. This option will be removed in the future.
	# Note, for backwards compatibility. When it sets to true, it enables volcano scheduler integration.
	enabled: false
	# Set the customized scheduler name, supported values are "volcano" or "yunikorn", do not set
	# "batchScheduler.enabled=true" at the same time as it will override this option.
	name: ""

	RaySchedulerName = "ray.io/scheduler-name"
	RayPriorityClassName = "ray.io/priority-class-name"
	RayClusterGangSchedulingEnabled = "ray.io/gang-scheduling-enabled"

support scheduler plugins #3612

Are you sure you want to change the base?

support scheduler plugins #3612

Uh oh!

Conversation

KunWuLuan commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checks

Uh oh!

KunWuLuan commented May 27, 2025

Uh oh!

kevin85421 commented May 27, 2025

Uh oh!

MortalHappiness left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KunWuLuan commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MortalHappiness commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MortalHappiness left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MortalHappiness commented May 30, 2025

Uh oh!

kevin85421 commented May 31, 2025

Uh oh!

Uh oh!

KunWuLuan commented May 16, 2025 •

edited

Loading

MortalHappiness commented May 30, 2025 •

edited

Loading