OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927

SamuelStuchly · 2021-10-06T03:05:18Z

Revendored openshift/cluster-api-provider-gcp to update GCPMachineProviderSpec struct.
Added validation for guest accelerators and appropriate associated default values.

SamuelStuchly · 2021-10-06T04:40:58Z

/test unit

SamuelStuchly · 2021-10-06T11:41:10Z

/retest

elmiko

just a couple comments, logic looks good to me

elmiko · 2021-10-06T14:48:08Z

pkg/apis/machine/v1beta1/machine_webhook.go

@@ -902,6 +903,16 @@ func defaultGCP(m *Machine, config *admissionConfig) (bool, []string, utilerrors

 	providerSpec.Disks = defaultGCPDisks(providerSpec.Disks, config.clusterID)

+	if len(providerSpec.GuestAccelerators) != 0 {
+		if providerSpec.GuestAccelerators[0].AcceleratorCount == 0 {
+			providerSpec.GuestAccelerators[0].AcceleratorCount = defaultGCPAcceleratorCount


i'm a little confused by the logic here, we check to see if the count is 0, and then set it. i'm not necessarily doubting what needs to be done here, but i think a comment would help explain why it needs to be done.

Well the idea is that 0 represents unset AcceleratorCount value, which we want to default to 1 since we do not expect anybody to define a gpu type with its count as zero. There would be no point. This is for a case when user inputs the type and no count, because they might assume it defaults to 1.

A comment can be added for sure.

your description makes sense. i think a comment would be good here.

elmiko · 2021-10-06T14:58:10Z

pkg/apis/machine/v1beta1/machine_webhook_test.go

+			modifyDefault: func(p *gcp.GCPMachineProviderSpec) {
+				p.MachineType = "a2-highgpu-1g"
+				p.OnHostMaintenance = "TERMINATE"
+			}, expectedOk: true,


minor nit, just to make it consistent

Suggested change

}, expectedOk: true,

},

expectedOk: true,

SamuelStuchly · 2021-10-15T06:19:25Z

/retest

JoelSpeed · 2021-10-20T13:21:14Z

pkg/apis/machine/v1beta1/machine_webhook.go

+	if len(providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(providerSpec.MachineType, "a2-") {
+		providerSpec.OnHostMaintenance = "TERMINATE"
+	}


Does this make sense? Are we sure we want to override this on behalf of a user? I'd prefer we set an error/warning and then users can update it themselves right?

If this isn't set, does the instance definitely fail to create?

Based on documentation i believe instance with GPUs will not create unless OnHostMaintenance is set to TERMINATE.
We can set an error/warning. But i didnt see the reason for that since user does not really have a choice here if they want to create a GPU instance.

https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#create-new-gpu-vm-a100 -> VMs with GPUs cannot live migrate, make sure you set the onHostMaintenance parameter to TERMINATE.
It is also mentioned to set this in the section further down Creating VMs with attached GPUs (other GPU types)

IMO, we should set an error when they have a GPU and the terminate and instruct them to update. This logic right now is magic and not obvious to a customer, prefer to make it super obvious what has happened/gone wrong

For customer clarity, it makes sense to me. Will update it.

SamuelStuchly · 2021-11-17T13:10:48Z

Should also implement changes from openshift/api#1044.

JoelSpeed · 2021-12-02T17:10:28Z

/approve

openshift-ci · 2021-12-02T17:11:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2021-12-03T07:50:23Z

/retest

lobziik · 2021-12-03T10:15:24Z

pkg/webhooks/machine_webhook.go

@@ -922,6 +923,13 @@ func defaultGCP(m *machinev1.Machine, config *admissionConfig) (bool, []string,

 	providerSpec.Disks = defaultGCPDisks(providerSpec.Disks, config.clusterID)

+	if len(providerSpec.GPUs) != 0 {
+		// In case Count was not set it should default to 1, since there is no valid reason for it to be purposely set to 0.


Will we have a gpu attached to every newly created vm?

This will only kick in when a GPU instance is specified by a customer, so no, not every instance

SamuelStuchly · 2021-12-03T11:36:32Z

/retest

SamuelStuchly · 2021-12-03T14:15:27Z

/retest

SamuelStuchly · 2021-12-04T14:16:24Z

/retest

SamuelStuchly · 2021-12-07T14:21:17Z

/retest

SamuelStuchly · 2021-12-07T18:25:28Z

/retest

SamuelStuchly · 2021-12-08T07:38:54Z

/retest

openshift-ci · 2021-12-08T07:46:41Z

@SamuelStuchly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-disruptive	`a9cf7c0`	link	false	`/test e2e-aws-disruptive`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

lobziik · 2021-12-08T09:26:56Z

/lgtm

openshift-ci bot requested review from elmiko and michaelgugino October 6, 2021 03:05

SamuelStuchly force-pushed the gcp-gpu-webhook branch from 89c0f6c to d0ff20c Compare October 6, 2021 03:51

SamuelStuchly changed the title ~~Add validation webhook for guestAccelerators on GCP~~ OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP Oct 6, 2021

elmiko reviewed Oct 6, 2021

View reviewed changes

SamuelStuchly force-pushed the gcp-gpu-webhook branch from d0ff20c to 2dfd330 Compare October 8, 2021 12:32

JoelSpeed reviewed Oct 20, 2021

View reviewed changes

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2021

SamuelStuchly force-pushed the gcp-gpu-webhook branch from 2dfd330 to 3e936c5 Compare December 2, 2021 14:44

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 2, 2021

add webook validation for guest accelerators on GCP

a9cf7c0

SamuelStuchly force-pushed the gcp-gpu-webhook branch from 753e8cc to a9cf7c0 Compare December 2, 2021 16:40

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 2, 2021

lobziik reviewed Dec 3, 2021

View reviewed changes

openshift-ci bot assigned lobziik Dec 8, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021

openshift-merge-robot merged commit e3e3e71 into openshift:master Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927

OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927

SamuelStuchly commented Oct 6, 2021 •

edited

Loading

SamuelStuchly commented Oct 6, 2021

SamuelStuchly commented Oct 6, 2021

elmiko left a comment

elmiko Oct 6, 2021

SamuelStuchly Oct 7, 2021 •

edited

Loading

elmiko Oct 7, 2021

elmiko Oct 6, 2021

SamuelStuchly commented Oct 15, 2021

JoelSpeed Oct 20, 2021

SamuelStuchly Oct 28, 2021

SamuelStuchly Oct 28, 2021

JoelSpeed Oct 28, 2021

SamuelStuchly Oct 28, 2021

SamuelStuchly commented Nov 17, 2021

JoelSpeed commented Dec 2, 2021

openshift-ci bot commented Dec 2, 2021

JoelSpeed commented Dec 3, 2021

lobziik Dec 3, 2021 •

edited

Loading

JoelSpeed Dec 3, 2021

SamuelStuchly commented Dec 3, 2021

SamuelStuchly commented Dec 3, 2021

SamuelStuchly commented Dec 4, 2021

SamuelStuchly commented Dec 7, 2021

SamuelStuchly commented Dec 7, 2021

SamuelStuchly commented Dec 8, 2021

openshift-ci bot commented Dec 8, 2021

lobziik commented Dec 8, 2021

OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927

OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927

Conversation

SamuelStuchly commented Oct 6, 2021 • edited Loading

SamuelStuchly commented Oct 6, 2021

SamuelStuchly commented Oct 6, 2021

elmiko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamuelStuchly Oct 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamuelStuchly commented Oct 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamuelStuchly commented Nov 17, 2021

JoelSpeed commented Dec 2, 2021

openshift-ci bot commented Dec 2, 2021

JoelSpeed commented Dec 3, 2021

lobziik Dec 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SamuelStuchly commented Dec 3, 2021

SamuelStuchly commented Dec 3, 2021

SamuelStuchly commented Dec 4, 2021

SamuelStuchly commented Dec 7, 2021

SamuelStuchly commented Dec 7, 2021

SamuelStuchly commented Dec 8, 2021

openshift-ci bot commented Dec 8, 2021

lobziik commented Dec 8, 2021

SamuelStuchly commented Oct 6, 2021 •

edited

Loading

SamuelStuchly Oct 7, 2021 •

edited

Loading

lobziik Dec 3, 2021 •

edited

Loading