Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927

Merged

Conversation

SamuelStuchly
Copy link
Contributor

@SamuelStuchly SamuelStuchly commented Oct 6, 2021

Revendored openshift/cluster-api-provider-gcp to update GCPMachineProviderSpec struct.
Added validation for guest accelerators and appropriate associated default values.

@SamuelStuchly
Copy link
Contributor Author

/test unit

@SamuelStuchly SamuelStuchly changed the title Add validation webhook for guestAccelerators on GCP OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP Oct 6, 2021
@SamuelStuchly
Copy link
Contributor Author

/retest

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple comments, logic looks good to me

@@ -902,6 +903,16 @@ func defaultGCP(m *Machine, config *admissionConfig) (bool, []string, utilerrors

providerSpec.Disks = defaultGCPDisks(providerSpec.Disks, config.clusterID)

if len(providerSpec.GuestAccelerators) != 0 {
if providerSpec.GuestAccelerators[0].AcceleratorCount == 0 {
providerSpec.GuestAccelerators[0].AcceleratorCount = defaultGCPAcceleratorCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little confused by the logic here, we check to see if the count is 0, and then set it. i'm not necessarily doubting what needs to be done here, but i think a comment would help explain why it needs to be done.

Copy link
Contributor Author

@SamuelStuchly SamuelStuchly Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the idea is that 0 represents unset AcceleratorCount value, which we want to default to 1 since we do not expect anybody to define a gpu type with its count as zero. There would be no point. This is for a case when user inputs the type and no count, because they might assume it defaults to 1.

A comment can be added for sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your description makes sense. i think a comment would be good here.

modifyDefault: func(p *gcp.GCPMachineProviderSpec) {
p.MachineType = "a2-highgpu-1g"
p.OnHostMaintenance = "TERMINATE"
}, expectedOk: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, just to make it consistent

Suggested change
}, expectedOk: true,
},
expectedOk: true,

@SamuelStuchly
Copy link
Contributor Author

/retest

Comment on lines 913 to 915
if len(providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(providerSpec.MachineType, "a2-") {
providerSpec.OnHostMaintenance = "TERMINATE"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Are we sure we want to override this on behalf of a user? I'd prefer we set an error/warning and then users can update it themselves right?

If this isn't set, does the instance definitely fail to create?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on documentation i believe instance with GPUs will not create unless OnHostMaintenance is set to TERMINATE.
We can set an error/warning. But i didnt see the reason for that since user does not really have a choice here if they want to create a GPU instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#create-new-gpu-vm-a100 -> VMs with GPUs cannot live migrate, make sure you set the onHostMaintenance parameter to TERMINATE.
It is also mentioned to set this in the section further down Creating VMs with attached GPUs (other GPU types)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should set an error when they have a GPU and the terminate and instruct them to update. This logic right now is magic and not obvious to a customer, prefer to make it super obvious what has happened/gone wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For customer clarity, it makes sense to me. Will update it.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2021
@SamuelStuchly
Copy link
Contributor Author

Should also implement changes from openshift/api#1044.

@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 2, 2021
@JoelSpeed
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 2, 2021
@JoelSpeed
Copy link
Contributor

/retest

@@ -922,6 +923,13 @@ func defaultGCP(m *machinev1.Machine, config *admissionConfig) (bool, []string,

providerSpec.Disks = defaultGCPDisks(providerSpec.Disks, config.clusterID)

if len(providerSpec.GPUs) != 0 {
// In case Count was not set it should default to 1, since there is no valid reason for it to be purposely set to 0.
Copy link
Contributor

@lobziik lobziik Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have a gpu attached to every newly created vm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only kick in when a GPU instance is specified by a customer, so no, not every instance

@SamuelStuchly
Copy link
Contributor Author

/retest

5 similar comments
@SamuelStuchly
Copy link
Contributor Author

/retest

@SamuelStuchly
Copy link
Contributor Author

/retest

@SamuelStuchly
Copy link
Contributor Author

/retest

@SamuelStuchly
Copy link
Contributor Author

/retest

@SamuelStuchly
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 8, 2021

@SamuelStuchly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-disruptive a9cf7c0 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@lobziik
Copy link
Contributor

lobziik commented Dec 8, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2021
@openshift-merge-robot openshift-merge-robot merged commit e3e3e71 into openshift:master Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants