-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927
OCPCLOUD-1252: Add validation webhook for guestAccelerators on GCP #927
Conversation
89c0f6c
to
d0ff20c
Compare
/test unit |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a couple comments, logic looks good to me
@@ -902,6 +903,16 @@ func defaultGCP(m *Machine, config *admissionConfig) (bool, []string, utilerrors | |||
|
|||
providerSpec.Disks = defaultGCPDisks(providerSpec.Disks, config.clusterID) | |||
|
|||
if len(providerSpec.GuestAccelerators) != 0 { | |||
if providerSpec.GuestAccelerators[0].AcceleratorCount == 0 { | |||
providerSpec.GuestAccelerators[0].AcceleratorCount = defaultGCPAcceleratorCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a little confused by the logic here, we check to see if the count is 0, and then set it. i'm not necessarily doubting what needs to be done here, but i think a comment would help explain why it needs to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the idea is that 0 represents unset AcceleratorCount
value, which we want to default to 1 since we do not expect anybody to define a gpu type with its count as zero. There would be no point. This is for a case when user inputs the type and no count, because they might assume it defaults to 1.
A comment can be added for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your description makes sense. i think a comment would be good here.
modifyDefault: func(p *gcp.GCPMachineProviderSpec) { | ||
p.MachineType = "a2-highgpu-1g" | ||
p.OnHostMaintenance = "TERMINATE" | ||
}, expectedOk: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit, just to make it consistent
}, expectedOk: true, | |
}, | |
expectedOk: true, |
d0ff20c
to
2dfd330
Compare
/retest |
if len(providerSpec.GuestAccelerators) != 0 || strings.HasPrefix(providerSpec.MachineType, "a2-") { | ||
providerSpec.OnHostMaintenance = "TERMINATE" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense? Are we sure we want to override this on behalf of a user? I'd prefer we set an error/warning and then users can update it themselves right?
If this isn't set, does the instance definitely fail to create?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on documentation i believe instance with GPUs will not create unless OnHostMaintenance
is set to TERMINATE
.
We can set an error/warning. But i didnt see the reason for that since user does not really have a choice here if they want to create a GPU instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#create-new-gpu-vm-a100 -> VMs with GPUs cannot live migrate, make sure you set the onHostMaintenance parameter to TERMINATE.
It is also mentioned to set this in the section further down Creating VMs with attached GPUs (other GPU types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should set an error when they have a GPU and the terminate and instruct them to update. This logic right now is magic and not obvious to a customer, prefer to make it super obvious what has happened/gone wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For customer clarity, it makes sense to me. Will update it.
Should also implement changes from openshift/api#1044. |
2dfd330
to
3e936c5
Compare
753e8cc
to
a9cf7c0
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
@@ -922,6 +923,13 @@ func defaultGCP(m *machinev1.Machine, config *admissionConfig) (bool, []string, | |||
|
|||
providerSpec.Disks = defaultGCPDisks(providerSpec.Disks, config.clusterID) | |||
|
|||
if len(providerSpec.GPUs) != 0 { | |||
// In case Count was not set it should default to 1, since there is no valid reason for it to be purposely set to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we have a gpu attached to every newly created vm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only kick in when a GPU instance is specified by a customer, so no, not every instance
/retest |
5 similar comments
/retest |
/retest |
/retest |
/retest |
/retest |
@SamuelStuchly: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
Revendored openshift/cluster-api-provider-gcp to update
GCPMachineProviderSpec
struct.Added validation for guest accelerators and appropriate associated default values.