Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook with a timeoutSeconds greater than 29 seconds will block upgrades? #167

Open
kyrofa opened this issue Apr 15, 2024 · 8 comments
Open

Comments

@kyrofa
Copy link

kyrofa commented Apr 15, 2024

I'm trying to upgrade from dok8s v1.28 to v1.29, but clusterlint is telling me that cert-manager's webhook, which uses a timeoutSeconds of 30, will block the upgrade. Would someone mind filling me in on why that's the case suddenly? I've been using cert-manager for years, across several k8s versions, without issue. I see in the upstream docs where it says:

The timeout value must be between 1 and 30 seconds.

But it's unclear whether that's inclusive or not. Clusterlint has seemingly decided that it's 1-inclusive and 30-exclusive. Is that actually based in reality? Is there another document I've missed? 29 seems so random, haha!

For now I will customize my deployment and set this to 29 to make clusterlint happy, but I'd like to get to the bottom of this. I've logged an issue against cert-manager as well. If clusterlint is correct, cert-manager should probably make the default value 29. On the other hand, if 30 is indeed a valid value, clusterlint probably shouldn't complain about it.

@kyrofa
Copy link
Author

kyrofa commented Apr 15, 2024

I just tried changing timeoutSeconds to 31 and got this:

Error: cannot patch "cert-manager-webhook" with kind MutatingWebhookConfiguration: MutatingWebhookConfiguration.admissionregistration.k8s.io "cert-manager-webhook" is invalid: webhooks[0].timeoutSeconds: Invalid value: 31: the timeout value must be between 1 and 30 seconds && cannot patch "cert-manager-webhook" with kind ValidatingWebhookConfiguration: ValidatingWebhookConfiguration.admissionregistration.k8s.io "cert-manager-webhook" is invalid: webhooks[0].timeoutSeconds: Invalid value: 31: the timeout value must be between 1 and 30 seconds

This is leading me to believe 30 should really be a valid value. Happy to make a PR if folks agree.

@kyrofa
Copy link
Author

kyrofa commented Apr 18, 2024

@maelvls provided some additional details on the issue I logged against cert-manager:

There is this comment in kubelint's code:

Webhooks with TimeoutSeconds set: less than 1 or greater than or equal to 30 is bad.

and

TimeoutSeconds value should be set to a non-nil value (greater than or equal to 1 and less than 30). If the TimeoutSeconds value is set to nil and the cluster version is 1.13.*, users are unable to configure the TimeoutSeconds value and this value will stay at nil, breaking upgrades. It's only for versions >= 1.14 that the value will default to 30 seconds.

That combined with my experimentation above makes it pretty clear that 30 seconds is actually a valid value, and clusterlint should allow it.

@timoreimann
Copy link

I vaguely remember there was a "conflict" between 30 seconds used by a webhook and the default 30 seconds timeout of kubectl: if a webhook used a 30 seconds timeout, the default timeout from kubectl could "override" the webhook one, rendering failure policies of Ignore ineffective.

@varshavaradarajan does that make sense, or am I completely off here?

@elcfd
Copy link

elcfd commented Jul 22, 2024

Hi - could you please let us know the reason why the upper limit of this value has been set to 29?

Seeing the above mentioned issue on both our Digital Ocean k8s clusters which run the cert manager helm chart.

Thanks,

Charlie

@kyrofa
Copy link
Author

kyrofa commented Jul 22, 2024

It sounds like the reason is almost lost to the dust of time. If I may, I recommend following shellcheck's model: not only point out problems, but write quick blurbs describing WHY it's a problem, and ways to fix it. That will be helpful not only to clusterlint users, but clusterlint maintainers over time.

@timoreimann
Copy link

timoreimann commented Jul 22, 2024

There should be documentation on every check, but as usual it can probably be improved.

I am reasonably sure that what I wrote at #167 (comment) about the upper limit is true. Should be fairly easy to verify and afterwards improve / adjust the documentation one way or another.

@elcfd
Copy link

elcfd commented Jul 22, 2024

@kyrofa I contacted Digital Ocean support as I am a customer and this is their response/reasoning:


I completely understand your concern here. I will be happy to assist you. We recommend ensuring the timeout is less than 30s. This is because 30s is the default timeout for Kubernetes objects and it is possible to create a scenario that lines up perfectly with other timeouts which would allow the timeout to fail indefinitely. For more details refer below link:

https://docs.digitalocean.com/support/clusterlint-error-fixes/#admission-controller-webhook-timeout

For example, if a node hosting the webhook goes down, then webhook will not respond to requests. This is problematic when nodes try and register to the master API and the master API cannot reach the webhook. The node registration fails (because the failurePolicy is set to 'Fail') while waiting on the webhook to become available. But the webhook often can't reschedule till a new node has been created. Thus creating a bit of a circular dependency scenario.


@elcfd
Copy link

elcfd commented Jul 22, 2024

There should be documentation on every check, but as usual it can probably be improved.

I am reasonably sure that what I wrote at #167 (comment) about the upper limit is true. Should be fairly easy to verify and afterwards improve / adjust the documentation one way or another.

This matches what Digital Ocean support have told me 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants