Webhook with a timeoutSeconds greater than 29 seconds will block upgrades? #167

kyrofa · 2024-04-15T21:01:25Z

I'm trying to upgrade from dok8s v1.28 to v1.29, but clusterlint is telling me that cert-manager's webhook, which uses a timeoutSeconds of 30, will block the upgrade. Would someone mind filling me in on why that's the case suddenly? I've been using cert-manager for years, across several k8s versions, without issue. I see in the upstream docs where it says:

The timeout value must be between 1 and 30 seconds.

But it's unclear whether that's inclusive or not. Clusterlint has seemingly decided that it's 1-inclusive and 30-exclusive. Is that actually based in reality? Is there another document I've missed? 29 seems so random, haha!

For now I will customize my deployment and set this to 29 to make clusterlint happy, but I'd like to get to the bottom of this. I've logged an issue against cert-manager as well. If clusterlint is correct, cert-manager should probably make the default value 29. On the other hand, if 30 is indeed a valid value, clusterlint probably shouldn't complain about it.

The text was updated successfully, but these errors were encountered:

kyrofa · 2024-04-15T21:35:17Z

I just tried changing timeoutSeconds to 31 and got this:

Error: cannot patch "cert-manager-webhook" with kind MutatingWebhookConfiguration: MutatingWebhookConfiguration.admissionregistration.k8s.io "cert-manager-webhook" is invalid: webhooks[0].timeoutSeconds: Invalid value: 31: the timeout value must be between 1 and 30 seconds && cannot patch "cert-manager-webhook" with kind ValidatingWebhookConfiguration: ValidatingWebhookConfiguration.admissionregistration.k8s.io "cert-manager-webhook" is invalid: webhooks[0].timeoutSeconds: Invalid value: 31: the timeout value must be between 1 and 30 seconds

This is leading me to believe 30 should really be a valid value. Happy to make a PR if folks agree.

kyrofa · 2024-04-18T18:04:24Z

@maelvls provided some additional details on the issue I logged against cert-manager:

There is this comment in kubelint's code:

Webhooks with TimeoutSeconds set: less than 1 or greater than or equal to 30 is bad.

and

TimeoutSeconds value should be set to a non-nil value (greater than or equal to 1 and less than 30). If the TimeoutSeconds value is set to nil and the cluster version is 1.13.*, users are unable to configure the TimeoutSeconds value and this value will stay at nil, breaking upgrades. It's only for versions >= 1.14 that the value will default to 30 seconds.

That combined with my experimentation above makes it pretty clear that 30 seconds is actually a valid value, and clusterlint should allow it.

timoreimann · 2024-04-19T06:03:57Z

I vaguely remember there was a "conflict" between 30 seconds used by a webhook and the default 30 seconds timeout of kubectl: if a webhook used a 30 seconds timeout, the default timeout from kubectl could "override" the webhook one, rendering failure policies of Ignore ineffective.

@varshavaradarajan does that make sense, or am I completely off here?

elcfd · 2024-07-22T09:13:57Z

Hi - could you please let us know the reason why the upper limit of this value has been set to 29?

Seeing the above mentioned issue on both our Digital Ocean k8s clusters which run the cert manager helm chart.

Thanks,

Charlie

kyrofa · 2024-07-22T16:11:03Z

It sounds like the reason is almost lost to the dust of time. If I may, I recommend following shellcheck's model: not only point out problems, but write quick blurbs describing WHY it's a problem, and ways to fix it. That will be helpful not only to clusterlint users, but clusterlint maintainers over time.

timoreimann · 2024-07-22T20:06:41Z

There should be documentation on every check, but as usual it can probably be improved.

I am reasonably sure that what I wrote at #167 (comment) about the upper limit is true. Should be fairly easy to verify and afterwards improve / adjust the documentation one way or another.

elcfd · 2024-07-22T20:07:40Z

@kyrofa I contacted Digital Ocean support as I am a customer and this is their response/reasoning:

I completely understand your concern here. I will be happy to assist you. We recommend ensuring the timeout is less than 30s. This is because 30s is the default timeout for Kubernetes objects and it is possible to create a scenario that lines up perfectly with other timeouts which would allow the timeout to fail indefinitely. For more details refer below link:

https://docs.digitalocean.com/support/clusterlint-error-fixes/#admission-controller-webhook-timeout

For example, if a node hosting the webhook goes down, then webhook will not respond to requests. This is problematic when nodes try and register to the master API and the master API cannot reach the webhook. The node registration fails (because the failurePolicy is set to 'Fail') while waiting on the webhook to become available. But the webhook often can't reschedule till a new node has been created. Thus creating a bit of a circular dependency scenario.

elcfd · 2024-07-22T20:08:24Z

There should be documentation on every check, but as usual it can probably be improved.

I am reasonably sure that what I wrote at #167 (comment) about the upper limit is true. Should be fairly easy to verify and afterwards improve / adjust the documentation one way or another.

This matches what Digital Ocean support have told me 👍

kyrofa mentioned this issue Apr 15, 2024

clusterlint claims that webhook timeoutSeconds of 30 is too high cert-manager/cert-manager#6907

Closed

timoreimann closed this as completed Jul 22, 2024

timoreimann reopened this Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook with a timeoutSeconds greater than 29 seconds will block upgrades? #167

Webhook with a timeoutSeconds greater than 29 seconds will block upgrades? #167

kyrofa commented Apr 15, 2024 •

edited

Loading

kyrofa commented Apr 15, 2024 •

edited

Loading

kyrofa commented Apr 18, 2024

timoreimann commented Apr 19, 2024

elcfd commented Jul 22, 2024

kyrofa commented Jul 22, 2024

timoreimann commented Jul 22, 2024 •

edited

Loading

elcfd commented Jul 22, 2024

elcfd commented Jul 22, 2024

Webhook with a timeoutSeconds greater than 29 seconds will block upgrades? #167

Webhook with a timeoutSeconds greater than 29 seconds will block upgrades? #167

Comments

kyrofa commented Apr 15, 2024 • edited Loading

kyrofa commented Apr 15, 2024 • edited Loading

kyrofa commented Apr 18, 2024

timoreimann commented Apr 19, 2024

elcfd commented Jul 22, 2024

kyrofa commented Jul 22, 2024

timoreimann commented Jul 22, 2024 • edited Loading

elcfd commented Jul 22, 2024

elcfd commented Jul 22, 2024

kyrofa commented Apr 15, 2024 •

edited

Loading

kyrofa commented Apr 15, 2024 •

edited

Loading

timoreimann commented Jul 22, 2024 •

edited

Loading