KEP 4960: Container Stop Signals #5122

sreeram-venkitesh · 2025-01-31T20:11:13Z

One-line PR description: Add the initial KEP for KEP 4960: Container Stop Signals

Issue link: Container Stop Signals #4960

Other comments:

thockin

I will review this for API but I will likely sit on my comments until one of the API shadows chimes in - @mrunalp is on the scene!

mrunalp · 2025-02-01T00:59:08Z

keps/sig-node/4960-container-stop-signals/README.md

+type ContainerStatus struct {
+  // ...
+  // +optional
+  StopSignal string


Not sure if this is something we want in Status.

We wouldn't have the stop signal in ContainerStatus once its moved to lifecycle as discussed.

See other comments - we absolutlety DO want it in status, I think.

if we return it in status, is the desire to show what the container image config had, what the kubernetes CRI request was, or any other possible signals that may be chained depending on the first results in the exit processing?

I think I meant the "effective" stopsignal -- if the Pod specified one, then that's the one. If the Pod did not specify, but the image did, then that's the one. If neither Pod nor image specified, then it's the runtime's default (presumably SIGTERM, but it's not clear if that is part of spec or just convention).

Does that make sense or am I missing something?

Yes, sounds good 👍

Makes sense.. yes.

The rest:

SIG<that will be used on stop request> // specified order: pod>image>runtime(NRI?/config/default)
timeout (default 2seconds)
SIG<that will be used on above signal timeout>

given ^ should we add timeout and secondary signal to the status or maybe just the verbose status info map

"We already control the timeout with terminationGracePeriodSeconds, right?

Is there any configurability for the "final" signal? It has to be something that results in process termination, and there's a very short list of those uncatchable / unblockable signals. SIGKILL is pretty much the only option.

I was more talking from the CRI implementation side than the kubelet node lifecycle/hook side.

Container Runtimes receive a stop "Timeout" on the StopContainer request but not for StopPodSandbox.. so when we call stopContainer to flush the containers in the pod we are stuck with using zero for the timeout of each container which results in immediate KILL signals.

The KILL is not configurable at least in containerd and yes is pretty much the only last ditch option..

You bring up a great point .. mapping terminationGracePeriodSeconds as a total to StopPodSandbox/StopContainer

mrunalp · 2025-02-01T01:07:08Z

keps/sig-node/4960-container-stop-signals/README.md

+type Container struct {
+  // ...
+  // +optional
+  StopSignal string


Could probably do:

type Signal string type Container struct { // ... // +optional StopSignal Signal }

Ohh, yes, that's a good call.

type Signal string // a string mappable/parseable into a golang syscall.Signal
//

going to need some detail I think.. for a typical set of options see
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal.go#L38
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_linux.go
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_windows.go

and yes we are delving into platform realm ... ** or we can consider other mappings / filters

"" or unset should use the container runtime default or "SIGTERM"

I think I prefer to leave it as "the runtime chooses the default if no other input is provided", rather than assuming SIGTERM in our spec, buyt I could be wrong?

use the container runtime default or "SIGTERM"

the runtime chooses the default if no other input is provided", rather than assuming SIGTERM

When I say the stop signal will default to SIGTERM if no signal is specified in the spec or the image, I had this line from containerd in mind where stopSignal is hardcoded to SIGTERM.

// internal/cri/server/container_stop.go if timeout > 0 { stopSignal := "SIGTERM" if container.StopSignal != "" { stopSignal = container.StopSignal } else { // ...

container.StopSignal is reading the stop signal from the image and we can add another condition before that to use the stop signal defined in spec if there is one. If nothing is defined in the spec ("" or unset), containerd behaves like how it is today. I can see similar logic in CRI-O here.

Before this we can also validate if r.GetStopSignal() (where r is *runtime.StopContainerRequest) is a valid signal with ParseSignal that Mike linked above.

That's probably fine. Whar I would like to know (@mikebrow ?) is whether "send a warning signal" is defined by OCI or is just a convention that containerd inherited from Docker, and CRIO does to be compatible? And where is it defined that the signal be SIGTERM?

I am not trying to say we should change it, just know where we define it explicitly and where we defer to lower-layers.

containerd inherited the TERM convention from.. docker inherited the convention from.. linux inherited the convention from.. posix inherited the convention from.. (Unix/IBM...) TERM is like INT (though INT is more reserved to console ctrl-c ...)

OCI left Stop out completely from the spec and signal processing purposefully vague, we had more interesting versions but ended with this:
https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#kill

I could see INT being used today for soft breaking into console apps.. lots of opinions on signaling... it depends.

I am not trying to say we should change it, just know where we define it explicitly and where we defer to lower-layers.

use cases would help..

keps/sig-node/4960-container-stop-signals/README.md

mrunalp · 2025-02-01T01:13:14Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Alpha
+
+- Feature implemented behind a feature flag


Any windows considerations?

Excellent question! What does Docker on Windows doe for Stop?

we (containerd) send the term/kill/signal along to their custom shim.. and windows processes it as we do on good ole linux.. albeit some signals may not be mapped, TERM and KILL seem to be almost universally supported. Would have to check if they are the same values, but conceptually we are ok. code over here https://github.com/search?q=repo%3Amicrosoft%2Fhcsshim%20SIGTERM&type=code

So what should the semantics here be? Is it still called "SIGTERM" / "SIGKILL" / "SIGUSR1" etc, or do we need a different set of values for Windows? Or is it just the end-user's job to know whether the pod is Windows or Linux and choose an appropriate value?

IIRC we hav an optional "OS" field which we can cross-validate on, right?

It only supports these signals according to the docs here https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signal?view=msvc-170 cc: @marosset

SIGABRT

SIGFPE

SIGILL

SIGINT

SIGSEGV

SIGTERM

In the hcsshim only SIGTERM and SIGKILL are supported for Windows containers (and a few windows-specify CTRL events)

https://github.com/microsoft/hcsshim/blob/e5c83a121b980b1b85f4df0813cfba2d83572bac/internal/signals/signal.go#L74-L126

IIRC we hav an optional "OS" field which we can cross-validate on, right?

Yes, there is the spec.Os.Name field which, when set to Windows, changes some behaviors like pod security policy enforcement.

Let me read the rest of the proposal before commenting further here

we (containerd) send the term/kill/signal along to their custom shim.. and windows processes it as we do on good ole linux.. albeit some signals may not be mapped, TERM and KILL seem to be almost universally supported.

In the hcsshim only SIGTERM and SIGKILL are supported for Windows containers (and a few windows-specify CTRL events)

@marosset: Following up here, can we implement support for SIGTERM and SIGKILL signals in Windows as best effort for alpha/beta? Also I'd like to learn what we mean when we say Windows support? I'm not sure if hcsshim is the only way to run container on Windows? I'm still reading about this. Either way if we're to support hcsshim completely this would mean that we can only support the signals you've linked above right?

I'm not sure if hcsshim is the only way to run container on Windows?

hcsshim is the only way to run containers on Windows (as far as i know).
Moby/docker and containerd bot use hcsshim to interact with the Windows Host Compute System which manages container lifecyle.

Either way if we're to support hcsshim completely this would mean that we can only support the signals you've linked above right?

Yes, for Windows I think we can only support SIGTERM and SIGKILL

Following up here, can we implement support for SIGTERM and SIGKILL signals in Windows as best effort for alpha/beta?

I'm think best effort for alpha is OK but I'd like to see support for Windows for beta.
The behavior in the kubelet shouldn't be much different for Windows / Linux. In both cases everything will be communicated over CRI calls, correct?

For validation we typically do Windows specific validation of pod-Specs at admission time if spec.Os.Name == windows"
If pods do not set spec.Os.Name is it is assumed that the pod will not run on a Windows node (even if tolerations / node selectors would place the pod on one).

There are a few cases where the kubelet does extra validation to ensure that linux specific fields are not set when the kubelet is validation / building the PodSandboxConfig / ContainerConfig ojbext (example: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/security_context_windows.go#L42-L73) but this isn't consistent.

I think there should be an admission check that validates the stop signal is only set to either SIGTERM or SIGKILL if spec.Os.Name == windows.
I think we can optionally add checks in the kubelet to ensure things like static pods also that might be good butt optional in my opinion.

In both cases everything will be communicated over CRI calls, correct?

Yes that is correct. I will update the KEP to include how we're only supporting SIGTERM and SIGKILL and how we're going to handle the validation for Windows Pods.

thockin · 2025-02-01T01:09:56Z

keps/sig-node/4960-container-stop-signals/README.md

+
+## Proposal
+
+A new StopSignal field will be added to the Container spec. This new field can take string values, and can be used to define a stop signal when creating Pods.


Let me propose an alternate.

We have Pod.spec.containers[].lifecycle.{postStart,preStop}. Why not add this in there?

Taking that a step further (stay with me, here), is it possible to tell runtimes NOT to send a signal? What I mean is: We have 2 lifecycle "moments" (postStart and preStop) which both are type LifecycleHandler. "Stop" seems like just another such moment -- can we make it a LifecycleHandler too?

Specifically:

Add Signal as an option in type LifecycleHandler

Add Stop *LifecycleHandler in type Lifecycle

Now we can send a signal for any lifecycle event AND we can do something non-signal for Stop events.

Defaulting is hard. We need it to default to whatever today's behavior is, which means the default would have to be something like:

lifecycle: stop: signal: {}

Meaning "send the default signal"

Oh, but default values on pods are still moratorium'ed I think. Shoot, that ruins my idea.

Hmm, just think about it? Worst case, we can still put StopSignal in Lifecycle - is that a bad idea?

we can still put StopSignal in Lifecycle - is that a bad idea?

No I agree, it makes sense for the stop signal to be under LifecycleHandler. I can see how this would be more extensible rather than adding a StopSignal field unrelated to Lifecycle.

Now we can send a signal for any lifecycle event AND we can do something non-signal for Stop events.

I might be wrong, but I don't see a reason why we'd need to send a signal for postStart and preStop but I like the idea of it being configurable with the rest of the lifecycle events in case we need it. All in one place under lifecycle. The second half of doing non-signal stuff for Stop events I can definitely see happening.

In case we don't specify a stop signal for a container from the pod spec, if we populate Pod.spec.containers[].lifecycle.stop.signal with the stop signal from the image or with the default value of SIGTERM (which based on Mrunal's comment #5122 (comment) seems possible), this would mean that all pods would have lifecycle.stop.signal in its spec whether or not the user defines it. This should be fine right?

My idea was trying to make "stop" a normal lifecycle thing, but I can't find a way to disable sending a literal signal upon stop. We can change WHICH signal but not the fact that you are going to get a signal.

As such, it seems to me that adding a single StopSignal field is the way to configure this. Maybe we ALSO want to do an HTTP or exec, but let's not add that speculatively if it is not also simplifying the surface.

In case we don't specify a stop signal for a container from the pod spec, if we populate Pod.spec.containers[].lifecycle.stop.signal with the stop signal from the image or with the default value of SIGTERM (which based on Mrunal's comment #5122 (comment) seems possible), this would mean that all pods would have lifecycle.stop.signal in its spec whether or not the user defines it. This should be fine right?

We can't know what the image has set STOPSIGNAL to until we have the image metadata. We don't want Kubelet updating pod spec, so it has to go it status. I think this is correct and aligns with other work like reporting which UID and GID are actually in use. See https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3619-supplemental-groups-policy

thockin · 2025-02-01T01:10:51Z

keps/sig-node/4960-container-stop-signals/README.md

+type ContainerStatus struct {
+  // ...
+  // +optional
+  StopSignal string


We should be clear that this gets populated whether or not the user has customized it. Whatever the CRI uses as a default should also find its way here.

Ahh, okay makes sense. We can surface what was extracted by the runtime from the image config 👍

For the default value, in my POC I had made the field optional so that if StopSignal is not set at the Container Spec, the runtime would use the signal defined in the image, and if that too is not present, default to SIGTERM.

But now that we have a stop signal field in the Container spec (in Pod.spec.containers[].lifecycle.stop as mentioned above), we'd want it to populated with either of these, based on whichever comes first, right?

the stop signal defined by user in spec, or

the signal defined in the image (if there is one), or

SIGTERM (if both the image and the spec doesn't specify any signal)

We wouldn't want Pod.spec.containers[].lifecycle.stop.signal be nil in the Pod spec while the runtime uses a certain signal defined in the image. Feels like misleading the user. This way Pod.spec.containers[].lifecycle.stop.signal would never be nil and would pick up the value read by the runtime.

We can surface what was extracted by the runtime from the image config

I need to look into how this is done, but is it possible to do this at the validation stage so that we can set the default value for stop signal for each container from its image if a custom value is not given? If yes, the container runtime logic can stay the same, if stop signal is received from the Container spec, use that signal, else fallback to the signal defined in the image. If we don't specify a signal in the spec, both of these would be the same.

For the default value, in my POC I had made the field optional so that if StopSignal is not set at the Container Spec, the runtime would use the signal defined in the image, and if that too is not present, default to SIGTERM.

I agree this is correct

We wouldn't want Pod.spec.containers[].lifecycle.stop.signal be nil in the Pod spec while the runtime uses a certain signal defined in the image.

See #5122 (comment) - we can't update the spec. We can update the status. Spec is what you asked for - if you didn't ask for a StopSignal in the pod, you will get the runtime's default. The status tells you the final result, and is populate by the only place that really knows the answer - kubelet.

IOW, if I specified something, use it. If I didn't specify, let the default happen and get that from the runtime and tell me what it was.

// might be nil going in result = createContainer(..., requestedSignal, ...) // is never nil coming out status.StopSignal = result.StopSignal

Do you get what I mean?

Gotcha, I was able to get this working after referring to how it is done in #3619, thanks!

I added StopSignal to ContainerConfig and ContainerStatus in CRI, passed the stop signal in when creating the ContainerConfig and from containerd's toCRIContainerStatus function in container_status.go return the actual stop signal value in use in the status with this logic:

// internal/cri/server/container_status.go +func toCRIContainerStopSignal(container containerstore.Container) (string, error) { + stopSignal := container.Config.GetStopSignal() // stop signal defined in spec + if stopSignal == "" { // if stop signal not defined in spec + stopSignal = container.StopSignal // Stop signal defined in the image || SIGTERM + } + return stopSignal, nil +} func toCRIContainerStatus(ctx context.Context, container containerstore.Container, spec *runtime.ImageSpec, imageRef string) (*runtime.ContainerStatus, error) { //... + stopSignal, _ := toCRIContainerStopSignal(container) return &runtime.ContainerStatus{ //... rest of the Status + StopSignal: stopSignal, }, nil }

This way stopSignal field would always be present in the containerStatus in this priority: StopSignal field defined in Container Spec > Stop Signal from Image > SIGTERM. If there is no StopSignal set in the Spec, only the Status would have the StopSignal field. Thus, it can be nil going in, but never nil coming out. I tested and confirmed this behavior in all the three cases. Here's the full code: k/k diff, containerd diff.

As such, it seems to me that adding a single StopSignal field is the way to configure this. Maybe we ALSO want to do an HTTP or exec, but let's not add that speculatively if it is not also simplifying the surface.

Based on this comment from #5122 (comment), are we good to go with a new StopSignal field instead of having this as a new LifecycleHandler? I understand that if we make this decision now it'll be kinda set in stone and will be hard to change later. But I also think that trying to show the stop signal in the Status under Lifecycle might also be tricky since we don't show anything lifecycle related in the Status currently. Based on our decision here I'll update the KEP.

Does the path under containerstatuses[] have to mirror the path under containers[]? That may be a rule I just don't know about. I don't object to it, if so. If not, I think it can be an independent decision.

Hmm I don't mind either option. Tagging @mrunalp for confirming: can we map pod.spec.containers[].lifecycle.stop.signal to pod.status.containerstatuses[].stopsignal? That would be simpler imo.

I agree that adding this to Pod.status.containerstatuses[].lifecycle.stop makes sense and is cleaner. I think we can make an independent decision about where we put it in status.

I think nesting under lifecycle for status may be good from grouping perspective.

Folks I've updated the KEP with the new Stop Lifecycle hook in f5406b6. Please take a look!

thockin · 2025-02-01T01:12:57Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Story 1
+
+Kubernetes by default sends a SIGTERM to all containers while killing them. When running nginx on Kubernetes, this can result in nginx dropping requests as reported [here](https://github.com/Kong/kubernetes-ingress-controller/pull/283). The current solution for this issue would be to build custom images with a SIGQUIT stop signal or to write a PreStop lifecycle hook that manually kills the process gracefully, which is what is done in the PR. If we had stop signal support at the Container spec level, this would've been easier and straightforward to implement. Users wouldn't have to patch the applications running on Kubernetes to handle different termination behavior. This is also similar to [this issue](https://github.com/github/resque/pull/21). 


We should enumerate the signal names that we will expose as constant in core/v1. I suppose they get Go-style names like Sigterm ? Or do we bend the rules and say SIGTERM -- I think I could let that one slide...

see above moby code and per platform references

we might timeout while doing the SIGQUIT core dump.. hmm

keps/sig-node/4960-container-stop-signals/kep.yaml

thockin · 2025-02-01T01:14:46Z

keps/sig-node/4960-container-stop-signals/README.md

+type ContainerStatus struct {
+  // ...
+  // +optional
+  StopSignal string


thockin · 2025-02-01T01:14:58Z

keps/sig-node/4960-container-stop-signals/README.md

+type Container struct {
+  // ...
+  // +optional
+  StopSignal string


Ohh, yes, that's a good call.

thockin · 2025-02-01T01:15:30Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Alpha
+
+- Feature implemented behind a feature flag


Excellent question! What does Docker on Windows doe for Stop?

mrunalp · 2025-02-01T01:15:50Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Beta
+
+- CRI API implementation for CRI-O


@haircommander I think this should be simple enough to get alpha support in CRI-O as well.

@mrunalp I can work on CRI-O in alpha since the change should be small-ish but what do you think about adding Windows support in beta? Do we usually ship Windows support in alpha itself? Especially for CRI related KEPs.

First we need to define what Windows support means. Does Windows have a sigterm equivalent when container is stopped?

AFAIK windows is usually best effort. Many examples I see windows is added as a separate KEP if the windows community is interested (CPU Manager, graceful node shutdown). I'm not sure if this is good or bad though.

I think windows support in beta could be good but I don't know if it needs to block the current KEP. That would be more of a question for sig-windows.

We should target CRI-O for alpha with the new container runtime policy. We could bring this to beta gate off with just containerd but we would need CRI-O to turn it on for beta. So I think adding this in alpha should be fine.

My point with Windows is we should not just say "we'll maybe handle windows later" without even knowing what Windows does.

When a docker container on Windows is stopped, what is the Windows equivalent of STOPSIGNAL? moby/moby#25982

If we add a field which only works on Linux, are we making a problem for future us?

we'll test it on windows as well

After the discussion in the SIG Windows meeting yesterday and based on @marosset's #5122 (comment), I've added a section in the KEP detailing what we're planning to do for Windows. Please take a look at 9101efb!

Pushed another commit adding more context about hcsshim and how it supports SIGTERM and SIGKILL as valid signals: d29143f

kannon92 · 2025-02-01T20:11:04Z

keps/sig-node/4960-container-stop-signals/README.md

+}
+```
+
+The CRI API would be updated to have StopSignal as the third field in `StopContainerRequest`, along with `container_id` and `timeout`. 


I would like to see what we are actually adding to the CRI API here.

So my experience with adding new CRI APIs fields is that it is difficult to add both the CRI API and the container runtime implementation in the same release. What I have done in the past is define the CRI API and roll that out as a first alpha. And then you can implement that functionality in the container runtime once containerd/CRI-O have been updated to consume the first alpha. Otherwise you hit a nasty dependency chain where you need the CRI published before you can use those fields in the container runtime.

adding it as a new optional field is convenient because the default will either be unset or nil.. first pass would require the container runtimes add a validator and if not implementing the additional field yet.. to fail the request... which can be tested in critest integration I suppose

for older versions of containerd the status check should suffice, with absence on a newly created container being evidence of non-support..

must remember here even if supported in a newer version .. long running containers for the same pod..

Added more details about the changes to the CRI API in f5406b6.

keps/sig-node/4960-container-stop-signals/README.md

mikebrow

Please see comments

keps/sig-node/4960-container-stop-signals/README.md

mikebrow · 2025-02-03T20:03:35Z

keps/sig-node/4960-container-stop-signals/README.md

+type Container struct {
+  // ...
+  // +optional
+  StopSignal string


type Signal string // a string mappable/parseable into a golang syscall.Signal
//

going to need some detail I think.. for a typical set of options see
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal.go#L38
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_linux.go
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_windows.go

and yes we are delving into platform realm ... ** or we can consider other mappings / filters

mikebrow · 2025-02-03T20:15:59Z

keps/sig-node/4960-container-stop-signals/README.md

+type Container struct {
+  // ...
+  // +optional
+  StopSignal string


"" or unset should use the container runtime default or "SIGTERM"

mikebrow · 2025-02-03T20:52:13Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Story 1
+
+Kubernetes by default sends a SIGTERM to all containers while killing them. When running nginx on Kubernetes, this can result in nginx dropping requests as reported [here](https://github.com/Kong/kubernetes-ingress-controller/pull/283). The current solution for this issue would be to build custom images with a SIGQUIT stop signal or to write a PreStop lifecycle hook that manually kills the process gracefully, which is what is done in the PR. If we had stop signal support at the Container spec level, this would've been easier and straightforward to implement. Users wouldn't have to patch the applications running on Kubernetes to handle different termination behavior. This is also similar to [this issue](https://github.com/github/resque/pull/21). 


see above moby code and per platform references

mikebrow · 2025-02-03T20:59:55Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Story 1
+
+Kubernetes by default sends a SIGTERM to all containers while killing them. When running nginx on Kubernetes, this can result in nginx dropping requests as reported [here](https://github.com/Kong/kubernetes-ingress-controller/pull/283). The current solution for this issue would be to build custom images with a SIGQUIT stop signal or to write a PreStop lifecycle hook that manually kills the process gracefully, which is what is done in the PR. If we had stop signal support at the Container spec level, this would've been easier and straightforward to implement. Users wouldn't have to patch the applications running on Kubernetes to handle different termination behavior. This is also similar to [this issue](https://github.com/github/resque/pull/21). 


we might timeout while doing the SIGQUIT core dump.. hmm

mikebrow · 2025-02-03T21:01:39Z

keps/sig-node/4960-container-stop-signals/README.md

+
+### Risks and Mitigations
+
+I don't see any issues with adding a new field for the stop signal because this is an optional feature that users.


Suggested change

I don't see any issues with adding a new field for the stop signal because this is an optional feature that users.

I don't see any issues with adding a new field for the stop signal because this is an optional feature.

A risk is adding the complexity of signal handling to pod/container spec users unhandled signals, hung pods, erratic results.. I don't see kubectl options in the kep, was that considered.

Understood the risks. I don't get what you meant with kubectl options. Are you talking about adding stop signal as an option to kubectl run maybe? Kevin mentioned adding stop signal as an option for crictl stop, which I've added to the KEP.

mikebrow · 2025-02-03T21:16:53Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Alpha
+
+- Feature implemented behind a feature flag


we (containerd) send the term/kill/signal along to their custom shim.. and windows processes it as we do on good ole linux.. albeit some signals may not be mapped, TERM and KILL seem to be almost universally supported. Would have to check if they are the same values, but conceptually we are ok. code over here https://github.com/search?q=repo%3Amicrosoft%2Fhcsshim%20SIGTERM&type=code

mikebrow · 2025-02-03T21:18:22Z

keps/sig-node/4960-container-stop-signals/README.md

+
+#### Beta
+
+- CRI API implementation for CRI-O


we'll test it on windows as well

…pContainerRequest in CRI

thockin

API is nice and simple - I think I prefer the second option I suggest, but open to arguments.

keps/sig-node/4960-container-stop-signals/README.md

keps/sig-node/4960-container-stop-signals/kep.yaml

keps/sig-node/4960-container-stop-signals/README.md

pacoxu · 2025-02-08T10:47:11Z

keps/sig-node/4960-container-stop-signals/README.md

+
+If only the kube-apiserver enables this feature, validation will pass, but kubelet won't understand the new lifecycle hook and will not add the stop signal when creating the ContainerConfig.
+
+If only the kubelet has enabled this feature, you won't be able to create a Pod which has a StopSignal lifecycle hook via the apiserver and hence the feature won't be usable even if the kubelet supports it. `containerSpec.Lifecycle.StopSignal` can be an empty value and kubelet functions as if no custom stop signal has been set for any container.


What if the user create a static pod with this feature?

If the feature is disabled in the kubelet and if we create a static pod, the StopSignal lifecycle hook would be dropped by the kubelet. In the case where the feature gate is disabled in the apiserver but enabled on the kubelet, the kubelet would pass the stop signal to the container runtime and it would work, but the kube-apiserver wouldn't show the pod as having a custom stop signal. This is the same case that'd happen when we create a regular pod with a stop signal and later disable the feature gate in kube-apiserver. The feature would work as long as the feature gate is enabled in the kubelet.

What do you think?

Got it.
/lgtm

keps/sig-node/4960-container-stop-signals/README.md

pacoxu · 2025-02-08T11:08:01Z

keps/sig-node/4960-container-stop-signals/README.md

+
+### Risks and Mitigations
+
+- We'll be adding the complexity of signal handling to the pod/container spec. If users define an signal that is not handled by their containers, this can lead to pods hanging. 


by default, the terminationGracePeriodSeconds is 30 seconds. For the default terminationGracePeriodSeconds setting, pod kill will hang for just 30 seconds by default.

https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#lifecycle

Do I understand correctly?

Yes this is correct, I have updated this in the KEP so that it is clearer.

jpbetz · 2025-02-10T17:14:53Z

keps/sig-node/4960-container-stop-signals/README.md

+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+If you reenable the feature, you'll be able to create Pods with StopSignal lifecycle hooks for their containers. Without the feature gate enabled, this would make your workloads invalid.


Can you explain more about how workloads can become invalid, and what can be done to ensure this does not happen? That is, if a cluster operator decides to reenable this feature and reads this section for guidance, what do they need to know?

Ah my bad, saying that workloads are invalid is incorrect. What I meant to say was that once the gate is disabled, if you try to create new Pods with StopSignal, those would be invalid. Existing worklods should still continue to function as I mentioned in the reply to Paco in #5122 (comment).

If the feature gate is disabled in kube-apiserver alone, users would not be able to create new Pods with containers having StopSignal, but existing Pods with StopSignal should work the same, given that the container runtime versions supports StopSignal in the first place.

If the feature gate is turned off in the kubelet alone, users would be able to create Pods with StopSignal, but the field will be dropped in the kubelet. When they turn this back on, the kubelet would start sending the StopSignal to the container runtime again, and the container runtime starts using the custom stop signal to kill the containers again.

If cluster operator disables or reenables the feature gate, no change should happen to exisiting workloads.

When they turn this back on, the kubelet would start sending the StopSignal to the container runtime again

@mrunalp @haircommander @kannon92 I have a question about this since we're sending the StopSignal in the ContainerConfig. Imagine the feature gate is turned on when we have a container running which where we have configured a StopSignal. Now we turn the feature gate off and restart the kubelet. Would the container runtime still have access to the same ContainerConfig we had sent earlier? And does the Pod have to be updated for the ContainerConfig to be updated for the runtime? (Even if we update it, the validation should let existing resources with StopSignal to pass). Existing resouces can still have the field, but the field can be dropped silently in the kubelet based on the feature gate. When the feature gate is turned back on, the custom stop signal would be used again and the kubelet starts sending it to the runtime. Please correct me if I'm wrong.

This makes sense to me theoretically, but I'm not entirely sure of how the kubelet-CRI-runtime connection works when we restart the kubelet after changing the feature gates etc. Please let me know if what I described would work. Thanks!

my understanding is when the kubelet restarts it polls the CRI impl for existing containers and then compares them to the ones that the API is requesting. Something I'm not totally clear on is what happens if the features change from under the kubelet, whether the kubelet can see the expected state having changed in that case. If it can't, we may need to recommend a drain in between enabling/disabling

Once we have agreement please update the readme.md with the text so it doesn't get lost here in this thread

I've made some clarifications in 7eadd9f. Please take a look!

From my understanding, the Kubelet’s role is to ensure that the actual state of containers on a node matches the desired state defined in the API server (which is why PLEG exists). During container runtime, the Kubelet checks whether the container’s state aligns with expectations, but it does not validate the ContainerConfig. This means that if a container is already running, restarting the Kubelet won’t disrupt its operation due to extra fields in the ContainerConfig.

While draining a node is the safest approach from a security perspective, I wonder if evicting all workloads on the node might be too costly in this scenario. Could we perhaps consider including a script in this KEP to help users identify and monitor containers leveraging new features?

jpbetz · 2025-02-10T17:16:25Z

/cc @jpbetz

for PRR review

Do we miss the deadline?

One question on PRR then LGTM. PRR is a soft deadline.

jpbetz · 2025-02-11T15:53:38Z

/approve
For PRR

sreeram-venkitesh · 2025-02-11T17:37:23Z

I've added the KEP to the SIG Windows meeting agenda today to clarify what we're planning with Windows support.

deads2k · 2025-02-11T18:07:08Z

keps/sig-node/4960-container-stop-signals/README.md

+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+Yes, the feature gate can be turned off to disable the feature once it has been enabled.


Describe what is involved. Restarting the kubelet?

deads2k · 2025-02-11T18:07:29Z

keps/prod-readiness/sig-node/4960.yaml

@@ -0,0 +1,3 @@
+kep-number: 4960
+alpha:
+  approver: "@jpbetz"


I'm going to pick up for Joe. Can you make this @deads2k?

deads2k

some PRR updates please.

deads2k · 2025-02-11T18:08:36Z

keps/sig-node/4960-container-stop-signals/README.md

+
+###### What specific metrics should inform a rollback?
+
+Pods/Containers not getting terminated properly might indicate that something is wrong, although we will aim to handle all such cases gracefully and show proper error messages if something is missing.


what specific metric could I look at?

deads2k · 2025-02-11T18:10:15Z

keps/sig-node/4960-container-stop-signals/README.md

+  - Condition name: 
+  - Other field: 
+- [x] Other (treat as last resort)
+  - Details: Check if the containers with custom stop signals are being killed with the stop signal provided. For example your container might want to take SIGUSR1 to be exited. You can achieve this by defining it in the Container spec and have to bake it into your container image.


This would likely be per-node, right? So checking once won't be enough. Until all kubelets unconditionally support this, is there a standard check the user could add in an extra container?

Since we're showing the effective stop signal in the container status, irrespective of whether we're setting a custom signal or not, we can check whether Pods scheduled to every node has a StopSignal field in their statuses to confirm whether the feature is working.

keps/sig-node/4960-container-stop-signals/README.md

deads2k · 2025-02-11T18:12:23Z

keps/sig-node/4960-container-stop-signals/README.md

+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+No.


Seems like metrics in this area may be missing.

deads2k · 2025-02-11T18:13:39Z

keps/sig-node/4960-container-stop-signals/README.md

+
+## Drawbacks
+
+There aren't any drawbacks to why this KEP shouldn't be implemented since it does not change the default behaviour.


there can be drawbacks that aren't related to default behavior. I don't necessarily see one in this KEP, but in general "it doesn't change the default behavior" is not sufficient to justify a change.

deads2k · 2025-02-11T18:14:33Z

keps/sig-node/4960-container-stop-signals/README.md

+
+###### What are other known failure modes?
+
+N/A


Seems like pods not terminating gracefully due to pod owner misconfiguration would be a new failure mode that wasn't possible before.

sreeram-venkitesh · 2025-02-11T20:59:31Z

@deads2k Thank you for the detailed review! I have pushed some updates in f83edd3. I will be looking into the metrics again tomorrow morning and see if I can add anything more. Please let me know what you think!

deads2k · 2025-02-12T23:04:35Z

Thanks for the updates. PRR lgtm

/approve

k8s-ci-robot · 2025-02-12T23:04:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jpbetz, mrunalp, sreeram-venkitesh, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k,jpbetz]
~~keps/sig-node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

marosset

LGTM - both for the windows support and overall

mrunalp · 2025-02-13T01:05:55Z

/lgtm

Added initial draft of KEP 4960

43784e7

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 31, 2025

k8s-ci-robot requested review from dchen1107 and mrunalp January 31, 2025 20:11

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 31, 2025

sreeram-venkitesh mentioned this pull request Jan 31, 2025

Container Stop Signals #4960

Open

4 tasks

thockin self-assigned this Feb 1, 2025

thockin reviewed Feb 1, 2025

View reviewed changes

mrunalp reviewed Feb 1, 2025

View reviewed changes

keps/sig-node/4960-container-stop-signals/README.md Outdated Show resolved Hide resolved

mrunalp reviewed Feb 1, 2025

View reviewed changes

keps/sig-node/4960-container-stop-signals/README.md Outdated Show resolved Hide resolved

mrunalp reviewed Feb 1, 2025

View reviewed changes

keps/sig-node/4960-container-stop-signals/README.md Outdated Show resolved Hide resolved

mrunalp reviewed Feb 1, 2025

View reviewed changes

thockin reviewed Feb 1, 2025

View reviewed changes

mrunalp reviewed Feb 1, 2025

View reviewed changes

kannon92 reviewed Feb 1, 2025

View reviewed changes

keps/sig-node/4960-container-stop-signals/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Feb 1, 2025

View reviewed changes

keps/sig-node/4960-container-stop-signals/README.md Outdated Show resolved Hide resolved

mikebrow reviewed Feb 3, 2025

View reviewed changes

mrunalp self-assigned this Feb 3, 2025

sreeram-venkitesh added 4 commits February 4, 2025 11:01

Added thockin and mrunalp as reviewers

f750d66

nits from review

6cc0e8c

typo fix

2393c27

Updated API to new lifecycle hook, use ContainerConfig instead of Sto…

f5406b6

…pContainerRequest in CRI

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2025

thockin reviewed Feb 6, 2025

View reviewed changes

keps/sig-node/4960-container-stop-signals/README.md Outdated Show resolved Hide resolved

pacoxu reviewed Feb 8, 2025

View reviewed changes

Clarified version skew with static pods, alternative with prestop hook

9508d34

thockin assigned jpbetz Feb 10, 2025

jpbetz reviewed Feb 10, 2025

View reviewed changes

k8s-ci-robot assigned pacoxu Feb 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025

Clarified PRR question about reenabling feature gate

7eadd9f

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025

mrunalp approved these changes Feb 11, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2025

deads2k reviewed Feb 11, 2025

View reviewed changes

deads2k requested changes Feb 11, 2025

View reviewed changes

Updated PRR approver, addressed comments from PRR review

f83edd3

sreeram-venkitesh added 4 commits February 12, 2025 10:59

Fix typo

c04e3e0

Added section on windows support

9101efb

Fix toc

ede63f3

Clarified hcsshim in Windows containers

d29143f

marosset reviewed Feb 12, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025

k8s-ci-robot merged commit 49cff2e into kubernetes:master Feb 13, 2025
4 checks passed

k8s-ci-robot added this to the v1.33 milestone Feb 13, 2025


		## Proposal

		A new StopSignal field will be added to the Container spec. This new field can take string values, and can be used to define a stop signal when creating Pods.


		#### Story 1

		Kubernetes by default sends a SIGTERM to all containers while killing them. When running nginx on Kubernetes, this can result in nginx dropping requests as reported [here](https://github.com/Kong/kubernetes-ingress-controller/pull/283). The current solution for this issue would be to build custom images with a SIGQUIT stop signal or to write a PreStop lifecycle hook that manually kills the process gracefully, which is what is done in the PR. If we had stop signal support at the Container spec level, this would've been easier and straightforward to implement. Users wouldn't have to patch the applications running on Kubernetes to handle different termination behavior. This is also similar to [this issue](https://github.com/github/resque/pull/21).


		### Risks and Mitigations

		I don't see any issues with adding a new field for the stop signal because this is an optional feature that users.


		If only the kube-apiserver enables this feature, validation will pass, but kubelet won't understand the new lifecycle hook and will not add the stop signal when creating the ContainerConfig.

		If only the kubelet has enabled this feature, you won't be able to create a Pod which has a StopSignal lifecycle hook via the apiserver and hence the feature won't be usable even if the kubelet supports it. `containerSpec.Lifecycle.StopSignal` can be an empty value and kubelet functions as if no custom stop signal has been set for any container.


		### Risks and Mitigations

		- We'll be adding the complexity of signal handling to the pod/container spec. If users define an signal that is not handled by their containers, this can lead to pods hanging.


		###### What happens if we reenable the feature if it was previously rolled back?

		If you reenable the feature, you'll be able to create Pods with StopSignal lifecycle hooks for their containers. Without the feature gate enabled, this would make your workloads invalid.


		###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

		Yes, the feature gate can be turned off to disable the feature once it has been enabled.


		###### What specific metrics should inform a rollback?

		Pods/Containers not getting terminated properly might indicate that something is wrong, although we will aim to handle all such cases gracefully and show proper error messages if something is missing.


		###### Are there any missing metrics that would be useful to have to improve observability of this feature?

		No.


		## Drawbacks

		There aren't any drawbacks to why this KEP shouldn't be implemented since it does not change the default behaviour.

KEP 4960: Container Stop Signals #5122

KEP 4960: Container Stop Signals #5122

Conversation

sreeram-venkitesh commented Jan 31, 2025

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kannon92 Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

mikebrow Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikebrow Feb 4, 2025 •

edited

Loading

mikebrow Feb 4, 2025 •

edited

Loading

kannon92 Feb 1, 2025 •

edited

Loading

mikebrow Feb 3, 2025 •

edited

Loading