-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP 4960: Container Stop Signals #5122
KEP 4960: Container Stop Signals #5122
Conversation
sreeram-venkitesh
commented
Jan 31, 2025
- One-line PR description: Add the initial KEP for KEP 4960: Container Stop Signals
- Issue link: Container Stop Signals #4960
- Other comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will review this for API but I will likely sit on my comments until one of the API shadows chimes in - @mrunalp is on the scene!
type ContainerStatus struct { | ||
// ... | ||
// +optional | ||
StopSignal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is something we want in Status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn't have the stop signal in ContainerStatus once its moved to lifecycle as discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comments - we absolutlety DO want it in status, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we return it in status, is the desire to show what the container image config had, what the kubernetes CRI request was, or any other possible signals that may be chained depending on the first results in the exit processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I meant the "effective" stopsignal -- if the Pod specified one, then that's the one. If the Pod did not specify, but the image did, then that's the one. If neither Pod nor image specified, then it's the runtime's default (presumably SIGTERM, but it's not clear if that is part of spec or just convention).
Does that make sense or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.. yes.
The rest:
SIG<that will be used on stop request>
// specified order: pod>image>runtime(NRI?/config/default)
timeout (default 2seconds)
SIG<that will be used on above signal timeout>
given ^ should we add timeout and secondary signal to the status or maybe just the verbose status info map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We already control the timeout with terminationGracePeriodSeconds
, right?
Is there any configurability for the "final" signal? It has to be something that results in process termination, and there's a very short list of those uncatchable / unblockable signals. SIGKILL is pretty much the only option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more talking from the CRI implementation side than the kubelet node lifecycle/hook side.
Container Runtimes receive a stop "Timeout" on the StopContainer request but not for StopPodSandbox.. so when we call stopContainer to flush the containers in the pod we are stuck with using zero for the timeout of each container which results in immediate KILL signals.
The KILL is not configurable at least in containerd and yes is pretty much the only last ditch option..
You bring up a great point .. mapping terminationGracePeriodSeconds as a total to StopPodSandbox/StopContainer
type Container struct { | ||
// ... | ||
// +optional | ||
StopSignal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could probably do:
type Signal string
type Container struct {
// ...
// +optional
StopSignal Signal
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, yes, that's a good call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type Signal string // a string mappable/parseable into a golang syscall.Signal
//
going to need some detail I think.. for a typical set of options see
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal.go#L38
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_linux.go
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_windows.go
and yes we are delving into platform realm ... ** or we can consider other mappings / filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"" or unset should use the container runtime default or "SIGTERM"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer to leave it as "the runtime chooses the default if no other input is provided", rather than assuming SIGTERM in our spec, buyt I could be wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the container runtime default or "SIGTERM"
the runtime chooses the default if no other input is provided", rather than assuming SIGTERM
When I say the stop signal will default to SIGTERM if no signal is specified in the spec or the image, I had this line from containerd in mind where stopSignal is hardcoded to SIGTERM.
// internal/cri/server/container_stop.go
if timeout > 0 {
stopSignal := "SIGTERM"
if container.StopSignal != "" {
stopSignal = container.StopSignal
} else {
// ...
container.StopSignal
is reading the stop signal from the image and we can add another condition before that to use the stop signal defined in spec if there is one. If nothing is defined in the spec ("" or unset), containerd behaves like how it is today. I can see similar logic in CRI-O here.
Before this we can also validate if r.GetStopSignal()
(where r is *runtime.StopContainerRequest
) is a valid signal with ParseSignal
that Mike linked above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably fine. Whar I would like to know (@mikebrow ?) is whether "send a warning signal" is defined by OCI or is just a convention that containerd inherited from Docker, and CRIO does to be compatible? And where is it defined that the signal be SIGTERM?
I am not trying to say we should change it, just know where we define it explicitly and where we defer to lower-layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containerd inherited the TERM convention from.. docker inherited the convention from.. linux inherited the convention from.. posix inherited the convention from.. (Unix/IBM...) TERM is like INT (though INT is more reserved to console ctrl-c ...)
OCI left Stop out completely from the spec and signal processing purposefully vague, we had more interesting versions but ended with this:
https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#kill
I could see INT being used today for soft breaking into console apps.. lots of opinions on signaling... it depends.
I am not trying to say we should change it, just know where we define it explicitly and where we defer to lower-layers.
use cases would help..
|
||
#### Alpha | ||
|
||
- Feature implemented behind a feature flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any windows considerations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent question! What does Docker on Windows doe for Stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we (containerd) send the term/kill/signal along to their custom shim.. and windows processes it as we do on good ole linux.. albeit some signals may not be mapped, TERM and KILL seem to be almost universally supported. Would have to check if they are the same values, but conceptually we are ok. code over here https://github.com/search?q=repo%3Amicrosoft%2Fhcsshim%20SIGTERM&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what should the semantics here be? Is it still called "SIGTERM" / "SIGKILL" / "SIGUSR1" etc, or do we need a different set of values for Windows? Or is it just the end-user's job to know whether the pod is Windows or Linux and choose an appropriate value?
IIRC we hav an optional "OS" field which we can cross-validate on, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only supports these signals according to the docs here https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signal?view=msvc-170 cc: @marosset
- SIGABRT
- SIGFPE
- SIGILL
- SIGINT
- SIGSEGV
- SIGTERM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the hcsshim only SIGTERM and SIGKILL are supported for Windows containers (and a few windows-specify CTRL events)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we hav an optional "OS" field which we can cross-validate on, right?
Yes, there is the spec.Os.Name field which, when set to Windows, changes some behaviors like pod security policy enforcement.
Let me read the rest of the proposal before commenting further here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we (containerd) send the term/kill/signal along to their custom shim.. and windows processes it as we do on good ole linux.. albeit some signals may not be mapped, TERM and KILL seem to be almost universally supported.
In the hcsshim only SIGTERM and SIGKILL are supported for Windows containers (and a few windows-specify CTRL events)
@marosset: Following up here, can we implement support for SIGTERM and SIGKILL signals in Windows as best effort for alpha/beta? Also I'd like to learn what we mean when we say Windows support? I'm not sure if hcsshim is the only way to run container on Windows? I'm still reading about this. Either way if we're to support hcsshim completely this would mean that we can only support the signals you've linked above right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if hcsshim is the only way to run container on Windows?
hcsshim is the only way to run containers on Windows (as far as i know).
Moby/docker and containerd bot use hcsshim to interact with the Windows Host Compute System which manages container lifecyle.
Either way if we're to support hcsshim completely this would mean that we can only support the signals you've linked above right?
Yes, for Windows I think we can only support SIGTERM and SIGKILL
Following up here, can we implement support for SIGTERM and SIGKILL signals in Windows as best effort for alpha/beta?
I'm think best effort for alpha is OK but I'd like to see support for Windows for beta.
The behavior in the kubelet shouldn't be much different for Windows / Linux. In both cases everything will be communicated over CRI calls, correct?
For validation we typically do Windows specific validation of pod-Specs at admission time if spec.Os.Name == windows"
If pods do not set spec.Os.Name
is it is assumed that the pod will not run on a Windows node (even if tolerations / node selectors would place the pod on one).
There are a few cases where the kubelet does extra validation to ensure that linux specific fields are not set when the kubelet is validation / building the PodSandboxConfig / ContainerConfig ojbext (example: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/security_context_windows.go#L42-L73) but this isn't consistent.
I think there should be an admission check that validates the stop signal is only set to either SIGTERM or SIGKILL if spec.Os.Name == windows.
I think we can optionally add checks in the kubelet to ensure things like static pods also that might be good butt optional in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases everything will be communicated over CRI calls, correct?
Yes that is correct. I will update the KEP to include how we're only supporting SIGTERM and SIGKILL and how we're going to handle the validation for Windows Pods.
|
||
## Proposal | ||
|
||
A new StopSignal field will be added to the Container spec. This new field can take string values, and can be used to define a stop signal when creating Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me propose an alternate.
We have Pod.spec.containers[].lifecycle.{postStart,preStop}
. Why not add this in there?
Taking that a step further (stay with me, here), is it possible to tell runtimes NOT to send a signal? What I mean is: We have 2 lifecycle "moments" (postStart and preStop) which both are type LifecycleHandler
. "Stop" seems like just another such moment -- can we make it a LifecycleHandler
too?
Specifically:
- Add
Signal
as an option intype LifecycleHandler
- Add
Stop *LifecycleHandler
intype Lifecycle
Now we can send a signal for any lifecycle event AND we can do something non-signal for Stop events.
Defaulting is hard. We need it to default to whatever today's behavior is, which means the default would have to be something like:
lifecycle:
stop:
signal: {}
Meaning "send the default signal"
Oh, but default values on pods are still moratorium'ed I think. Shoot, that ruins my idea.
Hmm, just think about it? Worst case, we can still put StopSignal
in Lifecycle - is that a bad idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can still put StopSignal in Lifecycle - is that a bad idea?
No I agree, it makes sense for the stop signal to be under LifecycleHandler. I can see how this would be more extensible rather than adding a StopSignal
field unrelated to Lifecycle.
Now we can send a signal for any lifecycle event AND we can do something non-signal for Stop events.
I might be wrong, but I don't see a reason why we'd need to send a signal for postStart and preStop but I like the idea of it being configurable with the rest of the lifecycle events in case we need it. All in one place under lifecycle. The second half of doing non-signal stuff for Stop events I can definitely see happening.
In case we don't specify a stop signal for a container from the pod spec, if we populate Pod.spec.containers[].lifecycle.stop.signal
with the stop signal from the image or with the default value of SIGTERM (which based on Mrunal's comment #5122 (comment) seems possible), this would mean that all pods would have lifecycle.stop.signal
in its spec whether or not the user defines it. This should be fine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was trying to make "stop" a normal lifecycle thing, but I can't find a way to disable sending a literal signal upon stop. We can change WHICH signal but not the fact that you are going to get a signal.
As such, it seems to me that adding a single StopSignal field is the way to configure this. Maybe we ALSO want to do an HTTP or exec, but let's not add that speculatively if it is not also simplifying the surface.
In case we don't specify a stop signal for a container from the pod spec, if we populate Pod.spec.containers[].lifecycle.stop.signal with the stop signal from the image or with the default value of SIGTERM (which based on Mrunal's comment #5122 (comment) seems possible), this would mean that all pods would have lifecycle.stop.signal in its spec whether or not the user defines it. This should be fine right?
We can't know what the image has set STOPSIGNAL to until we have the image metadata. We don't want Kubelet updating pod spec
, so it has to go it status
. I think this is correct and aligns with other work like reporting which UID and GID are actually in use. See https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3619-supplemental-groups-policy
type ContainerStatus struct { | ||
// ... | ||
// +optional | ||
StopSignal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be clear that this gets populated whether or not the user has customized it. Whatever the CRI uses as a default should also find its way here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, okay makes sense. We can surface what was extracted by the runtime from the image config 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the default value, in my POC I had made the field optional so that if StopSignal
is not set at the Container Spec, the runtime would use the signal defined in the image, and if that too is not present, default to SIGTERM.
But now that we have a stop signal field in the Container spec (in Pod.spec.containers[].lifecycle.stop
as mentioned above), we'd want it to populated with either of these, based on whichever comes first, right?
- the stop signal defined by user in spec, or
- the signal defined in the image (if there is one), or
- SIGTERM (if both the image and the spec doesn't specify any signal)
We wouldn't want Pod.spec.containers[].lifecycle.stop.signal
be nil in the Pod spec while the runtime uses a certain signal defined in the image. Feels like misleading the user. This way Pod.spec.containers[].lifecycle.stop.signal
would never be nil and would pick up the value read by the runtime.
We can surface what was extracted by the runtime from the image config
I need to look into how this is done, but is it possible to do this at the validation stage so that we can set the default value for stop signal for each container from its image if a custom value is not given? If yes, the container runtime logic can stay the same, if stop signal is received from the Container spec, use that signal, else fallback to the signal defined in the image. If we don't specify a signal in the spec, both of these would be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the default value, in my POC I had made the field optional so that if StopSignal is not set at the Container Spec, the runtime would use the signal defined in the image, and if that too is not present, default to SIGTERM.
I agree this is correct
We wouldn't want Pod.spec.containers[].lifecycle.stop.signal be nil in the Pod spec while the runtime uses a certain signal defined in the image.
See #5122 (comment) - we can't update the spec
. We can update the status
. Spec is what you asked for - if you didn't ask for a StopSignal in the pod, you will get the runtime's default. The status tells you the final result, and is populate by the only place that really knows the answer - kubelet.
IOW, if I specified something, use it. If I didn't specify, let the default happen and get that from the runtime and tell me what it was.
// might be nil going in
result = createContainer(..., requestedSignal, ...)
// is never nil coming out
status.StopSignal = result.StopSignal
Do you get what I mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I was able to get this working after referring to how it is done in #3619, thanks!
I added StopSignal to ContainerConfig
and ContainerStatus
in CRI, passed the stop signal in when creating the ContainerConfig and from containerd's toCRIContainerStatus
function in container_status.go
return the actual stop signal value in use in the status with this logic:
// internal/cri/server/container_status.go
+func toCRIContainerStopSignal(container containerstore.Container) (string, error) {
+ stopSignal := container.Config.GetStopSignal() // stop signal defined in spec
+ if stopSignal == "" { // if stop signal not defined in spec
+ stopSignal = container.StopSignal // Stop signal defined in the image || SIGTERM
+ }
+ return stopSignal, nil
+}
func toCRIContainerStatus(ctx context.Context, container containerstore.Container, spec *runtime.ImageSpec, imageRef string) (*runtime.ContainerStatus, error) {
//...
+ stopSignal, _ := toCRIContainerStopSignal(container)
return &runtime.ContainerStatus{
//... rest of the Status
+ StopSignal: stopSignal,
}, nil
}
This way stopSignal field would always be present in the containerStatus in this priority: StopSignal field defined in Container Spec > Stop Signal from Image > SIGTERM
. If there is no StopSignal set in the Spec, only the Status would have the StopSignal field. Thus, it can be nil going in, but never nil coming out. I tested and confirmed this behavior in all the three cases. Here's the full code: k/k diff, containerd diff.
As such, it seems to me that adding a single StopSignal field is the way to configure this. Maybe we ALSO want to do an HTTP or exec, but let's not add that speculatively if it is not also simplifying the surface.
Based on this comment from #5122 (comment), are we good to go with a new StopSignal field instead of having this as a new LifecycleHandler? I understand that if we make this decision now it'll be kinda set in stone and will be hard to change later. But I also think that trying to show the stop signal in the Status under Lifecycle might also be tricky since we don't show anything lifecycle related in the Status currently. Based on our decision here I'll update the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the path under containerstatuses[]
have to mirror the path under containers[]
? That may be a rule I just don't know about. I don't object to it, if so. If not, I think it can be an independent decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't mind either option. Tagging @mrunalp for confirming: can we map pod.spec.containers[].lifecycle.stop.signal
to pod.status.containerstatuses[].stopsignal
? That would be simpler imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that adding this to Pod.status.containerstatuses[].lifecycle.stop
makes sense and is cleaner. I think we can make an independent decision about where we put it in status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think nesting under lifecycle for status may be good from grouping perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Folks I've updated the KEP with the new Stop Lifecycle hook in f5406b6. Please take a look!
|
||
#### Story 1 | ||
|
||
Kubernetes by default sends a SIGTERM to all containers while killing them. When running nginx on Kubernetes, this can result in nginx dropping requests as reported [here](https://github.com/Kong/kubernetes-ingress-controller/pull/283). The current solution for this issue would be to build custom images with a SIGQUIT stop signal or to write a PreStop lifecycle hook that manually kills the process gracefully, which is what is done in the PR. If we had stop signal support at the Container spec level, this would've been easier and straightforward to implement. Users wouldn't have to patch the applications running on Kubernetes to handle different termination behavior. This is also similar to [this issue](https://github.com/github/resque/pull/21). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should enumerate the signal names that we will expose as constant in core/v1. I suppose they get Go-style names like Sigterm
? Or do we bend the rules and say SIGTERM
-- I think I could let that one slide...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above moby code and per platform references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might timeout while doing the SIGQUIT core dump.. hmm
type ContainerStatus struct { | ||
// ... | ||
// +optional | ||
StopSignal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
type Container struct { | ||
// ... | ||
// +optional | ||
StopSignal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, yes, that's a good call.
|
||
#### Alpha | ||
|
||
- Feature implemented behind a feature flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent question! What does Docker on Windows doe for Stop?
|
||
#### Beta | ||
|
||
- CRI API implementation for CRI-O |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haircommander I think this should be simple enough to get alpha support in CRI-O as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrunalp I can work on CRI-O in alpha since the change should be small-ish but what do you think about adding Windows support in beta? Do we usually ship Windows support in alpha itself? Especially for CRI related KEPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we need to define what Windows support means. Does Windows have a sigterm equivalent when container is stopped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK windows is usually best effort. Many examples I see windows is added as a separate KEP if the windows community is interested (CPU Manager, graceful node shutdown). I'm not sure if this is good or bad though.
I think windows support in beta could be good but I don't know if it needs to block the current KEP. That would be more of a question for sig-windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should target CRI-O for alpha with the new container runtime policy. We could bring this to beta gate off with just containerd but we would need CRI-O to turn it on for beta. So I think adding this in alpha should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point with Windows is we should not just say "we'll maybe handle windows later" without even knowing what Windows does.
When a docker container on Windows is stopped, what is the Windows equivalent of STOPSIGNAL? moby/moby#25982
If we add a field which only works on Linux, are we making a problem for future us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll test it on windows as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the discussion in the SIG Windows meeting yesterday and based on @marosset's #5122 (comment), I've added a section in the KEP detailing what we're planning to do for Windows. Please take a look at 9101efb!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed another commit adding more context about hcsshim and how it supports SIGTERM and SIGKILL as valid signals: d29143f
} | ||
``` | ||
|
||
The CRI API would be updated to have StopSignal as the third field in `StopContainerRequest`, along with `container_id` and `timeout`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see what we are actually adding to the CRI API here.
So my experience with adding new CRI APIs fields is that it is difficult to add both the CRI API and the container runtime implementation in the same release. What I have done in the past is define the CRI API and roll that out as a first alpha. And then you can implement that functionality in the container runtime once containerd/CRI-O have been updated to consume the first alpha. Otherwise you hit a nasty dependency chain where you need the CRI published before you can use those fields in the container runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding it as a new optional field is convenient because the default will either be unset or nil.. first pass would require the container runtimes add a validator and if not implementing the additional field yet.. to fail the request... which can be tested in critest integration I suppose
for older versions of containerd the status check should suffice, with absence on a newly created container being evidence of non-support..
must remember here even if supported in a newer version .. long running containers for the same pod..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more details about the changes to the CRI API in f5406b6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments
type Container struct { | ||
// ... | ||
// +optional | ||
StopSignal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type Signal string // a string mappable/parseable into a golang syscall.Signal
//
going to need some detail I think.. for a typical set of options see
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal.go#L38
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_linux.go
https://github.com/containerd/containerd/blob/main/vendor/github.com/moby/sys/signal/signal_windows.go
and yes we are delving into platform realm ... ** or we can consider other mappings / filters
type Container struct { | ||
// ... | ||
// +optional | ||
StopSignal string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"" or unset should use the container runtime default or "SIGTERM"
|
||
#### Story 1 | ||
|
||
Kubernetes by default sends a SIGTERM to all containers while killing them. When running nginx on Kubernetes, this can result in nginx dropping requests as reported [here](https://github.com/Kong/kubernetes-ingress-controller/pull/283). The current solution for this issue would be to build custom images with a SIGQUIT stop signal or to write a PreStop lifecycle hook that manually kills the process gracefully, which is what is done in the PR. If we had stop signal support at the Container spec level, this would've been easier and straightforward to implement. Users wouldn't have to patch the applications running on Kubernetes to handle different termination behavior. This is also similar to [this issue](https://github.com/github/resque/pull/21). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above moby code and per platform references
|
||
#### Story 1 | ||
|
||
Kubernetes by default sends a SIGTERM to all containers while killing them. When running nginx on Kubernetes, this can result in nginx dropping requests as reported [here](https://github.com/Kong/kubernetes-ingress-controller/pull/283). The current solution for this issue would be to build custom images with a SIGQUIT stop signal or to write a PreStop lifecycle hook that manually kills the process gracefully, which is what is done in the PR. If we had stop signal support at the Container spec level, this would've been easier and straightforward to implement. Users wouldn't have to patch the applications running on Kubernetes to handle different termination behavior. This is also similar to [this issue](https://github.com/github/resque/pull/21). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might timeout while doing the SIGQUIT core dump.. hmm
|
||
### Risks and Mitigations | ||
|
||
I don't see any issues with adding a new field for the stop signal because this is an optional feature that users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any issues with adding a new field for the stop signal because this is an optional feature that users. | |
I don't see any issues with adding a new field for the stop signal because this is an optional feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A risk is adding the complexity of signal handling to pod/container spec users unhandled signals, hung pods, erratic results.. I don't see kubectl options in the kep, was that considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood the risks. I don't get what you meant with kubectl options. Are you talking about adding stop signal as an option to kubectl run maybe? Kevin mentioned adding stop signal as an option for crictl stop
, which I've added to the KEP.
|
||
#### Alpha | ||
|
||
- Feature implemented behind a feature flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we (containerd) send the term/kill/signal along to their custom shim.. and windows processes it as we do on good ole linux.. albeit some signals may not be mapped, TERM and KILL seem to be almost universally supported. Would have to check if they are the same values, but conceptually we are ok. code over here https://github.com/search?q=repo%3Amicrosoft%2Fhcsshim%20SIGTERM&type=code
|
||
#### Beta | ||
|
||
- CRI API implementation for CRI-O |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll test it on windows as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API is nice and simple - I think I prefer the second option I suggest, but open to arguments.
|
||
If only the kube-apiserver enables this feature, validation will pass, but kubelet won't understand the new lifecycle hook and will not add the stop signal when creating the ContainerConfig. | ||
|
||
If only the kubelet has enabled this feature, you won't be able to create a Pod which has a StopSignal lifecycle hook via the apiserver and hence the feature won't be usable even if the kubelet supports it. `containerSpec.Lifecycle.StopSignal` can be an empty value and kubelet functions as if no custom stop signal has been set for any container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user create a static pod with this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the feature is disabled in the kubelet and if we create a static pod, the StopSignal lifecycle hook would be dropped by the kubelet. In the case where the feature gate is disabled in the apiserver but enabled on the kubelet, the kubelet would pass the stop signal to the container runtime and it would work, but the kube-apiserver wouldn't show the pod as having a custom stop signal. This is the same case that'd happen when we create a regular pod with a stop signal and later disable the feature gate in kube-apiserver. The feature would work as long as the feature gate is enabled in the kubelet.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
/lgtm
|
||
### Risks and Mitigations | ||
|
||
- We'll be adding the complexity of signal handling to the pod/container spec. If users define an signal that is not handled by their containers, this can lead to pods hanging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default, the terminationGracePeriodSeconds is 30 seconds. For the default terminationGracePeriodSeconds setting, pod kill will hang for just 30 seconds by default.
https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#lifecycle
Do I understand correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is correct, I have updated this in the KEP so that it is clearer.
|
||
###### What happens if we reenable the feature if it was previously rolled back? | ||
|
||
If you reenable the feature, you'll be able to create Pods with StopSignal lifecycle hooks for their containers. Without the feature gate enabled, this would make your workloads invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more about how workloads can become invalid, and what can be done to ensure this does not happen? That is, if a cluster operator decides to reenable this feature and reads this section for guidance, what do they need to know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad, saying that workloads are invalid is incorrect. What I meant to say was that once the gate is disabled, if you try to create new Pods with StopSignal, those would be invalid. Existing worklods should still continue to function as I mentioned in the reply to Paco in #5122 (comment).
If the feature gate is disabled in kube-apiserver alone, users would not be able to create new Pods with containers having StopSignal, but existing Pods with StopSignal should work the same, given that the container runtime versions supports StopSignal in the first place.
If the feature gate is turned off in the kubelet alone, users would be able to create Pods with StopSignal, but the field will be dropped in the kubelet. When they turn this back on, the kubelet would start sending the StopSignal to the container runtime again, and the container runtime starts using the custom stop signal to kill the containers again.
If cluster operator disables or reenables the feature gate, no change should happen to exisiting workloads.
When they turn this back on, the kubelet would start sending the StopSignal to the container runtime again
@mrunalp @haircommander @kannon92 I have a question about this since we're sending the StopSignal in the ContainerConfig. Imagine the feature gate is turned on when we have a container running which where we have configured a StopSignal. Now we turn the feature gate off and restart the kubelet. Would the container runtime still have access to the same ContainerConfig we had sent earlier? And does the Pod have to be updated for the ContainerConfig to be updated for the runtime? (Even if we update it, the validation should let existing resources with StopSignal to pass). Existing resouces can still have the field, but the field can be dropped silently in the kubelet based on the feature gate. When the feature gate is turned back on, the custom stop signal would be used again and the kubelet starts sending it to the runtime. Please correct me if I'm wrong.
This makes sense to me theoretically, but I'm not entirely sure of how the kubelet-CRI-runtime connection works when we restart the kubelet after changing the feature gates etc. Please let me know if what I described would work. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is when the kubelet restarts it polls the CRI impl for existing containers and then compares them to the ones that the API is requesting. Something I'm not totally clear on is what happens if the features change from under the kubelet, whether the kubelet can see the expected state having changed in that case. If it can't, we may need to recommend a drain in between enabling/disabling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have agreement please update the readme.md with the text so it doesn't get lost here in this thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some clarifications in 7eadd9f. Please take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, the Kubelet’s role is to ensure that the actual state of containers on a node matches the desired state defined in the API server (which is why PLEG exists). During container runtime, the Kubelet checks whether the container’s state aligns with expectations, but it does not validate the ContainerConfig. This means that if a container is already running, restarting the Kubelet won’t disrupt its operation due to extra fields in the ContainerConfig.
While draining a node is the safest approach from a security perspective, I wonder if evicting all workloads on the node might be too costly in this scenario. Could we perhaps consider including a script in this KEP to help users identify and monitor containers leveraging new features?
One question on PRR then LGTM. PRR is a soft deadline. |
/approve |
I've added the KEP to the SIG Windows meeting agenda today to clarify what we're planning with Windows support. |
|
||
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
Yes, the feature gate can be turned off to disable the feature once it has been enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe what is involved. Restarting the kubelet?
@@ -0,0 +1,3 @@ | |||
kep-number: 4960 | |||
alpha: | |||
approver: "@jpbetz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to pick up for Joe. Can you make this @deads2k?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some PRR updates please.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
Pods/Containers not getting terminated properly might indicate that something is wrong, although we will aim to handle all such cases gracefully and show proper error messages if something is missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what specific metric could I look at?
- Condition name: | ||
- Other field: | ||
- [x] Other (treat as last resort) | ||
- Details: Check if the containers with custom stop signals are being killed with the stop signal provided. For example your container might want to take SIGUSR1 to be exited. You can achieve this by defining it in the Container spec and have to bake it into your container image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would likely be per-node, right? So checking once won't be enough. Until all kubelets unconditionally support this, is there a standard check the user could add in an extra container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're showing the effective stop signal in the container status, irrespective of whether we're setting a custom signal or not, we can check whether Pods scheduled to every node has a StopSignal field in their statuses to confirm whether the feature is working.
|
||
###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||
|
||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like metrics in this area may be missing.
|
||
## Drawbacks | ||
|
||
There aren't any drawbacks to why this KEP shouldn't be implemented since it does not change the default behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there can be drawbacks that aren't related to default behavior. I don't necessarily see one in this KEP, but in general "it doesn't change the default behavior" is not sufficient to justify a change.
|
||
###### What are other known failure modes? | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like pods not terminating gracefully due to pod owner misconfiguration would be a new failure mode that wasn't possible before.
Thanks for the updates. PRR lgtm /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, jpbetz, mrunalp, sreeram-venkitesh, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - both for the windows support and overall
/lgtm |