Skip to content

Preventing containers from being unable to be deleted #4757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

HirazawaUi
Copy link

@HirazawaUi HirazawaUi commented May 5, 2025

This is follow-up to PR #4645, I am taking over from @jianghao65536 to continue addressing this issue.

If the runc-create process is terminated due to receiving a SIGKILL signal, it may lead to the runc-init process leaking due to issues like cgroup freezing, and it cannot be cleaned up by runc delete/stop because the container lacks a state.json file. This typically occurs when higher-level container runtimes terminate the runc create process due to context cancellation or timeout.

In PR #4645, the Creating state was added to clean up processes in the STAGE_PARENT/STAGE_CHILD stage within the cgroup. This PR no longer adds the Creating state for the following reasons:

  1. Although runc init STAGE_PARENT/STAGE_CHILD may exist simultaneously when runc create receives SIGKILL signal, after runc create terminates, STAGE_PARENT/STAGE_CHILD will also terminate due to the termination of runc create:

    • STAGE_PARENT: Directly relies on pipenum to communicate with runc create. When runc create terminates, pipenum is closed, causing STAGE_PARENT to fail when reading/writing to pipenum, triggering bail and termination.
    • STAGE_CHILD: Relies on syncfd to synchronize with STAGE_PARENT. When STAGE_PARENT terminates, syncfd is closed, causing STAGE_CHILD to fail when reading/writing to syncfd, triggering bail and termination.
  2. If the runc-create process is terminated during execution, the container may be in one of the following states:

    • paused: If runc create receives SIGKILL signal during the process of setting the cgroup, the container will be in a paused state. At this point, the runc init process becomes zombie process and cannot be killed. However, pausedState.destroy will thaw the cgroup and terminate the runc init process.
    • stopped: If runc create receives SIGKILL signal during the STAGE_PARENT -> STAGE_CHILD phase, the container will be in a stopped state. As described above, STAGE_PARENT/STAGE_CHILD will terminate due to the termination of runc create, so no processes will be left behind. We only need to clean up the remaining cgroup files, and stoppedState.destroy will handle this cleanup.

Therefore, based on the above reasons, the existing paused and stopped states are sufficient to handle the abnormal termination of runc create due to a SIGKILL signal.

@HirazawaUi HirazawaUi force-pushed the fix-unable-delete branch 2 times, most recently from 11c5aba to 60ae641 Compare May 5, 2025 13:30
@HirazawaUi
Copy link
Author

HirazawaUi commented May 5, 2025

I was unable to add integration tests for this PR without resorting to some hacky methods, but I tested whether this issue was resolved in the kubernetes-sigs/kind repository.

In brief, I discovered this issue while working in the kubernetes/kubernetes repo to propagate kubelet's context to the container runtime. The issue manifested as the test job being unable to tear down after the k/k repo's e2e tests completed, because the leaked runc init process and its corresponding systemd scope prevented systemd from shutting down.

Therefore, I opened a PR in the kubernetes-sigs/kind repo to debug this issue by manually replacing the containerd/runc binaries in the CI environment. After building the code from this PR and replacing the binaries in the CI environment, the test job no longer failed to tear down due to systemd being unable to shut down, as the leaked processes were resolved.

Ref: kubernetes-sigs/kind#3903 (Some job failures occurred due to the instability of the k/k repo e2e tests, but they are unrelated to this issue.)

I also conducted some manual tests targeting the scenarios where the leftover container is in the paused and stopped states.

Paused:

Inject sleep to allow us to control where the code is interrupted.

You can add a header

diff --git a/vendor/github.com/opencontainers/cgroups/systemd/v1.go b/vendor/github.com/opencontainers/cgroups/systemd/v1.go
index 8453e9b4..bbe3524c 100644
--- a/vendor/github.com/opencontainers/cgroups/systemd/v1.go
+++ b/vendor/github.com/opencontainers/cgroups/systemd/v1.go
@@ -6,6 +6,7 @@ import (
        "path/filepath"
        "strings"
        "sync"
+       "time"

        systemdDbus "github.com/coreos/go-systemd/v22/dbus"
        "github.com/sirupsen/logrus"
@@ -361,6 +362,7 @@ func (m *LegacyManager) Set(r *cgroups.Resources) error {
                }
        }
        setErr := setUnitProperties(m.dbus, unitName, properties...)
+       time.Sleep(time.Second * 30)
        if needsThaw {
                if err := m.doFreeze(cgroups.Thawed); err != nil {
                        logrus.Infof("thaw container after SetUnitProperties failed: %v", err)
1. Create a container:
./runc --systemd-cgroup create mycontainer

2. Check container processes:
ps -ef | grep runc
root        2944     694  0 15:36 pts/2    00:00:00 ./runc --systemd-cgroup create mycontainer
root        2956    2944  0 15:36 ?        00:00:00 ./runc init
root        2963     688  0 15:36 pts/1    00:00:00 grep runc

3. Kill the runc create process:
kill -9 2944

4. Check if the runc init process is left behind:
ps -ef | grep runc
root        2956       1  0 15:36 ?        00:00:00 ./runc init
root        2965     688  0 15:37 pts/1    00:00:00 grep runc

5. Check the current container state:
./runc list
ID            PID         STATUS      BUNDLE              CREATED                OWNER
mycontainer   2953        paused      /root/mycontainer   0001-01-01T00:00:00Z   root

6. Delete the container:
./runc delete -f mycontainer
writing sync procError: write sync: broken pipe
EOF

7. Verify if the runc init process has been cleaned up:
ps -ef | grep runc
root        3067     688  0 15:39 pts/1    00:00:00 grep runc

stopped:

Inject sleep to allow us to control where the code is interrupted.

You can add a header

diff --git a/libcontainer/process_linux.go b/libcontainer/process_linux.go
index 96e3ca5f..350e3660 100644
--- a/libcontainer/process_linux.go
+++ b/libcontainer/process_linux.go
@@ -613,6 +613,7 @@ func (p *initProcess) start() (retErr error) {
                        return fmt.Errorf("unable to apply cgroup configuration: %w", err)
                }
        }
+         time.Sleep(time.Second * 30)
        if p.intelRdtManager != nil {
                if err := p.intelRdtManager.Apply(p.pid()); err != nil {
                        return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)
1. Create a container:
./runc --systemd-cgroup create mycontainer

2. Check container processes:
ps -ef | grep runc
root        3124     694  0 15:45 pts/2    00:00:00 ./runc --systemd-cgroup create mycontainer
root        3132    3124  0 15:45 pts/2    00:00:00 ./runc init
root        3140     688  0 15:45 pts/1    00:00:00 grep runc

3. Kill the runc create process:
kill -9 3124

4. Check if the runc init process is left behind (There will be no runc init process left behind):
ps -ef | grep runc
root        3142     688  0 15:45 pts/1    00:00:00 grep runc

5. Check the current container state:
./runc list
ID            PID         STATUS      BUNDLE              CREATED                OWNER
mycontainer   0           stopped     /root/mycontainer   0001-01-01T00:00:00Z   root

6. Delete the container:
./runc delete -f mycontainer

@HirazawaUi
Copy link
Author

/cc @kolyshkin @AkihiroSuda @rata

@kolyshkin
Copy link
Contributor

See also: #2575

// because the container lacks a state.json file.
// This typically occurs when higher-level
// container runtimes terminate the runc create process due to context cancellation or timeout.
_, err = p.container.updateState(nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fixing the problem or there is still a race if the signal is sent before we do this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although runc init STAGE_PARENT/STAGE_CHILD may exist simultaneously when runc create receives SIGKILL signal, after runc create terminates, STAGE_PARENT/STAGE_CHILD will also terminate due to the termination of runc create:

  • STAGE_PARENT: Directly relies on pipenum to communicate with runc create. When runc create terminates, pipenum is closed, causing STAGE_PARENT to fail when reading/writing to pipenum, triggering bail and termination.
  • STAGE_CHILD: Relies on syncfd to synchronize with STAGE_PARENT. When STAGE_PARENT terminates, syncfd is closed, causing STAGE_CHILD to fail when reading/writing to syncfd, triggering bail and termination.

As stated here, runc init [STAGE_PARENT/STAGE_CHILD] will terminated after runc create terminates, and at this point, the cgroup has not yet been created. I believe this does not lead to a race condition, nor does it cause process leaks or other resources to remain uncleaned.

@rata
Copy link
Member

rata commented May 7, 2025

@HirazawaUi thanks! So my comment was on-spot, but you didn't need to remove the assignment?

For testing, I'd like to have something. It should be simple and kind of reliable. Here are some ideas, but we don't need a test if we don't find a reasonable and simple way to test this:

  • I wonder if creating a PID namespace with a low limit can emulate it (only if it's simple, I guess it is?). We can then increase the limit and call runc delete to see it is deleted correctly?
  • Or maybe we can use fanotify, to block some operation and send a SIGKILL at that point?
  • Or maybe in unit tests, we can maybe override the start() function and create a process with the API that will be blocked there, before the state file is created?

@HirazawaUi
Copy link
Author

but you didn't need to remove the assignment?

I believe that removing this assignment and delaying the assignment process until after updateState is pointless. Regardless of whether it is removed here, the container will enter the stopped state if the creation process is interrupted before the cgroup is frozen, and stoppedState.destroy() can properly clean up residual files in this scenario.
ref:

if !c.hasInit() {
return c.state.transition(&stoppedState{c: c})
}

func (c *Container) hasInit() bool {
if c.initProcess == nil {
return false
}
pid := c.initProcess.pid()
stat, err := system.Stat(pid)
if err != nil {
return false
}

@HirazawaUi
Copy link
Author

  • I wonder if creating a PID namespace with a low limit can emulate it (only if it's simple, I guess it is?). We can then increase the limit and call runc delete to see it is deleted correctly?
  • Or maybe we can use fanotify, to block some operation and send a SIGKILL at that point?
  • Or maybe in unit tests, we can maybe override the start() function and create a process with the API that will be blocked there, before the state file is created?

I will try testing it in the direction of Suggestion 2 (it seems the most effective). If it cannot be implemented, I will promptly provide feedback here :)

@HirazawaUi HirazawaUi force-pushed the fix-unable-delete branch 9 times, most recently from 39d801e to a6ebd29 Compare May 9, 2025 14:16
@HirazawaUi
Copy link
Author

Test case has been added.

While attempting to use fanotify to monitor the open events of state.json and terminate the runc create process upon detecting an open event, I suddenly realized a blind spot I had never considered before: why not try running runc create and then sending it SIGKILL signal to terminate it within a very short time frame?

Compared to event monitoring, this approach better aligns with the scenario we encountered and is completely asynchronous. The only downside seems to be its fragility, but I added numerous device rules to slow down cgroup creation and restricted it to cgroup v1 only, which reduces the likelihood of errors (in my last code submission, all tests performed well with no errors).

@rata Do you think this test case sufficiently covers the scenarios for this PR?

@vagabond2522

This comment was marked as spam.

@HirazawaUi
Copy link
Author

ping @kolyshkin @AkihiroSuda @rata Could you take another look at this PR? Any feedback would be greatly appreciated.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, had some pending review comments which I forgot to submit)

Also, you need a proper name/description for the second commit. Currently it just says "add integration test" which is enough in the context of this PR, but definitely not enough when looking at git history.

@HirazawaUi HirazawaUi force-pushed the fix-unable-delete branch from a6ebd29 to 1606d12 Compare May 15, 2025 04:24
@HirazawaUi HirazawaUi force-pushed the fix-unable-delete branch 2 times, most recently from 4b1b1e0 to 72e3fac Compare May 15, 2025 13:01
@HirazawaUi
Copy link
Author

@kolyshkin Thank you so much for your review! All the issues have been addressed. Could you take another look at this PR?

I believe this PR is overall harmless, it doesn’t break any of our existing behavior and simply adds a new troubleshooting scenario. :)

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@kolyshkin kolyshkin requested review from rata and lifubang May 27, 2025 19:25
@kolyshkin
Copy link
Contributor

@rata @lifubang ptal

# This test verifies that a container can be properly deleted
# even if the runc create process was killed with SIGKILL

requires root cgroups_v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why requires cgroups_v1? I think it's also work for cgroup v2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using cgroup v2 might cause the cgroup configuration phase during runc create to complete prematurely, container will reach the created state directly. This could result in test failures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it might succeed, this approach could introduce instability in the test behavior. Therefore, I'm restricting usage to cgroups v1 exclusively for this implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could result in test failures.

Failure in which line? Maybe here?

[[ "$output" == *"stopped"* || "$output" == *"paused"* ]]

Copy link
Author

@HirazawaUi HirazawaUi May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the changes in this PR aim to ensure proper cleanup of the container if the runc create process terminates unexpectedly before the container enters the "created" state.

Therefore, our test cases should specifically verify that the container remains in either "stopped" or "paused" states under such scenarios. Once the container transitions to the "created" state, it would fall outside the scope of the test cases are intended to cover.

Copy link
Member

@lifubang lifubang May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using cgroup v2 might cause the cgroup configuration phase during runc create to complete prematurely, container will reach the created state directly.

So, there is still a chance to reach the stopped or paused state for cgroup v2 here?

The changes in this PR aim to ensure proper cleanup of the container if the runc create process terminates unexpectedly before the container enters the "created" state.

Therefore, our test cases should specifically verify that the container remains in either "stopped" or "paused" states under such scenarios. Once the container transitions to the "created" state, it would fall outside the scope of the test cases are intended to cover.

But how to ensure your changes worked in cgroup v2? If we introduce some regressions in cgroup v2 for such scenario, there is no test to cover cgroup v2.
I think requires cgroups_v1 is used to test some functions only supported by cgroup v1, but not supported by cgroup v2. This PR influences both cgroup versions, so I think we should support both cgroup v1 and cgroup v2 here.

Maybe you can do the cgroup version check in L302?

Copy link
Author

@HirazawaUi HirazawaUi May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there is still a chance to reach the stopped or paused state for cgroup v2 here?

yes.

But how to ensure your changes worked in cgroup v2? If we introduce some regressions in cgroup v2 for such scenario, there is no test to cover cgroup v2. I think requires cgroups_v1 is used to test some functions only supported by cgroup v1, but not supported by cgroup v2. This PR influences both cgroup versions, so I think we should support both cgroup v1 and cgroup v2 here.

Maybe you can do the cgroup version check in L302?

I fully understand your concerns, but as I mentioned earlier, there's no reliable and effective way to 100% reproduce this test case in cgroupv2 environments without artificially injecting sleeps into the code.

We could modify the test at L302 to relax state validation for cgroupv2 (allowing containers in "created" state to pass), but this would render the test meaningless in the context of this PR. The core purpose of this PR is to ensure cleanup when runc create terminates before the container reaches "created" state.

Once a container enters the "created" state, its cleanup is already guaranteed by existing runc logic, that's outside the scope of what these changes aim to address.

This is merely a stopgap measure I've resorted to since I've been unable to delay the cgroupv2 set process. I'd be extremely grateful if you have any suggestions for reliably postponing cgroupv2 configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but as I mentioned earlier, there's no reliable and effective way to 100% reproduce this test case in cgroupv2 environments without artificially injecting sleeps into the code.

In fact, I have done 1000 times test running for cgroupv2, I didn't meet any errors here. Another question, you can ensure a 100% right result for this test case in cgroupv1?

BTW, please remove the last commit Merge branch main..., and use git rebase to catch up with the branch main.

Copy link
Author

@HirazawaUi HirazawaUi May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I have done 1000 times test running for cgroupv2, I didn't meet any errors here. Another question, you can ensure a 100% right result for this test case in cgroupv1?

I can't guarantee that cgroupv1 will definitely succeed, but based on my debugging experience, configuring cgroupv1 generally takes more time than cgroupv2 :)

In previous automated test runs prior to this PR, there were failures. So, I've modified this test case to be limited to cgroupv1.

I just tried to find the historical test execution records for this PR, but GitHub Actions doesn't seem to provide an accessible entry point for this.

BTW, please remove the last commit Merge branch main..., and use git rebase to catch up with the branch main.

Ah, this is GitHub's default behavior after clicking the 'update branch' button. I haven't noticed the additional commit yet, I'll rebase it.

@HirazawaUi HirazawaUi force-pushed the fix-unable-delete branch from 0a35a54 to 4720e5b Compare May 29, 2025 00:31
@HirazawaUi
Copy link
Author

The failed test job is unrelated to the current PR. Waiting for retesting....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants