Try to use pidfd and epoll to wait init process exit #4517

lifubang · 2024-11-09T03:23:34Z

This PR does some optimizations for runc delete -f.

Nowadays, go runtime has tried to use unix.PidFDSendSignal to send signal to the process,
this is helpful to reduce the risk of pid reuse attack. So we should replace unix.Kill with
os.Process.Signal in runc when possible.
But os.Process.Wait is used to wait the child process, to wait a unrelated process, we
should introduce pidfd & epoll to reduce the sleep time when we want to detect the init
process exited or not.
For the kernel which doesn't support pidfd & epoll solution, we will fall back to the traditional
unix.Kill solution, but for stopped containers or containers running in a low load machine,
we don't need to wait 100ms to do the next detection.

Close: #4512

lifubang · 2024-11-09T03:29:08Z

@abel-von PTAL

kolyshkin · 2024-11-12T07:47:05Z

It seems that the first commit can be merged now and is definitely an improvement.

For the rest of it, give me a few days to review.

libcontainer/container_linux.go

kolyshkin · 2024-11-14T21:50:48Z

Reviewing this reminded me the next step needed for pidfd support in Go, so I wrote this proposal: golang/go#70352

lifubang · 2024-11-15T00:04:15Z

Reviewing this reminded me the next step needed for pidfd support in Go, so I wrote this proposal: golang/go#70352

Wonderful proposal, I used to think that golang wouldn't support similar interfaces, but I think it's very useful, looking forward its coming.

libcontainer/container_linux.go

delete.go

abel-von · 2024-11-30T08:21:15Z

LGTM

kolyshkin · 2024-12-06T23:08:08Z

@lifubang are you going to keep working on it? This looks very good overall

lifubang · 2024-12-07T01:39:59Z

@lifubang are you going to keep working on it? This looks very good overall

Thanks, I'll work on it later.

kolyshkin

Frankly, I'm having a hard time reviewing this because it's hard to wrap around all the kill/destroy logic that we already have in place:

container.Signal;
terminate method of parentProcess;
destroy(c *Container);
etc.

All this seem like a bunch of code full of special cases and kludges, and (in my eyes) it cries to be refactored to be more straightforward and clean. Maybe I'm wrong but this stands in the way of me reviewing this.

As I can't finish this, here's my WIP review bits and pieces:

The logic added might also be useful from func destroy(c *Container) error.
Using epoll on pidfd can also be used when there are multiple pids (i.e. from signalAllProcesses, in the current code).
This adds a new public method (container.Kill) when we already have container.Signal. I understand why, but maybe we should call it container.KillSync (or container.EnsureKilled) instead (so it's clear it not just sends the SIGKILL but also waits for the container to be killed). A note to container.Signal should be added referencing the new method. It should also be described in libcontainer's README.

kolyshkin · 2025-02-22T00:03:25Z

This would be very nice to have in 1.3 but we might not have enough time to have it ready by 1.3-rc1.

lifubang · 2025-02-25T03:06:37Z

2. Using epoll on pidfd can also be used when there are multiple pids (i.e. from signalAllProcesses, in the current code).

I think it's not worth to do:

We have no killed check mechanism for shared pid ns container killing before;
I think cgroupv1 will be deprecated in a short future, and we have changed to use cgroup.kill in cgroupv2.
WDYT @kolyshkin

If we don't want to introduce a wait mechanism in signalAllProcesses, I think I will complete this PR in a short time.

kolyshkin · 2025-02-25T05:48:22Z

I think it's not worth to do:

We have no killed check mechanism for shared pid ns container killing before;

I think cgroupv1 will be deprecated in a short future, and we have changed to use cgroup.kill in cgroupv2.
WDYT @kolyshkin

If we don't want to introduce a wait mechanism in signalAllProcesses, I think I will complete this PR in a short time.

I agree, let's drop the 'signalAllProcesses` (frankly I don't remember why I thought it would be useful in there).

libcontainer/README.md

kolyshkin · 2025-02-26T03:32:05Z

Will try to review the rest of it tomorrow.

libcontainer/container_linux.go

kolyshkin · 2025-03-08T01:23:55Z

libcontainer/container_linux.go

+	// We don't need unix.PidfdSendSignal because go runtime will use it if possible.
+	_ = c.Signal(unix.SIGKILL)


Maybe use c.signal here as we made sure that we have private pidns.

Maybe don't ignore error, because otherwise you might be waiting for 10 seconds for something that will never happen.

Or, you can be explicit and use PidfdSendSignal here to be 100% sure that it succeeded.

libcontainer/container_linux.go

kolyshkin

LGTM, thanks!

@rata PTAL

rata · 2025-03-26T12:02:18Z

I'm on PTO, will take a look next week if it's not already merged ;)

rata

I left several comment. I feel like I'm missing something regarding the relation of pidfd and pidns. Sorry if that is the case :-/

rata · 2025-04-02T13:11:06Z

libcontainer/container_linux.go

+		n, err := unix.EpollWait(epollfd, events, 10000)
+		if err != nil {
+			if err == unix.EINTR {


Can you add a wrapper in libcontainer/internal/linux for unix.EpollWait? Using the retryOnEINTR helper, so we don't have to do it here.

I'll rebase to add this based on #4697.

rata · 2025-04-02T13:25:13Z

libcontainer/container_linux.go

+		err := c.killViaPidfd()
+		if err == nil {


we need a linter to suggest automatically to use the single-line if here: if err := ...; err == nil {...

It can't use a single line to check the err, because we have to log the err and go to next.

But you can check if no errors, IOW err==nil. It's exactly the same

What am I missing?

We need to print this error in the next line, but anyway, I have declared the err as a block scope var now.

That is okay. I think using an else is even simpler. Or something like:

if err :=c.Kill...(); err != nil { logrus.Debugf("pidfd....") } else { return nil }

libcontainer/container_linux.go

rata · 2025-04-02T13:35:19Z

libcontainer/container_linux.go

+	defer unix.Close(epollfd)
+
+	event := unix.EpollEvent{
+		Events: unix.EPOLLIN,


I think doing epoll with only one fd is kind of overkill, I wonder if there are simpler solutions. But if, as I mentioned in another comment, we can kill all the processes in the cgroup, then the epoll might be worth it?

rata · 2025-04-02T13:40:23Z

libcontainer/container_linux.go

+			return errors.New("container init still running")
+		}
+
+		if n > 0 {


Is there a case where n < 0? I mean, do we need this if here?

I'll refactor to cover this condition.

libcontainer/container_linux.go

rata · 2025-04-02T13:44:16Z

libcontainer/container_linux.go

@@ -431,6 +435,81 @@ func (c *Container) signal(s os.Signal) error {
 	return nil
 }

+func (c *Container) killViaPidfd() error {
+	pidfd, err := unix.PidfdOpen(c.initProcess.pid(), 0)


I know the old code was killing only the init process. But would it make sense to kill all the processes in the cgroup instead?

We can do it as another PR, if that makes sense.

As you mentioned in the above, we can consider this order:

Use cgroup.kill if the kernel supports it

Use pidfd if the kernel supports it

Just send a signal otherwise

But I think we still need to consider whether the container has a private pid ns or not.
What I think it's that, it's reasonable to kill only the init process for a private pid ns container.

Taking into account this: #4517 (comment).

It seems simpler to kill pid1 if it has it's own pidns. Otherwise, cgroup.kill or, if not supported, send a signal.

Does it make sense?

I don't think it's no need to use cgroup.kill to kill all process in the cgroup to kill the pid1 if it has a private own pidns. We just only need to kill the exact pid1 process.

Exactly what I said in the last comment :)

rata · 2025-04-03T12:53:41Z

libcontainer/container_linux.go

+
+// EnsureKilled kills the container and waits for the kernel to finish killing it.
+func (c *Container) EnsureKilled() error {
+	// When a container doesn't have a private pidns, we have to kill all processes


s/doesn't/does/?

Opening again, sorry, but the if checks that it does have a private PID namespace. The comment says it doesn't. Something seems odd. Am I missing something?

rata · 2025-04-03T12:55:04Z

libcontainer/container_linux.go

@@ -431,6 +435,81 @@ func (c *Container) signal(s os.Signal) error {
 	return nil
 }

+func (c *Container) killViaPidfd() error {
+	pidfd, err := unix.PidfdOpen(c.initProcess.pid(), 0)


Taking into account this: #4517 (comment).

It seems simpler to kill pid1 if it has it's own pidns. Otherwise, cgroup.kill or, if not supported, send a signal.

Does it make sense?

rata · 2025-04-03T12:58:12Z

delete.go

-		// When --force is given, we kill all container processes and
-		// then destroy the container. This is done even for a stopped
-		// container, because (in case it does not have its own PID
-		// namespace) there may be some leftover processes in the
-		// container's cgroup.
-		if force {
-			return killAndDestroy(container)
-		}
 		s, err := container.Status()
 		if err != nil {
 			return err
 		}
 		switch s {
 		case libcontainer.Stopped:
-			return container.Destroy()
+			// For a stopped container, because (in case it does not have
+			// its own PID namespace) there may be some leftover processes
+			// in the container's cgroup.
+			if !container.Config().Namespaces.IsPrivate(configs.NEWPID) {
+				if err := container.EnsureKilled(); err != nil {
+					return err
+				}
+			}


I can't wrap my head around this. Before this PR on force, we were using container.Signal(). Now, if a container is stopped and doesn't have a private pidns, we don't do nothing here, and at the end we do container.Destroy().

The commit msg doesn't mention anything about this. Am I missing something?

Yes, it's a mistake, removed now. Thanks.

libcontainer/container_linux.go

Signed-off-by: lifubang <[email protected]>

When using unix.Kill to kill the container, we need a for loop to detect the init process exited or not manually, we sleep 100ms each time in the current, but for stopped containers or containers running in a low load machine, we don't need to wait so long time. This change will reduce the delete delay in some situations, especially for those pods with many containers in. Co-authored-by: Abel Feng <[email protected]> Signed-off-by: lifubang <[email protected]>

rata · 2025-04-22T14:14:07Z

internal/linux/linux.go

+		return unix.EpollWait(epfd, events, msec)
+	})
+	if err != nil {
+		return 0, os.NewSyscallError("epollwait", err)


epoll_wait returns -1 on error.Let's return -1 here too.

rata · 2025-04-22T14:15:39Z

libcontainer/container_linux.go

+
+// EnsureKilled kills the container and waits for the kernel to finish killing it.
+func (c *Container) EnsureKilled() error {
+	// When a container doesn't have a private pidns, we have to kill all processes


Opening again, sorry, but the if checks that it does have a private PID namespace. The comment says it doesn't. Something seems odd. Am I missing something?

rata · 2025-04-22T14:42:20Z

libcontainer/container_linux.go

+		if err = c.killViaPidfd(); err == nil {
+			return nil
+		}
+
+		logrus.Debugf("pidfd & epoll failed, falling back to unix.Signal: %v", err)
+	}
+	return c.kill()


I can't wrap my head around this either. Why don't we use a pidfd in c.Signal() (if that is available, if it's not we fallback to sending a signal to the pid number) and create a new function to wait on the process to die? Again, if that if pidfd is possible, we wait on that, if it's not, we fallback to wait as we do now.

Having a (public/exported) Kill() and KillViaPidfd() seems like something that should be abstracted.

What we are doing now seems kind of complex:

If it has a pidns, then try to kill with a pidfd the pid1

If it doesn't have a pidns, then try to kill the process sending a signal to the pid number (even if pidfd is supported, why?). The function we call here also handles the case of having a private pidns, which makes this more tricky.

If what I propose doesn't seem okay, I'm open to other ways to simplify this :)

rata · 2025-04-22T14:43:53Z

@lifubang if you address the comments and want another review, if you can please comment or request a review via github, that greatly helps me to know when it is ready for another round. Thanks!

lifubang force-pushed the kill-via-pidfd branch from bcddc62 to 8126a8d Compare November 9, 2024 03:26

lifubang mentioned this pull request Nov 12, 2024

Reduce the delete delay with an exponential wait #4512

Closed

kolyshkin reviewed Nov 14, 2024

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

lifubang force-pushed the kill-via-pidfd branch from 8126a8d to 7833912 Compare November 15, 2024 00:09

kolyshkin reviewed Nov 15, 2024

View reviewed changes

libcontainer/container_linux.go Show resolved Hide resolved

kolyshkin reviewed Nov 15, 2024

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

kolyshkin reviewed Nov 15, 2024

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

kolyshkin reviewed Nov 15, 2024

View reviewed changes

delete.go Outdated Show resolved Hide resolved

kolyshkin reviewed Nov 15, 2024

View reviewed changes

delete.go Outdated Show resolved Hide resolved

lifubang force-pushed the kill-via-pidfd branch 2 times, most recently from cc64599 to 16ae7fc Compare December 8, 2024 14:34

kolyshkin reviewed Dec 20, 2024

View reviewed changes

lifubang force-pushed the kill-via-pidfd branch 2 times, most recently from edc621d to 5f102d8 Compare February 25, 2025 03:46

kolyshkin reviewed Feb 26, 2025

View reviewed changes

libcontainer/README.md Outdated Show resolved Hide resolved

lifubang force-pushed the kill-via-pidfd branch from 5f102d8 to 74e9927 Compare February 26, 2025 05:55

kolyshkin mentioned this pull request Mar 7, 2025

Use pidfd_send_signal under the hood #4666

Merged

kolyshkin reviewed Mar 8, 2025

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

kolyshkin reviewed Mar 8, 2025

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

kolyshkin reviewed Mar 8, 2025

View reviewed changes

lifubang force-pushed the kill-via-pidfd branch from 74e9927 to 6afd540 Compare March 8, 2025 03:58

kolyshkin reviewed Mar 19, 2025

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

lifubang force-pushed the kill-via-pidfd branch from 6afd540 to df920a9 Compare March 21, 2025 23:24

kolyshkin approved these changes Mar 24, 2025

View reviewed changes

lifubang requested review from rata and AkihiroSuda March 26, 2025 10:44

rata reviewed Apr 2, 2025

View reviewed changes

lifubang force-pushed the kill-via-pidfd branch from df920a9 to 35ce40a Compare April 3, 2025 06:58

rata reviewed Apr 3, 2025

View reviewed changes

libcontainer/container_linux.go Outdated Show resolved Hide resolved

lifubang force-pushed the kill-via-pidfd branch from 35ce40a to 94caf04 Compare April 8, 2025 10:23

lifubang marked this pull request as draft April 8, 2025 10:27

lifubang and others added 2 commits April 11, 2025 02:59

libct: use pidfd and epoll to wait the init process exit

c10a85d

Signed-off-by: lifubang <[email protected]>

lifubang force-pushed the kill-via-pidfd branch from 94caf04 to e22ebcd Compare April 11, 2025 07:18

lifubang marked this pull request as ready for review April 11, 2025 07:38

rata requested changes Apr 22, 2025

View reviewed changes

		// We don't need unix.PidfdSendSignal because go runtime will use it if possible.
		_ = c.Signal(unix.SIGKILL)

Try to use pidfd and epoll to wait init process exit #4517

Are you sure you want to change the base?

Try to use pidfd and epoll to wait init process exit #4517

Uh oh!

Conversation

lifubang commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lifubang commented Nov 9, 2024

Uh oh!

kolyshkin commented Nov 12, 2024

Uh oh!

Uh oh!

kolyshkin commented Nov 14, 2024

Uh oh!

lifubang commented Nov 15, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abel-von commented Nov 30, 2024

Uh oh!

kolyshkin commented Dec 6, 2024

Uh oh!

lifubang commented Dec 7, 2024

Uh oh!

kolyshkin left a comment

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Feb 22, 2025

Uh oh!

lifubang commented Feb 25, 2025

Uh oh!

kolyshkin commented Feb 25, 2025

Uh oh!

Uh oh!

kolyshkin commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kolyshkin left a comment

Choose a reason for hiding this comment

Uh oh!

rata commented Mar 26, 2025

Uh oh!

rata left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lifubang commented Nov 9, 2024 •

edited

Loading

rata left a comment •

edited

Loading