Skip to content

Add creating status to ensure state.json exists when runc kill. #4645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jianghao65536
Copy link

@jianghao65536 jianghao65536 commented Feb 26, 2025

(This is a continuation of #4535)

Fix 4534

If runc gets killed before it sets the cgroup properties during the container creation process, the cgroup ends up in a frozen state. As a result, subsequent attempts to delete the container using 'runc delete' will fail because the state.json file hasn't been created yet.

I've introduced a 'creatingstate'. It's only used by 'runc delete'. Currently, for the containers that are created, 'Stopped' is still the initial set status. I originally wanted to replace it with 'Creating', but due to the mutual reference between 'parentProcess' and 'Container', it is not easy to implement.

Test cases will be supplemented later。

@kolyshkin
Copy link
Contributor

I have yet to take a closer look but it is much better than #4535 already!

@kolyshkin
Copy link
Contributor

Please add a description into commit message.

}

func (i *creatingState) destroy() error {
_ = signalAllProcesses(i.c.cgroupManager, unix.SIGKILL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong.

  1. There are no processes in the cgroup yet, why are you killing them?
  2. We should only use signalAllProcesses for cases when the container doesn't have its own PID namespace.

Copy link
Author

@jianghao65536 jianghao65536 Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{6B191454-9517-436D-843E-F302279AF7BF}

Between the creation of cgroup and waitForChildExit, the pid of runc init[STAGE_PARENT], runc init[STAGE_CHILD], and runc init[STAGE_INIT] will exist in the cgroup.

runc delete does not know which process to send SIGKILL to. Considering the possible existence of processes like runc init PARENT, CHILD, etc., it is necessary to kill the processes in cgroup.procs one by one.

When I was contemplating about writing a function to terminate all processes in cgroup.procs, I discovered that signalAllProcesses already does exactly that, so I directly called it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.

Maybe add a comment, something like

// Use signalAllProcesses here because various stages of `runc init` might be in the cgroup.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the missing comment.

@cyphar PTAL

@kolyshkin kolyshkin force-pushed the creating-state-add branch from 5b791ec to 9544dba Compare March 8, 2025 00:54
@kolyshkin
Copy link
Contributor

@rata @lifubang PTAL 🙏🏻

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left a comment, it will be great if we can have tests too :)

@@ -242,3 +242,20 @@ func (n *loadedState) destroy() error {
}
return n.c.state.destroy()
}

type creatingState struct {
Copy link
Member

@rata rata Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This state is awfully similar to the created state. I haven't had a deep look into this, but any reason to not use the created state?

I think if we remove the c.Init assignment as this PR does, and we run the p.container.UpdateState() as this PR also does, maybe it will also fix the problem?

It seems in that case, the refreshstate will return the created state, the only difference is that the destroy function is _ = i.c.initProcess.signal(unix.SIGKILL). If that is a problem, maybe the signalAllProcesses can work for the other things in the created state too?

Am I missing something?

@rata
Copy link
Member

rata commented Apr 2, 2025

@jianghao65536 ping?

@rata
Copy link
Member

rata commented Apr 8, 2025

Test cases will be supplemented later。

Just a reminder for maintainers, once the code here looks good, we should remember to add tests before we merge this (if possible).

@HirazawaUi
Copy link

@jianghao65536 friendly ping, may I ask if you’re still working on this? This issue is currently blocking some of my work. If you don’t have the bandwidth to move this forward, I’d be happy to open a new PR to address it and include tests for it. Thank you!

@jianghao65536
Copy link
Author

Sorry for the delay, I don't have a good way to add test cases here, I really appreciate you taking over this fix. @HirazawaUi cc @kolyshkin @rata

@rata
Copy link
Member

rata commented Apr 28, 2025

@HirazawaUi sure, please go ahead

@HirazawaUi
Copy link

HirazawaUi commented Apr 28, 2025

@rata @jianghao65536 Thanks, but with China's public holidays approaching, I might implement this fix after the holidays (next week).

@kolyshkin
Copy link
Contributor

closing in favor of #4757

@kolyshkin kolyshkin closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The 'runc delete --force' command can't delete container if runc receives a SIGKILL before it can generate the state.json file.
4 participants