-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChildActorExited - Manually killing child PID does not result in ChildActorExited Message to Parent #85
Comments
Thank you for the clear report with samples to check this. Unfortunately, I cannot reproduce the problem: the Python versions tested: 3.7.12, 3.8.12, 3.9.6 The main difference is that I'm running a Linux system (I don't presently have access to either a Windows or a MacOS system).
Can you please try two things:
subproc.py: from multiprocessing import *
import signal
import time
import os
def sleeper(t):
print('Process %s sleeping for %d seconds' % (os.getpid(), t))
time.sleep(t)
print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
print('Parent got signal %s' % signum)
if __name__ == "__main__":
wait_time = 10
signal.signal(signal.SIGCHLD, gotsig)
p = Process(target=sleeper, args=(wait_time,))
p.start()
time.sleep(wait_time + 2)
print('Exiting') |
When I do this, the process never gets killed and the actor stays alive. Both OSs exhibit the same behavior where I am prevented from killing ActorSystem child actor processes without forcefully killing them. Note: In MacOS the signal number for
Results: ~/test$ python subproc.py
Process 42158 sleeping for 10 seconds
Process 42158 awakened and exiting
Parent got signal 20
Exiting
~/test$
Results: ~/test$ python subproc.py
Process 53352 sleeping for 10 seconds
~/test$ kill 53352
Parent got signal 20
Exiting
~/test$ I received the exit signal immediately after killing the process.
Results: ~/test$ python subproc.py
Process 56348 sleeping for 10 seconds
~/test$ kill -9 56348
Parent got signal 20
Exiting
~/test$ I received the exit signal immediately after killing the process. I was unable to run the I should have a Linux system up and running in the next couple days so I can get back to you with my test results on that, however it will be on an ARM Cortex processor so hopefully that doesn't throw a wrench into the mix. At least for now it seems like it should work and that this is more of a Mac and Windows specific issue. This is certainly an odd issue... It almost feels like there may be a permissions issue or something. Are there any environment variables or permissions that I should verify on my end? Additional environment info: |
Interesting... and a bit unexpected:
There seem to be some sort of shenanigans going on with signals that's causing the problem. Normally, thespian will watch for a child process exit by registering for a number of signals (see https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L678-L694 and https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L29-L40). Your results indicate that signals are being delivered promptly, but not necessarily the signal that was expected. It may be possible to resolve this by simply adding the right additional signal to https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L36, but I'd like to try to understand what's happening a little better first. This situation should not be affected by the hardware architecture or the python virtual environment. |
This is showing SICHLD on signal number 20. Here is the output of the requested command from your first point: ~/test$ python -c 'import signal; print(repr(signal.SIGCHLD))'
<Signals.SIGCHLD: 20>
I'm looking through the code you linked and can't see why the exit signal isn't being caught since you're referencing it by name and not number. I modified your test code in an attempt to view what signals are being received by the killed process. from multiprocessing import *
import signal
import time
import os
def sleeper(t):
print('Process %s sleeping for %d seconds' % (os.getpid(), t))
time.sleep(t)
print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
print('Parent got signal %s' % signum)
if __name__ == '__main__':
for i in ['SIGTERM', 'SIGKILL', 'SIGQUIT', 'SIGABRT', 'SIGCHLD']:
try:
signum = getattr(signal, i)
signal.signal(signum, gotsig)
print(f'{i}: {signum}')
except (OSError, RuntimeError) as m: # os error
print(f'Skipping {i}: {m}')
wait_time = 10
p = Process(target=sleeper, args=(wait_time,))
p.start()
time.sleep(wait_time+2)
print('Exiting') By running this code, I get the following output.
Regardless of which signal I am killing the process with (SIGKILL: 9, SIGTERM: 15, SIGQUIT: 3, etc.) I get the same result signal in the output
This explains the invalid argument error for SIGKILL in my outputs. Please let me know if there is any other info that may help. |
Looks like only some of the signal values are defined by POSIX and the others can be remapped by different OS vendors (see https://en.wikipedia.org/wiki/Signal_(IPC) and search for "Portable number"), which is what accounts for the different numbers. Thanks for trying the explicit
I appreciate your patience in working through this with me. When you have a chance, the next set of tests is:
That will give me some internal Thespian log information that should hopefully reveal what's different about running in your environment. |
I've attached a few different logs here. All were generated by running the test code originally linked at the beginning of this issue. kill_minus_15_SIGTERM.log kill_minus_9_SIGKILL.log I'm noticing in the
whereas no such line exists in the I also tested using kill_minus_17_SIGSTOP.log
I think this is due to |
Quick update. On my Debian system I receive the ChildActorExited message as anticipated. |
Thanks for the info and update. The log files are not showing some events I'm expecting to see. I'm looking into getting access to a MacOS environment to see if I can do some local testing/reproduction. |
If there's something in particular you're looking for, I could do a bit of digging to get some more detailed info. Thanks again for your help. :) |
Code
Below is example code which has reproduced this issue for me.
Project structure
start.py
stop.py
parent.py
child.py
Procedure
start.py
~/Documents/thespian/test$ python stop.py
Problem
From the procedure above, upon killing the child process in step 3, it is to my understanding that the parent actor should immediately receive a
ChildActorExited
message and my example program should print outChildActorExited
to the terminal; this, however, does not happen. Instead the remaining parent actor will stay alive and report nothing.I have tested this same functionality on MacOS and Windows with the same results. I also tried using
multiprocUDPBase
but again got the same results.Another thing to note is after killing the child process and running
stop.py
, the actor system takes a bit longer than usual to shutdown however it does not print any additional information.Environment
thespian
.If you need any additional information, please let me know.
The text was updated successfully, but these errors were encountered: