Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad kills random processes on restart while restoring old processes #17960

Closed
mmpataki opened this issue Jul 17, 2023 · 6 comments
Closed

nomad kills random processes on restart while restoring old processes #17960

mmpataki opened this issue Jul 17, 2023 · 6 comments
Labels

Comments

@mmpataki
Copy link

Nomad version

Output from nomad version

# /home/informatica/ics/nomad/nomad version
Nomad v0.12.4 (8efaee4ba5e9727ab323aaba2ac91c2d7b572d84)

Operating system and Environment details

# uname -a
Linux ink04366106 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Issue

  • User starts a job in nomad (single OS process). Let's say the PID of this job is AAA
  • For some reason nomad process goes down (restarts)
  • Before the nomad restarts, the job completes
  • nomad starts and tries to kill the process with pid AAA.
  • Since the process id AAA is used by some other process, AAA gets killed

Reproduction steps

This is no exact reproduction, but we can observe the issue happening this way

./killsnoop -n nomad
  • Run a nomad job
  • Kill the nomad process first and then the job process. Below script can do it without a delay (In my case job process was called GroupExecutor. It also prints the PID of the job process, let's say it's AAA
kill -9 `ps -ef|grep 'nomad agen[t]'|awk '{print $2}'` && ps -ef|grep 'GroupE[x]' && kill -9 `ps -ef|grep 'GroupE[x]'|awk '{print $2}'`
  • Note the output of killsnoop. nomad issues a SIGKILL to the AAA (16276 - nomad, 16286 - job)
nomad            16276  16286    9          18446744073709551616
  • If some other process was allocated the pid = 16286 it would get a signal.

Expected Result

  • nomad shouldn't send the signal if the process is not started by itself.

Actual Result

  • nomad is sending the SIGKILL

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@mmpataki
Copy link
Author

A simple solution to this would be adding a env variable to the child proceses which can be checked before killing the process.

@shoenig
Copy link
Member

shoenig commented Jul 17, 2023

Hi @mmpataki, note that Nomad v0.12.4 is a very old release from September 2020. If you can reproduce with a supported version (e.g. v1.6-RC, 1.5.x, or 1.4.x), that would be worth investigating.

@schmichael
Copy link
Member

I tested with v1.6.0 and could not reproduce. I have vague memories of this bug being fixed but a quick skim of the changelog didn't jog my memory.

Please reopen if you can reproduce! This definitely shouldn't happen (the executor should stick around across client restarts and allow the client to reattach and learn about the completed process).

@yigit-erkoca
Copy link

Looks like reproduced here too: #23969

@yigit-erkoca
Copy link

I tested with v1.6.0 and could not reproduce. I have vague memories of this bug being fixed but a quick skim of the changelog didn't jog my memory.

Please reopen if you can reproduce! This definitely shouldn't happen (the executor should stick around across client restarts and allow the client to reattach and learn about the completed process).

Could this be reponed? Reproduced the issue here: #23969

@jrasell
Copy link
Member

jrasell commented Sep 17, 2024

Hi @yigit-erkoca; we will keep the newer issue which is linked to this one. There is no reason to have two issues open for the same bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

5 participants