-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows -> Raw exec = Broken with 1.8.2 #23668
Comments
Hi @guifran001! I can't really figure out why your postgres process exited, but it looks like the shutdown of the executor is the issue here. When the
This error is coming from func findProcess(pid int) (p *Process, err error) {
const da = syscall.STANDARD_RIGHTS_READ |
syscall.PROCESS_QUERY_INFORMATION | syscall.SYNCHRONIZE
h, e := syscall.OpenProcess(da, false, uint32(pid))
if e != nil {
return nil, NewSyscallError("OpenProcess", e)
}
return newProcess(pid, uintptr(h)), nil
} Is there any possibility that your task daemonizes? That is, does it double fork and then closes the first forking process to divorce the child from the grandparent (the Nomad executor in this case)? If not, then I'm not sure what's up... you're running Nomad as a privileged user, right?
That's an error trying to access Linux cgroups, so highly unlikely to be related. |
@tgross : thank you for the reply. Our test is not starting Nomad as a privileged user but only as the We haven't taken the time to properly investigate yet. But we'll give it a try with a privileged user. From
So it it more likely than the |
I've found that Windows returns I don't have an answer yet, but that gives me a place to start looking at least. |
I have been experiencing a very similar issue to @guifran001 so I will see if my situation can help shed any light on this issue. Nomad version Operating system and Environment detailsWindows IssueI run long lived Nomad jobs that can live for days at a time as they are service jobs. These jobs are designed to only have one instance running at once. I updated my Nomad clients from v1.6.7 to v1.8.2. The upgrade involved draining the clients, upgrading nomad and then setting the client to eligible. This was done one client at a time. My Nomad servers are running v1.6.7 and haven't been upgraded for a while. For the first 3 hours after the upgrade everything was running smoothly with no issues. Approximately 3 hours after upgrading my Nomad clients to v1.8.2 I noticed that all of our allocations on one of the Nomad clients had been restarted and this prompted me to investigate further. During my investigation I realised that the original allocations were still running and the restart had essentially duplicated all the allocations on that client. When viewing running processes in task manager it was obvious that each allocation now had two instances running. I checked the Nomad UI and only the newest instance of each allocation was being displayed. Here is what I think happened based on the logs:
Here is how I fixed it:
This fixed the problem for now but if the problem reoccurs it will cause me serious issues. Nomad Client logsLogs from the time of the upgrade:
Logs from the time that the problems started:
|
Hi @PayStreamArmitsteadJ. What's you're seeing isn't identical, but it sure smells similar. I'm not sure where you're getting this idea from:
The executor isn't going to be making any HTTP requests or outbound connections of any kind. The executor is a go-plugin server. So the Nomad client agent spawns the executor process and then connects to that process via either a unix domain socket or a local TCP connection on Windows. When the client restarts (ex. after an upgrade), it reads its on-disk state to find tasks that should still be running, and then re-connects to the TCP server it expects to find there. If it can't reconnect, you'll see an error like the one you saw here:
At that point, the client says "ok this task must've failed while I was offline". That's just something Nomad has to expect could happen, so it restarts the task (or not) according to the policy in the But if the executor task is missing, there's no way for the client to have any idea that the task process is running or not. We're supposed to always kill the child processes when the executor dies, for any reason. And then that's compounded by what you're seeing here:
In this case, it looks like the executor's TCP server is unavailable. I suspect in this case the executor process is gone already so we're just trying to send a message over a connection where nobody's listening anymore. Which is more evidence that the executor processes are getting killed unexpectedly. A question for both @guifran001 and @PayStreamArmitsteadJ: when these task processes are left running, are their executor processes still running as well, or are the executor processes also gone? |
I have only experienced this issue once and it was affecting our production cluster, so I had to remedy the situation right away. Therefore, I do not know if the executor processes were still running. It seems like this was a one-off issue caused by the upgrade. |
Ok, that's information in itself. Thanks. |
Hello everybody,
There was a old instance already running. After stop the process it was not possible to start the process in a normal way. only restart the hole machine was a temporary solution. The logfiles of nomad are crashed for today at this point i cannot provide at the moment and will monitor this situation and will deliver logs afterwards. @tgross i have multiple times per week this issue, old instances on the servers are running with version 1.8.2. How can i check the executor process is still running ? |
I've encountered the same issue on Windows 10 and 11. |
I've updated this issue to make sure it gets some attention on our internal tracker. I also have some suspicion that this may be related to #23969 |
I was able to reproduce this issue on Nomad 1.8.1 and 1.8.4 on Windows 11 with a minimal batch job.
Failure logs:
Once I downgraded to 1.7.7 the issue went away. Success on 1.7.7:
|
@CullenShane I was hoping your reproduction would be the clue I needed, but I think it's possibly related but not quite the same. It looks like we're simply returning an error spuriously. I ran that job and used process explorer to look at the process tree. The task was PID6000, and the {"@level":"debug","@message":"plugin address","@timestamp":"2024-10-14T15:59:21.207440Z","address":"127.0.0.1:14000","network":"tcp"}
{"@level":"debug","@message":"shutdown requested","@module":"executor","@timestamp":"2024-10-14T16:00:21.488858Z","grace_period_ms":0,"signal":""}
{"@level":"warn","@message":"failed to shutdown due to inability to find process","@module":"executor","@timestamp":"2024-10-14T16:00:21.488858Z","error":"executor failed to find process: OpenProcess: The parameter is incorrect.","pid":6000} The client logs report this as:
At the point we're calling In the reports above the executor fails (the parent of the task itself), and then the task isn't stopping as it should. For folks who have seen this, it would be helpful if you could get the (If this were the bug, it would be is the exact opposite bug as #11958) |
Ok, I tried some more to reproduce this and I want to make sure that I've documented here a specific bug where the proximate cause is the executor process fails so that folks can report whether this is what they're seeing. I'm running Nomad as a Windows service and deploying a web application job "example" {
task "task" {
driver = "raw_exec"
config {
command = "C:/Users/Administrator/echo/echo.exe"
}
}
} Killing the Instead, let's look at what happens if we kill the executor. Before doing so, using Process Explorer I can see my process tree looks like this:
Our client logs look like the following. Note this tells us where to find the
Now kill the
At this point, we can see in Process Explorer that the
The {"@level":"debug","@message":"plugin address","@timestamp":"2024-10-14T18:33:59.452893Z","address":"127.0.0.1:14000","network":"tcp"}
{"@level":"debug","@message":"plugin address","@timestamp":"2024-10-14T18:42:15.932447Z","address":"127.0.0.1:14000","network":"tcp"}
{"@level":"debug","@message":"shutdown requested","@module":"executor","@timestamp":"2024-10-14T18:42:16.012855Z","grace_period_ms":0,"signal":""}
{"@level":"warn","@message":"failed to shutdown due to inability to find process","@module":"executor","@timestamp":"2024-10-14T18:42:16.012940Z","error":"executor failed to find process: OpenProcess: The parameter is incorrect.","pid":5580} Note that there are no logs in the
So my hypothesis is:
This makes for 3 bugs in a trenchcoat:
If folks have any information from their environments that contradicts that hypothesis, I'd appreciate sharing it because that would be helpful to know. |
I've got a fix up here for the executor orphaning the child processes: #24214. I'll open a new issue for improving the UX of the error messages. |
On Windows, if the `raw_exec` driver's executor exits, the child processes are not also killed. Create a Windows "job object" (not to be confused with a Nomad job) and add the executor to it. Child processes of the executor will inherit the job automatically. When the handle to the job object is freed (on executor exit), the job itself is destroyed and this causes all processes in that job to exit. Fixes: #23668 Ref: https://learn.microsoft.com/en-us/windows/win32/procthread/job-objects
The fix for the child process being orphaned will ship in Nomad 1.9.1 (with Enterprise backports). The executor failure itself is likely fixed by #24182, and that's why we're seeing it suddenly in 1.8.x even though the underlying bug here is quite old. (And, as it turns out, exists on unixish systems too). |
Nomad version
Nomad v1.8.2
BuildDate 2024-07-16T08:50:09Z
Revision 7f0822c
Operating system and Environment details
Windows
driver: raw_exec
Issue
Jobs are stopping unexpectedly in nomad, but the processes are still alive in Windows, so when nomad tries to restart them, it fails because the port of application is already bound or a lock file is already lock.
It might not be 100% reproductible as we have an automated test that deploys a nomad instance and starts some jobs that never failed in 1.7.3, but now is failing since our update to 1.8.2. The test has been successful once with 1.8.2.
The automated test deploy consul 1.17.2 as both server and client, then nomad 1.8.2 as both server and client, then starts a postgres job and a keycloak job.
From the logs, we can see that postgres and keycloak start, then postgres stops and keycloak stops with Exit Code 2. Then they are stuck in a restart loop as the orignial postgres and keycloak processes are still executing.
From the log, we can see the Exit Code 2 that is usually reserved to file not found error. Could it be related to #23595 ?
Reproduction steps
80%. The test is using pulumi to run the jobs. Each job is run sequentially
Expected Result
postgres and keycloak run fine
Actual Result
postgres and keycloak are stuck in a restart loop while a detached process of postgres and one from keycloak started by nomad are still running while nomad try to start new ones.
Job file (if appropriate)
Will be provided if needed.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: