Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows -> Raw exec = Broken with 1.8.2 #23668

Closed
guifran001 opened this issue Jul 22, 2024 · 16 comments · Fixed by #24214
Closed

Windows -> Raw exec = Broken with 1.8.2 #23668

guifran001 opened this issue Jul 22, 2024 · 16 comments · Fixed by #24214

Comments

@guifran001
Copy link
Contributor

guifran001 commented Jul 22, 2024

Nomad version

Nomad v1.8.2
BuildDate 2024-07-16T08:50:09Z
Revision 7f0822c

Operating system and Environment details

Windows
driver: raw_exec

Issue

Jobs are stopping unexpectedly in nomad, but the processes are still alive in Windows, so when nomad tries to restart them, it fails because the port of application is already bound or a lock file is already lock.

It might not be 100% reproductible as we have an automated test that deploys a nomad instance and starts some jobs that never failed in 1.7.3, but now is failing since our update to 1.8.2. The test has been successful once with 1.8.2.

The automated test deploy consul 1.17.2 as both server and client, then nomad 1.8.2 as both server and client, then starts a postgres job and a keycloak job.

From the logs, we can see that postgres and keycloak start, then postgres stops and keycloak stops with Exit Code 2. Then they are stuck in a restart loop as the orignial postgres and keycloak processes are still executing.

From the log, we can see the Exit Code 2 that is usually reserved to file not found error. Could it be related to #23595 ?

Reproduction steps

80%. The test is using pulumi to run the jobs. Each job is run sequentially

Expected Result

postgres and keycloak run fine

Actual Result

postgres and keycloak are stuck in a restart loop while a detached process of postgres and one from keycloak started by nomad are still running while nomad try to start new ones.

Job file (if appropriate)

Will be provided if needed.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

POSTGRES Jobs is scheduled: 
2024-07-19T17:47:48.093Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Received msg="Task received by client" failed=false
2024-07-19T17:47:48.110Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type="Task Setup" msg="Building Task Directory" failed=false
2024-07-19T17:47:48.353Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres path=c:\data\nomad-1.8.2\alloc\79de797f-3e61-54a6-ce48-2affc507d636\postgres\secrets\api.sock error="listen unix c:\data\nomad-1.8.2\alloc\79de797f-3e61-54a6-ce48-2affc507d636\postgres\secrets\api.sock: bind: invalid argument"
POSTGRES task starts
2024-07-19T17:47:48.385Z [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:/install/00Postgres-v16.1.13-update-manifest-versions.1/pgsql/bin/postgres.exe Args:[-p 5432] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-07-19T17:47:48.597Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Started msg="Task started by client" failed=false

KEYCLOAK Jobs is scheduled
2024-07-19T17:49:04.930Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Received msg="Task received by client" failed=false
2024-07-19T17:49:04.931Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type=Received msg="Task received by client" failed=false
2024-07-19T17:49:04.948Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type="Task Setup" msg="Building Task Directory" failed=false
2024-07-19T17:49:05.158Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak path=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\compile-keycloak\secrets\api.sock error="listen unix c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\compile-keycloak\secrets\api.sock: bind: invalid argument"
KEYCLOAK-COMPILE task starts (lifecycle {hook = "prestart"sidecar = false}
2024-07-19T17:49:05.177Z [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:/install/openjdk-v21.0.2/bin/java.exe Args:[-Dkc.config.build-and-exit=true -Dprogram.name=keycloak -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:FlightRecorderOptions=stackdepth=512 --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.security=ALL-UNNAMED -Dkc.home.dir=C:/install/keycloak-v23.0.6 -Djboss.server.config.dir=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\compile-keycloak\local/conf -Dkeycloak.theme.dir=C:/install/keycloak-v23.0.6/themes -Djava.util.logging.manager=org.jboss.logmanager.LogManager -Dquarkus-log-max-startup-records=10000 -Dquarkus.datasource.jdbc.acquisition-timeout=2000 -cp C:/install/keycloak-v23.0.6/lib/quarkus-run.jar;C:/install/keycloak-v23.0.6/lib/**/* io.quarkus.bootstrap.runner.QuarkusEntryPoint build --db=postgres] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-07-19T17:49:05.404Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Started msg="Task started by client" failed=false
KEYCLOAK-COMPILE task ends
2024-07-19T17:49:23.430Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Terminated msg="Exit Code: 0" failed=false
2024-07-19T17:49:23.438Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unknown desc = executor failed to find process: OpenProcess: The parameter is incorrect."
2024-07-19T17:49:23.464Z [INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 driver=raw_exec task_name=compile-keycloak plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=4688
2024-07-19T17:49:23.465Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak reason="Restart unnecessary as task terminated successfully"
2024-07-19T17:49:23.479Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type="Task Setup" msg="Building Task Directory" failed=false
2024-07-19T17:49:23.626Z [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak path=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\keycloak\secrets\api.sock error="listen unix c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\keycloak\secrets\api.sock: bind: invalid argument"
KEYCLOAK task starts
2024-07-19T17:49:23.642Z [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:/install/openjdk-v21.0.2/bin/java.exe Args:[-Dkc.config.built=true -Dprogram.name=keycloak -Xms64m -Xmx512m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.err.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:FlightRecorderOptions=stackdepth=512 --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.security=ALL-UNNAMED -Dkc.home.dir=C:/install/keycloak-v23.0.6 -Djboss.server.config.dir=c:\data\nomad-1.8.2\alloc\876ad764-d20b-600a-adcb-8f22186401f7\keycloak\local/conf -Dkeycloak.theme.dir=C:/install/keycloak-v23.0.6/themes -Djava.util.logging.manager=org.jboss.logmanager.LogManager -Dquarkus-log-max-startup-records=10000 -Dquarkus.datasource.jdbc.acquisition-timeout=2000 -cp C:/install/keycloak-v23.0.6/lib/quarkus-run.jar;C:/install/keycloak-v23.0.6/lib/**/* io.quarkus.bootstrap.runner.QuarkusEntryPoint start --optimized --db=postgres --db-url jdbc:postgresql://127.0.0.1:5432/keycloak?user=iac&password=xxx OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-07-19T17:49:23.761Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type=Started msg="Task started by client" failed=false

!!! POSTGRES stops with Exit code 0 ??? 
2024-07-19T17:50:51.551Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50071->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-07-19T17:50:51.569Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14000: connectex: No connection could be made because the target machine actively refused it.\""

!!! KEYCLOAK stops with Exit code 0 ???
2024-07-19T17:50:51.589Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50272->127.0.0.1:14001: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-07-19T17:50:51.619Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14001: connectex: No connection could be made because the target machine actively refused it.\""

!!! Executor de POSTGRES fails with Exit code 2 ??? (does 2 mean "The system cannot find the file specified"?)
2024-07-19T17:50:51.714Z [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 driver=raw_exec task_name=postgres plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=2184 error="exit status 2"
2024-07-19T17:50:51.714Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres reason="Restart within policy" delay=18.647683748s
2024-07-19T17:50:51.715Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=79de797f-3e61-54a6-ce48-2affc507d636 task=postgres type=Restarting msg="Task restarting in 18.647683748s" failed=false

Executor de KEYCLOAK fails with Exit code 2 ??? (does 2 mean "The system cannot find the file specified"?)
2024-07-19T17:50:51.783Z [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 driver=raw_exec task_name=keycloak plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=1568 error="exit status 2"
2024-07-19T17:50:51.783Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=keycloak reason="Restart within policy" delay=34.485843068s
@tgross
Copy link
Member

tgross commented Aug 2, 2024

Hi @guifran001! I can't really figure out why your postgres process exited, but it looks like the shutdown of the executor is the issue here. When the compile-keycloak pre-start task finishes, we see the following logs:

2024-07-19T17:49:23.430Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 task=compile-keycloak type=Terminated msg="Exit Code: 0" failed=false
2024-07-19T17:49:23.438Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unknown desc = executor failed to find process: OpenProcess: The parameter is incorrect."
2024-07-19T17:49:23.464Z [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=876ad764-d20b-600a-adcb-8f22186401f7 driver=raw_exec task_name=compile-keycloak plugin=C:\install\nomad-1.8.2\bin\nomad.exe id=4688

This error is coming from executor.go#L612. To hit this code at all, the executor still has the process PID it's kept around in memory, but when we call os.FindProcess that hits this code in the Go stdlib, which is just a Windows syscall:

func findProcess(pid int) (p *Process, err error) {
	const da = syscall.STANDARD_RIGHTS_READ |
		syscall.PROCESS_QUERY_INFORMATION | syscall.SYNCHRONIZE
	h, e := syscall.OpenProcess(da, false, uint32(pid))
	if e != nil {
		return nil, NewSyscallError("OpenProcess", e)
	}
	return newProcess(pid, uintptr(h)), nil
}

Is there any possibility that your task daemonizes? That is, does it double fork and then closes the first forking process to divorce the child from the grandparent (the Nomad executor in this case)? If not, then I'm not sure what's up... you're running Nomad as a privileged user, right?

Could it be related to #23595 ?

That's an error trying to access Linux cgroups, so highly unlikely to be related.

@guifran001
Copy link
Contributor Author

@tgross : thank you for the reply.

Our test is not starting Nomad as a privileged user but only as the Network Service User.

We haven't taken the time to properly investigate yet. But we'll give it a try with a privileged user.

From OpenProcess function (processthreadsapi.h) doc:

If the specified process is the System Idle Process (0x00000000), the function fails and the last error code is ERROR_INVALID_PARAMETER. If the specified process is the System process or one of the Client Server Run-Time Subsystem (CSRSS) processes, this function fails and the last error code is ERROR_ACCESS_DENIED because their access restrictions prevent user-level code from opening them.

So it it more likely than the pid is 0. From the code, I cannot figure out how this would happen and what's different from version 1.7.x about that.

@tgross
Copy link
Member

tgross commented Aug 2, 2024

I've found that Windows returns ERROR_INVALID_PARAMETER at the drop of a hat, but that's a good catch. The added bit of context is the sometimes-unfortunate decision that Go made to have "zero values" for uninitialized fields, even if they're value-fields. And guess what the zero-value of an PID (integer) is? So it could be that we're somehow no longer capturing the PID from the child process and so when we try to shutdown the executor we're passing it that uninitialized field.

I don't have an answer yet, but that gives me a place to start looking at least.

@PayStreamArmitsteadJ
Copy link

@tgross

I have been experiencing a very similar issue to @guifran001 so I will see if my situation can help shed any light on this issue.

Nomad version
Nomad v1.8.2
BuildDate 2024-07-16T08:50:09Z
Revision 7f0822c

Operating system and Environment details

Windows
driver: raw_exec

Issue

I run long lived Nomad jobs that can live for days at a time as they are service jobs. These jobs are designed to only have one instance running at once.

I updated my Nomad clients from v1.6.7 to v1.8.2. The upgrade involved draining the clients, upgrading nomad and then setting the client to eligible. This was done one client at a time. My Nomad servers are running v1.6.7 and haven't been upgraded for a while.

For the first 3 hours after the upgrade everything was running smoothly with no issues. Approximately 3 hours after upgrading my Nomad clients to v1.8.2 I noticed that all of our allocations on one of the Nomad clients had been restarted and this prompted me to investigate further. During my investigation I realised that the original allocations were still running and the restart had essentially duplicated all the allocations on that client. When viewing running processes in task manager it was obvious that each allocation now had two instances running. I checked the Nomad UI and only the newest instance of each allocation was being displayed. Here is what I think happened based on the logs:

  1. The executor failed due to not being able to make a HTTP call to a local HTTP server.
  2. The executor was then supposed to be destroyed but the destroy failed.
  3. This caused the allocations already running on the client to become "forgotten" and Nomad no longer knows about them.
  4. Since Nomad thinks no allocations are running it attempts to start (restart) all the allocations.
  5. The old allocations are still running as well as the new ones hence why there are duplicate allocations

Here is how I fixed it:

  1. I drained the problematic Nomad client via the Nomad UI to move all the new allocations to another Nomad client.
  2. Once the drain was complete we manually killed all the "forgotten" allocations directly on the Nomad client via Task Manager.
  3. We then set the client to be eligible again.

This fixed the problem for now but if the problem reoccurs it will cause me serious issues.

Nomad Client logs

Logs from the time of the upgrade:

2024-08-05T15:08:47.679+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f task=task type=Killed msg="Task successfully killed" failed=false
2024-08-05T15:08:47.692+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f task=task plugin=E:\nomad\nomad.exe id=17564
2024-08-05T15:08:47.695+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f
2024-08-05T15:11:59.263+0100 [INFO]  agent: requesting shutdown
2024-08-05T15:11:59.263+0100 [INFO]  client: shutting down
2024-08-05T15:11:59.328+0100 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
2024-08-05T15:11:59.328+0100 [INFO]  client.plugin: plugin manager finished: plugin-type=device
2024-08-05T15:11:59.328+0100 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
2024-08-05T15:11:59.348+0100 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
2024-08-05T15:11:59.348+0100 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
2024-08-05T15:11:59.348+0100 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
2024-08-05T15:11:59.416+0100 [INFO]  agent: shutdown complete
2024-08-05T15:15:05.542+0100 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=E:\nomad\data\plugins
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2024-08-05T15:15:05.546+0100 [INFO]  client: using state directory: state_dir=E:\nomad\data\client
2024-08-05T15:15:05.552+0100 [INFO]  client: using alloc directory: alloc_dir=E:\nomad\data\alloc
2024-08-05T15:15:05.552+0100 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=14000-14512
2024-08-05T15:15:05.649+0100 [INFO]  client.fingerprint_mgr.consul: consul agent is available: cluster=default
2024-08-05T15:15:05.714+0100 [WARN]  client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="landlock not supported on this platform"
2024-08-05T15:15:13.963+0100 [WARN]  client.fingerprint_mgr.network: failed to detect link speed: device="Loopback Pseudo-Interface 1" path=powershell.exe command="Get-NetAdapter -Name 'Loopback Pseudo-Interface 1' -ErrorAction Ignore | Select-Object -ExpandProperty 'Speed'" error="exit status 1"
2024-08-05T15:15:14.595+0100 [WARN]  client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: The system cannot find the path specified."
2024-08-05T15:15:14.626+0100 [INFO]  client.fingerprint_mgr.vault: Vault is available: cluster=default
2024-08-05T15:15:24.659+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
2024-08-05T15:15:24.659+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
2024-08-05T15:15:24.659+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
2024-08-05T15:15:24.660+0100 [ERROR] client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.nomad.alloc_id%22%5D%7D\": open //./pipe/docker_engine: The system cannot find the file specified."
2024-08-05T15:15:24.673+0100 [WARN]  client.server_mgr: no servers available
2024-08-05T15:15:24.674+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:15:24.674+0100 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=5391c4b9-73e7-6943-8459-620ce812c13f/task/cae7b09d
2024-08-05T15:15:24.678+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a9d42f37-f5f8-590e-3df2-e6bb962b6b3b task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:15:24.678+0100 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=a9d42f37-f5f8-590e-3df2-e6bb962b6b3b/task/68e876f1
2024-08-05T15:15:24.680+0100 [INFO]  client: started client: node_id=6bbe1a22-1440-c71b-7d86-0d445a719a7d
2024-08-05T15:15:24.680+0100 [WARN]  client.server_mgr: no servers available
2024-08-05T15:15:24.681+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=a9d42f37-f5f8-590e-3df2-e6bb962b6b3b
2024-08-05T15:15:24.682+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=5391c4b9-73e7-6943-8459-620ce812c13f
2024-08-05T15:15:24.688+0100 [INFO]  client.consul: discovered following servers: servers=[172.20.34.40:4647, 172.20.34.41:4647, 172.20.34.42:4647]
2024-08-05T15:15:24.827+0100 [INFO]  client: node registration complete
2024-08-05T15:15:31.083+0100 [INFO]  client: node registration complete
2024-08-05T15:18:09.714+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d29f972f-0875-3084-31e8-f86f1d8342dd task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:18:09.740+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=8c94e150-617b-2203-f95f-16d3061993d6 task=task type=Received msg="Task received by client" failed=false
2024-08-05T15:18:09.788+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=6873d78b-2c91-e5b8-64e6-acfcc8e64dbc task=task type=Received msg="Task received by client" failed=false

Logs from the time that the problems started:

Normal operation
2024-08-05T16:45:12.892+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=325bb45a-2676-d696-0192-b415768a1a90 task=task type=Killing msg="Sent interrupt. Waiting 30s before force killing" failed=false
2024-08-05T16:50:12.973+0100 [INFO]  client.gc: garbage collecting allocation: alloc_id=58b1cf38-c1d0-2984-1fc6-5e40b90db20f reason="forced collection"
2024-08-05T16:50:12.975+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=58b1cf38-c1d0-2984-1fc6-5e40b90db20f task=task type=Killing msg="Sent interrupt. Waiting 30s before force killing" failed=false

The start of the problems
2024-08-05T18:24:05.038+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 task=task type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:60955->127.0.0.1:14005: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-08-05T18:24:05.041+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 task=task type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:61608->127.0.0.1:14019: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-08-05T18:24:05.593+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14005: connectex: No connection could be made because the target machine actively refused it.\""
2024-08-05T18:24:05.593+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14019: connectex: No connection could be made because the target machine actively refused it.\""
2024-08-05T18:24:05.736+0100 [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 driver=raw_exec task_name=task plugin=E:\nomad\nomad.exe id=34452 error="exit status 2"
2024-08-05T18:24:05.754+0100 [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 driver=raw_exec task_name=task plugin=E:\nomad\nomad.exe id=16152 error="exit status 2"
2024-08-05T18:24:05.781+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 task=task reason="Restart within policy" delay=1m4.76052843s
2024-08-05T18:24:05.781+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 task=task reason="Restart within policy" delay=1m0.536694183s
2024-08-05T18:24:05.801+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=7a3ba3b6-6513-184b-b8de-e6ae45d6def8 task=task type=Restarting msg="Task restarting in 1m0.536694183s" failed=false
2024-08-05T18:24:05.801+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1065576c-4089-21b8-8214-05fe62f59e81 task=task type=Restarting msg="Task restarting in 1m4.76052843s" failed=false
2024-08-05T18:24:09.828+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a0f5d27f-d704-978b-8897-457936a481a0 task=task type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:61741->127.0.0.1:14024: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-08-05T18:24:09.903+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14024: connectex: No connection could be made because the target machine actively refused it.\""
2024-08-05T18:24:10.607+0100 [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=a0f5d27f-d704-978b-8897-457936a481a0 driver=raw_exec task_name=task plugin=E:\nomad\nomad.exe id=4772 error="exit status 2"
2024-08-05T18:24:10.613+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=a0f5d27f-d704-978b-8897-457936a481a0 task=task reason="Restart within policy" delay=1m0.536694183s

@tgross
Copy link
Member

tgross commented Aug 8, 2024

Hi @PayStreamArmitsteadJ. What's you're seeing isn't identical, but it sure smells similar. I'm not sure where you're getting this idea from:

The executor failed due to not being able to make a HTTP call to a local HTTP server.

The executor isn't going to be making any HTTP requests or outbound connections of any kind. The executor is a go-plugin server. So the Nomad client agent spawns the executor process and then connects to that process via either a unix domain socket or a local TCP connection on Windows.

When the client restarts (ex. after an upgrade), it reads its on-disk state to find tasks that should still be running, and then re-connects to the TCP server it expects to find there. If it can't reconnect, you'll see an error like the one you saw here:

2024-08-05T15:15:24.674+0100 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=5391c4b9-73e7-6943-8459-620ce812c13f/task/cae7b09d

At that point, the client says "ok this task must've failed while I was offline". That's just something Nomad has to expect could happen, so it restarts the task (or not) according to the policy in the restart block.

But if the executor task is missing, there's no way for the client to have any idea that the task process is running or not. We're supposed to always kill the child processes when the executor dies, for any reason.

And then that's compounded by what you're seeing here:

2024-08-05T18:24:05.593+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp 127.0.0.1:14005: connectex: No connection could be made because the target machine actively refused it.""

In this case, it looks like the executor's TCP server is unavailable. I suspect in this case the executor process is gone already so we're just trying to send a message over a connection where nobody's listening anymore. Which is more evidence that the executor processes are getting killed unexpectedly.


A question for both @guifran001 and @PayStreamArmitsteadJ: when these task processes are left running, are their executor processes still running as well, or are the executor processes also gone?

@PayStreamArmitsteadJ
Copy link

@tgross

I have only experienced this issue once and it was affecting our production cluster, so I had to remedy the situation right away. Therefore, I do not know if the executor processes were still running. It seems like this was a one-off issue caused by the upgrade.

@tgross
Copy link
Member

tgross commented Aug 9, 2024

I have only experienced this issue once and it was affecting our production cluster, so I had to remedy the situation right away. Therefore, I do not know if the executor processes were still running. It seems like this was a one-off issue caused by the upgrade.

Ok, that's information in itself. Thanks.

@ORuessel
Copy link

ORuessel commented Aug 21, 2024

Hello everybody,
i have the same situation. We upgraded the nomad cluster in two steps from 1.5.2 to 1.7.3 to 1.8.2. In our situation we have "long live" running jobs over one day,
At the end of the day, we shut down and suspend in AWS the client nodes. On the next morning sometimes we get the issue, that on one different node the process is still running and not fully shutdown by nomad.

Aug 21, '24 11:43:28 +0200	Killing	Sent interrupt. Waiting 5s before force killing
Aug 21, '24 11:43:21 +0200	Restarting	Task restarting in 16s
Aug 21, '24 11:43:21 +0200	Terminated	Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:64431->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
Aug 21, '24 11:43:19 +0200	Started	Task started by client
Aug 21, '24 11:43:01 +0200	Restarting	Task restarting in 18s
Aug 21, '24 11:43:01 +0200	Terminated	Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:64411->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
Aug 21, '24 11:43:00 +0200	Started	Task started by client
Aug 21, '24 11:42:44 +0200	Restarting	Task restarting in 15s
Aug 21, '24 11:42:44 +0200	Terminated	Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:64400->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
Aug 21, '24 11:42:43 +0200	Started	Task started by client

There was a old instance already running. After stop the process it was not possible to start the process in a normal way. only restart the hole machine was a temporary solution. The logfiles of nomad are crashed for today at this point i cannot provide at the moment and will monitor this situation and will deliver logs afterwards.

@tgross i have multiple times per week this issue, old instances on the servers are running with version 1.8.2. How can i check the executor process is still running ?

@tk2232
Copy link

tk2232 commented Sep 16, 2024

I've encountered the same issue on Windows 10 and 11.
After downgrading from Nomad 1.8.3 to 1.7.7, everything worked as expected again.

@tgross tgross added the hcc/jira label Oct 3, 2024
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Oct 3, 2024
@tgross tgross removed their assignment Oct 3, 2024
@tgross
Copy link
Member

tgross commented Oct 3, 2024

I've updated this issue to make sure it gets some attention on our internal tracker. I also have some suspicion that this may be related to #23969

@CullenShane
Copy link

CullenShane commented Oct 4, 2024

I was able to reproduce this issue on Nomad 1.8.1 and 1.8.4 on Windows 11 with a minimal batch job.

job "issue-23668" {
  datacenters = ["sally"]
  type = "batch"
  
  task "task" {
    driver = "raw_exec"

    config {
      command = "PowerShell"
      args = [
        "-Command",
        "Start-Sleep -Seconds 5"
      ]
    }
  }
} 

Failure logs:

2024-10-04T17:02:35.616-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 task=task type=Received msg="Task received by client" failed=false
2024-10-04T17:02:35.630-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 task=task type="Task Setup" msg="Building Task Directory" failed=false
2024-10-04T17:02:35.727-0500 [WARN]  client.alloc_runner.task_runner: some environment variables not available for rendering: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 task=task keys=""
2024-10-04T17:02:35.727-0500 [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:PowerShell Args:[-Command Start-Sleep -Seconds 5] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-10-04T17:02:35.805-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 task=task type=Started msg="Task started by client" failed=false
2024-10-04T17:02:37.433-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 task=task type=Terminated msg="Exit Code: 0, Exit Message: \"executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:59504->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host.\"" failed=false
2024-10-04T17:02:37.437-0500 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 127.0.0.1:14000: connectex: No connection could be made because the target machine actively refused it.\""
2024-10-04T17:02:37.530-0500 [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 driver=raw_exec task_name=task plugin=C:\Nomad\nomad.exe id=13416 error="exit status 2"
2024-10-04T17:02:37.530-0500 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 task=task reason="Restart within policy" delay=17.91890335s
2024-10-04T17:02:37.530-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0cbf4574-ebb0-b8d0-d6c5-140291560959 task=task type=Restarting msg="Task restarting in 17.91890335s" failed=false

Once I downgraded to 1.7.7 the issue went away.

Success on 1.7.7:

    2024-10-04T17:05:16.995-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 task=task type=Received msg="Task received by client" failed=false
    2024-10-04T17:05:17.007-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 task=task type="Task Setup" msg="Building Task Directory" failed=false
    2024-10-04T17:05:17.105-0500 [WARN]  client.alloc_runner.task_runner: some environment variables not available for rendering: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 task=task keys=""
    2024-10-04T17:05:17.107-0500 [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:PowerShell Args:[-Command Start-Sleep -Seconds 5]}"
    2024-10-04T17:05:17.185-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 task=task type=Started msg="Task started by client" failed=false
    2024-10-04T17:05:22.478-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 task=task type=Terminated msg="Exit Code: 0" failed=false
    2024-10-04T17:05:22.490-0500 [INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 driver=raw_exec task_name=task plugin=C:\Nomad\nomad.exe id=10828
    2024-10-04T17:05:22.490-0500 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 task=task reason="Restart unnecessary as task terminated successfully"
    2024-10-04T17:05:22.506-0500 [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653 task=task plugin=C:\Nomad\nomad.exe id=8764
    2024-10-04T17:05:22.506-0500 [INFO]  client.gc: marking allocation for GC: alloc_id=5dd66d9b-0ad6-5581-356b-d28f0e56a653

@tgross tgross self-assigned this Oct 10, 2024
@tgross tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Oct 10, 2024
@tgross
Copy link
Member

tgross commented Oct 14, 2024

@CullenShane I was hoping your reproduction would be the clue I needed, but I think it's possibly related but not quite the same. It looks like we're simply returning an error spuriously. I ran that job and used process explorer to look at the process tree. The task was PID6000, and the executor.out log for the batch process reads like this:

{"@level":"debug","@message":"plugin address","@timestamp":"2024-10-14T15:59:21.207440Z","address":"127.0.0.1:14000","network":"tcp"}
{"@level":"debug","@message":"shutdown requested","@module":"executor","@timestamp":"2024-10-14T16:00:21.488858Z","grace_period_ms":0,"signal":""}
{"@level":"warn","@message":"failed to shutdown due to inability to find process","@module":"executor","@timestamp":"2024-10-14T16:00:21.488858Z","error":"executor failed to find process: OpenProcess: The parameter is incorrect.","pid":6000}

The client logs report this as:

2024-10-14T14:21:38.993Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unknown desc = executor failed to find process: OpenProcess: The parameter is incorrect."

At the point we're calling Shutdown, process 6000 has already exited. But of course it has! In theory there's nothing to do so we should be able to return nil rather than an error, but I'm worried doing so may be leaking child processes. I need to verify that.

In the reports above the executor fails (the parent of the task itself), and then the task isn't stopping as it should. For folks who have seen this, it would be helpful if you could get the executor.out logs from the failed allocations.

(If this were the bug, it would be is the exact opposite bug as #11958)

@tgross
Copy link
Member

tgross commented Oct 14, 2024

Ok, I tried some more to reproduce this and I want to make sure that I've documented here a specific bug where the proximate cause is the executor process fails so that folks can report whether this is what they're seeing.

I'm running Nomad as a Windows service and deploying a web application echo.exe as a raw_exec service job:

job "example" {

  task "task" {
    driver = "raw_exec"

    config {
      command = "C:/Users/Administrator/echo/echo.exe"
    }
  }
}

Killing the echo.exe process is uninteresting and the task restarts as expected.

Instead, let's look at what happens if we kill the executor. Before doing so, using Process Explorer I can see my process tree looks like this:

(1060) C:/Users/Administrator/nomad.exe agent -dev -config=C:/Users/Administrator/config.hcl
├── (3836) C:/Users/Administrator/nomad.exe logmon
│   └── (3740) C:/Windows/system32/conhost.exe 0x4
└── (3424) C:/Users/Administrator/nomad.exe executor
    ├── (4420) C:/Windows/system32/conhost.exe 0x4
    └── (2240) C:/Users/Administrator/echo/echo.exe

Our client logs look like the following. Note this tells us where to find the executor.out log (which can also be read via nomad alloc fs):

2024-10-14T18:33:59.355Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:/Users/Administrator/echo/echo.exe Args:[] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0}"
2024-10-14T18:33:59.355Z [DEBUG] client.driver_mgr.raw_exec.executor: starting plugin: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task path=C:\Users\Administrator\nomad.exe args=["C:\Users\Administrator\nomad.exe", "executor", "{"LogFile":"C:\\Windows\\SystemTemp\\NomadClient183761340\\ac14131d-352a-d5b5-9270-6ec3dc0267bd\\task\\executor.out","LogLevel":"debug","FSIsolation":false,"Compute":{"tc":4600,"nc":2}}"]
2024-10-14T18:33:59.358Z [DEBUG] client.driver_mgr.raw_exec.executor: plugin started: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task path=C:\Users\Administrator\nomad.exe pid=3424
2024-10-14T18:33:59.358Z [DEBUG] client.driver_mgr.raw_exec.executor: waiting for RPC address: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task plugin=C:\Users\Administrator\nomad.exe
2024-10-14T18:33:59.455Z [DEBUG] client.driver_mgr.raw_exec.executor: using plugin: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task version=2
2024-10-14T18:33:59.479Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd task=task type=Started msg="Task started by client" failed=false

Now kill the echo.exe process via Process Explorer and we'll see the following in our client logs:

2024-10-14T18:41:59.936Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd task=task type=Terminated msg="Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50900->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."" failed=false
2024-10-14T18:41:59.936Z [DEBUG] client.driver_mgr.raw_exec.executor.stdio: received EOF, stopping recv loop: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task err="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50900->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
2024-10-14T18:41:59.936Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial tcp 127.0.0.1:14000: connectex: No connection could be made because the target machine actively refused it.""
2024-10-14T18:41:59.945Z [ERROR] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task plugin=C:\Users\Administrator\nomad.exe id=3424 error="exit status 1"
2024-10-14T18:41:59.945Z [DEBUG] client.driver_mgr.raw_exec.executor: plugin exited: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task
2024-10-14T18:41:59.945Z [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd task=task reason="Restart within policy" delay=15.884342135s

At this point, we can see in Process Explorer that the echo.exe process has been leaked. Because it's bound to a socket, the restart will fail. If we wait 15s for the restart, we see the following in the client logs:

2024-10-14T18:42:15.833Z [DEBUG] client.driver_mgr.raw_exec.executor: starting plugin: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task path=C:\Users\Administrator\nomad.exe args=["C:\Users\Administrator\nomad.exe", "executor", "{"LogFile":"C:\\Windows\\SystemTemp\\NomadClient183761340\\ac14131d-352a-d5b5-9270-6ec3dc0267bd\\task\\executor.out","LogLevel":"debug","FSIsolation":false,"Compute":{"tc":4600,"nc":2}}"]
2024-10-14T18:42:15.837Z [DEBUG] client.driver_mgr.raw_exec.executor: plugin started: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task path=C:\Users\Administrator\nomad.exe pid=3236
2024-10-14T18:42:15.837Z [DEBUG] client.driver_mgr.raw_exec.executor: waiting for RPC address: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task plugin=C:\Users\Administrator\nomad.exe
2024-10-14T18:42:15.934Z [DEBUG] client.driver_mgr.raw_exec.executor: using plugin: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd driver=raw_exec task_name=task version=2
2024-10-14T18:42:15.957Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd task=task type=Started msg="Task started by client" failed=false
2024-10-14T18:42:16.012Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=ac14131d-352a-d5b5-9270-6ec3dc0267bd task=task type=Terminated msg="Exit Code: 1" failed=false
2024-10-14T18:42:16.012Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unknown desc = executor failed to find process: OpenProcess: The parameter is incorrect."

The executor.out logs look like the following:

{"@level":"debug","@message":"plugin address","@timestamp":"2024-10-14T18:33:59.452893Z","address":"127.0.0.1:14000","network":"tcp"}
{"@level":"debug","@message":"plugin address","@timestamp":"2024-10-14T18:42:15.932447Z","address":"127.0.0.1:14000","network":"tcp"}
{"@level":"debug","@message":"shutdown requested","@module":"executor","@timestamp":"2024-10-14T18:42:16.012855Z","grace_period_ms":0,"signal":""}
{"@level":"warn","@message":"failed to shutdown due to inability to find process","@module":"executor","@timestamp":"2024-10-14T18:42:16.012940Z","error":"executor failed to find process: OpenProcess: The parameter is incorrect.","pid":5580}

Note that there are no logs in the executor.out for when we kill the executor, and we get the "OpenProcess" error only when the task is restarted and fails. All that follows from the Task Events we see as well. Here the first Terminated event is from the killed executor -- the Nomad client records that the go-plugin connection to the executor has been lost. And then the subsequent Terminated events is from the application failing (as expected, because we've leaked the process bound to the port).

Recent Events:
Time                  Type            Description
2024-10-14T18:42:34Z  Not Restarting  Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2024-10-14T18:42:34Z  Terminated      Exit Code: 1
2024-10-14T18:42:34Z  Started         Task started by client
2024-10-14T18:42:16Z  Restarting      Task restarting in 18.621692608s
2024-10-14T18:42:16Z  Terminated      Exit Code: 1
2024-10-14T18:42:15Z  Started         Task started by client
2024-10-14T18:41:59Z  Restarting      Task restarting in 15.884342135s
2024-10-14T18:41:59Z  Terminated      Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50900->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."
2024-10-14T18:33:59Z  Started         Task started by client
2024-10-14T18:33:59Z  Task Setup      Building Task Directory

So my hypothesis is:

  • The executor is failing ("wsarecv: An existing connection was forcibly closed by the remote host").
  • Then the workload process is leaked.
  • This prevents the new workload process from succeeding.
  • The error messages we see in the logs from that are themselves weird and misleading ("executor failed to find process: OpenProcess: The parameter is incorrect").

This makes for 3 bugs in a trenchcoat:

  • The executor should not be failing. However, I did notice while testing with 1.8.x that the executor's resource usage was climbing. This is not the case with a build that includes windows: trade heap for stack to build process tree for stats in linear space #24182. So that might be fixed already but it's hard to tell why we'd be seeing this otherwise. If folks have executor.out logs that contain stack traces that'd be new information.
  • The workload process should not be leaked.
  • The error handling reporting here is wrong.

If folks have any information from their environments that contradicts that hypothesis, I'd appreciate sharing it because that would be helpful to know.

@tgross
Copy link
Member

tgross commented Oct 15, 2024

I've got a fix up here for the executor orphaning the child processes: #24214. I'll open a new issue for improving the UX of the error messages.

@tgross tgross added this to the 1.9.x milestone Oct 15, 2024
tgross added a commit that referenced this issue Oct 15, 2024
On Windows, if the `raw_exec` driver's executor exits, the child processes are
not also killed. Create a Windows "job object" (not to be confused with a Nomad
job) and add the executor to it. Child processes of the executor will inherit
the job automatically. When the handle to the job object is freed (on executor
exit), the job itself is destroyed and this causes all processes in that job to
exit.

Fixes: #23668
Ref: https://learn.microsoft.com/en-us/windows/win32/procthread/job-objects
@tgross tgross modified the milestones: 1.9.x, 1.9.1 Oct 15, 2024
@tgross
Copy link
Member

tgross commented Oct 15, 2024

Opened #24220 and #24219 as follow-ups from this issue.

@tgross
Copy link
Member

tgross commented Oct 16, 2024

The fix for the child process being orphaned will ship in Nomad 1.9.1 (with Enterprise backports). The executor failure itself is likely fixed by #24182, and that's why we're seeing it suddenly in 1.8.x even though the underlying bug here is quite old. (And, as it turns out, exists on unixish systems too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment