Fix #6914 job_agent recover existing sbatch jobs #7404

robnagler · 2024-12-20T16:29:21Z

Fix Remove SIREPO_FEATURE_CONFIG_UI_WEBSOCKET=0 test case #7308 ui_websocket default is True and removed False case from test.sh
Fix Remove supervisor _run task #7385 job_supervisor run returns immediately and is not a task
job_supervisor run_status_op pends until run or status watcher complete
run_status_update is new op that is sent asynchronously from agent to supervisor
job_agent separate out logic for run/state; reconnects to sbatch job
job_cmd restructured and more error handling
job_cmd centralized dispatch in _process_msg
job_cmd._do_compute more robust and supports separate run/status
job documents more ops and statuses
Added max_procs=4 to test.sh to parallelize tests
Fixed global state checks (mpiexec) to allow parallel test execution
Increased timeouts to allow for delays during parallel test execution
Improve arg validation in simulation_db.json_filename
sbatchLoginService commented out invalid state transitions
SIREPO.srlog includes time

- Fix #7308 ui_websocket default is True and removed False case from test.sh - job_supervisor run returns immediately and is not a task - job_supervisor run_status_op pends until run or status watcher complete - run_status_update is new op that is sent asynchronously from agent to supervisor - job_agent separate out logic for run/state; reconnects to sbatch job - job_cmd restructured and more error handling - job_cmd centralized dispatch in _process_msg - job_cmd._do_compute more robust and supports separate run/status - job documents more ops and statuses - Added max_procs=4 to test.sh to parallelize tests - Fixed global state checks (mpiexec) to allow parallel test execution - Increased timeouts to allow for delays during parallel test execution - Improve arg validation in simulation_db.json_filename - sbatchLoginService commented out invalid state transitions - SIREPO.srlog includes time

robnagler · 2024-12-20T16:56:32Z

GitHub Actions speed up with max_procs=4 is 3x (8 vs 25 minutes). The docker pull, pip install, fmt, etc. take 2.5 minutes so the speed up is actually linear. I'm going to add SIREPO_MPI_CORES=2, because I think this will test the code better.

robnagler · 2024-12-20T16:57:33Z

@e-carlin I'm still testing. Good to get started on the review now, though.

robnagler · 2024-12-20T17:18:25Z

SIREPO_MPI_CORES=2 doesn't change the speed. I think this makes GH action a better teset.

/bin/test does not exist so just call test, which is a builtin took out printing of env for testing

missing nextRequestSeconds SlotProxy shows enter and exit from wait

fixed more status issues can kill supervisor or agent on running, queued

robnagler · 2024-12-24T23:29:05Z

@e-carlin ready for a review. Tests pass, and seems to work on NERSC. I've done a lot more testing of NERSC than local. I didn't test docker, but I don't think I modified that.

e-carlin

I'm working my way through reviewing. Probably another day. I left some initial comments.

Some quirks I noticed

If I'm running a sim (doesn't need to be under sbatch) and I kill -9 it from the terminal I the GUI reports it as canceled. Seems like it should be error
If I kill -9 an agent (again doesn't need to be under sbatch) then the gui continues to report "running: awaiting output". Even after refresh.

sirepo/job_driver/__init__.py

sirepo/job_supervisor.py

sirepo/job_driver/__init__.py

robnagler · 2024-12-28T18:52:07Z

Added to #7406 (comment).

e-carlin

Two errors while running simulations:

openmc > aurora > wait for volume extraction > visualization > vagrant cluster > login > start > error: [No such file or directory]: open('/var/tmp/vagrant/sirepo/user/ZSLW4c4Y/openmc/ZSLW4c4Y-VqEsWQZE-openmcAnimation/in.json', 'r')
flash > blast2 > run setup and compile > visualization > vagrant cluster > login > start > error: /home/vagrant/.pyenv/versions/py3/bin/python: can't open file '/var/tmp/vagrant/sirepo/user/6NiqqZff/flash/6NiqqZff-ykM8ISjL-animation/parameters.py': [Errno 2] No such file or directory

e-carlin

I've reviewed everything. Just a few more comments.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

sirepo/job_supervisor.py

sirepo/pkcli/job_agent.py

robnagler · 2025-01-02T15:36:50Z

I've reviewed everything. Just a few more comments.

Thank you. I know it was a lot and very complicated.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

I appreciate the testing.

e-carlin

I ran into the same openmc error

#7404 (review)

…pare Modularize access to run_dir_input file

robnagler · 2025-01-04T00:44:02Z

Fix #7404 had to write in.json. Refactored that code. openmc works now. I didn't test flash.

e-carlin

Openmc and flash work.

I can't create a reproducible example but this has happened to me 3 times: The supervisor can get in a state where it no longer responds to SIGTERM.

~$ ps uww | grep job_supervisor
vagrant   426434  6.5  2.0 596624 164596 pts/2   Sl+  16:41   0:32 /home/vagrant/.pyenv/versions/3.9.15/envs/py3/bin/python /home/vagrant/.pyenv/versions/py3/bin/sirepo job_supervisor
~$ kill -SIGTERM 426434
~$ ps uww | grep job_supervisor
vagrant   426434  6.5  2.0 596624 164596 pts/2   Sl+  16:41   0:33 /home/vagrant/.pyenv/versions/3.9.15/envs/py3/bin/python /home/vagrant/.pyenv/versions/py3/bin/sirepo job_supervisor

When I send the SIGTERM to the supervisor it logs

Jan 06 16:49:33 426434     0 sirepo/job_driver/local.py:87:kill LocalDriver(a=foCR k=sequential u=d2dF []) pid=427224

The closest I can get to a reproducible example is it seem to only happen when the supervisor is signaled while a job is running.

robnagler · 2025-01-06T19:40:34Z

created #7416

robnagler · 2025-01-06T19:44:55Z

Should we merge?

e-carlin · 2025-01-06T20:54:36Z

Good with me!

robnagler added 5 commits December 20, 2024 16:15

fix srdbg and console.log

6932459

remove comment

701e261

fmt

db0a36c

cores=2 runs mpiexec

cc6ee18

robnagler requested a review from e-carlin December 20, 2024 16:56

robnagler and others added 21 commits December 20, 2024 23:25

need compute model for sbatch login exception

58a6d45

DEV_SRC_RADIASOFT_DIR must be str so not eval'ed on server

2cedf27

/bin/test does not exist so just call test, which is a builtin took out printing of env for testing

run_dir needs to exist for run_status

5f40c63

undef variable

791deeb

jobCmd has to be set before calling _SbatchRunStatus

05a3057

need to setup _SbatchRunStatus better

555c7f2

incorrect attrs

47d70de

various attribute and exception issues

fa37016

more attr issues

2831d41

job agent runs sacct

4ab3cb5

run_status_op has to free run_dir_slot

010aba3

missing nextRequestSeconds SlotProxy shows enter and exit from wait

fix missing status

696037c

send() returns false on socket error and clears _websocket

fbe9c34

fixed more status issues can kill supervisor or agent on running, queued

fmt

668bea8

add more logging

5684755

make job_cancel_test more robust

89e6ecc

make tests more robust to time sensitivity

4992a00

remove pkdp

ec9d4e9

fmt

bd49fa7

too much asynchrony so be flexible about states

0d5cb1b

fixing state

82949a9

e-carlin requested changes Dec 27, 2024

View reviewed changes

sirepo/job_driver/__init__.py Show resolved Hide resolved

sirepo/job_supervisor.py Outdated Show resolved Hide resolved

sirepo/job_driver/__init__.py Outdated Show resolved Hide resolved

sirepo/job_driver/__init__.py Outdated Show resolved Hide resolved

robnagler mentioned this pull request Dec 28, 2024

when runSimulation encounters an error, it's not displayed #7406

Open

review

6e0f1ee

e-carlin requested changes Dec 30, 2024

View reviewed changes

sirepo/job_supervisor.py Show resolved Hide resolved

sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved

sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved

sirepo/pkcli/job_agent.py Show resolved Hide resolved

sirepo/pkcli/job_agent.py Outdated Show resolved Hide resolved

robnagler added 2 commits December 30, 2024 23:31

Fix #7414 write_message binary=True

3d0638e

review

d0649f2

robnagler requested a review from e-carlin January 2, 2025 15:36

e-carlin requested changes Jan 3, 2025

View reviewed changes

robnagler added 5 commits January 3, 2025 22:49

Refactor simulation_db.prepare_simulation as sim_data.sim_run_dir_pre…

39c607b

…pare Modularize access to run_dir_input file

pkdp

4661f2b

pkdp

994b0fa

must call sim_run_input_to_run_dir

4d3d886

deviance case is special

9beb8b2

robnagler requested a review from e-carlin January 4, 2025 00:44

don't hardwire path for tail (perlmutter is different)

17918c6

e-carlin approved these changes Jan 6, 2025

View reviewed changes

robnagler mentioned this pull request Jan 6, 2025

supervisor doesn't always respond to TERM #7416

Open

robnagler merged commit 6fee980 into master Jan 6, 2025
3 checks passed

robnagler deleted the 6914-sbatch-recover branch January 6, 2025 22:06

moellep mentioned this pull request Jan 7, 2025

Alpha Release 2025-01-07 20:57:08 UTC #7402

Closed

This was referenced Jan 9, 2025

Beta Release 2025-01-09 16:51:43 UTC #7420

Closed

Prod Release 2025-01-13 16:42:35 UTC #7421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #6914 job_agent recover existing sbatch jobs #7404

Fix #6914 job_agent recover existing sbatch jobs #7404

robnagler commented Dec 20, 2024 •

edited

Loading

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 24, 2024

e-carlin left a comment

robnagler commented Dec 28, 2024

e-carlin left a comment

e-carlin left a comment

robnagler commented Jan 2, 2025

e-carlin left a comment

robnagler commented Jan 4, 2025

e-carlin left a comment

robnagler commented Jan 6, 2025

robnagler commented Jan 6, 2025

e-carlin commented Jan 6, 2025

Fix #6914 job_agent recover existing sbatch jobs #7404

Fix #6914 job_agent recover existing sbatch jobs #7404

Conversation

robnagler commented Dec 20, 2024 • edited Loading

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 20, 2024

robnagler commented Dec 24, 2024

e-carlin left a comment

Choose a reason for hiding this comment

robnagler commented Dec 28, 2024

e-carlin left a comment

Choose a reason for hiding this comment

e-carlin left a comment

Choose a reason for hiding this comment

robnagler commented Jan 2, 2025

e-carlin left a comment

Choose a reason for hiding this comment

robnagler commented Jan 4, 2025

e-carlin left a comment

Choose a reason for hiding this comment

robnagler commented Jan 6, 2025

robnagler commented Jan 6, 2025

e-carlin commented Jan 6, 2025

robnagler commented Dec 20, 2024 •

edited

Loading