Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #6914 job_agent recover existing sbatch jobs #7404

Merged
merged 45 commits into from
Jan 6, 2025
Merged

Conversation

robnagler
Copy link
Member

@robnagler robnagler commented Dec 20, 2024

  • Fix Remove SIREPO_FEATURE_CONFIG_UI_WEBSOCKET=0 test case #7308 ui_websocket default is True and removed False case from test.sh
  • Fix Remove supervisor _run task #7385 job_supervisor run returns immediately and is not a task
  • job_supervisor run_status_op pends until run or status watcher complete
  • run_status_update is new op that is sent asynchronously from agent to supervisor
  • job_agent separate out logic for run/state; reconnects to sbatch job
  • job_cmd restructured and more error handling
  • job_cmd centralized dispatch in _process_msg
  • job_cmd._do_compute more robust and supports separate run/status
  • job documents more ops and statuses
  • Added max_procs=4 to test.sh to parallelize tests
  • Fixed global state checks (mpiexec) to allow parallel test execution
  • Increased timeouts to allow for delays during parallel test execution
  • Improve arg validation in simulation_db.json_filename
  • sbatchLoginService commented out invalid state transitions
  • SIREPO.srlog includes time

- Fix #7308 ui_websocket default is True and removed False case from test.sh
- job_supervisor run returns immediately and is not a task
- job_supervisor run_status_op pends until run or status watcher complete
- run_status_update is new op that is sent asynchronously from agent to supervisor
- job_agent separate out logic for run/state; reconnects to sbatch job
- job_cmd restructured and more error handling
- job_cmd centralized dispatch in _process_msg
- job_cmd._do_compute more robust and supports separate run/status
- job documents more ops and statuses
- Added max_procs=4 to test.sh to parallelize tests
- Fixed global state checks (mpiexec) to allow parallel test execution
- Increased timeouts to allow for delays during parallel test execution
- Improve arg validation in simulation_db.json_filename
- sbatchLoginService commented out invalid state transitions
- SIREPO.srlog includes time
@robnagler
Copy link
Member Author

GitHub Actions speed up with max_procs=4 is 3x (8 vs 25 minutes). The docker pull, pip install, fmt, etc. take 2.5 minutes so the speed up is actually linear. I'm going to add SIREPO_MPI_CORES=2, because I think this will test the code better.

@robnagler robnagler requested a review from e-carlin December 20, 2024 16:56
@robnagler
Copy link
Member Author

@e-carlin I'm still testing. Good to get started on the review now, though.

@robnagler
Copy link
Member Author

SIREPO_MPI_CORES=2 doesn't change the speed. I think this makes GH action a better teset.

@robnagler
Copy link
Member Author

@e-carlin ready for a review. Tests pass, and seems to work on NERSC. I've done a lot more testing of NERSC than local. I didn't test docker, but I don't think I modified that.

Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working my way through reviewing. Probably another day. I left some initial comments.

Some quirks I noticed

  • If I'm running a sim (doesn't need to be under sbatch) and I kill -9 it from the terminal I the GUI reports it as canceled. Seems like it should be error
  • If I kill -9 an agent (again doesn't need to be under sbatch) then the gui continues to report "running: awaiting output". Even after refresh.

@robnagler
Copy link
Member Author

Added to #7406 (comment).

Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two errors while running simulations:

  • openmc > aurora > wait for volume extraction > visualization > vagrant cluster > login > start > error: [No such file or directory]: open('/var/tmp/vagrant/sirepo/user/ZSLW4c4Y/openmc/ZSLW4c4Y-VqEsWQZE-openmcAnimation/in.json', 'r')
  • flash > blast2 > run setup and compile > visualization > vagrant cluster > login > start > error: /home/vagrant/.pyenv/versions/py3/bin/python: can't open file '/var/tmp/vagrant/sirepo/user/6NiqqZff/flash/6NiqqZff-ykM8ISjL-animation/parameters.py': [Errno 2] No such file or directory

Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed everything. Just a few more comments.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

@robnagler
Copy link
Member Author

I've reviewed everything. Just a few more comments.

Thank you. I know it was a lot and very complicated.

The code works well. There are a lot of changes and a lot of cases so I'm sure there are some I didn't exercise.

I appreciate the testing.

@robnagler robnagler requested a review from e-carlin January 2, 2025 15:36
Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into the same openmc error

#7404 (review)

@robnagler
Copy link
Member Author

Fix #7404 had to write in.json. Refactored that code. openmc works now. I didn't test flash.

@robnagler robnagler requested a review from e-carlin January 4, 2025 00:44
Copy link
Member

@e-carlin e-carlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Openmc and flash work.

I can't create a reproducible example but this has happened to me 3 times: The supervisor can get in a state where it no longer responds to SIGTERM.

~$ ps uww | grep job_supervisor
vagrant   426434  6.5  2.0 596624 164596 pts/2   Sl+  16:41   0:32 /home/vagrant/.pyenv/versions/3.9.15/envs/py3/bin/python /home/vagrant/.pyenv/versions/py3/bin/sirepo job_supervisor
~$ kill -SIGTERM 426434
~$ ps uww | grep job_supervisor
vagrant   426434  6.5  2.0 596624 164596 pts/2   Sl+  16:41   0:33 /home/vagrant/.pyenv/versions/3.9.15/envs/py3/bin/python /home/vagrant/.pyenv/versions/py3/bin/sirepo job_supervisor

When I send the SIGTERM to the supervisor it logs

Jan 06 16:49:33 426434     0 sirepo/job_driver/local.py:87:kill LocalDriver(a=foCR k=sequential u=d2dF []) pid=427224

The closest I can get to a reproducible example is it seem to only happen when the supervisor is signaled while a job is running.

@robnagler
Copy link
Member Author

created #7416

@robnagler
Copy link
Member Author

Should we merge?

@e-carlin
Copy link
Member

e-carlin commented Jan 6, 2025

Good with me!

@robnagler robnagler merged commit 6fee980 into master Jan 6, 2025
3 checks passed
@robnagler robnagler deleted the 6914-sbatch-recover branch January 6, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove supervisor _run task Remove SIREPO_FEATURE_CONFIG_UI_WEBSOCKET=0 test case
2 participants