Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing #125

Merged
merged 1 commit into from
Nov 15, 2023
Merged

Checkpointing #125

merged 1 commit into from
Nov 15, 2023

Conversation

akoumjian
Copy link
Collaborator

Wanted to push this early so implementation can be discussed. This required more changes than I anticipated.
It needs many unit tests.
It is blocked from being fully implemented until we can get the other stages to accept the quivr tables as inputs / outputs.

What this adds:

  • Filtering observations for those within test orbit radius is now its own "stage", with a data product that it produces
  • A checkpoint_dir that is used for storing config and checkpointed data files. Intermediate results are written to disk if it is specified.
  • a compare_configs function that checks to make sure a checkpointed instance is using the original config with an override boolean allow_config_override. Only runs when checkpoint_dir is being used
  • Adds a LinkTestOrbitStageResult which contains references to result[s], a name which describes the stage result, and optional path[s] to the data products on disk. This allows a dynamic caller of link_test_orbit to know what type of results are being yielded back, it can analyze the results in memory if it chooses and know that the result files are ready to store elsewhere if checkpoint_dir is being used.
  • A partial implementation of checkpointing. load_initial_checkpoint_values is run near the beginning to check the state of things. It will assign the current CheckpointData based on what it sees. This requires adding a control flow to link_test_orbit to always check what stage the checkpoint: CheckpointData is at. It also requires updating the checkpoint after each stage so that the following stages are run.
  • Isolated the ray initialization code into its own function, just for cleanliness

Additional thoughts:
There is a bit of a game of ping pong with use_ray and whether we are passing ObjectRef or the objects themselves. Even if we always use ray and get rid of that boolean, there will be some of this. The checkpointing is also going to suffer this a bit as it becomes the main container to move the inputs along the pipeline. We anticipated this and I'm not sure there is a clearly correct solution. I suggest we push forward with it until everything is updated to use quivr tables at the edges and checkpointing is complete, then a pattern will hopefully emerge.

@akoumjian akoumjian changed the base branch from main to v2.0-link-aims-sample November 1, 2023 13:50
Base automatically changed from v2.0-link-aims-sample to main November 2, 2023 13:52
@akoumjian akoumjian changed the base branch from main to v2.0-fitted-orbits November 6, 2023 16:49
@akoumjian akoumjian changed the title WIP: Checkpointing Checkpointing Nov 6, 2023
thor/main.py Show resolved Hide resolved
Base automatically changed from v2.0-fitted-orbits to main November 9, 2023 21:00
@moeyensj
Copy link
Owner

Thanks for this @akoumjian. I'm going to merge it and fix typing issues in a later PR.

@moeyensj moeyensj merged commit 4971bd5 into main Nov 15, 2023
0 of 3 checks passed
@moeyensj moeyensj deleted the checkpointing branch November 15, 2023 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants