Checkpointing #125

akoumjian · 2023-11-01T13:49:54Z

Wanted to push this early so implementation can be discussed. This required more changes than I anticipated.
It needs many unit tests.
It is blocked from being fully implemented until we can get the other stages to accept the quivr tables as inputs / outputs.

What this adds:

Filtering observations for those within test orbit radius is now its own "stage", with a data product that it produces
A checkpoint_dir that is used for storing config and checkpointed data files. Intermediate results are written to disk if it is specified.
a compare_configs function that checks to make sure a checkpointed instance is using the original config with an override boolean allow_config_override. Only runs when checkpoint_dir is being used
Adds a LinkTestOrbitStageResult which contains references to result[s], a name which describes the stage result, and optional path[s] to the data products on disk. This allows a dynamic caller of link_test_orbit to know what type of results are being yielded back, it can analyze the results in memory if it chooses and know that the result files are ready to store elsewhere if checkpoint_dir is being used.
A partial implementation of checkpointing. load_initial_checkpoint_values is run near the beginning to check the state of things. It will assign the current CheckpointData based on what it sees. This requires adding a control flow to link_test_orbit to always check what stage the checkpoint: CheckpointData is at. It also requires updating the checkpoint after each stage so that the following stages are run.
Isolated the ray initialization code into its own function, just for cleanliness

Additional thoughts:
There is a bit of a game of ping pong with use_ray and whether we are passing ObjectRef or the objects themselves. Even if we always use ray and get rid of that boolean, there will be some of this. The checkpointing is also going to suffer this a bit as it becomes the main container to move the inputs along the pipeline. We anticipated this and I'm not sure there is a clearly correct solution. I suggest we push forward with it until everything is updated to use quivr tables at the edges and checkpointing is complete, then a pattern will hopefully emerge.

thor/main.py

moeyensj · 2023-11-15T15:57:52Z

Thanks for this @akoumjian. I'm going to merge it and fix typing issues in a later PR.

akoumjian changed the base branch from main to v2.0-link-aims-sample November 1, 2023 13:50

akoumjian force-pushed the checkpointing branch from 830aa26 to 70751e2 Compare November 1, 2023 13:56

Base automatically changed from v2.0-link-aims-sample to main November 2, 2023 13:52

akoumjian force-pushed the checkpointing branch from bf54de9 to c990789 Compare November 6, 2023 16:49

akoumjian changed the base branch from main to v2.0-fitted-orbits November 6, 2023 16:49

akoumjian changed the title ~~WIP: Checkpointing~~ Checkpointing Nov 6, 2023

akoumjian force-pushed the checkpointing branch from c990789 to 21f75c3 Compare November 8, 2023 15:59

moeyensj reviewed Nov 9, 2023

View reviewed changes

thor/main.py Show resolved Hide resolved

moeyensj force-pushed the v2.0-fitted-orbits branch from f7bf0ec to 797a022 Compare November 9, 2023 20:54

Base automatically changed from v2.0-fitted-orbits to main November 9, 2023 21:00

akoumjian force-pushed the checkpointing branch from 71216d3 to 8ad99b8 Compare November 9, 2023 21:34

Use checkpointing and working directory

4b7fbe6

akoumjian force-pushed the checkpointing branch from 574294f to 4b7fbe6 Compare November 15, 2023 01:39

moeyensj merged commit 4971bd5 into main Nov 15, 2023
0 of 3 checks passed

moeyensj deleted the checkpointing branch November 15, 2023 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing #125

Checkpointing #125

akoumjian commented Nov 1, 2023

moeyensj commented Nov 15, 2023

Checkpointing #125

Checkpointing #125

Conversation

akoumjian commented Nov 1, 2023

moeyensj commented Nov 15, 2023