make --fast robust against credential or wheel updates #4289

cg505 · 2024-11-07T22:31:44Z

Introduce a "config hash", which is a deterministic hash of the ray cluster config and all the referenced file mounts. We expect this value to stay the same when running launch on an existing cluster if all else is equal.
Store the current config hash for each cluster into the global_user_state DB.
Add a flag to provisioning that allows the provisioning path to short-circuit if the calculated config hash is the same as the one that should already be present on the cluster.
Use this new path for --fast.

The result is that --fast will reprovision the cluster if some important things change (such as new cloud credentials or a new version of the skypilot wheel).

However, this does cause some performance regression in the --fast case since we need to go through the initial provisioning stages. That costs about 4s on my machine. I am looking into optimizing this - it's mostly an unnecessary roundtrip checking that the cluster is up.

Tested (run the relevant ones):

This used to be true, but since skypilot-org#2943, 'ray' is the only provisioner. Add other keys that are now present instead.

This is needed since some files in the fake file mounts don't actually exist, like the wheel path.

Michaelvll

Thanks for adding this @cg505! It looks mostly good to me. Left several comments

Michaelvll · 2024-11-11T21:58:39Z

sky/backends/backend.py

+        stream_logs: bool,
+        cluster_name: str,
+        retry_until_up: bool = False,
+        skip_if_config_hash_matches: Optional[str] = None


A bit hesitating to add a new argument to the API. Do we really need it?

I was trying to avoid it, but I don't really see another good way to do this. We need to somehow get this deep into the provision call stack.

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

sky/execution.py

…-hash

Michaelvll

Thanks for the update @cg505! It looks mostly good to me! Left several comments, mostly nits.

Can we also add a smoke test for --fast?

sky/backends/backend.py

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2024-11-13T00:25:14Z

sky/backends/cloud_vm_ray_backend.py

+                        task, to_provision_config, dryrun, stream_logs,
+                        skip_if_config_hash_matches)


Should we just store the skip_if_no_updates in the to_provision_config and read the bool from the underlying functions?

I feel like it complicates the to_provision_config too much.
Currently, to_provision_config is set from _check_existing_cluster and it's not modified anywhere else, which is helpful to reason about. It seems irrelevant to pass skip_if_no_updates into _check_existing_cluster, and bad to modify to_provision_config once it has been created.

We could move this code on L2755-2759 into provision_with_retries, and pass skip_if_no_updates directly. But this complicates provision_with_retries, which currently only deals with cloud-specific retry things.

So my opinion is that this is the cleanest way to do it, but I don't feel that strongly.

It seems we do store the prev_config_hash in to_provision_config at the moment. Should we just pass the skip_if_no_cluster_updates to the underlying function, instead of the hash itself?

Yeah, this is what I meant by the second paragraph in my comment above. It means that provision_with_retries has to handle more business logic besides just setting up retries and calling provision. But we could do it.

I am leaning towards to pushing down the functionality as we are only comparing it in that function, i.e. keeping the function deep.

sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <[email protected]>

…-hash

Michaelvll

Thank you for the updates @cg505! It looks mostly good to me.

sky/backends/backend.py

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2024-11-16T01:26:29Z

sky/backends/cloud_vm_ray_backend.py

+                        task, to_provision_config, dryrun, stream_logs,
+                        skip_if_config_hash_matches)


It seems we do store the prev_config_hash in to_provision_config at the moment. Should we just pass the skip_if_no_cluster_updates to the underlying function, instead of the hash itself?

Michaelvll · 2024-11-16T01:52:52Z

sky/global_user_state.py

+        'select name, launched_at, handle, last_use, status, autostop, '
+        'metadata, to_down, owner, cluster_hash, storage_mounts_metadata, '
+        'cluster_ever_up, config_hash from clusters order by launched_at desc'


can we have a constants for all the columns, see: https://github.com/skypilot-org/skypilot/blob/master/sky/jobs/state.py#L131-L164

Sure, I'll do it in a separate PR. I made this same change in #4332 so this is already in master.

…-hash

Co-authored-by: Zhanghao Wu <[email protected]>

Michaelvll

Thanks @cg505! It looks mostly good to me. Left some comments mostly for readability and code style.

We should make sure the backward compatibility tests pass. Also, can we add a quick smoke test to test this --fast.

Michaelvll · 2024-11-24T19:13:25Z

sky/backends/backend.py

+            to_provision: Resource config to provision. Should only be None if
+                cluster_name refers to an existing cluster, whose resources will
+                be used.


This does not necessarily need to be True.

Suggested change

to_provision: Resource config to provision. Should only be None if

cluster_name refers to an existing cluster, whose resources will

be used.

to_provision: Resource config to provision.

Hmm... if to_provision is None and cluster_name refers to a non-existent cluster, what do we even do? We have no resource information then.

Michaelvll · 2024-11-24T19:18:37Z

sky/backends/backend_utils.py

+        - 'config_hash': Hash of the cluster config and file mounts
+          contents. Can be missing if we failed to calculate the hash for some
+          reason


Suggested change

- 'config_hash': Hash of the cluster config and file mounts

contents. Can be missing if we failed to calculate the hash for some

reason

- 'config_hash': Hash of the cluster config and file mounts

contents. Can be missing if we fail to calculate the hash for some

reason, which is fine, i.e., we do not do the provisioning

optimization.

I think we should not expect this to fail. It's fine with me to catch the exception in case it does fail, to avoid blocking the user, but it is still bad and unexpected, and will lead to degraded performance for the user.

Michaelvll · 2024-11-24T19:20:31Z

sky/backends/backend_utils.py

+        config_dict['config_hash'] = _deterministic_cluster_yaml_hash(
+            tmp_yaml_path)
+    except Exception as e:  # pylint: disable=broad-except
+        logger.warning(f'Failed to calculate config_hash: {e}')


We can keep this in debug as a user may not need to know the details.

See above comment #4289 (comment). I think this is pretty bad and a warning is appropriate.

Michaelvll · 2024-11-24T19:30:41Z

sky/backends/cloud_vm_ray_backend.py

+                        task, to_provision_config, dryrun, stream_logs,
+                        skip_if_config_hash_matches)


I am leaning towards to pushing down the functionality as we are only comparing it in that function, i.e. keeping the function deep.

sky/backends/cloud_vm_ray_backend.py

Michaelvll · 2024-11-24T19:32:51Z

sky/execution.py

+                stream_logs=stream_logs,
+                cluster_name=cluster_name,
+                retry_until_up=retry_until_up,
+                skip_if_no_cluster_updates=skip_unnecessary_provisioning)


How about we keep the argument name the same through the functions, to keep it easier to read?

Co-authored-by: Zhanghao Wu <[email protected]>

cg505 added 3 commits November 6, 2024 17:09

add config_dict['config_hash'] output to write_cluster_config

d833bcb

fix docstring for write_cluster_config

fab22d0

This used to be true, but since skypilot-org#2943, 'ray' is the only provisioner. Add other keys that are now present instead.

when using --fast, check if config_hash matches, and if not, provision

0ce6821

cg505 force-pushed the config-hash branch from ef81fb0 to 0ce6821 Compare November 7, 2024 23:49

cg505 added 2 commits November 7, 2024 20:29

mock hashing method in unit test

9ecdcf1

This is needed since some files in the fake file mounts don't actually exist, like the wheel path.

Merge branch 'master' into config-hash

ed414b3

Michaelvll self-requested a review November 11, 2024 06:41

cg505 mentioned this pull request Nov 11, 2024

[ux] cache cluster status of autostop or spot clusters for 2s #4332

Merged

5 tasks

Michaelvll reviewed Nov 11, 2024

View reviewed changes

cg505 added 3 commits November 12, 2024 14:31

check config hash within provision with lock held

90ddda0

address other PR review comments

d12ca45

Merge branch 'master' of github.com:skypilot-org/skypilot into config…

8bf62ab

…-hash

cg505 requested a review from Michaelvll November 12, 2024 23:13

Michaelvll reviewed Nov 13, 2024

View reviewed changes

cg505 mentioned this pull request Nov 13, 2024

remove empty file mount from yaml config #4351

Open

cg505 and others added 3 commits November 13, 2024 15:53

rename to skip_if_no_cluster_updates

1b020cc

Co-authored-by: Zhanghao Wu <[email protected]>

add assert details

c1fd5ce

Co-authored-by: Zhanghao Wu <[email protected]>

address PR comments and update docstrings

6282697

cg505 force-pushed the config-hash branch from 9a75e5b to 6282697 Compare November 14, 2024 00:40

cg505 requested a review from Michaelvll November 14, 2024 00:41

cg505 added 2 commits November 14, 2024 20:33

fix test

ccadf28

Merge branch 'master' of github.com:skypilot-org/skypilot into config…

0da9764

…-hash

cg505 mentioned this pull request Nov 15, 2024

[UX] Launch on existing cluster should be very fast #4157

Open

Michaelvll reviewed Nov 16, 2024

View reviewed changes

cg505 and others added 3 commits November 15, 2024 19:15

Merge branch 'master' of github.com:skypilot-org/skypilot into config…

7877045

…-hash

update docstrings

0a898b6

Co-authored-by: Zhanghao Wu <[email protected]>

address PR comments

143295d

cg505 requested a review from Michaelvll November 16, 2024 03:40

fix lint and tests

1688a84

Michaelvll approved these changes Nov 24, 2024

View reviewed changes

Update sky/backends/cloud_vm_ray_backend.py

787c3fc

Co-authored-by: Zhanghao Wu <[email protected]>

cg505 requested a review from Michaelvll December 3, 2024 21:27

refactor skip_if_no_cluster_update var

df321b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make --fast robust against credential or wheel updates #4289

make --fast robust against credential or wheel updates #4289

cg505 commented Nov 7, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Nov 11, 2024

cg505 Nov 12, 2024

Michaelvll left a comment

Michaelvll Nov 13, 2024

cg505 Nov 14, 2024

Michaelvll Nov 16, 2024

cg505 Nov 16, 2024

Michaelvll Nov 24, 2024

Michaelvll left a comment

Michaelvll Nov 16, 2024

Michaelvll Nov 16, 2024

cg505 Nov 16, 2024

Michaelvll left a comment

Michaelvll Nov 24, 2024

cg505 Dec 3, 2024

Michaelvll Nov 24, 2024

cg505 Dec 3, 2024

Michaelvll Nov 24, 2024

cg505 Dec 3, 2024

Michaelvll Nov 24, 2024

Michaelvll Nov 24, 2024

		task, to_provision_config, dryrun, stream_logs,
		skip_if_config_hash_matches)

make --fast robust against credential or wheel updates #4289

Are you sure you want to change the base?

make --fast robust against credential or wheel updates #4289

Conversation

cg505 commented Nov 7, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cg505 commented Nov 7, 2024 •

edited

Loading