-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make --fast robust against credential or wheel updates #4289
base: master
Are you sure you want to change the base?
Conversation
This used to be true, but since skypilot-org#2943, 'ray' is the only provisioner. Add other keys that are now present instead.
This is needed since some files in the fake file mounts don't actually exist, like the wheel path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @cg505! It looks mostly good to me. Left several comments
sky/backends/backend.py
Outdated
stream_logs: bool, | ||
cluster_name: str, | ||
retry_until_up: bool = False, | ||
skip_if_config_hash_matches: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit hesitating to add a new argument to the API. Do we really need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid it, but I don't really see another good way to do this. We need to somehow get this deep into the provision call stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @cg505! It looks mostly good to me! Left several comments, mostly nits.
Can we also add a smoke test for --fast
?
sky/backends/cloud_vm_ray_backend.py
Outdated
task, to_provision_config, dryrun, stream_logs, | ||
skip_if_config_hash_matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just store the skip_if_no_updates
in the to_provision_config
and read the bool from the underlying functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it complicates the to_provision_config
too much.
Currently, to_provision_config
is set from _check_existing_cluster
and it's not modified anywhere else, which is helpful to reason about. It seems irrelevant to pass skip_if_no_updates
into _check_existing_cluster
, and bad to modify to_provision_config
once it has been created.
We could move this code on L2755-2759 into provision_with_retries
, and pass skip_if_no_updates
directly. But this complicates provision_with_retries
, which currently only deals with cloud-specific retry things.
So my opinion is that this is the cleanest way to do it, but I don't feel that strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we do store the prev_config_hash
in to_provision_config
at the moment. Should we just pass the skip_if_no_cluster_updates
to the underlying function, instead of the hash itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is what I meant by the second paragraph in my comment above. It means that provision_with_retries has to handle more business logic besides just setting up retries and calling provision. But we could do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards to pushing down the functionality as we are only comparing it in that function, i.e. keeping the function deep.
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @cg505! It looks mostly good to me.
sky/backends/cloud_vm_ray_backend.py
Outdated
task, to_provision_config, dryrun, stream_logs, | ||
skip_if_config_hash_matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we do store the prev_config_hash
in to_provision_config
at the moment. Should we just pass the skip_if_no_cluster_updates
to the underlying function, instead of the hash itself?
sky/global_user_state.py
Outdated
'select name, launched_at, handle, last_use, status, autostop, ' | ||
'metadata, to_down, owner, cluster_hash, storage_mounts_metadata, ' | ||
'cluster_ever_up, config_hash from clusters order by launched_at desc' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a constants for all the columns, see: https://github.com/skypilot-org/skypilot/blob/master/sky/jobs/state.py#L131-L164
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll do it in a separate PR. I made this same change in #4332 so this is already in master.
Co-authored-by: Zhanghao Wu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! It looks mostly good to me. Left some comments mostly for readability and code style.
We should make sure the backward compatibility tests pass. Also, can we add a quick smoke test to test this --fast
.
to_provision: Resource config to provision. Should only be None if | ||
cluster_name refers to an existing cluster, whose resources will | ||
be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not necessarily need to be True.
to_provision: Resource config to provision. Should only be None if | |
cluster_name refers to an existing cluster, whose resources will | |
be used. | |
to_provision: Resource config to provision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... if to_provision is None and cluster_name refers to a non-existent cluster, what do we even do? We have no resource information then.
- 'config_hash': Hash of the cluster config and file mounts | ||
contents. Can be missing if we failed to calculate the hash for some | ||
reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 'config_hash': Hash of the cluster config and file mounts | |
contents. Can be missing if we failed to calculate the hash for some | |
reason | |
- 'config_hash': Hash of the cluster config and file mounts | |
contents. Can be missing if we fail to calculate the hash for some | |
reason, which is fine, i.e., we do not do the provisioning | |
optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not expect this to fail. It's fine with me to catch the exception in case it does fail, to avoid blocking the user, but it is still bad and unexpected, and will lead to degraded performance for the user.
config_dict['config_hash'] = _deterministic_cluster_yaml_hash( | ||
tmp_yaml_path) | ||
except Exception as e: # pylint: disable=broad-except | ||
logger.warning(f'Failed to calculate config_hash: {e}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep this in debug
as a user may not need to know the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment #4289 (comment). I think this is pretty bad and a warning is appropriate.
sky/backends/cloud_vm_ray_backend.py
Outdated
task, to_provision_config, dryrun, stream_logs, | ||
skip_if_config_hash_matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards to pushing down the functionality as we are only comparing it in that function, i.e. keeping the function deep.
sky/execution.py
Outdated
stream_logs=stream_logs, | ||
cluster_name=cluster_name, | ||
retry_until_up=retry_until_up, | ||
skip_if_no_cluster_updates=skip_unnecessary_provisioning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we keep the argument name the same through the functions, to keep it easier to read?
Co-authored-by: Zhanghao Wu <[email protected]>
The result is that --fast will reprovision the cluster if some important things change (such as new cloud credentials or a new version of the skypilot wheel).
However, this does cause some performance regression in the --fast case since we need to go through the initial provisioning stages. That costs about 4s on my machine. I am looking into optimizing this - it's mostly an unnecessary roundtrip checking that the cluster is up.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_minimal
pytest tests/test_smoke.py:: --managed-jobs
conda deactivate; bash -i tests/backward_compatibility_tests.sh