Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make --fast robust against credential or wheel updates #4289

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Nov 7, 2024

  • Introduce a "config hash", which is a deterministic hash of the ray cluster config and all the referenced file mounts. We expect this value to stay the same when running launch on an existing cluster if all else is equal.
  • Store the current config hash for each cluster into the global_user_state DB.
  • Add a flag to provisioning that allows the provisioning path to short-circuit if the calculated config hash is the same as the one that should already be present on the cluster.
  • Use this new path for --fast.

The result is that --fast will reprovision the cluster if some important things change (such as new cloud credentials or a new version of the skypilot wheel).

However, this does cause some performance regression in the --fast case since we need to go through the initial provisioning stages. That costs about 4s on my machine. I am looking into optimizing this - it's mostly an unnecessary roundtrip checking that the cluster is up.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Manually test that updating credential files is caught
    • Manually test that new skypilot wheel is caught
    • Test happy path
    • Test for managed jobs controller
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests:
    • pytest tests/test_smoke.py::test_minimal
    • pytest tests/test_smoke.py:: --managed-jobs
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @cg505! It looks mostly good to me. Left several comments

stream_logs: bool,
cluster_name: str,
retry_until_up: bool = False,
skip_if_config_hash_matches: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit hesitating to add a new argument to the API. Do we really need it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid it, but I don't really see another good way to do this. We need to somehow get this deep into the provision call stack.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cg505! It looks mostly good to me! Left several comments, mostly nits.

Can we also add a smoke test for --fast?

sky/backends/backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Comment on lines 2803 to 2804
task, to_provision_config, dryrun, stream_logs,
skip_if_config_hash_matches)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just store the skip_if_no_updates in the to_provision_config and read the bool from the underlying functions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it complicates the to_provision_config too much.
Currently, to_provision_config is set from _check_existing_cluster and it's not modified anywhere else, which is helpful to reason about. It seems irrelevant to pass skip_if_no_updates into _check_existing_cluster, and bad to modify to_provision_config once it has been created.

We could move this code on L2755-2759 into provision_with_retries, and pass skip_if_no_updates directly. But this complicates provision_with_retries, which currently only deals with cloud-specific retry things.

So my opinion is that this is the cleanest way to do it, but I don't feel that strongly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do store the prev_config_hash in to_provision_config at the moment. Should we just pass the skip_if_no_cluster_updates to the underlying function, instead of the hash itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is what I meant by the second paragraph in my comment above. It means that provision_with_retries has to handle more business logic besides just setting up retries and calling provision. But we could do it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards to pushing down the functionality as we are only comparing it in that function, i.e. keeping the function deep.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @cg505! It looks mostly good to me.

sky/backends/backend.py Outdated Show resolved Hide resolved
sky/backends/backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Comment on lines 2803 to 2804
task, to_provision_config, dryrun, stream_logs,
skip_if_config_hash_matches)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do store the prev_config_hash in to_provision_config at the moment. Should we just pass the skip_if_no_cluster_updates to the underlying function, instead of the hash itself?

Comment on lines 619 to 621
'select name, launched_at, handle, last_use, status, autostop, '
'metadata, to_down, owner, cluster_hash, storage_mounts_metadata, '
'cluster_ever_up, config_hash from clusters order by launched_at desc'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll do it in a separate PR. I made this same change in #4332 so this is already in master.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! It looks mostly good to me. Left some comments mostly for readability and code style.

We should make sure the backward compatibility tests pass. Also, can we add a quick smoke test to test this --fast.

Comment on lines +61 to +63
to_provision: Resource config to provision. Should only be None if
cluster_name refers to an existing cluster, whose resources will
be used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not necessarily need to be True.

Suggested change
to_provision: Resource config to provision. Should only be None if
cluster_name refers to an existing cluster, whose resources will
be used.
to_provision: Resource config to provision.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... if to_provision is None and cluster_name refers to a non-existent cluster, what do we even do? We have no resource information then.

Comment on lines +654 to +656
- 'config_hash': Hash of the cluster config and file mounts
contents. Can be missing if we failed to calculate the hash for some
reason
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 'config_hash': Hash of the cluster config and file mounts
contents. Can be missing if we failed to calculate the hash for some
reason
- 'config_hash': Hash of the cluster config and file mounts
contents. Can be missing if we fail to calculate the hash for some
reason, which is fine, i.e., we do not do the provisioning
optimization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not expect this to fail. It's fine with me to catch the exception in case it does fail, to avoid blocking the user, but it is still bad and unexpected, and will lead to degraded performance for the user.

config_dict['config_hash'] = _deterministic_cluster_yaml_hash(
tmp_yaml_path)
except Exception as e: # pylint: disable=broad-except
logger.warning(f'Failed to calculate config_hash: {e}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this in debug as a user may not need to know the details.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment #4289 (comment). I think this is pretty bad and a warning is appropriate.

Comment on lines 2803 to 2804
task, to_provision_config, dryrun, stream_logs,
skip_if_config_hash_matches)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards to pushing down the functionality as we are only comparing it in that function, i.e. keeping the function deep.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/execution.py Outdated
stream_logs=stream_logs,
cluster_name=cluster_name,
retry_until_up=retry_until_up,
skip_if_no_cluster_updates=skip_unnecessary_provisioning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we keep the argument name the same through the functions, to keep it easier to read?

@cg505 cg505 requested a review from Michaelvll December 3, 2024 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants