Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race between status update and instance creation can cause resource leak #4431

Closed
cg505 opened this issue Dec 3, 2024 · 1 comment
Closed
Assignees

Comments

@cg505
Copy link
Collaborator

cg505 commented Dec 3, 2024

The cloud may not immediately reflect instance creation after sending the request. This can lead to the following race:

  1. [process 1] sky launch a new cluster
  2. [process 1] sky launch grabs the cluster lock
  3. [process 1] sky launch adds cluster to the state database as INIT
  4. [process 2] sky status -r is launched and tries to grab the cluster lock
  5. [process 1] sky launch makes the cloud API calls to create the instances
  6. [process 1] sky launch immediately dies (e.g. is SIGKILLed), maybe even before getting a response from the cloud API. The lock is released.
  7. [process 2] sky status -r obtains the lock
  8. [process 2] sky status -r queries the instances, which the cloud API does not show yet
  9. The cloud creates the instances. Future queries will show them.
  10. [process 2] Since no instances were found, sky status -r thinks the cluster is terminated and deletes it from the state database.

Apparently AWS takes some very small amount of time to reflect instance state after instance creation.
I was able to reproduce in AWS by crashing sky launch as soon as the create instance request is sent (step 6 in the above race description).
Other clouds may have similar behaviors. We really can't ensure the request is received before the lock is released in this case. For instance, the create instance request may still be on the wire and could be racing with the list instance request.

This is pretty hard to fix reliably. However, I think we can reasonably assume there's very limited set of cases where this could arise:

  • We are terminating or checking the status of a cluster that was recently created. (Say, in the past 60s.)
    • Typically there is not a lot of time between cluster "creation" (adding to state database) and actually creating the instances, but we may want to tune this.
  • The cluster is INIT, but we don't see any instances in the cloud. Also, we don't see any previously terminated instances since the cluster was created.

In this case, we can just wait a very short amount of time (maybe as short as 1 second, but hard to know) and double check that the instances do not exist.

@cg505
Copy link
Collaborator Author

cg505 commented Dec 3, 2024

This is possibly the cause of #4410.

@cg505 cg505 self-assigned this Dec 3, 2024
cg505 added a commit to cg505/skypilot that referenced this issue Dec 5, 2024
cg505 added a commit that referenced this issue Dec 8, 2024
* if a newly-created cluster is missing from the cloud, wait before deleting

Addresses #4431.

* confirm cluster actually terminates before deleting from the db

* avoid deleting cluster data outside the primary provision loop

* tweaks

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* use usage_intervals for new cluster detection

get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.

* fix terminating/stopping state for Lambda and Paperspace

* Revert "use usage_intervals for new cluster detection"

This reverts commit aa6d2e9.

* check cloud.STATUS_VERSION before calling query_instances

* avoid try/catch when querying instances

* update comments

---------

Co-authored-by: Zhanghao Wu <[email protected]>
zpoint pushed a commit to zpoint/skypilot that referenced this issue Dec 9, 2024
…g#4443)

* if a newly-created cluster is missing from the cloud, wait before deleting

Addresses skypilot-org#4431.

* confirm cluster actually terminates before deleting from the db

* avoid deleting cluster data outside the primary provision loop

* tweaks

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* use usage_intervals for new cluster detection

get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.

* fix terminating/stopping state for Lambda and Paperspace

* Revert "use usage_intervals for new cluster detection"

This reverts commit aa6d2e9.

* check cloud.STATUS_VERSION before calling query_instances

* avoid try/catch when querying instances

* update comments

---------

Co-authored-by: Zhanghao Wu <[email protected]>
@cg505 cg505 closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant