Race between status update and instance creation can cause resource leak #4431

cg505 · 2024-12-03T03:12:11Z

The cloud may not immediately reflect instance creation after sending the request. This can lead to the following race:

[process 1] sky launch a new cluster
[process 1] sky launch grabs the cluster lock
[process 1] sky launch adds cluster to the state database as INIT
[process 2] sky status -r is launched and tries to grab the cluster lock
[process 1] sky launch makes the cloud API calls to create the instances
[process 1] sky launch immediately dies (e.g. is SIGKILLed), maybe even before getting a response from the cloud API. The lock is released.
[process 2] sky status -r obtains the lock
[process 2] sky status -r queries the instances, which the cloud API does not show yet
The cloud creates the instances. Future queries will show them.
[process 2] Since no instances were found, sky status -r thinks the cluster is terminated and deletes it from the state database.

Apparently AWS takes some very small amount of time to reflect instance state after instance creation.
I was able to reproduce in AWS by crashing sky launch as soon as the create instance request is sent (step 6 in the above race description).
Other clouds may have similar behaviors. We really can't ensure the request is received before the lock is released in this case. For instance, the create instance request may still be on the wire and could be racing with the list instance request.

This is pretty hard to fix reliably. However, I think we can reasonably assume there's very limited set of cases where this could arise:

We are terminating or checking the status of a cluster that was recently created. (Say, in the past 60s.)
- Typically there is not a lot of time between cluster "creation" (adding to state database) and actually creating the instances, but we may want to tune this.
The cluster is INIT, but we don't see any instances in the cloud. Also, we don't see any previously terminated instances since the cluster was created.

In this case, we can just wait a very short amount of time (maybe as short as 1 second, but hard to know) and double check that the instances do not exist.

cg505 · 2024-12-03T03:29:52Z

This is possibly the cause of #4410.

…eting Addresses skypilot-org#4431.

* if a newly-created cluster is missing from the cloud, wait before deleting Addresses #4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>

…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>

cg505 self-assigned this Dec 3, 2024

cg505 added a commit to cg505/skypilot that referenced this issue Dec 5, 2024

if a newly-created cluster is missing from the cloud, wait before del…

4daf500

…eting Addresses skypilot-org#4431.

cg505 mentioned this issue Dec 5, 2024

[robustness] cover some potential resource leakage cases #4443

Merged

5 tasks

cg505 closed this as completed Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race between status update and instance creation can cause resource leak #4431

Race between status update and instance creation can cause resource leak #4431

cg505 commented Dec 3, 2024 •

edited

Loading

cg505 commented Dec 3, 2024

Race between status update and instance creation can cause resource leak #4431

Race between status update and instance creation can cause resource leak #4431

Comments

cg505 commented Dec 3, 2024 • edited Loading

cg505 commented Dec 3, 2024

cg505 commented Dec 3, 2024 •

edited

Loading