-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race between status update and instance creation can cause resource leak #4431
Comments
This is possibly the cause of #4410. |
cg505
added a commit
to cg505/skypilot
that referenced
this issue
Dec 5, 2024
5 tasks
cg505
added a commit
that referenced
this issue
Dec 8, 2024
* if a newly-created cluster is missing from the cloud, wait before deleting Addresses #4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>
zpoint
pushed a commit
to zpoint/skypilot
that referenced
this issue
Dec 9, 2024
…g#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses skypilot-org#4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The cloud may not immediately reflect instance creation after sending the request. This can lead to the following race:
Apparently AWS takes some very small amount of time to reflect instance state after instance creation.
I was able to reproduce in AWS by crashing sky launch as soon as the create instance request is sent (step 6 in the above race description).
Other clouds may have similar behaviors. We really can't ensure the request is received before the lock is released in this case. For instance, the create instance request may still be on the wire and could be racing with the list instance request.
This is pretty hard to fix reliably. However, I think we can reasonably assume there's very limited set of cases where this could arise:
In this case, we can just wait a very short amount of time (maybe as short as 1 second, but hard to know) and double check that the instances do not exist.
The text was updated successfully, but these errors were encountered: