Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [perf] optimizations for sky jobs launch (#4341) * cache AWS get_user_identities With SSO enabled (and maybe without?) this takes about a second. We already use an lru_cache for Azure, do the same here. * skip optimization for sky jobs launch --yes The only reason we call optimize for jobs_launch is to give a preview of the resources we expect to use, and give the user an opportunity to back out if it's not what they expect. If you use --yes or -y, you don't have a chance to back out and you're probably running from a script, where you don't care. Optimization can take ~2 seconds, so just skip it. * update logging * address PR comments * [ux] cache cluster status of autostop or spot clusters for 2s (#4332) * add status_updated_at to DB * don't refresh autostop/spot cluster if it's recently been refreshed * update locking mechanism for status check to early exit * address PR comments * add warning about cluster status lock timeout * [k8s] fix managed job issue on k8s (#4357) Signed-off-by: nkwangleiGIT <[email protected]> * [Core] Add `NO_UPLOAD` for `remote_identity` (#4307) * Add skip flag to remote_identity * Rename to NO_UPLOAD * Fixes * lint * comments * Add comments * lint * Add Lambda's GH200 instance type (#4377) Add GH200 instance type * [FluidStack] Fix provisioning and add new gpu types (#4359) [FluidStack] Fix provisioning and add new gpu types * Add new `provisioning` status to fix failed deployments * Add H100 SXM5 GPU mapping * [ux] display human-readable name for controller (#4376) * [k8s] Handle apt update log not existing (#4381) do not panic if file does not exist, it may be written soon * Support event based smoke test instead of sleep time based to reduce flaky test and faster test (#4284) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * restore sleep for fail test case * [UX] user-friendly message shown if Kubernetes is not enabled. (#4336) try except * [Jobs] Disable deduplication for logs (#4388) Disable dedup * [OCI] set zone in the ProvisionRecord (#4383) * fix: Add zone to the ProvisionRecord * fix * [Examples] Specify version for vllm cuz vllm v0.6.4.post1 has issue (#4391) * [OCI] Specify vllm version because the latest vllm v0.6.4.post1 has issue * version for vllm-flash-attn * [docs] Specify compartment for OCI resources. (#4384) * [docs] Specify compartment for OCI resources. * Add link to compartment definition page * [k8s] Improve multi-node provisioning time (nimbus) (#4393) * Tracking k8s events with timeline * Remove SSH wait * Parallelize pod creation and status check * Parallelize labelling, add docs on optimizing base image, bump default provision timeout * More parallelization, batching and optimizations * lint * correctness * Fix double launch bug * fix num threads * Add fd limit warning * [k8s] Move setup and ray start to pod args to make them async (#4389) * move scripts to args * Avoid ray setup * fix * Add checks for ray healthiness * remove bc installation * wait for healthy * add todo * fix * fix * format * format * remove unnecessary logging * print out error setup * Add comment * clean up the logging * style * Fixes for ubuntu images * format * remove unused comments * Optimize ray start * add comments * Add comments * Fix comments and logging * missing end_epoch * Add logging * Longer timeout and trigger ray start * Fixes for the ray port and AWS credential setup * Update netcat-openbsd, comments * _NUM_THREADS rename * add num_nodes to calculate timeout * lint * revert * use uv for pip install and for venv creation (#4394) * use uv for pip install and for venv creation uv is a tool that can replace pip and venv (and some other stuff we're not using I think). It's written in rust and in testing is significantly faster for many operation, especially things like `pip list` or `pip install skypilot` when skypilot or all its dependencies are already installed. * add comment to SKY_PIP_CMD * sudo handling for ray * Add comment in dockerfile * fix pod checks * lint --------- Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> * [Core] Skip worker ray start for multinode (#4390) * Optimize ray start * add comments * update logging * remove `uv` from runtime setup due to azure installation issue (#4401) * [k8s] Skip listing all pods to speed up optimizer (#4398) * Reduce API calls * lint * [k8s] Nimbus backward compatibility (#4400) * Add nimbus backward compatibility * add uv backcompat * add uv backcompat * add uv backcompat * lint * merge * merge * [Storage] Call `sync_file_mounts` when either rsync or storage file_mounts are specified (#4317) do file mounts if storage is specified * [k8s] Support in-cluster and kubeconfig auth simultaneously (#4188) * per-context SA + incluster auth fixes * lint * Support both incluster and kubeconfig * wip * Ignore kubeconfig when context is not specified, add su, mounting kubeconfig * lint * comments * fix merge issues * lint * Fix Spot instance on Azure (#4408) * [UX] Allow disabling ports in CLI (#4378) [UX] Allow disabling ports * [AWS] Get rid of credential files if `remote_identity: SERVICE_ACCOUNT` specified (#4395) * syntax * minor * Fix OD instance on Azure (#4411) * [UX] Remove K80 and M60 from common GPU list (#4382) * Remove K80 and M60 from GPU list * Fix kubernetes instance type with space * comments * format * format * remove mi25 * Event based smoke tests -- manged jobs (#4386) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * more test case * support test_managed_jobs_pipeline_failed_setup * support test_managed_jobs_recovery_aws * manged job status * bug fix * test managed job cancel * test_managed_jobs_storage * more test cases * resolve pr comment * private member function * bug fix * interface change * bug fix * bug fix * raise error on empty status * [k8s] Fix in-cluster auth namespace fetching (#4420) * Fix incluster auth namespace fetching * Fixes * [k8s] Update comparison page image (#4415) Update image * Add a pre commit config to help format before pushing (#4258) * pre commit config * yapf version * fix * mypy check all files * skip smoke_test.py * add doc * better format * newline format * sync with format.sh * comment fix * fix the pylint hook for pre-commit (#4422) * fix the pylint hook * remove default arg * change name * limit pylint files * [k8s] Fix resources.image_id backward compatibility (#4425) * Fix back compat * Fix back compat for image_id + regions * lint * comments * [Tests] Move tests to uv to speed up the dependency installation by >10x (#4424) * correct cache for pypi * Add doc cache and test cache * Add examples folder * fix policy path * use uv for pylint * Fix azure cli * disable cache * use venv * set venv * source instead * rename doc build * Move to uv * Fix azure cli * Add -e * Update .github/workflows/format.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/mypy.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/pylint.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/pytest.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/test-doc-build.yml Co-authored-by: Christopher Cooper <[email protected]> * fix pytest yml * Add merge group --------- Co-authored-by: Christopher Cooper <[email protected]> * fix db * fix launch * remove transaction id * format * format * format * test doc build * doc build * update readme for test kubernetes example (#4426) * update readme * fetch version from gcloud * rename var to GKE_VERSION * subnetwork also use REGION * format * fix types * fix * format * fix types * [k8s] Fix `show-gpus` availability map when nvidia drivers are not installed (#4429) * Fix availability map * Fix availability map * fix types * avoid catching ValueError during failover (#4432) * avoid catching ValueError during failover If the cloud api raises ValueError or a subclass of ValueError during instance termination, we will assume the cluster was downed. Fix this by introducing a new exception ClusterDoesNotExist that we can catch instead of the more general ValueError. * add unit test * lint * [Core] Execute setup when `--detach-setup` and no `run` section (#4430) * Execute setup when --detach-setup and no run section * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Tian Xia <[email protected]> * add comments * Fix types * format * minor * Add test for detach setup only --------- Co-authored-by: Tian Xia <[email protected]> * wait for cleanup * [Jobs] Allow logs for finished jobs and add `sky jobs logs --refresh` for restartin jobs controller (#4380) * Stream logs for finished jobs * Allow stream logs for finished jobs * Read files after the indicator lines * Add refresh for `sky jobs logs` * fix log message * address comments * Add smoke test * fix smoke * fix jobs queue smoke test * fix storage * fix merge issue * fix merge issue * Fix merging issue * format --------- Signed-off-by: nkwangleiGIT <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Lei <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Cody Brownstein <[email protected]> Co-authored-by: mjibril <[email protected]> Co-authored-by: zpoint <[email protected]> Co-authored-by: Hysun He <[email protected]> Co-authored-by: Tian Xia <[email protected]> Co-authored-by: zpoint <[email protected]>
- Loading branch information