SkyPilot v0.3.0
SkyPilot v0.3.0: LLM Support, New Clouds, Enhanced Production-Readiness
We are excited to release SkyPilot v0.3, the most significant release thus far in the project's history.
v0.3 focuses on:
- LLM support (Vicuna, LLaMA)
- New clouds (Lambda Cloud; IBM; Cloudflare R2)
- Enhanced production readiness
See the release blog post for a deep-dive into highlights.
Release notes below are as compared to v0.2 (full changelog).
Release Highlights
- LLM support
- Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
- Full finetuning & serving YAMLs released here to build off of!
- Serve your own LLaMA LLM chatbot on any cloud: full example, blog, repo
- Significantly expanded GPU availability by leveraging the widest selection of clouds (see below)
- Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
- More clouds, more choices: delivering the highest GPU availability & cost savings
- Lambda Cloud is now supported!
- IBM Cloud is now supported!
- This brings the first hyperscaler cloud after AWS/GCP/Azure to SkyPilot. (#1598)
- Cloudflare R2 object store is now supported!
- This brings zero-egress cost object storage to SkyPilot. (#1736)
- To use it, see setup docs and usage docs.
- Managed Spot is made significantly more robust via a host of fixes/enhancements.
- Cluster leakage prevention and detection are significantly improved.
- CLI/API & Backend shipped many new features:
sky cost-report
; fine-grained optimizer; user identity; AWS SSO; private IP-only VPCs; Ray runtime is decoupled from user's Ray clusters; ...
CLI/API
New Features
- New CLI
sky cost-report
: show the estimated cost of launched clusters (#1301, #1621, #1780, #1680, #1788)- Experimental: Costs for clusters with auto{stop,down} scheduled may not be accurate.
- New resource filtering support in
sky launch
/ YAMLresources:
field - Add
--detach-setup
and--detach-run
tosky launch
#1379 - Add
--retry-until-up
,--region
,--zone
, and--idle-minutes-to-autostop
for interactive nodes #1297 - Add autodown (#1217, #1254)
- Support calling
sky status/sky.status()
on specific clusters #1568 - Support
--region
insky show-gpus
#1187 - Support passing AMIs for different regions in
image_id
field underresources
#1384
Enhancements
- Improvements to
sky show-gpus
- Check image existence and its size can fit in OS disk #1508
- Make
sky down -p
bypass identity mismatch errors. #1892
Fixes
- Make repeated
sky {cpu,gpu,tpu}node
commands correctly reuse existing cluster if possible #1787 - Fix errors from empty 'resources' field in YAML. #1816
- Make autostop more robust for AWS custom images that by default export 2 credential env vars (#1880, #1894, #1946)
Managed spot
New Features
- Latest in-progress spot jobs are shown in
sky status
(#1270, #1467, #1691) - Detailed reasons for failed spot jobs are exposed in
sky spot queue -a
(#1655)
Enhancements
- Make
sky spot launch
default-r/--retry-until-up
to True. #1781 - Make job termination/cancellation significantly more robust (#1433, #1745)
- Catch "pre-launch" errors early (e.g., invalid cluster names, no cloud access) to avoid unnecessary retries (#1714)
sky start
on the spot controller resets the default autostop #1453sky spot queue
displays job states with colors (#1473)sky spot queue
no longer shows a cached (and possibly stale) version of the jobs (#1742)- Disallow
sky down
on spot controller when in-progress spot jobs exist #1667 - New state
FAILED_SETUP
for spot jobs that fail duringsetup
(#1479) - New state
CANCELLING
for spot jobs that are being cancelled (#1785) - Keep env var
SKYPILOT_JOB_ID
the same for all recoveries of the same job #1400
Fixes
- Robustness fixes (#851, #1329, #1411, #1545, #1738, #1757, #1798, #1951, ...)
- Fixes for spot TPUs (#1249, #1470, #1500, #1555, #1717)
- Fix spot jobs with the same name (
-n
) possibly overwriting each other #1782 - Make spot job failover only use the regions in
ssh_proxy_command
if specified #1792 - Fix failing to launch spot jobs when spot controller is created with AWS SSO #1817
TPU
Robustness is enhanced for TPUs in various modes: VMs, pods, spot (#1500, #1279, #1359, #1483, #1562, ...).
Provisioner
Enhancements
- Cluster leakage prevention is significantly improved!
- Disable unattended-upgrade (nondeterministic APT lock) on cluster start
- Generate valid cluster names when username has invalid characters #1526
Fixes
- GCP/provisioner: Handle the occasional RESOURCE_NOT_FOUND error. #1842
- Robustness fixes (#1236, #1287, #1619, #1969)
Storage
New Features
- Cloudflare R2 is now supported! #1736
- R2 is an S3-compatible object store with zero egress fee.
- To use it, see setup docs and usage docs.
- Support multiple paths in the
source
of a storage mount, e.g.,source: [~/mydir/myfile.txt, ~/datasets]
#1311 #1677
Enhancements
- Exclude uploading
.git
folder for cloud storage mounts #1494 - If a
file_mounts
destination path is a relative path, it is treated as being under workdir #1315 - Upgrade GCSFuse version to 0.42.3 #1829
- Mounting options improvements (#1312, #1296, #1320)
- API improvements (#1223, #1239)
- UX/logging improvements (#1200, #1285, #1457, #1833, #1857, #1908, #1858)
Fixes
- Fix
sky storage delete
for externally deleted buckets #1875 - Disallow single files for upload to Storage #1231
- fix rsync for paths with spaces #1190
Backend
New Features
- New feature: Fine-grained optimizer
- Optimizing & provisioning retries at the granularity of regions/zones #975
- In other words, SkyPilot now automatically recognizes and optimizes across the cost differences between zones (e.g., AWS zones have different prices for the same spot instance type) or regions
- New feature: User identity is associated with each cluster (#1513, #1550, #1809)
- Identities are e.g., different AWS profiles / GCP projects
- With this, users are free to switch across identities, and SkyPilot will properly protect each cluster
Enhancements
- Ray runtime on SkyPilot clusters is upgraded to v2.4.0 (#1734)
- All existing clusters are automatically upgraded on its next
sky launch/start
- Local client's ray requirement is updated to
ray[default]>=2.2.0,<=2.4.0
to fix some dependency conflicts with click/grpcio/protobuf
- All existing clusters are automatically upgraded on its next
- Ray cluster used by the SkyPilot runtime now occupies non-default ports #1790
- This means user's Ray applications no longer share the same Ray cluster and can use different versions
- Retry any internal commands that throw rsync transient errors #1839
- Make SkyPilot runtime setup more robust against various images (#1532, #1625)
- Use 1.1.1.1 in addition to 8.8.8.8 for network check. #1800 #1852
- Set SKYPILOT_JOB_ID for all tasks #1377
- Add environment variable
SKY_NUM_GPUS_PER_NODE
#1337 - Task logs now show SSH name and rank #1380
- Set
PYTHONUNBUFFERED=1
in task execution to disable python output buffer by default #1290 - Add mypy checks to codebase (#1549, #1564, #1519)
- Drop Python 3.6 support, switch to Python>=3.7 #1952
Fixes
- Fix SKY_NODE_RANK environment variable #1291
- Make
.ssh/config
more robust (#1763, #1683) - Mitigate the "database is locked" problem (#1509, #1576)
- Many other fixes (#1224, #1257, #1325, #1421, #1695, #1806, #1713, #1595, ...)
Cloud: AWS
New Features
- Bring-Your-Own-VPC: support private IP-only VPCs, per-region or global SSH proxy commands (#1512, #1569, #1575, #1747)
- AWS SSO is supported #1489
- Support for tagging AWS EC2 nodes
Enhancements
- Reduces AWS spot controller cost by 20% by using lower IOPS #1235
- Update default images for AWS and GCP #1608
- Allow longer cluster name for AWS/Azure clusters #1648
Fixes
- Avoid key pair permission issue by using cloud-init for authorized keys #1427
- Correctly detect credentials provided by temporary env vars. #1963
- (SSO) Retry status fetch on 'Unable to locate credentials' error. #1988
Cloud: GCP
Enhancements
- Add basic permission checks as part of
sky check
#1772 - Automatic GCP VPC creation if no VPC available (#1594, #1640)
- GCP/Azure: add
skypilot-user
tags to VMs on these clouds. #1593 - Update default images for AWS and GCP #1608
Fixes
Cloud: Azure
Enhancements
- GCP/Azure: add
skypilot-user
tags to VMs on these clouds. #1593 - Allow longer cluster name for AWS/Azure clusters #1648
Fixes
- [Azure] Patch azure cli timeout issue #1288
Catalog
New Features
- AWS catalog is refreshed periodically via GitHub actions, so users get up-to-date prices (#1451)
- On any call that uses the catalog, SkyPilot automatically pulls the latest catalog from the catalog repo if it detects the local copy is too old
Enhancements
- New GCP catalog & fetcher: all regions are now fetched! #1642
- Action required:
mv ~/.sky/catalogs/v5/gcp ~/.sky/catalogs/v5/gcp.backup
so that new catalogs will be auto-fetched
- Action required:
- Add a Lambda Cloud catalog fetcher #1722
- AWS: Enable p4de (A100-80GB x8 GPUs) in the catalog #1827
Fixes
Thanks to all contributors!
New contributors: @dongreenberg, @turian, @scruel, @vivekkhimani, @stephenbalaban, @landscapepainter, @cblmemo, @Saikrishna-Achalla, @datlife, @Cohen-J-Omer (IBM Cloud support!), @zetavg.
Many thanks to all contributors who contributed to this release!
@Michaelvll, @concretevitamin, @romilbhardwaj, @infwinston, @ewzeng, @michaelzhiluo, @WoosukKwon, @iojw, @sumanthgenz, @landscapepainter, @suquark, @dongreenberg, @cblmemo, @mraheja, @vivekkhimani, @turian, @stephenbalaban, @scruel, @lhqing, @datlife, @Saikrishna-Achalla, @Cohen-J-Omer, @zetavg