Skip to content

SkyPilot v0.3.0

Compare
Choose a tag to compare
@concretevitamin concretevitamin released this 30 May 17:29

SkyPilot v0.3.0: LLM Support, New Clouds, Enhanced Production-Readiness

We are excited to release SkyPilot v0.3, the most significant release thus far in the project's history.

v0.3 focuses on:

  • LLM support (Vicuna, LLaMA)
  • New clouds (Lambda Cloud; IBM; Cloudflare R2)
  • Enhanced production readiness

See the release blog post for a deep-dive into highlights.

Release notes below are as compared to v0.2 (full changelog).

Release Highlights

  • LLM support
    • Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
    • Serve your own LLaMA LLM chatbot on any cloud: full example, blog, repo
    • Significantly expanded GPU availability by leveraging the widest selection of clouds (see below)
  • More clouds, more choices: delivering the highest GPU availability & cost savings
    • Lambda Cloud is now supported!
      • This brings high-end GPUs at lower costs to SkyPilot. (#1557, #1838)
      • Simply run sky check to set it up. Docs here.
    • IBM Cloud is now supported!
      • This brings the first hyperscaler cloud after AWS/GCP/Azure to SkyPilot. (#1598)
    • Cloudflare R2 object store is now supported!
  • Managed Spot is made significantly more robust via a host of fixes/enhancements.
  • Cluster leakage prevention and detection are significantly improved.
  • CLI/API & Backend shipped many new features:
    • sky cost-report; fine-grained optimizer; user identity; AWS SSO; private IP-only VPCs; Ray runtime is decoupled from user's Ray clusters; ...

CLI/API

New Features

  • New CLI sky cost-report: show the estimated cost of launched clusters (#1301, #1621, #1780, #1680, #1788)
    • Experimental: Costs for clusters with auto{stop,down} scheduled may not be accurate.
  • New resource filtering support in sky launch / YAML resources: field
    • Add vCPUs --cpus support #1622
    • Add memory --memory support #1746
    • Add disk performance tier --disk-tier support #1812
  • Add --detach-setup and --detach-run to sky launch #1379
  • Add --retry-until-up, --region, --zone, and --idle-minutes-to-autostop for interactive nodes #1297
  • Add autodown (#1217, #1254)
  • Support calling sky status/sky.status() on specific clusters #1568
  • Support --region in sky show-gpus #1187
  • Support passing AMIs for different regions in image_id field under resources #1384

Enhancements

  • Improvements to sky show-gpus
    • Show DEVICE_MEMORY for AWS & Lambda #1825
    • Support querying specific number/type of devices sky show-gpus <gpu>:<num> (same syntax as sky launch --gpus) #1924
  • Check image existence and its size can fit in OS disk #1508
  • Make sky down -p bypass identity mismatch errors. #1892

Fixes

  • Make repeated sky {cpu,gpu,tpu}node commands correctly reuse existing cluster if possible #1787
  • Fix errors from empty 'resources' field in YAML. #1816
  • Make autostop more robust for AWS custom images that by default export 2 credential env vars (#1880, #1894, #1946)

Managed spot

New Features

  • Latest in-progress spot jobs are shown in sky status (#1270, #1467, #1691)
  • Detailed reasons for failed spot jobs are exposed in sky spot queue -a (#1655)

Enhancements

  • Make sky spot launch default -r/--retry-until-up to True. #1781
  • Make job termination/cancellation significantly more robust (#1433, #1745)
  • Catch "pre-launch" errors early (e.g., invalid cluster names, no cloud access) to avoid unnecessary retries (#1714)
  • sky start on the spot controller resets the default autostop #1453
  • sky spot queue displays job states with colors (#1473)
  • sky spot queue no longer shows a cached (and possibly stale) version of the jobs (#1742)
  • Disallow sky down on spot controller when in-progress spot jobs exist #1667
  • New state FAILED_SETUP for spot jobs that fail during setup (#1479)
  • New state CANCELLING for spot jobs that are being cancelled (#1785)
  • Keep env var SKYPILOT_JOB_ID the same for all recoveries of the same job #1400

Fixes

TPU

Robustness is enhanced for TPUs in various modes: VMs, pods, spot (#1500, #1279, #1359, #1483, #1562, ...).

Provisioner

Enhancements

  • Cluster leakage prevention is significantly improved!
    • Skip Ray's launch hash check, which caused many leakage (#1671)
    • Launch existing cluster in the same zone to avoid leakage (#1700)
    • Existing cluster's cluster YAML will keep certain fields unchanged across re-launch (#1235, #1251)
    • Fix leakage of existing cluster when failed to start #1497
  • Disable unattended-upgrade (nondeterministic APT lock) on cluster start
    • Previously, apt install ... in setup may non-deterministically fail due to APT lock being held by background unattended upgrades
    • Now: for AWS cloud-init ensures unattended-upgrade is disabled at boot (#1949, #1954); for other clouds we kill the processes (#1347)
  • Generate valid cluster names when username has invalid characters #1526

Fixes

Storage

New Features

  • Cloudflare R2 is now supported! #1736
    • R2 is an S3-compatible object store with zero egress fee.
    • To use it, see setup docs and usage docs.
  • Support multiple paths in the source of a storage mount, e.g., source: [~/mydir/myfile.txt, ~/datasets] #1311 #1677

Enhancements

Fixes

  • Fix sky storage delete for externally deleted buckets #1875
  • Disallow single files for upload to Storage #1231
  • fix rsync for paths with spaces #1190

Backend

New Features

  • New feature: Fine-grained optimizer
    • Optimizing & provisioning retries at the granularity of regions/zones #975
    • In other words, SkyPilot now automatically recognizes and optimizes across the cost differences between zones (e.g., AWS zones have different prices for the same spot instance type) or regions
  • New feature: User identity is associated with each cluster (#1513, #1550, #1809)
    • Identities are e.g., different AWS profiles / GCP projects
    • With this, users are free to switch across identities, and SkyPilot will properly protect each cluster

Enhancements

  • Ray runtime on SkyPilot clusters is upgraded to v2.4.0 (#1734)
    • All existing clusters are automatically upgraded on its next sky launch/start
    • Local client's ray requirement is updated to ray[default]>=2.2.0,<=2.4.0 to fix some dependency conflicts with click/grpcio/protobuf
  • Ray cluster used by the SkyPilot runtime now occupies non-default ports #1790
    • This means user's Ray applications no longer share the same Ray cluster and can use different versions
  • Retry any internal commands that throw rsync transient errors #1839
  • Make SkyPilot runtime setup more robust against various images (#1532, #1625)
  • Use 1.1.1.1 in addition to 8.8.8.8 for network check. #1800 #1852
  • Set SKYPILOT_JOB_ID for all tasks #1377
  • Add environment variable SKY_NUM_GPUS_PER_NODE #1337
  • Task logs now show SSH name and rank #1380
  • Set PYTHONUNBUFFERED=1 in task execution to disable python output buffer by default #1290
  • Add mypy checks to codebase (#1549, #1564, #1519)
  • Drop Python 3.6 support, switch to Python>=3.7 #1952

Fixes

Cloud: AWS

New Features

  • Bring-Your-Own-VPC: support private IP-only VPCs, per-region or global SSH proxy commands (#1512, #1569, #1575, #1747)
  • AWS SSO is supported #1489
  • Support for tagging AWS EC2 nodes
    • Tag all SkyPilot AWS nodes with "skypilot-user: $USER". #1515
    • Support custom instance tags in ~/.sky/config.yaml. #1992

Enhancements

  • Reduces AWS spot controller cost by 20% by using lower IOPS #1235
  • Update default images for AWS and GCP #1608
  • Allow longer cluster name for AWS/Azure clusters #1648

Fixes

  • Avoid key pair permission issue by using cloud-init for authorized keys #1427
  • Correctly detect credentials provided by temporary env vars. #1963
  • (SSO) Retry status fetch on 'Unable to locate credentials' error. #1988

Cloud: GCP

Enhancements

  • Add basic permission checks as part of sky check #1772
  • Automatic GCP VPC creation if no VPC available (#1594, #1640)
  • GCP/Azure: add skypilot-user tags to VMs on these clouds. #1593
  • Update default images for AWS and GCP #1608

Fixes

Cloud: Azure

Enhancements

  • GCP/Azure: add skypilot-user tags to VMs on these clouds. #1593
  • Allow longer cluster name for AWS/Azure clusters #1648

Fixes

  • [Azure] Patch azure cli timeout issue #1288

Catalog

New Features

  • AWS catalog is refreshed periodically via GitHub actions, so users get up-to-date prices (#1451)
    • On any call that uses the catalog, SkyPilot automatically pulls the latest catalog from the catalog repo if it detects the local copy is too old

Enhancements

  • New GCP catalog & fetcher: all regions are now fetched! #1642
    • Action required: mv ~/.sky/catalogs/v5/gcp ~/.sky/catalogs/v5/gcp.backup so that new catalogs will be auto-fetched
  • Add a Lambda Cloud catalog fetcher #1722
  • AWS: Enable p4de (A100-80GB x8 GPUs) in the catalog #1827

Fixes

Thanks to all contributors!

New contributors: @dongreenberg, @turian, @scruel, @vivekkhimani, @stephenbalaban, @landscapepainter, @cblmemo, @Saikrishna-Achalla, @datlife, @Cohen-J-Omer (IBM Cloud support!), @zetavg.

Many thanks to all contributors who contributed to this release!

@Michaelvll, @concretevitamin, @romilbhardwaj, @infwinston, @ewzeng, @michaelzhiluo, @WoosukKwon, @iojw, @sumanthgenz, @landscapepainter, @suquark, @dongreenberg, @cblmemo, @mraheja, @vivekkhimani, @turian, @stephenbalaban, @scruel, @lhqing, @datlife, @Saikrishna-Achalla, @Cohen-J-Omer, @zetavg