Skip to content

Commit

Permalink
Merge master (#40)
Browse files Browse the repository at this point in the history
* [perf] optimizations for sky jobs launch (#4341)

* cache AWS get_user_identities

With SSO enabled (and maybe without?) this takes about a second. We already use
an lru_cache for Azure, do the same here.

* skip optimization for sky jobs launch --yes

The only reason we call optimize for jobs_launch is to give a preview of the
resources we expect to use, and give the user an opportunity to back out if it's
not what they expect. If you use --yes or -y, you don't have a chance to back
out and you're probably running from a script, where you don't care.
Optimization can take ~2 seconds, so just skip it.

* update logging

* address PR comments

* [ux] cache cluster status of autostop or spot clusters for 2s (#4332)

* add status_updated_at to DB

* don't refresh autostop/spot cluster if it's recently been refreshed

* update locking mechanism for status check to early exit

* address PR comments

* add warning about cluster status lock timeout

* [k8s] fix managed job issue on k8s (#4357)

Signed-off-by: nkwangleiGIT <[email protected]>

* [Core] Add `NO_UPLOAD` for `remote_identity` (#4307)

* Add skip flag to remote_identity

* Rename to NO_UPLOAD

* Fixes

* lint

* comments

* Add comments

* lint

* Add Lambda's GH200 instance type (#4377)

Add GH200 instance type

* [FluidStack] Fix provisioning and add new gpu types (#4359)

[FluidStack] Fix provisioning and add new gpu types

    * Add new `provisioning` status to fix failed deployments

    * Add H100 SXM5 GPU mapping

* [ux] display human-readable name for controller (#4376)

* [k8s] Handle apt update log not existing (#4381)

do not panic if file does not exist, it may be written soon

* Support event based smoke test instead of sleep time based to reduce flaky test and faster test (#4284)

* event based smoke test

* more event based smoke test

* more test cases

* more test cases with managed jobs

* bug fix

* bump up seconds

* merge master and resolve conflict

* restore sleep for fail test case

* [UX] user-friendly message shown if Kubernetes is not enabled. (#4336)

try except

* [Jobs] Disable deduplication for logs (#4388)

Disable dedup

* [OCI] set zone in the ProvisionRecord (#4383)

* fix: Add zone to the ProvisionRecord

* fix

* [Examples] Specify version for vllm cuz vllm v0.6.4.post1 has issue (#4391)

* [OCI] Specify vllm version because the latest vllm v0.6.4.post1 has issue

* version for vllm-flash-attn

* [docs] Specify compartment for OCI resources. (#4384)

* [docs] Specify compartment for OCI resources.

* Add link to compartment definition page

* [k8s] Improve multi-node provisioning time (nimbus) (#4393)

* Tracking k8s events with timeline

* Remove SSH wait

* Parallelize pod creation and status check

* Parallelize labelling, add docs on optimizing base image, bump default provision timeout

* More parallelization, batching and optimizations

* lint

* correctness

* Fix double launch bug

* fix num threads

* Add fd limit warning

* [k8s] Move setup and ray start to pod args to make them async (#4389)

* move scripts to args

* Avoid ray setup

* fix

* Add checks for ray healthiness

* remove bc installation

* wait for healthy

* add todo

* fix

* fix

* format

* format

* remove unnecessary logging

* print out error setup

* Add comment

* clean up the logging

* style

* Fixes for ubuntu images

* format

* remove unused comments

* Optimize ray start

* add comments

* Add comments

* Fix comments and logging

* missing end_epoch

* Add logging

* Longer timeout and trigger ray start

* Fixes for the ray port and AWS credential setup

* Update netcat-openbsd, comments

* _NUM_THREADS rename

* add num_nodes to calculate timeout

* lint

* revert

* use uv for pip install and for venv creation (#4394)

* use uv for pip install and for venv creation

uv is a tool that can replace pip and venv (and some other stuff we're not using
I think). It's written in rust and in testing is significantly faster for many
operation, especially things like `pip list` or `pip install skypilot` when
skypilot or all its dependencies are already installed.

* add comment to SKY_PIP_CMD

* sudo handling for ray

* Add comment in dockerfile

* fix pod checks

* lint

---------

Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Christopher Cooper <[email protected]>

* [Core] Skip worker ray start for multinode (#4390)

* Optimize ray start

* add comments

* update logging

* remove `uv` from runtime setup due to azure installation issue (#4401)

* [k8s] Skip listing all pods to speed up optimizer (#4398)

* Reduce API calls

* lint

* [k8s] Nimbus backward compatibility (#4400)

* Add nimbus backward compatibility

* add uv backcompat

* add uv backcompat

* add uv backcompat

* lint

* merge

* merge

* [Storage] Call `sync_file_mounts` when either rsync or storage file_mounts are specified  (#4317)

do file mounts if storage is specified

* [k8s] Support in-cluster and kubeconfig auth simultaneously (#4188)

* per-context SA + incluster auth fixes

* lint

* Support both incluster and kubeconfig

* wip

* Ignore kubeconfig when context is not specified, add su, mounting kubeconfig

* lint

* comments

* fix merge issues

* lint

* Fix Spot instance on Azure (#4408)

* [UX] Allow disabling ports in CLI (#4378)

[UX] Allow disabling ports

* [AWS] Get rid of credential files if `remote_identity: SERVICE_ACCOUNT` specified (#4395)

* syntax

* minor

* Fix OD instance on Azure (#4411)

* [UX] Remove K80 and M60 from common GPU list (#4382)

* Remove K80 and M60 from GPU list

* Fix kubernetes instance type with space

* comments

* format

* format

* remove mi25

* Event based smoke tests -- manged jobs (#4386)

* event based smoke test

* more event based smoke test

* more test cases

* more test cases with managed jobs

* bug fix

* bump up seconds

* merge master and resolve conflict

* more test case

* support test_managed_jobs_pipeline_failed_setup

* support test_managed_jobs_recovery_aws

* manged job status

* bug fix

* test managed job cancel

* test_managed_jobs_storage

* more test cases

* resolve pr comment

* private member function

* bug fix

* interface change

* bug fix

* bug fix

* raise error on empty status

* [k8s] Fix in-cluster auth namespace fetching (#4420)

* Fix incluster auth namespace fetching

* Fixes

* [k8s] Update comparison page image (#4415)

Update image

* Add a pre commit config to help format before pushing (#4258)

* pre commit config

* yapf version

* fix

* mypy check all files

* skip smoke_test.py

* add doc

* better format

* newline format

* sync with format.sh

* comment fix

* fix the pylint hook for pre-commit (#4422)

* fix the pylint hook

* remove default arg

* change name

* limit pylint files

* [k8s] Fix resources.image_id backward compatibility (#4425)

* Fix back compat

* Fix back compat for image_id + regions

* lint

* comments

* [Tests] Move tests to uv to speed up the dependency installation by >10x (#4424)

* correct cache for pypi

* Add doc cache and test cache

* Add examples folder

* fix policy path

* use uv for pylint

* Fix azure cli

* disable cache

* use venv

* set venv

* source instead

* rename doc build

* Move to uv

* Fix azure cli

* Add -e

* Update .github/workflows/format.yml

Co-authored-by: Christopher Cooper <[email protected]>

* Update .github/workflows/mypy.yml

Co-authored-by: Christopher Cooper <[email protected]>

* Update .github/workflows/pylint.yml

Co-authored-by: Christopher Cooper <[email protected]>

* Update .github/workflows/pytest.yml

Co-authored-by: Christopher Cooper <[email protected]>

* Update .github/workflows/test-doc-build.yml

Co-authored-by: Christopher Cooper <[email protected]>

* fix pytest yml

* Add merge group

---------

Co-authored-by: Christopher Cooper <[email protected]>

* fix db

* fix launch

* remove transaction id

* format

* format

* format

* test doc build

* doc build

* update readme for test kubernetes example (#4426)

* update readme

* fetch version from gcloud

* rename var to GKE_VERSION

* subnetwork also use REGION

* format

* fix types

* fix

* format

* fix types

* [k8s] Fix `show-gpus` availability map when nvidia drivers are not installed (#4429)

* Fix availability map

* Fix availability map

* fix types

* avoid catching ValueError during failover (#4432)

* avoid catching ValueError during failover

If the cloud api raises ValueError or a subclass of ValueError during instance
termination, we will assume the cluster was downed. Fix this by introducing a
new exception ClusterDoesNotExist that we can catch instead of the more general
ValueError.

* add unit test

* lint

* [Core] Execute setup when `--detach-setup` and no `run` section (#4430)

* Execute setup when --detach-setup and no run section

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Tian Xia <[email protected]>

* add comments

* Fix types

* format

* minor

* Add test for detach setup only

---------

Co-authored-by: Tian Xia <[email protected]>

* wait for cleanup

* [Jobs] Allow logs for finished jobs and add `sky jobs logs --refresh` for restartin jobs controller (#4380)

* Stream logs for finished jobs

* Allow stream logs for finished jobs

* Read files after the indicator lines

* Add refresh for `sky jobs logs`

* fix log message

* address comments

* Add smoke test

* fix smoke

* fix jobs queue smoke test

* fix storage

* fix merge issue

* fix merge issue

* Fix merging issue

* format

---------

Signed-off-by: nkwangleiGIT <[email protected]>
Co-authored-by: Christopher Cooper <[email protected]>
Co-authored-by: Lei <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Cody Brownstein <[email protected]>
Co-authored-by: mjibril <[email protected]>
Co-authored-by: zpoint <[email protected]>
Co-authored-by: Hysun He <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: zpoint <[email protected]>
  • Loading branch information
10 people authored Dec 3, 2024
1 parent aae6ae5 commit 4b2dd86
Show file tree
Hide file tree
Showing 82 changed files with 2,140 additions and 878 deletions.
20 changes: 13 additions & 7 deletions .github/workflows/format.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,29 +22,35 @@ jobs:
python-version: ["3.8"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
- name: Install the latest version of uv
uses: astral-sh/setup-uv@v4
with:
version: "latest"
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install yapf==0.32.0
pip install toml==0.10.2
pip install black==22.10.0
pip install isort==5.12.0
uv venv --seed ~/test-env
source ~/test-env/bin/activate
uv pip install yapf==0.32.0
uv pip install toml==0.10.2
uv pip install black==22.10.0
uv pip install isort==5.12.0
- name: Running yapf
run: |
source ~/test-env/bin/activate
yapf --diff --recursive ./ --exclude 'sky/skylet/ray_patches/**' \
--exclude 'sky/skylet/providers/ibm/**'
- name: Running black
run: |
source ~/test-env/bin/activate
black --diff --check sky/skylet/providers/ibm/
- name: Running isort for black formatted files
run: |
source ~/test-env/bin/activate
isort --diff --check --profile black -l 88 -m 3 \
sky/skylet/providers/ibm/
- name: Running isort for yapf formatted files
run: |
source ~/test-env/bin/activate
isort --diff --check ./ --sg 'sky/skylet/ray_patches/**' \
--sg 'sky/skylet/providers/ibm/**'
23 changes: 0 additions & 23 deletions .github/workflows/mypy-generic.yml

This file was deleted.

15 changes: 10 additions & 5 deletions .github/workflows/mypy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ on:
- master
- 'releases/**'
- restapi
merge_group:

jobs:
mypy:
runs-on: ubuntu-latest
Expand All @@ -20,15 +22,18 @@ jobs:
python-version: ["3.8"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
- name: Install the latest version of uv
uses: astral-sh/setup-uv@v4
with:
version: "latest"
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install mypy==$(grep mypy requirements-dev.txt | cut -d'=' -f3)
pip install $(grep types- requirements-dev.txt | tr '\n' ' ')
uv venv --seed ~/test-env
source ~/test-env/bin/activate
uv pip install mypy==$(grep mypy requirements-dev.txt | cut -d'=' -f3)
uv pip install $(grep types- requirements-dev.txt | tr '\n' ' ')
- name: Running mypy
run: |
source ~/test-env/bin/activate
mypy $(cat tests/mypy_files.txt)
16 changes: 10 additions & 6 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,16 +22,20 @@ jobs:
python-version: ["3.8"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
- name: Install the latest version of uv
uses: astral-sh/setup-uv@v4
with:
version: "latest"
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install ".[all]"
pip install pylint==2.14.5
pip install pylint-quotes==0.2.3
uv venv --seed ~/test-env
source ~/test-env/bin/activate
uv pip install --prerelease=allow "azure-cli>=2.65.0"
uv pip install ".[all]"
uv pip install pylint==2.14.5
uv pip install pylint-quotes==0.2.3
- name: Analysing the code with pylint
run: |
source ~/test-env/bin/activate
pylint --load-plugins pylint_quotes sky
31 changes: 13 additions & 18 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,26 +35,21 @@ jobs:
steps:
- name: Checkout repository
uses: actions/checkout@v3

- name: Install Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
- name: Install the latest version of uv
uses: astral-sh/setup-uv@v4
with:
version: "latest"
python-version: ${{ matrix.python-version }}

- name: Cache dependencies
uses: actions/cache@v3
if: startsWith(runner.os, 'Linux')
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-pytest-${{ matrix.python-version }}
restore-keys: |
${{ runner.os }}-pip-pytest-${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[all]"
pip install pytest pytest-xdist pytest-env>=0.6 memory-profiler==0.61.0
uv venv --seed ~/test-env
source ~/test-env/bin/activate
uv pip install --prerelease=allow "azure-cli>=2.65.0"
# Use -e to include examples and tests folder in the path for unit
# tests to access them.
uv pip install -e ".[all]"
uv pip install pytest pytest-xdist pytest-env>=0.6 memory-profiler==0.61.0
- name: Run tests with pytest
run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 0 --dist no ${{ matrix.test-path }}
run: |
source ~/test-env/bin/activate
SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 0 --dist no ${{ matrix.test-path }}
17 changes: 11 additions & 6 deletions .github/workflows/test-doc-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,27 +11,32 @@ on:
branches:
- master
- 'releases/**'
- restapi
merge_group:

jobs:
format:
doc-build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
- name: Install the latest version of uv
uses: astral-sh/setup-uv@v4
with:
version: "latest"
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .
uv venv --seed ~/test-env
source ~/test-env/bin/activate
uv pip install --prerelease=allow "azure-cli>=2.65.0"
uv pip install ".[all]"
cd docs
pip install -r ./requirements-docs.txt
uv pip install -r ./requirements-docs.txt
- name: Build documentation
run: |
source ~/test-env/bin/activate
cd ./docs
./build.sh
74 changes: 74 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Ensure this configuration aligns with format.sh and requirements.txt
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files

- repo: https://github.com/psf/black
rev: 22.10.0 # Match the version from requirements
hooks:
- id: black
name: black (IBM specific)
files: "^sky/skylet/providers/ibm/.*" # Match only files in the IBM directory

- repo: https://github.com/pycqa/isort
rev: 5.12.0 # Match the version from requirements
hooks:
# First isort command
- id: isort
name: isort (general)
args:
- "--sg=build/**" # Matches "${ISORT_YAPF_EXCLUDES[@]}"
- "--sg=sky/skylet/providers/ibm/**"
files: "^(sky|tests|examples|llm|docs)/.*" # Only match these directories
# Second isort command
- id: isort
name: isort (IBM specific)
args:
- "--profile=black"
- "-l=88"
- "-m=3"
files: "^sky/skylet/providers/ibm/.*" # Only match IBM-specific directory

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.991 # Match the version from requirements
hooks:
- id: mypy
args:
# From tests/mypy_files.txt
- "sky"
- "--exclude"
- "sky/benchmark|sky/callbacks|sky/skylet/providers/azure|sky/resources.py|sky/backends/monkey_patches"
pass_filenames: false
additional_dependencies:
- types-PyYAML
- types-requests<2.31 # Match the condition in requirements.txt
- types-setuptools
- types-cachetools
- types-pyvmomi

- repo: https://github.com/google/yapf
rev: v0.32.0 # Match the version from requirements
hooks:
- id: yapf
name: yapf
exclude: (build/.*|sky/skylet/providers/ibm/.*) # Matches exclusions from the script
args: ['--recursive', '--parallel'] # Only necessary flags
additional_dependencies: [toml==0.10.2]

- repo: https://github.com/pylint-dev/pylint
rev: v2.14.5 # Match the version from requirements
hooks:
- id: pylint
additional_dependencies:
- pylint-quotes==0.2.3 # Match the version from requirements
name: pylint
args:
- --rcfile=.pylintrc # Use your custom pylint configuration
- --load-plugins=pylint_quotes # Load the pylint-quotes plugin
files: ^sky/ # Only include files from the 'sky/' directory
exclude: ^sky/skylet/providers/ibm/
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ It has some convenience features which you might find helpful (see [Dockerfile](
- If relevant, add tests for your changes. For changes that touch the core system, run the [smoke tests](#testing) and ensure they pass.
- Follow the [Google style guide](https://google.github.io/styleguide/pyguide.html).
- Ensure code is properly formatted by running [`format.sh`](https://github.com/skypilot-org/skypilot/blob/master/format.sh).
- [Optional] You can also install pre-commit hooks by running `pre-commit install` to automatically format your code on commit.
- Push your changes to your fork and open a pull request in the SkyPilot repository.
- In the PR description, write a `Tested:` section to describe relevant tests performed.

Expand Down
2 changes: 1 addition & 1 deletion Dockerfile_k8s
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ ARG DEBIAN_FRONTEND=noninteractive

# Initialize conda for root user, install ssh and other local dependencies
RUN apt update -y && \
apt install git gcc rsync sudo patch openssh-server pciutils nano fuse socat netcat curl -y && \
apt install git gcc rsync sudo patch openssh-server pciutils nano fuse socat netcat-openbsd curl -y && \
rm -rf /var/lib/apt/lists/* && \
apt remove -y python3 && \
conda init
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile_k8s_gpu
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ ARG DEBIAN_FRONTEND=noninteractive
# We remove cuda lists to avoid conflicts with the cuda version installed by ray
RUN rm -rf /etc/apt/sources.list.d/cuda* && \
apt update -y && \
apt install git gcc rsync sudo patch openssh-server pciutils nano fuse unzip socat netcat curl -y && \
apt install git gcc rsync sudo patch openssh-server pciutils nano fuse unzip socat netcat-openbsd curl -y && \
rm -rf /var/lib/apt/lists/*

# Setup SSH and generate hostkeys
Expand Down Expand Up @@ -36,6 +36,7 @@ SHELL ["/bin/bash", "-c"]

# Install conda and other dependencies
# Keep the conda and Ray versions below in sync with the ones in skylet.constants
# Keep this section in sync with the custom image optimization recommendations in our docs (kubernetes-getting-started.rst)
RUN curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -o Miniconda3-Linux-x86_64.sh && \
bash Miniconda3-Linux-x86_64.sh -b && \
eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true && conda activate base && \
Expand Down
8 changes: 8 additions & 0 deletions docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,14 @@ The :code:`~/.oci/config` file should contain the following fields:
# Note that we should avoid using full home path for the key_file configuration, e.g. use ~/.oci instead of /home/username/.oci
key_file=~/.oci/oci_api_key.pem
By default, the provisioned nodes will be in the root `compartment <https://docs.oracle.com/en/cloud/foundation/cloud_architecture/governance/compartments.html>`__. To specify the `compartment <https://docs.oracle.com/en/cloud/foundation/cloud_architecture/governance/compartments.html>`_ other than root, create/edit the file :code:`~/.sky/config.yaml`, put the compartment's OCID there, as the following:

.. code-block:: text
oci:
default:
compartment_ocid: ocid1.compartment.oc1..aaaaaaaa......
Lambda Cloud
~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/comparison.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ SkyPilot provides faster iteration for interactive development. For example, a c
* :strong:`With SkyPilot, a single command (`:literal:`sky launch`:strong:`) takes care of everything.` Behind the scenes, SkyPilot provisions pods, installs all required dependencies, executes the job, returns logs, and provides SSH and VSCode access to debug.


.. figure:: https://blog.skypilot.co/ai-on-kubernetes/images/k8s_vs_skypilot_iterative_v2.png
.. figure:: https://i.imgur.com/xfCfz4N.png
:align: center
:width: 95%
:alt: Iterative Development with Kubernetes vs SkyPilot
Expand Down
Loading

0 comments on commit 4b2dd86

Please sign in to comment.