Merge master (#40)

* [perf] optimizations for sky jobs launch (#4341) * cache AWS get_user_identities With SSO enabled (and maybe without?) this takes about a second. We already use an lru_cache for Azure, do the same here. * skip optimization for sky jobs launch --yes The only reason we call optimize for jobs_launch is to give a preview of the resources we expect to use, and give the user an opportunity to back out if it's not what they expect. If you use --yes or -y, you don't have a chance to back out and you're probably running from a script, where you don't care. Optimization can take ~2 seconds, so just skip it. * update logging * address PR comments * [ux] cache cluster status of autostop or spot clusters for 2s (#4332) * add status_updated_at to DB * don't refresh autostop/spot cluster if it's recently been refreshed * update locking mechanism for status check to early exit * address PR comments * add warning about cluster status lock timeout * [k8s] fix managed job issue on k8s (#4357) Signed-off-by: nkwangleiGIT <[email protected]> * [Core] Add `NO_UPLOAD` for `remote_identity` (#4307) * Add skip flag to remote_identity * Rename to NO_UPLOAD * Fixes * lint * comments * Add comments * lint * Add Lambda's GH200 instance type (#4377) Add GH200 instance type * [FluidStack] Fix provisioning and add new gpu types (#4359) [FluidStack] Fix provisioning and add new gpu types * Add new `provisioning` status to fix failed deployments * Add H100 SXM5 GPU mapping * [ux] display human-readable name for controller (#4376) * [k8s] Handle apt update log not existing (#4381) do not panic if file does not exist, it may be written soon * Support event based smoke test instead of sleep time based to reduce flaky test and faster test (#4284) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * restore sleep for fail test case * [UX] user-friendly message shown if Kubernetes is not enabled. (#4336) try except * [Jobs] Disable deduplication for logs (#4388) Disable dedup * [OCI] set zone in the ProvisionRecord (#4383) * fix: Add zone to the ProvisionRecord * fix * [Examples] Specify version for vllm cuz vllm v0.6.4.post1 has issue (#4391) * [OCI] Specify vllm version because the latest vllm v0.6.4.post1 has issue * version for vllm-flash-attn * [docs] Specify compartment for OCI resources. (#4384) * [docs] Specify compartment for OCI resources. * Add link to compartment definition page * [k8s] Improve multi-node provisioning time (nimbus) (#4393) * Tracking k8s events with timeline * Remove SSH wait * Parallelize pod creation and status check * Parallelize labelling, add docs on optimizing base image, bump default provision timeout * More parallelization, batching and optimizations * lint * correctness * Fix double launch bug * fix num threads * Add fd limit warning * [k8s] Move setup and ray start to pod args to make them async (#4389) * move scripts to args * Avoid ray setup * fix * Add checks for ray healthiness * remove bc installation * wait for healthy * add todo * fix * fix * format * format * remove unnecessary logging * print out error setup * Add comment * clean up the logging * style * Fixes for ubuntu images * format * remove unused comments * Optimize ray start * add comments * Add comments * Fix comments and logging * missing end_epoch * Add logging * Longer timeout and trigger ray start * Fixes for the ray port and AWS credential setup * Update netcat-openbsd, comments * _NUM_THREADS rename * add num_nodes to calculate timeout * lint * revert * use uv for pip install and for venv creation (#4394) * use uv for pip install and for venv creation uv is a tool that can replace pip and venv (and some other stuff we're not using I think). It's written in rust and in testing is significantly faster for many operation, especially things like `pip list` or `pip install skypilot` when skypilot or all its dependencies are already installed. * add comment to SKY_PIP_CMD * sudo handling for ray * Add comment in dockerfile * fix pod checks * lint --------- Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> * [Core] Skip worker ray start for multinode (#4390) * Optimize ray start * add comments * update logging * remove `uv` from runtime setup due to azure installation issue (#4401) * [k8s] Skip listing all pods to speed up optimizer (#4398) * Reduce API calls * lint * [k8s] Nimbus backward compatibility (#4400) * Add nimbus backward compatibility * add uv backcompat * add uv backcompat * add uv backcompat * lint * merge * merge * [Storage] Call `sync_file_mounts` when either rsync or storage file_mounts are specified (#4317) do file mounts if storage is specified * [k8s] Support in-cluster and kubeconfig auth simultaneously (#4188) * per-context SA + incluster auth fixes * lint * Support both incluster and kubeconfig * wip * Ignore kubeconfig when context is not specified, add su, mounting kubeconfig * lint * comments * fix merge issues * lint * Fix Spot instance on Azure (#4408) * [UX] Allow disabling ports in CLI (#4378) [UX] Allow disabling ports * [AWS] Get rid of credential files if `remote_identity: SERVICE_ACCOUNT` specified (#4395) * syntax * minor * Fix OD instance on Azure (#4411) * [UX] Remove K80 and M60 from common GPU list (#4382) * Remove K80 and M60 from GPU list * Fix kubernetes instance type with space * comments * format * format * remove mi25 * Event based smoke tests -- manged jobs (#4386) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * more test case * support test_managed_jobs_pipeline_failed_setup * support test_managed_jobs_recovery_aws * manged job status * bug fix * test managed job cancel * test_managed_jobs_storage * more test cases * resolve pr comment * private member function * bug fix * interface change * bug fix * bug fix * raise error on empty status * [k8s] Fix in-cluster auth namespace fetching (#4420) * Fix incluster auth namespace fetching * Fixes * [k8s] Update comparison page image (#4415) Update image * Add a pre commit config to help format before pushing (#4258) * pre commit config * yapf version * fix * mypy check all files * skip smoke_test.py * add doc * better format * newline format * sync with format.sh * comment fix * fix the pylint hook for pre-commit (#4422) * fix the pylint hook * remove default arg * change name * limit pylint files * [k8s] Fix resources.image_id backward compatibility (#4425) * Fix back compat * Fix back compat for image_id + regions * lint * comments * [Tests] Move tests to uv to speed up the dependency installation by >10x (#4424) * correct cache for pypi * Add doc cache and test cache * Add examples folder * fix policy path * use uv for pylint * Fix azure cli * disable cache * use venv * set venv * source instead * rename doc build * Move to uv * Fix azure cli * Add -e * Update .github/workflows/format.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/mypy.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/pylint.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/pytest.yml Co-authored-by: Christopher Cooper <[email protected]> * Update .github/workflows/test-doc-build.yml Co-authored-by: Christopher Cooper <[email protected]> * fix pytest yml * Add merge group --------- Co-authored-by: Christopher Cooper <[email protected]> * fix db * fix launch * remove transaction id * format * format * format * test doc build * doc build * update readme for test kubernetes example (#4426) * update readme * fetch version from gcloud * rename var to GKE_VERSION * subnetwork also use REGION * format * fix types * fix * format * fix types * [k8s] Fix `show-gpus` availability map when nvidia drivers are not installed (#4429) * Fix availability map * Fix availability map * fix types * avoid catching ValueError during failover (#4432) * avoid catching ValueError during failover If the cloud api raises ValueError or a subclass of ValueError during instance termination, we will assume the cluster was downed. Fix this by introducing a new exception ClusterDoesNotExist that we can catch instead of the more general ValueError. * add unit test * lint * [Core] Execute setup when `--detach-setup` and no `run` section (#4430) * Execute setup when --detach-setup and no run section * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Tian Xia <[email protected]> * add comments * Fix types * format * minor * Add test for detach setup only --------- Co-authored-by: Tian Xia <[email protected]> * wait for cleanup * [Jobs] Allow logs for finished jobs and add `sky jobs logs --refresh` for restartin jobs controller (#4380) * Stream logs for finished jobs * Allow stream logs for finished jobs * Read files after the indicator lines * Add refresh for `sky jobs logs` * fix log message * address comments * Add smoke test * fix smoke * fix jobs queue smoke test * fix storage * fix merge issue * fix merge issue * Fix merging issue * format --------- Signed-off-by: nkwangleiGIT <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Lei <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Cody Brownstein <[email protected]> Co-authored-by: mjibril <[email protected]> Co-authored-by: zpoint <[email protected]> Co-authored-by: Hysun He <[email protected]> Co-authored-by: Tian Xia <[email protected]> Co-authored-by: zpoint <[email protected]>
skypilot-org · Dec 3, 2024 · 4b2dd86 · 4b2dd86
1 parent aae6ae5
commit 4b2dd86
Show file tree

Hide file tree

Showing 82 changed files with 2,140 additions and 878 deletions.
diff --git a/.github/workflows/format.yml b/.github/workflows/format.yml
@@ -22,29 +22,35 @@ jobs:
         python-version: ["3.8"]
     steps:
     - uses: actions/checkout@v3
-    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v4
+    - name: Install the latest version of uv
+      uses: astral-sh/setup-uv@v4
       with:
+        version: "latest"
         python-version: ${{ matrix.python-version }}
     - name: Install dependencies
       run: |
-        python -m pip install --upgrade pip
-        pip install yapf==0.32.0
-        pip install toml==0.10.2
-        pip install black==22.10.0
-        pip install isort==5.12.0
+        uv venv --seed ~/test-env
+        source ~/test-env/bin/activate
+        uv pip install yapf==0.32.0
+        uv pip install toml==0.10.2
+        uv pip install black==22.10.0
+        uv pip install isort==5.12.0
     - name: Running yapf
       run: |
+        source ~/test-env/bin/activate
         yapf --diff --recursive ./ --exclude 'sky/skylet/ray_patches/**' \
             --exclude 'sky/skylet/providers/ibm/**'
     - name: Running black
       run: |
+        source ~/test-env/bin/activate
         black --diff --check sky/skylet/providers/ibm/
     - name: Running isort for black formatted files
       run: |
+        source ~/test-env/bin/activate
         isort --diff --check --profile black -l 88 -m 3 \
             sky/skylet/providers/ibm/
     - name: Running isort for yapf formatted files
       run: |
+        source ~/test-env/bin/activate
         isort --diff --check ./ --sg 'sky/skylet/ray_patches/**' \
             --sg 'sky/skylet/providers/ibm/**'
diff --git a/.github/workflows/mypy-generic.yml b/.github/workflows/mypy-generic.yml
diff --git a/.github/workflows/mypy.yml b/.github/workflows/mypy.yml
@@ -12,6 +12,8 @@ on:
       - master
       - 'releases/**'
       - restapi
+  merge_group:
+
 jobs:
   mypy:
     runs-on: ubuntu-latest
@@ -20,15 +22,18 @@ jobs:
         python-version: ["3.8"]
     steps:
     - uses: actions/checkout@v3
-    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v4
+    - name: Install the latest version of uv
+      uses: astral-sh/setup-uv@v4
       with:
+        version: "latest"
         python-version: ${{ matrix.python-version }}
     - name: Install dependencies
       run: |
-        python -m pip install --upgrade pip
-        pip install mypy==$(grep mypy requirements-dev.txt | cut -d'=' -f3)
-        pip install $(grep types- requirements-dev.txt | tr '\n' ' ')
+        uv venv --seed ~/test-env
+        source ~/test-env/bin/activate
+        uv pip install mypy==$(grep mypy requirements-dev.txt | cut -d'=' -f3)
+        uv pip install $(grep types- requirements-dev.txt | tr '\n' ' ')
     - name: Running mypy
       run: |
+        source ~/test-env/bin/activate
         mypy $(cat tests/mypy_files.txt)
diff --git a/.github/workflows/pylint.yml b/.github/workflows/pylint.yml
@@ -22,16 +22,20 @@ jobs:
         python-version: ["3.8"]
     steps:
     - uses: actions/checkout@v3
-    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v4
+    - name: Install the latest version of uv
+      uses: astral-sh/setup-uv@v4
       with:
+        version: "latest"
         python-version: ${{ matrix.python-version }}
     - name: Install dependencies
       run: |
-        python -m pip install --upgrade pip
-        pip install ".[all]"
-        pip install pylint==2.14.5
-        pip install pylint-quotes==0.2.3
+        uv venv --seed ~/test-env
+        source ~/test-env/bin/activate
+        uv pip install --prerelease=allow "azure-cli>=2.65.0"
+        uv pip install ".[all]"
+        uv pip install pylint==2.14.5
+        uv pip install pylint-quotes==0.2.3
     - name: Analysing the code with pylint
       run: |
+        source ~/test-env/bin/activate
         pylint --load-plugins pylint_quotes sky
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -35,26 +35,21 @@ jobs:
     steps:
       - name: Checkout repository
         uses: actions/checkout@v3
-
-      - name: Install Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
+      - name: Install the latest version of uv
+        uses: astral-sh/setup-uv@v4
         with:
+          version: "latest"
           python-version: ${{ matrix.python-version }}
-
-      - name: Cache dependencies
-        uses: actions/cache@v3
-        if: startsWith(runner.os, 'Linux')
-        with:
-          path: ~/.cache/pip
-          key: ${{ runner.os }}-pip-pytest-${{ matrix.python-version }}
-          restore-keys: |
-            ${{ runner.os }}-pip-pytest-${{ matrix.python-version }}
-
       - name: Install dependencies
         run: |
-          python -m pip install --upgrade pip
-          pip install -e ".[all]"
-          pip install pytest pytest-xdist pytest-env>=0.6 memory-profiler==0.61.0
-
+          uv venv --seed ~/test-env
+          source ~/test-env/bin/activate
+          uv pip install --prerelease=allow "azure-cli>=2.65.0"
+          # Use -e to include examples and tests folder in the path for unit
+          # tests to access them.
+          uv pip install -e ".[all]"
+          uv pip install pytest pytest-xdist pytest-env>=0.6 memory-profiler==0.61.0
       - name: Run tests with pytest
-        run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 0 --dist no ${{ matrix.test-path }}
+        run: |
+          source ~/test-env/bin/activate
+          SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 0 --dist no ${{ matrix.test-path }}
diff --git a/.github/workflows/test-doc-build.yml b/.github/workflows/test-doc-build.yml
@@ -11,27 +11,32 @@ on:
     branches:
       - master
       - 'releases/**'
+      - restapi
   merge_group:
 
 jobs:
-  format:
+  doc-build:
     runs-on: ubuntu-latest
     strategy:
       matrix:
         python-version: ["3.10"]
     steps:
     - uses: actions/checkout@v3
-    - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v4
+    - name: Install the latest version of uv
+      uses: astral-sh/setup-uv@v4
       with:
+        version: "latest"
         python-version: ${{ matrix.python-version }}
     - name: Install dependencies
       run: |
-        python -m pip install --upgrade pip
-        pip install .
+        uv venv --seed ~/test-env
+        source ~/test-env/bin/activate
+        uv pip install --prerelease=allow "azure-cli>=2.65.0"
+        uv pip install ".[all]"
         cd docs
-        pip install -r ./requirements-docs.txt
+        uv pip install -r ./requirements-docs.txt
     - name: Build documentation
       run: |
+        source ~/test-env/bin/activate
         cd ./docs
         ./build.sh
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,74 @@
+# Ensure this configuration aligns with format.sh and requirements.txt
+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+    -   id: trailing-whitespace
+    -   id: end-of-file-fixer
+    -   id: check-yaml
+    -   id: check-added-large-files
+
+-   repo: https://github.com/psf/black
+    rev: 22.10.0  # Match the version from requirements
+    hooks:
+    -   id: black
+        name: black (IBM specific)
+        files: "^sky/skylet/providers/ibm/.*"  # Match only files in the IBM directory
+
+-   repo: https://github.com/pycqa/isort
+    rev: 5.12.0  # Match the version from requirements
+    hooks:
+    # First isort command
+    -   id: isort
+        name: isort (general)
+        args:
+          - "--sg=build/**"  # Matches "${ISORT_YAPF_EXCLUDES[@]}"
+          - "--sg=sky/skylet/providers/ibm/**"
+        files: "^(sky|tests|examples|llm|docs)/.*"  # Only match these directories
+    # Second isort command
+    -   id: isort
+        name: isort (IBM specific)
+        args:
+          - "--profile=black"
+          - "-l=88"
+          - "-m=3"
+        files: "^sky/skylet/providers/ibm/.*"  # Only match IBM-specific directory
+
+-   repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v0.991  # Match the version from requirements
+    hooks:
+    -   id: mypy
+        args:
+            # From tests/mypy_files.txt
+            - "sky"
+            - "--exclude"
+            - "sky/benchmark|sky/callbacks|sky/skylet/providers/azure|sky/resources.py|sky/backends/monkey_patches"
+        pass_filenames: false
+        additional_dependencies:
+            - types-PyYAML
+            - types-requests<2.31  # Match the condition in requirements.txt
+            - types-setuptools
+            - types-cachetools
+            - types-pyvmomi
+
+-   repo: https://github.com/google/yapf
+    rev: v0.32.0  # Match the version from requirements
+    hooks:
+    -   id: yapf
+        name: yapf
+        exclude: (build/.*|sky/skylet/providers/ibm/.*)  # Matches exclusions from the script
+        args: ['--recursive', '--parallel']  # Only necessary flags
+        additional_dependencies: [toml==0.10.2]
+
+-   repo: https://github.com/pylint-dev/pylint
+    rev: v2.14.5  # Match the version from requirements
+    hooks:
+    -   id: pylint
+        additional_dependencies:
+            - pylint-quotes==0.2.3  # Match the version from requirements
+        name: pylint
+        args:
+            - --rcfile=.pylintrc  # Use your custom pylint configuration
+            - --load-plugins=pylint_quotes  # Load the pylint-quotes plugin
+        files: ^sky/  # Only include files from the 'sky/' directory
+        exclude: ^sky/skylet/providers/ibm/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -78,6 +78,7 @@ It has some convenience features which you might find helpful (see [Dockerfile](
 - If relevant, add tests for your changes. For changes that touch the core system, run the [smoke tests](#testing) and ensure they pass.
 - Follow the [Google style guide](https://google.github.io/styleguide/pyguide.html).
 - Ensure code is properly formatted by running [`format.sh`](https://github.com/skypilot-org/skypilot/blob/master/format.sh).
+  - [Optional] You can also install pre-commit hooks by running `pre-commit install` to automatically format your code on commit.
 - Push your changes to your fork and open a pull request in the SkyPilot repository.
 - In the PR description, write a `Tested:` section to describe relevant tests performed.
 

diff --git a/Dockerfile_k8s b/Dockerfile_k8s
@@ -7,7 +7,7 @@ ARG DEBIAN_FRONTEND=noninteractive
 
 # Initialize conda for root user, install ssh and other local dependencies
 RUN apt update -y && \
-    apt install git gcc rsync sudo patch openssh-server pciutils nano fuse socat netcat curl -y && \
+    apt install git gcc rsync sudo patch openssh-server pciutils nano fuse socat netcat-openbsd curl -y && \
     rm -rf /var/lib/apt/lists/* && \
     apt remove -y python3 && \
     conda init

diff --git a/Dockerfile_k8s_gpu b/Dockerfile_k8s_gpu
@@ -7,7 +7,7 @@ ARG DEBIAN_FRONTEND=noninteractive
 # We remove cuda lists to avoid conflicts with the cuda version installed by ray
 RUN rm -rf /etc/apt/sources.list.d/cuda* && \
     apt update -y && \
-    apt install git gcc rsync sudo patch openssh-server pciutils nano fuse unzip socat netcat curl -y && \
+    apt install git gcc rsync sudo patch openssh-server pciutils nano fuse unzip socat netcat-openbsd curl -y && \
     rm -rf /var/lib/apt/lists/*
 
 # Setup SSH and generate hostkeys
@@ -36,6 +36,7 @@ SHELL ["/bin/bash", "-c"]
 
 # Install conda and other dependencies
 # Keep the conda and Ray versions below in sync with the ones in skylet.constants
+# Keep this section in sync with the custom image optimization recommendations in our docs (kubernetes-getting-started.rst)
 RUN curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -o Miniconda3-Linux-x86_64.sh && \
     bash Miniconda3-Linux-x86_64.sh -b && \
     eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true && conda activate base && \

diff --git a/docs/source/getting-started/installation.rst b/docs/source/getting-started/installation.rst
@@ -267,6 +267,14 @@ The :code:`~/.oci/config` file should contain the following fields:
   # Note that we should avoid using full home path for the key_file configuration, e.g. use ~/.oci instead of /home/username/.oci
   key_file=~/.oci/oci_api_key.pem
 
+By default, the provisioned nodes will be in the root `compartment <https://docs.oracle.com/en/cloud/foundation/cloud_architecture/governance/compartments.html>`__. To specify the `compartment <https://docs.oracle.com/en/cloud/foundation/cloud_architecture/governance/compartments.html>`_ other than root, create/edit the file :code:`~/.sky/config.yaml`, put the compartment's OCID there, as the following:
+
+.. code-block:: text
+
+  oci:
+    default:
+      compartment_ocid: ocid1.compartment.oc1..aaaaaaaa......
+
 
 Lambda Cloud
 ~~~~~~~~~~~~~~~~~~

diff --git a/docs/source/reference/comparison.rst b/docs/source/reference/comparison.rst
@@ -46,7 +46,7 @@ SkyPilot provides faster iteration for interactive development. For example, a c
 * :strong:`With SkyPilot, a single command (`:literal:`sky launch`:strong:`) takes care of everything.` Behind the scenes, SkyPilot provisions pods, installs all required dependencies, executes the job, returns logs, and provides SSH and VSCode access to debug.
 
 
-.. figure:: https://blog.skypilot.co/ai-on-kubernetes/images/k8s_vs_skypilot_iterative_v2.png
+.. figure:: https://i.imgur.com/xfCfz4N.png
     :align: center
     :width: 95%
     :alt: Iterative Development with Kubernetes vs SkyPilot