Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure workflow-run-job-linux to use sccache-dist build cluster #2672

Draft
wants to merge 54 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
c3263ac
Configure workflow-run-job-linux to use sccache-dist build cluster [s…
trxcllnt Oct 30, 2024
3d8e058
try parallelism=(nproc * 2) [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 1, 2024
3c706f5
try on cpu4 runners with 4x parallelism [skip-vdc] [skip-docs] [skip-…
trxcllnt Nov 1, 2024
476fb43
turn off SCCACHE_NO_CACHE [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 1, 2024
9868491
initializeCommand should not pass two arguments to `bash -c` [skip-vd…
trxcllnt Nov 1, 2024
d389ba4
build uncached on cpu8 with -j16 [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 1, 2024
4425803
test fewer jobs [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 1, 2024
3ef25fe
test all jobs, use cpu4 runners [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 12, 2024
569d179
use cpu16 for tests, only use build cluster for build jobs [skip-vdc]…
trxcllnt Nov 13, 2024
8d68433
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 23, 2024
95c726e
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Nov 23, 2024
3aeaefe
update cuda12.6ext-gcc13 devcontainer [skip-vdc] [skip-docs] [skip-ra…
trxcllnt Nov 23, 2024
8c01095
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 24, 2024
1983388
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Nov 27, 2024
3d36247
use -j64 [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 27, 2024
e884b4e
test with cpu16 runners [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 27, 2024
31c3817
include hidden files (.ninja_log) in job artifact [skip-vdc] [skip-do…
trxcllnt Nov 27, 2024
47135b9
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Nov 30, 2024
5a976d2
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Dec 2, 2024
7085ad2
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Dec 2, 2024
6e1c181
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Dec 3, 2024
97a5c91
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Dec 3, 2024
c3aaba3
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Dec 28, 2024
bcfa0b5
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Dec 28, 2024
4eb6e78
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Dec 28, 2024
f684fe3
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Jan 13, 2025
32ef519
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 14, 2025
501b259
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 14, 2025
6529947
add script to print dist status table [skip-vdc] [skip-docs] [skip-ra…
trxcllnt Jan 14, 2025
8a1e909
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 15, 2025
2ca46f7
include timestamp in dist stats
trxcllnt Jan 15, 2025
c901c62
include quotes in csv output [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 15, 2025
4ca3499
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 16, 2025
74bb161
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Jan 16, 2025
288e3b5
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 16, 2025
82d5a45
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Jan 17, 2025
bc89a11
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 17, 2025
bdd9395
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Jan 24, 2025
dd3d91f
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 24, 2025
f0cf283
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Jan 27, 2025
eb245ca
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Jan 28, 2025
574d6fe
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 28, 2025
d9a6dc8
use 4-core runners [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 29, 2025
f2537f9
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Jan 31, 2025
5ddbbc7
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Jan 31, 2025
f845766
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Feb 3, 2025
3af5ded
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Feb 3, 2025
cedfeaf
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Feb 4, 2025
88c964a
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Feb 4, 2025
7dabc2f
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Feb 6, 2025
aeb23df
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Feb 6, 2025
eae4a82
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Feb 6, 2025
e22e8a3
Merge branch 'main' of github.com:NVIDIA/cccl into fea/use-sccache-bu…
trxcllnt Feb 11, 2025
fb9ffc9
bump sccache version [skip-vdc] [skip-docs] [skip-rapids]
trxcllnt Feb 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/actions/workflow-build/build-workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,15 @@ def generate_dispatch_job_runner(matrix_job, job_type):

job_info = get_job_type_info(job_type)
if not job_info["gpu"]:
# Use smaller 4-core runners for build jobs if we can
if job_type == "build":
# ClangCUDA, MSVC, and NVHPC should use 16-core runners
if (
("clang" not in matrix_job["cudacxx"]) and
("msvc" not in matrix_job["cxx"]) and
("nvhpc" not in matrix_job["cxx"])
):
return f"{runner_os}-{cpu}-cpu4"
return f"{runner_os}-{cpu}-cpu16"

gpu = get_gpu(matrix_job["gpu"])
Expand Down
186 changes: 162 additions & 24 deletions .github/actions/workflow-run-job-linux/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ inputs:
host:
description: "The host compiler to use when selecting a devcontainer."
required: true
# This token must have the "read:enterprise" scope
dist-token:
description: "The token used to authenticate with the sccache-dist build cluster."
required: false

runs:
using: "composite"
Expand Down Expand Up @@ -72,12 +76,15 @@ runs:
# Dereferencing the command from an env var instead of a GHA input avoids issues with escaping
# semicolons and other special characters (e.g. `-arch "60;70;80"`).
COMMAND: "${{inputs.command}}"
DIST_TOKEN: "${{inputs.dist-token}}"
AWS_ACCESS_KEY_ID: "${{env.AWS_ACCESS_KEY_ID}}"
AWS_SESSION_TOKEN: "${{env.AWS_SESSION_TOKEN}}"
AWS_SECRET_ACCESS_KEY: "${{env.AWS_SECRET_ACCESS_KEY}}"
run: |
echo "[host] github.workspace: ${{github.workspace}}"
echo "[host] runner.temp: ${{runner.temp}}"
echo "[container] GITHUB_WORKSPACE: ${GITHUB_WORKSPACE:-}"
echo "[container] RUNNER_TEMP: ${RUNNER_TEMP:-}"
echo "[container] PWD: $(pwd)"

# Necessary because we're doing docker-outside-of-docker:
Expand All @@ -87,12 +94,14 @@ runs:
ln -s "$(pwd)" "${{github.workspace}}"
cd "${{github.workspace}}"

mkdir artifacts
echo "[container] new PWD: $(pwd)"

cat <<'EOF' > ci.sh
cat <<"EOF" > "$RUNNER_TEMP/ci.sh"
#! /usr/bin/env bash
set -euo pipefail
echo -e "\e[1;34mRunning as '$(whoami)' user in $(pwd):\e[0m"
# Print current dist status to verify we're connected
echo -e "\e[1;34mBuild cluster:\n$(./ci/sccache_dist_status.sh | sed 's/\"//g' | column -t -s, -R $(seq -s, 1 12))\e[0m"
echo -e "\e[1;34m${COMMAND}\e[0m"
eval "${COMMAND}"
exit_code=$?
Expand Down Expand Up @@ -128,16 +137,18 @@ runs:
find_and_copy "sccache_stats.json" || :
EOF

chmod +x ci.sh
chmod +x "$RUNNER_TEMP/ci.sh"

mkdir "$RUNNER_TEMP/.aws";
mkdir -p "$RUNNER_TEMP/.aws"

cat <<EOF > "$RUNNER_TEMP/.aws/config"
[default]
bucket=rapids-sccache-devs
region=us-east-2
EOF

chmod 0664 "$RUNNER_TEMP/.aws/config"

cat <<EOF > "$RUNNER_TEMP/.aws/credentials"
[default]
aws_access_key_id=$AWS_ACCESS_KEY_ID
Expand All @@ -146,32 +157,128 @@ runs:
EOF

chmod 0600 "$RUNNER_TEMP/.aws/credentials"
chmod 0664 "$RUNNER_TEMP/.aws/config"

declare -a gpu_request=()
mkdir -p "$RUNNER_TEMP/.config/sccache"

# Configure the sccache client
cat <<EOF > "$RUNNER_TEMP/.config/sccache/config"
server_startup_timeout_ms = $((5 * 60 * 1000))
[cache.disk]
size = 0
[cache.disk.preprocessor_cache_mode]
use_preprocessor_cache_mode = false
EOF

chmod 0664 "$RUNNER_TEMP/.config/sccache/config"

declare -a extra_launch_args=(
# When we cache, use a separate namespace
--env "SCCACHE_S3_KEY_PREFIX=cccl-test-sccache-dist"
)

OS="$(uname -s)"
CPUS="$(nproc --all)"
ARCH="$(dpkg --print-architecture)"

# Only use the build cluster for build jobs.
# Everything should be cached for test jobs.
not_test_job="$(grep -q '"./ci/test_' <<< "$COMMAND" || echo $?)"

# Temporary: don't use sccache-dist for NVHPC or clang-cuda
# until sccache packages up a correct toolchain for the server
not_nvhpc="$(grep -q 'nvhpc' <<< "${{inputs.host}}" || echo $?)"
not_clang_cuda="$(grep -q '\-cuda "clang' <<< "$COMMAND" || echo $?)"

# If a test job, over-subscribe -j to download more cache objects at once
if test -z "${not_test_job:+x}"; then
extra_launch_args+=(
--env "PARALLEL_LEVEL=$((CPUS * 2))"
)
else
extra_launch_args+=(
# Repopulate the cache
--env "SCCACHE_RECACHE=1"
)
fi

# Download new sccache binary
mkdir -p "$RUNNER_TEMP/bin"
curl -fsSL \
"https://github.com/trxcllnt/sccache/releases/download/v0.9.1-rapids.25/sccache-v0.9.1-rapids.25-$(uname -m)-unknown-linux-musl.tar.gz" \
| tar -C "$RUNNER_TEMP/bin" -zf - --wildcards --strip-components=1 -x '*/sccache'

# If this is not a test job and not one of the excluded compilers, use the build cluster
if test -n "${not_nvhpc:+x}" \
&& test -n "${not_test_job:+x}" \
&& test -n "${not_clang_cuda:+x}" \
&& test -n "${DIST_TOKEN:+x}"; then

# Configure sccache client to talk to the build cluster
cat <<EOF >> "$RUNNER_TEMP/.config/sccache/config"
[dist]
# Never fallback to building locally
max_retries = inf
# Retry failed builds 10 times before building locally
# max_retries = 10
scheduler_url = "https://${ARCH}.${OS,,}.sccache.gha-runners.nvidia.com"

# Build cluster auth
[dist.auth]
type = "token"
token = "${DIST_TOKEN}"

# Build cluster network config
[dist.net]
connect_timeout = 10
request_timeout = 1200
# connection_pool = true
EOF

extra_launch_args+=(
# Over-subscribe -j to keep the build cluster busy
# --env "PARALLEL_LEVEL=$((CPUS * 4))"
# --env "PARALLEL_LEVEL=64"
--env "PARALLEL_LEVEL=1000"

# Uncomment to repopulate the cache
# --env "SCCACHE_RECACHE=1"

# Uncomment to not use the cache at all
# --env "SCCACHE_NO_CACHE=1"

# Instruct sccache to write debug logs to a file we can upload
--env "SCCACHE_SERVER_LOG=sccache=debug"
--env "SCCACHE_ERROR_LOG=/home/coder/cccl/sccache.log"

# Mount in new sccache binary
--volume "${{runner.temp}}/bin/sccache:/usr/bin/sccache:ro"
)

if ! grep -q '11.1' <<< "${{inputs.cuda}}"; then
# Compile device objects in parallel
extra_launch_args+=(
--env "NVCC_APPEND_FLAGS=-t=100"
)
fi
fi

# Explicitly pass which GPU to use if on a GPU runner
if [[ "${RUNNER}" = *"-gpu-"* ]]; then
gpu_request+=(--gpus "device=${NVIDIA_VISIBLE_DEVICES}")
extra_launch_args+=(--gpus "device=${NVIDIA_VISIBLE_DEVICES}")
fi

host_path() {
sed "s@/__w@$(dirname "$(dirname "${{github.workspace}}")")@" <<< "$1"
}

# If the image contains "cudaXX.Yext"...
if [[ "${IMAGE}" =~ cuda[0-9.]+ext ]]; then
cuda_ext_request="--cuda-ext"
extra_launch_args+=(--cuda-ext)
fi

# Launch this container using the host's docker daemon
set -x

${{github.event.repository.name}}/.devcontainer/launch.sh \
--docker \
--cuda ${{inputs.cuda}} \
--host ${{inputs.host}} \
${cuda_ext_request:-} \
"${gpu_request[@]}" \
--env "CI=$CI" \
--env "AWS_ROLE_ARN=" \
--env "COMMAND=$COMMAND" \
Expand All @@ -185,33 +292,64 @@ runs:
--env "GITHUB_WORKSPACE=$GITHUB_WORKSPACE" \
--env "GITHUB_REPOSITORY=$GITHUB_REPOSITORY" \
--env "GITHUB_STEP_SUMMARY=$GITHUB_STEP_SUMMARY" \
--volume "${{github.workspace}}/ci.sh:/ci.sh" \
--volume "${{github.workspace}}/artifacts:/artifacts" \
--volume "$(host_path "$RUNNER_TEMP")/.aws:/root/.aws" \
--volume "${{runner.temp}}/ci.sh:/ci.sh:ro" \
--volume "${{runner.temp}}/.aws:/root/.aws" \
--volume "${{runner.temp}}/.config:/root/.config:ro" \
--volume "$(dirname "$(dirname "${{github.workspace}}")"):/__w" \
"${extra_launch_args[@]}" \
-- /ci.sh

- name: Prepare job artifacts
- if: ${{ always() }}
name: Create job artifact dir
shell: bash --noprofile --norc -euo pipefail {0}
run: |
echo "Prepare job artifacts"
result_dir="jobs/${{inputs.id}}"
mkdir -p "$result_dir"
echo "result_dir=$result_dir" >> "$GITHUB_ENV"

- if: ${{ success() }}
name: Record job success
shell: bash --noprofile --norc -euo pipefail {0}
run: |
touch "$result_dir/success"

artifacts_exist="$(ls -A artifacts)"
if [ "$artifacts_exist" ]; then
cp -rv artifacts/* "$result_dir"
fi
- if: ${{ always() }}
name: Prepare job artifacts
shell: bash --noprofile --norc -euo pipefail {0}
run: |
echo "Prepare job artifacts"

# chmod all temp contents 777 so the runner can delete them
find "$RUNNER_TEMP/" -exec chmod 0777 {} \;

# Finds a matching file in the repo directory and copies it to the results directory.
find_and_copy() {
pat="$1"
dir="${{github.event.repository.name}}"
filepath="$(find "$dir/" -type f -path "$dir/$pat" -print -quit)"
if [[ -z "$filepath" ]]; then
echo "File with pattern '$dir/$pat' does not exist in repo directory."
return 1
fi
cp -v "$filepath" "$result_dir"
}

# Ignore failures
find_and_copy "sccache.log" || :
find_and_copy "build/*/.ninja_log" || :
find_and_copy "build/*/build.ninja" || :
find_and_copy "build/*/rules.ninja" || :
find_and_copy "build/*/sccache_stats.json" || :

echo "::group::Job artifacts"
tree "$result_dir"
echo "::endgroup::"

- name: Upload job artifacts
- if: ${{ always() }}
name: Upload job artifacts
uses: actions/upload-artifact@v4
with:
name: jobs-${{inputs.id}}
path: jobs
compression-level: 0
include-hidden-files: true
41 changes: 31 additions & 10 deletions .github/actions/workflow-run-job-windows/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,37 +63,58 @@ runs:
[System.Environment]::SetEnvironmentVariable('SCCACHE_S3_NO_CREDENTIALS','${{env.SCCACHE_S3_NO_CREDENTIALS}}');
git config --global --add safe.directory '${{steps.paths.outputs.MOUNT_REPO}}';
${{inputs.command}}"
- name: Prepare job artifacts

- if: ${{ always() }}
name: Create job artifact dir
shell: bash --noprofile --norc -euo pipefail {0}
id: done
run: |
echo "SUCCESS=true" | tee -a "${GITHUB_OUTPUT}"

result_dir="jobs/${{inputs.id}}"
mkdir -p "$result_dir"
echo "result_dir=$result_dir" >> "$GITHUB_ENV"

- if: ${{ success() }}
name: Record job success
shell: bash --noprofile --norc -euo pipefail {0}
run: |
touch "$result_dir/success"

- if: ${{ always() }}
name: Prepare job artifacts
shell: bash --noprofile --norc -euo pipefail {0}
run: |
echo "Prepare job artifacts"

# chmod all temp contents 777 so the runner can delete them
find "$RUNNER_TEMP/" -exec chmod 0777 {} \;

# Finds a matching file in the repo directory and copies it to the results directory.
find_and_copy() {
filename="$1"
filepath="$(find ${{github.event.repository.name}} -name "${filename}" -print -quit)"
pat="$1"
dir="${{github.event.repository.name}}"
filepath="$(find "$dir/" -type f -path "$dir/$pat" -print -quit)"
if [[ -z "$filepath" ]]; then
echo "${filename} does not exist in repo directory."
echo "File with pattern '$dir/$pat' does not exist in repo directory."
return 1
fi
cp -v "$filepath" "$result_dir"
}

find_and_copy "sccache_stats.json" || true # Ignore failures
# Ignore failures
find_and_copy "sccache.log" || true
find_and_copy "build/*/.ninja_log" || true
find_and_copy "build/*/build.ninja" || true
find_and_copy "build/*/rules.ninja" || true
find_and_copy "build/*/sccache_stats.json" || true

echo "::group::Job artifacts"
find "$result_dir" # Tree not available in this image.
ls -l "$result_dir"
echo "::endgroup::"

- name: Upload job artifacts
- if: ${{ always() }}
name: Upload job artifacts
uses: actions/upload-artifact@v4
with:
name: jobs-${{inputs.id}}
path: jobs
compression-level: 0
include-hidden-files: true
2 changes: 2 additions & 0 deletions .github/workflows/ci-workflow-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ jobs:
name: ${{ matrix.name }}
if: ${{ toJSON(fromJSON(needs.build-workflow.outputs.workflow)['linux_two_stage']['keys']) != '[]' }}
needs: build-workflow
secrets: inherit
permissions:
id-token: write
contents: read
Expand Down Expand Up @@ -85,6 +86,7 @@ jobs:
name: ${{ matrix.name }}
if: ${{ toJSON(fromJSON(needs.build-workflow.outputs.workflow)['linux_standalone']['keys']) != '[]' }}
needs: build-workflow
secrets: inherit
permissions:
id-token: write
contents: read
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/ci-workflow-pull-request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ jobs:
${{ !contains(github.event.head_commit.message, '[skip-matrix]') &&
toJSON(fromJSON(needs.build-workflow.outputs.workflow)['linux_two_stage']['keys']) != '[]' }}
needs: build-workflow
secrets: inherit
permissions:
id-token: write
contents: read
Expand Down Expand Up @@ -108,6 +109,7 @@ jobs:
${{ !contains(github.event.head_commit.message, '[skip-matrix]') &&
toJSON(fromJSON(needs.build-workflow.outputs.workflow)['linux_standalone']['keys']) != '[]' }}
needs: build-workflow
secrets: inherit
permissions:
id-token: write
contents: read
Expand Down
Loading
Loading