Skip to content

Commit

Permalink
Roll forward PR #1275 with fixes (#1298)
Browse files Browse the repository at this point in the history
```
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

* [gpu] strict driver and cuda version assignment

Roll forward #1275

gpu/install_gpu_driver.sh
  * updated supported versions
  * moved all code into functions, which are called at the footer of
    the installer
  * install cuda and driver exclusively from run files
  * extract cuda and driver version from urls if supplied
  * support supplying cuda version as x.y.z instead of just x.y
  * build nccl from source
  * poll dpkg lock status for up to 60 seconds
  * cache build artifacts from kernel driver and nccl
  * use consistent arguments to curl
  * create is_complete and mark_complete functions to allow re-running
  * Tested more CUDA minor versions
  * Printing warnings when combination provided is known to fail
  * only install build dependencies on build cache miss
  * added optional pytorch install option
  * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum
  * verified that proprietary kernel drivers work with older dataproc images
  * clear dkms key immediately after use
  * cache .run files to GCS to reduce fetches from origin
  * Install nvidia container toolkit and select container runtime
  * tested installer on clusters without GPUs attached
  * fixed a problem with ops agent not installing ; using venv
  * Older CapacityScheduler does not permit use of gpu resources ;
    switch to FairScheduler on 2.0 and below
  * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript
  * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf
  * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS
  * Hold all NVIDIA-related packages from upgrading unintenionally
  * skipping proxy setup if http-proxy metadata not set
  * added function to check secure-boot and os version compatability
  * harden sshd config
  * install spark rapids acceleration libraries

gpu/manual-test-runner.sh
  * order commands correctly

gpu/run-bazel-tests.sh
  * do not retry flakey tests

gpu/test_gpu.py
  * clearer test skipping logic
  * added instructions on how to test pyspark
  * remove skip of rocky9 tests

* Correct test failures on 2.0-debian10

gpu/install_gpu_driver.sh

* Do not use fair scheduler for 2.0 clusters
* comment out spark-defaults.conf config options as guidance for tuning

gpu/test_gpu.py

* There are now three tests run from the verify_instance_spark function
* * Run the SparkPi example with no parameters specified
* * Run the JavaIndexToStringExample with many parameters specified
* * Run the JavaIndexToStringExample with few parameters specified

-----BEGIN PGP SIGNATURE-----

iQGzBAEBCgAdFiEEWBh4gudL5t7O9mieFuBp2E4LHyAFAmeuX5wACgkQFuBp2E4L
HyBnxQv/fWnbrBx0NuZQJGJt8qfuja5zSbmZL2XdgqLEkzv+y78jrIWX4wQYVDni
Hy5aN8HIRttslitNj+f4et0XSpxRFSvwJ/JZ362RMCUVUrNG/W6p+haIzPkzJz2+
0SgAaAE8JL8NOjPgCqLD7ZnaHBsA8ZPq9lXJkktkzdxzo6+jCoPY8GHELg5Cfm2e
x8mzKMwgRWIOPiW3kzvxIEJCdkQ+oM+18TyWfdal/QKDNvNTepVHeSCwzgrUEq6y
lXv+DsAfI8s8zLp1WQQt5fV+eLO66ey98RIpGedLlKhuOaTAVOyq+6ZrPva2RQEd
2QTEYWRyRynST+Cy/fLST/rZhRKoA4U0WLEru2XtIXuGU6UIdZT4ob2VEk25hxaH
FHpi3zoHzK28sx6v7qM7DuGYgyUwhL+mVddWXdwIvPvDXbJsf2ATFCCqGbCOjLYA
WXKeGg69BrERvjbQqVcpppyy2mw+CMBEPLGix7VwmVnJdU1zIqp0vvmqUD66D3gY
McpSvQyd
=llel
-----END PGP SIGNATURE-----
```
  • Loading branch information
cjac authored Feb 13, 2025
1 parent 85c1dde commit 7e87522
Show file tree
Hide file tree
Showing 6 changed files with 1,538 additions and 514 deletions.
14 changes: 10 additions & 4 deletions gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,25 @@ RUN apt-get -qq update \
curl jq less screen > /dev/null 2>&1 && apt-get clean

# Install bazel signing key, repo and package
ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"
ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg \
bazel_version=7.4.0 \
bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8" \
DEBIAN_FRONTEND=noninteractive

RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
| gpg --dearmor -o "${bazel_kr_path}" \
&& echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
| dd of=/etc/apt/sources.list.d/bazel.list status=none \
&& apt-get update -qq

RUN apt-get autoremove -y -qq && \
apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
RUN apt-get autoremove -y -qq > /dev/null 2>&1 && \
apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} > /dev/null 2>&1 && \
apt-get clean

# Set bazel-${bazel_version} as the default bazel alternative in this container
RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \
update-alternatives --set bazel /usr/bin/bazel-${bazel_version}

# Install here any utilities you find useful when troubleshooting
RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean

Expand Down
Loading

0 comments on commit 7e87522

Please sign in to comment.