Roll forward PR #1275 with fixes (#1298)

``` -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 * [gpu] strict driver and cuda version assignment Roll forward #1275 gpu/install_gpu_driver.sh * updated supported versions * moved all code into functions, which are called at the footer of the installer * install cuda and driver exclusively from run files * extract cuda and driver version from urls if supplied * support supplying cuda version as x.y.z instead of just x.y * build nccl from source * poll dpkg lock status for up to 60 seconds * cache build artifacts from kernel driver and nccl * use consistent arguments to curl * create is_complete and mark_complete functions to allow re-running * Tested more CUDA minor versions * Printing warnings when combination provided is known to fail * only install build dependencies on build cache miss * added optional pytorch install option * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum * verified that proprietary kernel drivers work with older dataproc images * clear dkms key immediately after use * cache .run files to GCS to reduce fetches from origin * Install nvidia container toolkit and select container runtime * tested installer on clusters without GPUs attached * fixed a problem with ops agent not installing ; using venv * Older CapacityScheduler does not permit use of gpu resources ; switch to FairScheduler on 2.0 and below * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * Hold all NVIDIA-related packages from upgrading unintenionally * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability * harden sshd config * install spark rapids acceleration libraries gpu/manual-test-runner.sh * order commands correctly gpu/run-bazel-tests.sh * do not retry flakey tests gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark * remove skip of rocky9 tests * Correct test failures on 2.0-debian10 gpu/install_gpu_driver.sh * Do not use fair scheduler for 2.0 clusters * comment out spark-defaults.conf config options as guidance for tuning gpu/test_gpu.py * There are now three tests run from the verify_instance_spark function * * Run the SparkPi example with no parameters specified * * Run the JavaIndexToStringExample with many parameters specified * * Run the JavaIndexToStringExample with few parameters specified -----BEGIN PGP SIGNATURE----- iQGzBAEBCgAdFiEEWBh4gudL5t7O9mieFuBp2E4LHyAFAmeuX5wACgkQFuBp2E4L HyBnxQv/fWnbrBx0NuZQJGJt8qfuja5zSbmZL2XdgqLEkzv+y78jrIWX4wQYVDni Hy5aN8HIRttslitNj+f4et0XSpxRFSvwJ/JZ362RMCUVUrNG/W6p+haIzPkzJz2+ 0SgAaAE8JL8NOjPgCqLD7ZnaHBsA8ZPq9lXJkktkzdxzo6+jCoPY8GHELg5Cfm2e x8mzKMwgRWIOPiW3kzvxIEJCdkQ+oM+18TyWfdal/QKDNvNTepVHeSCwzgrUEq6y lXv+DsAfI8s8zLp1WQQt5fV+eLO66ey98RIpGedLlKhuOaTAVOyq+6ZrPva2RQEd 2QTEYWRyRynST+Cy/fLST/rZhRKoA4U0WLEru2XtIXuGU6UIdZT4ob2VEk25hxaH FHpi3zoHzK28sx6v7qM7DuGYgyUwhL+mVddWXdwIvPvDXbJsf2ATFCCqGbCOjLYA WXKeGg69BrERvjbQqVcpppyy2mw+CMBEPLGix7VwmVnJdU1zIqp0vvmqUD66D3gY McpSvQyd =llel -----END PGP SIGNATURE----- ```
GoogleCloudDataproc · Feb 13, 2025 · 7e87522 · 7e87522
1 parent 85c1dde
commit 7e87522
Show file tree

Hide file tree

Showing 6 changed files with 1,538 additions and 514 deletions.
diff --git a/gpu/Dockerfile b/gpu/Dockerfile
@@ -15,19 +15,25 @@ RUN apt-get -qq update \
      curl jq less screen > /dev/null 2>&1  && apt-get clean
 
 # Install bazel signing key, repo and package
-ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
-ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"
+ENV bazel_kr_path=/usr/share/keyrings/bazel-keyring.gpg \
+    bazel_version=7.4.0 \
+    bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8" \
+    DEBIAN_FRONTEND=noninteractive
 
 RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
       | gpg --dearmor -o "${bazel_kr_path}" \
     && echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
       | dd of=/etc/apt/sources.list.d/bazel.list status=none \
     && apt-get update -qq
 
-RUN apt-get autoremove -y -qq && \
-    apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
+RUN apt-get autoremove -y -qq > /dev/null 2>&1 && \
+    apt-get install -y -qq default-jdk python3-setuptools bazel-${bazel_version} > /dev/null 2>&1 && \
     apt-get clean
 
+# Set bazel-${bazel_version} as the default bazel alternative in this container
+RUN update-alternatives --install /usr/bin/bazel bazel /usr/bin/bazel-${bazel_version} 1 && \
+    update-alternatives --set bazel /usr/bin/bazel-${bazel_version}
+
 # Install here any utilities you find useful when troubleshooting
 RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean