Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
``` -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 * [gpu] strict driver and cuda version assignment Roll forward #1275 gpu/install_gpu_driver.sh * updated supported versions * moved all code into functions, which are called at the footer of the installer * install cuda and driver exclusively from run files * extract cuda and driver version from urls if supplied * support supplying cuda version as x.y.z instead of just x.y * build nccl from source * poll dpkg lock status for up to 60 seconds * cache build artifacts from kernel driver and nccl * use consistent arguments to curl * create is_complete and mark_complete functions to allow re-running * Tested more CUDA minor versions * Printing warnings when combination provided is known to fail * only install build dependencies on build cache miss * added optional pytorch install option * renamed metadata attribute cert_modulus_md5sum to modulus_md5sum * verified that proprietary kernel drivers work with older dataproc images * clear dkms key immediately after use * cache .run files to GCS to reduce fetches from origin * Install nvidia container toolkit and select container runtime * tested installer on clusters without GPUs attached * fixed a problem with ops agent not installing ; using venv * Older CapacityScheduler does not permit use of gpu resources ; switch to FairScheduler on 2.0 and below * caching result of nvidia-smi in spark.executor.resource.gpu.discoveryScript * setting some reasonable defaults in /etc/spark/conf.dist/spark-defaults.conf * Installing gcc-12 on ubuntu22 to fix kernel driver FTBFS * Hold all NVIDIA-related packages from upgrading unintenionally * skipping proxy setup if http-proxy metadata not set * added function to check secure-boot and os version compatability * harden sshd config * install spark rapids acceleration libraries gpu/manual-test-runner.sh * order commands correctly gpu/run-bazel-tests.sh * do not retry flakey tests gpu/test_gpu.py * clearer test skipping logic * added instructions on how to test pyspark * remove skip of rocky9 tests * Correct test failures on 2.0-debian10 gpu/install_gpu_driver.sh * Do not use fair scheduler for 2.0 clusters * comment out spark-defaults.conf config options as guidance for tuning gpu/test_gpu.py * There are now three tests run from the verify_instance_spark function * * Run the SparkPi example with no parameters specified * * Run the JavaIndexToStringExample with many parameters specified * * Run the JavaIndexToStringExample with few parameters specified -----BEGIN PGP SIGNATURE----- iQGzBAEBCgAdFiEEWBh4gudL5t7O9mieFuBp2E4LHyAFAmeuX5wACgkQFuBp2E4L HyBnxQv/fWnbrBx0NuZQJGJt8qfuja5zSbmZL2XdgqLEkzv+y78jrIWX4wQYVDni Hy5aN8HIRttslitNj+f4et0XSpxRFSvwJ/JZ362RMCUVUrNG/W6p+haIzPkzJz2+ 0SgAaAE8JL8NOjPgCqLD7ZnaHBsA8ZPq9lXJkktkzdxzo6+jCoPY8GHELg5Cfm2e x8mzKMwgRWIOPiW3kzvxIEJCdkQ+oM+18TyWfdal/QKDNvNTepVHeSCwzgrUEq6y lXv+DsAfI8s8zLp1WQQt5fV+eLO66ey98RIpGedLlKhuOaTAVOyq+6ZrPva2RQEd 2QTEYWRyRynST+Cy/fLST/rZhRKoA4U0WLEru2XtIXuGU6UIdZT4ob2VEk25hxaH FHpi3zoHzK28sx6v7qM7DuGYgyUwhL+mVddWXdwIvPvDXbJsf2ATFCCqGbCOjLYA WXKeGg69BrERvjbQqVcpppyy2mw+CMBEPLGix7VwmVnJdU1zIqp0vvmqUD66D3gY McpSvQyd =llel -----END PGP SIGNATURE----- ```
- Loading branch information