This repository contains mini applications for GPU Bootcamps. This bootcamp focuses on multi-GPU programming models.
Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.
- Overview of single-GPU code and Nsight Systems Profiler
- Single Node Multi-GPU:
- CUDA Memcpy and Peer-to-Peer Memory Access
- Intra-node topology
- CUDA Streams and Events
- Multi-Node Multi-GPU:
- Introduction to MPI and Multi-Node execution overview
- MPI with CUDA Memcpy
- CUDA-aware MPI
- Supplemental: Configuring MPI in a containerized environment
- NVIDIA Collectives Communications Library (NCCL)
- NVHSMEM Library
This bootcamp requires a multi-node system with multiple GPUs in each node (atleast 2 GPUs/ node).
The total bootcamp material would take approximately 8 hours .
A multi-node installation of NVIDIA's HPC SDK is desired. Refer to NVIDIA HPC SDK Installation Guide for detailed instructions. Ensure that your installation contains HPCX with UCX.
After installation, make sure to add HPC SDK to the environment as follows(For example the PATH highlighted below is for HPC SDK 21.5):
# Add HPC-SDK to PATH:
export PATH="<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/compilers/bin:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda/bin:$PATH"
# Add HPC-SDK to LD_LIBRARY_PATH:
export LD_LIBRARY_PATH="<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/nvshmem/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/nccl/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/mpi/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/math_libs/lib64:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/compilers/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda/extras/CUPTI/lib64:<path-nvidia-hpc-sdk>>/Linux_x86_64/21.5/cuda/lib64:$LD_LIBRARY_PATH"
#ADD NVSHMEM HOME DIRECTORY PATH
export CUDA_HOME=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda
export NVSHMEM_HOME=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/nvshmem
Note: If you don't use Slurm workload manager, remove --with-slurm
flag.
Then, install OpenMPI as follows:
# Download and extract OpenMPI Tarfile
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
tar -xvzf openmpi-4.1.1.tar.gz
cd openmpi-4.1.1/
mkdir -p build
# Configure OpenMPI
./configure --prefix=$PWD/build --with-libevent=internal --with-xpmem --with-cuda=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda/ --with-slurm --enable-mpi1-compatibility --with-verbs --with-hcoll=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/hpcx/hpcx-2.8.1/hcoll/lib --with-ucx=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/hpcx/hpcx-2.8.1/ucx/
# Install OpenMPI
make all install
Now, add OpenMPI to the environment:
export PATH="<path-to-openmpi>/build/bin/:$PATH"
export LD_LIBRARY_PATH="<path-to-openmpi/build/lib:$LD_LIBRARY_PATH"
Ensure that the custom-built OpenMPI is in use by running which mpirun
which should point the mpirun
binary in <path-to-openmpi>/build/bin
directory.
Multi-node compatible versions of the following are required:
We have tested all the codes with CUDA drivers 460.32.03 with CUDA 11.3.0.0, OpenMPI 4.1.1, HPCX 2.8.1, Singularity 3.6.1, NCCL 2.9.9.1, and NVSHMEM 2.1.2. Note that OpenMPI in our cluster was compiled with CUDA, HCOLL, and UCX support.
As this bootcamp covers multi-node CUDA-aware MPI concepts, it is primarily designed to run without any containers. After the prerequisite softwares have been installed, follow these steps to install and run Jupyter Lab:
# Install Anaconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p <my_dir>
# Add conda to PATH
export PATH=$PATH:<my_dir>/bin/
# Install Jupyter Lab
conda install -c conda-forge jupyterlab
# Run Jupyter Lab
jupyter lab --notebook-dir=<path-to-gpubootcamp-repo>/hpc/multi_gpu_nways/labs/ --port=8000 --ip=0.0.0.0 --no-browser --NotebookApp.token=""
After running Jupyter Lab, open http://localhost:8888 in a web browser and start the introduction.ipynb
notebook.
This material is designed to primarily run in containerless environments, that is, directly on the cluster. Thus, building the Singularity container is OPTIONAL.
If containerization is desired, follow the steps outlined in the notebook MPI in Containerized Environments.
Follow the steps below to build the Singularity container image and run Jupyter Lab:
# Build the container
singularity build multi_gpu_nways.simg Singularity
# Run Jupyter Lab
singularity run --nv multi_gpu_nways.simg jupyter lab --notebook-dir=<path-to-gpubootcamp-repo>/hpc/multi_gpu_nways/labs/ --port=8000 --ip=0.0.0.0 --no-browser --NotebookApp.token=""
Then, access Jupyter Lab on http://localhost:8888.
If compiling any program throws an error related to CUDA/ NCCL/ NVHSMEM/ MPI libraries or header files being not found, ensure that LD_LIBRARY_PATH
is correctly set. Moreover, make sure environment variables CUDA_HOME
, NCCL_HOME
, and NVSHMEM_HOME
are set either during installation or manually inside each Makefile
.
- Please go through the list of exisiting bugs/issues or file a new issue at Github.
Please join OpenACC Slack Channel to raise questions.
If you observe any errors or issues, please file an issue on GPUBootcamp GitHuB repository.