So far we have used the simple text-based report nvprof
. NVIDIA now
provide a much more sophisticated suite of tools for profiling of
GPU codes. There are two elements. First, Nsight Systems is
https://developer.nvidia.com/nsight-systems
Second, Nsight Compute deals with kernels.
https://developer.nvidia.com/nsight-compute
The first port-of-call should always be Nsight Systems.
To use, compile and run as usual, and prefix the executable
nsys profile -o systems ./a.out
which should produce a file systems.nsys-rep
. This can be
read into the user interface Nsight Systems.
The usual mode of operation is to copy the report file back to your local machine.
A bare profile shows CUDA API (host) activity and device activity (memory copies and kernels). It can be useful to identify particular sections of code.
NVTX (NVIDIA Toolkit extension) markers can be added to code (host code) as follows. Include the file
#include "nvToolsExt.h"
Identify the region of code of interest, and add a range start and end with, e.g.,
nvtxRangeId_t id = nvtxRangeStartA("MY ASCII LABEL");
/* ... code of interest ... */
nvtxRangeEnd(id);
and recompile (you may needto add -lnvToolsExt
).
This will cause a coloured bar to appear in the profile indicating the relevant duration.
If one has concerns about the performance of a particular kernel, or kernels, then one needs to turn to Nsight Compute.
For basic information use
ncu -o default ./a.out
which should produce a file default.ncu-rep
. This report file is
loaded into Nsight Compute.
For more detailed information one can use, e.g.,
ncu --set detailed -o detailed ./a.out
and for full information use
ncu --set full -o full ./a.out
which will run additional passes of the kernel to collect more metrics.
Note that running ncu
can be quite time-consuming, and if using
a real application, a small problem size should be selected in the
first instance. One can also use filters to limit the information
collected (e.g., for an individual kernel).
For the simple matrix operation we developed earlier, try to run first Nsight Systems, and then Nsight Compute with the various options.
Have a go at adding some NVTX markers to highlight a region of host code.