- Week 1
- 1.1: Course Overview
- 1.2: Introduction to Heterogeneous Parallel Computing
- 1.3: Portability and Scalability in Heterogeneous Parallel Computing
- 1.4: Introduction to CUDA, Data Parallelism and Threads
- 1.5 Introduction to CUDA, Memory Allocation and Data Movement API Functions
- 1.6: Introduction to CUDA, Kernel-Based SPMD Parallel Programming
- 1.7: Kernel-based Parallel Programming, Multidimensional Kernel Configuration
- 1.8: Kernel-based Parallel Programming, Basic Matrix-Matrix Multiplication
Recommended Textook/Notes
- D.Kirk and W.Hwu : Programming Massively Parallel Processors - A Hands-on Approach ,2nd Edition
- Lab assignments
- NVDIA : NVidia CUDA C Programming Guide v5.0
- to learn the major difference between latency devices (CPU cores) and throughput devices (GPU cores)
- to understand why winning applications increasingly use both types of devices
- CPU has more local cache
- GPU has a large number of registers , mostly to support a large number of threads.
- GPU tends to have large number of SIMD execution units.
- Single Instruction Multiple Data,单指令多数据流,能够复制多个操作数,并把它们打包在大型寄存器的一组指令集。
- CPU tends to have sophisticated control logic and therefore there is a big purple block labelled as control
- GPU tends to have very simple control but a large number of threads to manage.
- Powerful ALU (Arithmetic Logic Units 算术逻辑单元)
- Reduced operation latency
- Large caches
- Convert long latency memory accesses to short latency cache accesses
- Sophisticated control
- Branch prediction for reduced branch latency
- Data forwarding for reduced data latency
- Small caches
- To boost memory throughput , the cache not used for future data acceses, but designed for as staging units for a large number of threads.
- Simple Control
- No branch prediction
- No data forwarding
- Energy efficient ALUs
- Many , long latency but heavily pipelined for high throughput
- Require massive number of threads to tolerate latencies.
- 多个线程可以同时请求一个数据,这些请求会被打包成一个请求, 这样latency 就被分摊了。
- CPUs for sequential parts where latency matters
- CPUs can be 10+X faster then GPUs for sequential code
- GPUs for parallel parts where throughput wins
- GPUs can be 10+X faster than CPUs for paralled code
In future , software development cost growing much faster than hardware cose. So future system must minimize software redevelopment. This leads to the consideration of scalability and portability.
Keys to Software Cost Control:
- Scalability
- The same application runs efficiently on new generations of cores (coreA 2.0)
- The same application runs efficiently on more of the same cores ( coreA *4 )
- Portability
- The same application runs efficiently on different types of cores (coreA, coreB, coreC)
- The samp application runs efficiently on systems with different organizations and interfaces
CUDA C , a heterogeneous parallel programming interface that enables exploitation of data parallelism
- Hierarchical thread organization
- Main interfaces for launching parallel execution
- Thread index to data index mapping
The phenomena of data parallelism is that , different parts of the data can actually be processed independently of each other.
A very simple example is vector addition. When we add 2 vector together, the elements can be added together independently.
The parallel execution model of CUDA , close to Open CL, are both based on a host plus device kind of arrangement. The basic concept is that when we start executing a piece of code or application , the application will be executing on the host. The host is typically a CPU core. And when we start to execute a parallel part of the application , that's when we start to have an opportunity to use a throughput oriented device. This is done by writing some specialized functions called kernel functions. The kernel functions are very similar to functions in the C programming language. The execution will go between host and device , device in CUDA means parallel execution device , and most of the time, the device are correspond to throughput-oriented GPUs.
CUDA is really a programming language at the C level. It actually designed as an extension to the C language. More recently , more and more of the C++ features are also available in CUDA.
- An Instruction Set Architecture (ISA) is a contract between the hardware and the software.
- As the name suggests, it is a set of instructions that the architecture (hardware) can execute.
- A programm is a set of instructions stored in memory that can be read, interpreted , and executed by the hardware.
- Program instructions operate on data stored in memory or provide by I/O device.
Von-Neumann Processor:
A thread is a "virtualized" and "abstracted" Von-Neumann Processor.
The kernel function we describe is that program. And the hardware will actually generate a large number of these Von-Neumann processors.
The number of real processors may be much ,much smaller than the threads that a CUDA program will create. Some of them will be actively executing and some of them will not be actively execution, can this is what we called context switching.
- A CUDA kernel is executed by a grid (array) of threads
- All threads in a grid run the same kernel code (SPMD)
- Each thread has indexed that it uses to compute memory addresses and make control decisions
How A Single Thread Block Work:
threads 0-255:
执行同一份代码的threads 称为 thread block. All the thread will run the same code. But every thread has a different thread index value, that is used to compute memory addressed and make controlled decisions.
In this example, there's 256 thread in the thread block , each of them have an unique thread index from 0 to 255. Thereis a piece of code in the kernel, showing as the middle box. That box first calculates i variable based on the thread index. This i variable is actually private.
How Multiple Thread Blocks Work:
- Divide thread array into mulitple blocks
- Threads within a block cooperate via shared memory(exchange data), atomic operations (update same variable) , and barrier synchronization(屏障同步 to force others to wait)
- Threads in different blocks do not interact
Here every thread not only have a thread index , but also have a block index. There are predefined CUDA variables that we can use in a kernel , and they actually are initialized by the hardware for each thread.
In this case, block dimension is 256.
The thread index and block index are not just one-dimensional. In CUDA, block index can be a 1D, 2D, or 3D variable. So each thread index can also be 1D,2D or 3D. ( 9D max ? ) This is why use should use name.x to calclate i , x means the first dimension, y means the 2nd one.
Multidimensional index is very convenient for image processing, or solve three-dimensional partial differential equations, and so on.
Vector addition:
Normal C:
void vecAdd(float* h_A,float* h_B,float* h_C,int n){
int i;
for (i=0;i<n;i++) {
h_C[i] = h_A[i] + h_B[i];
}
}
void mian() {
vecAdd( h_A, h_B, h_C , N );
}
Parallel CUDA code:
#include <cuda.h>
void vecAdd(float* h_A,float* h_B,float* h_C,int n){
int size = n*sizeof(float);
float* d_A,d_B,d_C ;
1. Allocate device memory for A,B and C ,
copy A and B to device memory
2. Kernel launch code - the device performs the
actual vector addition
3. copy C from the device memory
}
Instead of performing the actual computation, this function is actually going to call a kernel function that will be executed on the device , and before calling that function , this functio also needs to do some outsourcing activity. It needs to copy data from the host memory into the device memory , so that the device is ready to precess data. And eventually after the device completes its computation it needs to copy the data's C vector back into the host memory.
- device code can
- R/W per-thread register
- R/W all-shared global memory
- host code can
- Transfer data to/from per grid global memory
Each device have many many threads, remember , these threads are actually virtualized Von Neumann processors. So you can think about these threads as processors. And each of processor will have a set of registers. These registers hold variables that are private to the thread.
All threads will have access to a shared global memory. It's important that the host code can actually allocate memory in the global memory, and also request data copy between host memory and global memory.
- cudaMalloc()
- Allocates object in the device global memory
- 2 parameters
- address of a pointer to the allocated object
- 和标准c的malloc函数不同,c malloc返回一个指针, cudaMalloc()需要提供一个指针的指针(地址)。
- 之所以有这个区别,是因为all the CUDA API functions returen error code.
- size of allocated object in terms of bytes
- address of a pointer to the allocated object
- cudaFree()
- Free object from device global memory
- takes 1 parameter:
- pointer to freed object
- 注意和cudaMalloc()参数的区别。
- pointer to freed object
- cudaMemcpy()
- memory data transfer
- Require 4 parameters
- pointer to destination
- pointer to source
- number of bytes to be copy
- Type/Direction of transfer
- Transfer to device is asyncronous
#include <cuda.h>
void vecAdd(float* h_A,float* h_B,float* h_C,int n){
int size =n*sizeof(float);
float* d_A, d_B, d_C ;
cudaMalloc( (void**) %d_A , size );
cudaMemcpy(d_A,h_A,size,cudaMemcpyHostToDevice);
cudaMalloc( (void**) %d_B , size );
cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice);
cudaMalloc( (void**) %d_C , size );
cudaMemcpy(d_C,h_C,size,cudaMemcpyHostToDevice);
//kernel invocation code - to be shown later
cudaMemcpy(h_C,d_C,size,cudaMemcpyDeviceToHost);
cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
}
In general , when we actually try to get performance from this kind of code, we can not afford to copy data back and forth before and after each kernel invocation. For real applications , we tend to have all data just reside in the device memory and we just keep launching kernels to perform computation on the device memory. This example is just show you all the pieces of that can be involved.
In practise , I will encourage you to always check for error conditions.
cudaError_t err = cudaMalloc( (void**)&d_A,size );
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n" ,
cudaGetErrorString(err) ,
__FILE__ , __LINE__; )
exit(EXIT_FAILURE);
}
Declare a CUDA kernel:
// Compute vector sum C=A+B
// Each thread performs one pair-wise(配对) addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
int i= threadIdx.x+blockDim.x*blockIdx.x ;
if (i<n) C[i]=A[i]+B[i];
}
This is the vector addition kernel that we will eventually be launching from the host code.
- global keyword
- Every CUDA kernel function needs to be preceded by
__global__
__global__
tells the compile that it's a kernel function
- Every CUDA kernel function needs to be preceded by
// only 1 demensional grids
int vecAdd(float* h_A, float* h_B, float* h_C , int n) {
// d_A, d_B, d_C allocations and copies omitted
...
// 2nd part , launch a kernel
// run ceil(n/256.0) blocks of 256 threads each
vecAddKernel <<< ceil(n/256.0) , 256>>>(d_A,d_B,d_C , n );
}
- <<< ceil(n/256.0) , 256>>>
- we need supply a configuration parameter set to the kernel
- here we actually have 2 configuration parameters
- number of blocks in the grid
- number of threads in a block
- in other words, we want to have 256 threads in each thread-block, if n=1000, we actually generates 1024 threads to work ,so in vecAddKernel function above, the index i may >=1000, eg.1023 . So it's very necessary to test
if (i<n)
.
int vecAdd(float* h_A, float* h_B, float* h_C , int n) {
dim3 DimGrid( (n-1)/256+1,1,1 );
dim3 DimBlock( 256,1,1 ) ;
vecAddkernel<<<DimGrid,DimBlock>>>(d_A,d_B,d_C,n);
}
- dim3 only means that a variable has 3 parts (x,y,z) , each part is a integer value
- DimGrid / DimBlock to initialize the x,y,z value
- in this case, we're using a 1-D grid, we only need to initialize the x-values of both the DimGrid and DimBlock.
In the host code, you have a host function, the host function can launch a kernel by calling the kernel function just like any other C function, but with additional configuration parameters.
Once we launched a kernel, we call the kernel function, then the hardware will generate a grid of threads to execute the kernel function. Every thread will execute this kernel function(可能有多余的线程).
All the threads have built-in variables: blockIdx, blockDim and threadIdx , there predefined, preinitialized will allow the threads to generate different data indices, so that each thread will be processing a different part of the data.
keyword | Example | Execute on | Only callable from |
---|---|---|---|
__device__ |
float DeviceFunc() | Device | Device |
__global__ |
void KernelFunc() | Device | Host |
__host__ |
float HostFunc() | Host | Host |
- kernel function must return void
__device__
/__host__
can be used together- It will be compiled into a host function,also compiled into a device function, and the function can be used in both the host and device.
__host__
is optional if used alone__device__
keyword tells compile that is a device function.
In very file, we're going to have both host function and device function, so the NVCC , the Nvidia C Compile, will automatically identify all the host function and kernels functions and separate them out into two complication paths.
The host function will go through the host path ,will be compiled by a host C compile and linker , such as GCC, such as Intel ICC, and so on.
The second one path is for the device function or the kernel functions. And the functions are compiled into a binary file format called PTX , which is very similar to java bytecode.
At run time, there is a JIT compile that will take the PTX binary and generate the real ISA, instruction set architecture, binary for that particular device that you are using.
So then both types of executables will be used for this Heterogeneous Computing Platform, the host path will be used on CPU , while the device binary will be run ont GPU.
This picture was shown earlier , and we assume that user has specified that we're going to generate a 2-D grid which is reflected by the 2-D indices of the blocks. And then we also assume that each block has 3-D threads. So each thread in this picture has tree indices.
a 62*76 picture.
We assume that we're going to be using a 16*16 thread block , 16 threads in the y , 16 threads in the x , to process the picture.
So we need launch in total 4 threads blocks in the y to cover all the 62 pixels , so that we have 64 threads.
In the x dimension, we're going to need to launch 5 thread blocks, that will give us 80 threads.
Source code of a PictureKernel:
__global__ void PictureKernel( float* d_Pin, float* d_Pout, int n, int m ) {
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDIm.x + threadIdx.x;
if ( (Row<m) && (Col <n) ) {
d_Pout( Row*n + Col ) = 2.0 * d_Pin( Row*n + Col );
}
}
Host Code for Launching PictureKernel
// assume picture is mxn
// input d_Pin has been allocated on and copied to device
// out d_Pin has been allocated on device
...
dim3 DimGrid( (n-1)/16 +1 , (m-1)/16 +1 , 1 ) ;
dim3 DimBlock( 16,16,1 ) ;
PictureKernel <<< DimGrid , DimBlock >>> (d_Pin , d_Pout , m,n) ;
一个矩阵乘法的例子