- Week 3
- Lecture 3.1: Performance Considerations - DRAM Bandwidth
- Lecture 3.2: Performance Considerations - Memory Coalescing(合并)
- Lecture 3.3: Parallel Computation Patterns - Convolution
- Lecture 3.4: Parallel Computation Patterns - Tiled Convolution
- Lecture 3.5: Parallel Computation Patterns - 2D Tiled Convolution Kernel
DRAM数据吞吐就像高速公路的收费站, 为了保证进出站效率一般会设置很多的收费口, 同样的,DRAM也被设计为在 Busting Mode下工作.
- Each address space is partitioned into burst sections – Whenever a location is accessed, all other locations in the same section are also delivered to the processor
- To example: a 16-byte address space, 4-byte burst sections
- every time location 5 , of a burst section is accessed, all the location 4,5,6,7 will also be delivered to the processor along with location 5.
– In practice, we have at least 4GB address space, 128-byte burst sections
- every time location 5 , of a burst section is accessed, all the location 4,5,6,7 will also be delivered to the processor along with location 5.
- When all threads of a warp execute a load, if all accessed locations fall into the same burst section, only one DRAM request will be made and the access is fully coalesced.
- When the accessed locations spread across burst section boundaries, coalescing fails, multiple DRAM requests are made and the access is not fully coalesced.
- In modern GPUs , we actually have cache memories that alleviate many of the uncoalesced access of the 2nd catagory(right 1). So we will be focusing more on the situation where the threads do not access consecutive memory locations (left 1).
- Accesses in a warp are to consecutive locations if the index in an array access is in the form of – A[(terms independent of threadIdx.x)+ threadIdx.x];
- i is loop counter in the inner product loop of the kernel code
- A is mxn, B is nxk
- Col = blockIdx.x*blockDim.x + threadIdx.x
It's not about the access of each thread. It really about the locations accessed by neighboring threads in the same memory. This is another coalesced access.
So whenever we see that adjacent threads are accessing adjacent locations in the memory , and then they all move to the down and access adjacent locations in the memory , then we see a perfectly good coalesced memory access pattern.
A[Row][tx]
B[ty][Col]
- Often performed as a filter that transforms signals and pixels into more desirable values.
- Some filters smooth out the signal values so that one can see the big-picture trend
- Others like Gaussian filters can be used to sharpen boundaries and edges of objects in images
- An array operation , where each output data element is weighted sum of a collection of neighboring input elements
- In general, when we perform convolution, we will transform an input array into an output array of the same size
- To compute each output array element we will take the corresponding input element and some of the neighboring elements in the input array, to perform a weighted sum calculationg.
- The weights used in the weighted sum calculation are defined by an input mask array, commonly referred to as the convolution kernel
- we will refer to these mask array as convolution masks to avoid confusion (against kernel function in CUDA)
- the same convolution mask is typically used for all elements of the array
- a mask of 5 elements
- Commonly used for audio processing
- Mask size is usually an odd number of elements for symmetry
- Calculation of P[2] = 57
- P[3] = 2*3 + 3*4 + 4*5 + 5*4 + 6*3 = 76
- Calculation of output elements near the boundaries (beginning and end) of the input array need to deal with “ghost” elements
– Different policies (0, replicates of boundary values, etc.)
- 可以有不同的策略,比如处理为0, 或者复制边界值
- This kernel forces all elements outside the image to 0
# N: input array
# M: mask array
# P : output array
# Mask_Width: number of elements in mask array
# Width: number of elements in input array
__global__ void convolution_1D_basic_kernel(float *N, float *M, float *P,
int Mask_Width, int Width)
{
int i=blockIdx.x * blockDim.x + threadIdx.x ;
float Pvalue = 0 ;
int N_start_point = i-(Mask_Width/2);
for (int j=0; j< Mask_Width; j++) {
if ( N_start_point + j >=0 && N_start_point+j < Width ) {
Pvalue += N[N_start_point +j]* M[j];
}
}
P[i] = Pvalue ;
}
similar to 1D case
defining input/output tiles diferently, in order to manager the complexity.
- Each thread block calculates an output tile
- Each output tile width is O_TILE_WIDTH
- For each thread,
- index_o = blockIdx.x*O_TILE_WIDTH + threadIdx.x
- O_TILE_WIDTH is 4 in this example
- Each input tile has all values needed to calculate the corresponding output tile.
- input tile size depends on the mask width
- in practice , input size is relatively close to output tile
- Size each thread block to cover input tiles blockDim.x is 8 in this example
#define O_TILE_WIDTH 1020
#define BLOCK_WIDTH (O_TILE_WIDTH + 4) // POT
- The Mask_Width is 5 in this example
- In General, block width should be output tile width + (mask width -1)
understanding the benefit of tiled algorithms for convolution patterns.
N_ds: 2 3 4 5 6 7 8 9
- Mask_Width is 5
- Unlike in the matrix multiplication example, convolution calculation do not have the same number of re-use for all the input elements loaded into the shared memory.
- Element 2 is only used in the calcuation of output element 4 (once)
- Element 3 is used by threads 4,5 ( twice )
- Element 4 is used by threads 4,5,6 ( 3X )
- ...
- It is sometimes desirable to pad each row of a 2D matrix to multiples of DRAM bursts
- So each row starts at the DRAM burst boundary
- Effectively adding columns
- This is usually done automatically by matrix allocation function
- Pitch can be different for different hardware
- In this example, assume the DRAM burst is four elements
// Image Matrix Structure declaration
typedef struct {
int width;
int height;
int pitch;
int channels;
float* data;
} * wbImage_t;
- This type will only be used in the host code of the machine problem
- by the time you invoke your kernel you should have extracted the data and the width and height and pitch , and send them into the kernel
#define O_TILE_WIDTH 12
#define BLOCK_WIDTH (O_TILE_WIDTH + 4)
dim3 dimBlock(BLOCK_WIDTH,BLOCK_WIDTH);
dim3 dimGrid((wbImage_getWidth(N)-1)/O_TILE_WIDTH+1,
(wbImage_getHeight(N)-1)/O_TILE_WIDTH+1, 1)
- In general, BLOCK_WIDTH should be O_TILE_WIDTH + (MASK_WIDTH-1)
- Mask is used by all threads but not modified in the convolution kernel
- All threads in a warp access the same locations at each point in time
- CUDA devices provide constant memory whose contents are aggressively cached
- Cached values are broadcast to all threads in a warp
- Effectively magnifies memory bandwidth without consuming shared memory
- Use of const restrict qualifiers for the mask parameter informs the compiler that it is eligible for constant caching
For example:
__global__ void convolution_2D_kernel(float *P, float *N, height, width, channels, const float __restrict__ *M) {