Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 2: Xiaomao Ding #12

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 93 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,99 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Xiaomao Ding
* Tested on: Windows 8.1, i7-4700MQ @ 2.40GHz 8.00GB, GT 750M 2047MB (Personal Computer)

### (TODO: Your README)
# Intro
The code in this repo implements stream compaction and scan algorithms on the GPU in CUDA as well as on the CPU in C++ for performance comparisons. The scan algorithm performs a parallel prefix sum on the GPU. For more information, read this [NVIDIA link](http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html).

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
![Image of Prefix Sum](http://http.developer.nvidia.com/GPUGems3/elementLinks/39fig02.jpg)

<font size="8"> Image from NVIDIA </font>

# Performance Analysis
This section below discusses the performance of the algorithms in this repository.

### Optimal Block Size
Each GPU algorithm was tested using an array of 2^14 integers. The optimal block size was found to be 128-256 as shown below. All calculations following this section are done with block size 128. Performance was timed using CUDAEvents.

| Block Size | Naive GPU scan (ms) | Efficient GPU scan (ms) | Efficient GPU Compaction (ms)|
| :------------- |-------------:| -----:|-----:|
| 64 | 0.124 | 0.527 |0.531 |
| 128 | 0.094 | 0.484 |0.412 |
| 256 | 0.095 | 0.473 |0.423 |
| 512 | 0.102 | 0.471 |0.454 |
| 1024 | 0.109 | 0.495 |0.487 |

![Plot of block size](https://github.com/xnieamo/Project2-Stream-Compaction/blob/master/images/blockSizePlot.png)

### Performance comparisons
This section describes the performance of the various implementations of scan and stream compaction in this repository. For some reason, I get a stack overflow error when trying to run the algorithms with greater than 2^16 array entries, so that is maximum array size presented here.

| Array Size | CPU scan | Naive GPU scan | Efficient GPU scan | CPU Compact w/o scan | CPU compact w/ scan | Efficient GPU compact | Thrust |
|:------|-----------:|----------------:|---------------------:|----------------------:|---------------------:|-----------------------:|--------:|
| 2^12 | 0.015623 | 0.051032 | 0.298 | 0.0090072 | 0.0312529 | 0.263 | 0.352 |
| 2^14 | 0.062499 | 0.0928 | 0.422 | 0.0468755 | 0.1716863 | 0.425 | 0.502 |
| 2^16 | 0.2343767 | 0.342 | 1.15 | 0.250018 | 0.6718685 | 1.127 | 1.325 |

![Plot of various runtimes](https://github.com/xnieamo/Project2-Stream-Compaction/blob/master/images/performanceChart.png)

Because we are implementing the work-efficient algorithm described in GPU Gems without any optimizations, it actually runs SLOWER! When looking at the NVIDIA NSight runtime analysis, it appears that the thrust implementation is using asynchronous memory transfer, which seems to allow the CPU to call functions while a kernel is running. Surprisingly, the thrust implementation is still slower than the efficient GPU implementation (runtime was taken from NSight analysis, discounting initial and final memcpy operations).

In the case of the work-efficient algorithm, one of the issues that affects runtime is the fact that many threads idle as the upsweep and downsweep progress. Aside from that, a main bottleneck in my implementation is memory transfer from host to device. In the stream compaction algorithm, there is a need to set the last index to 0. Instead of doing this via a kernel, I transfer back to host. This results in an expensive memory transfer and adds roughly 0.100 ms to the runtime. Another bottleneck that seems to take about as long as the calculation itself is the cudaLaunch function. The internet hasn't been helpful in telling me what this does, but I suspect that it is responsible for launching the grids or blocks on the GPU. If so, then changing the index to 0 on the GPU might save me 25% of my runtime!

With the naive GPU scan, there aren't really many addressable bottlenecks. The calculation just takes that long.

For the CPU implementation, I think for this particular project, the w/o scan compaction runs faster as it only needs to perform a single comparison operation per element. The w/ scan implementation adds a large amount of unnecessary calculations (on the CPU) which makes it run much slower. This shows that GPU and CPU algorithms and the way we should about implementing code on these machines differs by quite a lot!

### Program output
Finally, here is the output of the various tests to validate my implementations, using an array of 2^16 elements. They all pass, woohoo!

```

****************
** SCAN TESTS **
****************
[ 38 19 38 37 5 47 15 35 0 12 3 0 42 ... 35 0 ]
==== cpu scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
==== cpu scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
passed
==== naive scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
passed
==== naive scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 0 0 ]
passed
==== work-efficient scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
passed
==== work-efficient scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
passed
==== thrust scan, power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604374 1604409 ]
passed
==== thrust scan, non-power-of-two ====
[ 0 38 57 95 132 137 184 199 234 234 246 249 249 ... 1604305 1604316 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 3 2 1 3 1 1 1 2 0 1 0 2 ... 1 0 ]
==== cpu compact without scan, power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 3 1 ]
passed
==== cpu compact with scan ====
[ 2 3 2 1 3 1 1 1 2 1 2 1 1 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
passed
==== work-efficient compact, non-power-of-two ====
passed
```
Binary file added images/blockSizePlot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/performanceChart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 33 additions & 11 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,20 @@
*/

#include <cstdio>
#include <iostream>
#include <chrono>
#include <stream_compaction/cpu.h>
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

int main(int argc, char* argv[]) {
const int SIZE = 1 << 8;
const int SIZE = 1 << 16;
const int NPOT = SIZE - 3;
int a[SIZE], b[SIZE], c[SIZE];

// Scan tests

printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
Expand All @@ -31,7 +32,13 @@ int main(int argc, char* argv[]) {

zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
//auto begin = std::chrono::high_resolution_clock::now();
//for (int i = 0; i < 1000; i++){
StreamCompaction::CPU::scan(SIZE, b, a);
//}
//auto end = std::chrono::high_resolution_clock::now();
//std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "ns" << std::endl;

printArray(SIZE, b, true);

zeroArray(SIZE, c);
Expand All @@ -43,37 +50,37 @@ int main(int argc, char* argv[]) {
zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
StreamCompaction::Naive::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
StreamCompaction::Naive::scan(NPOT, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
StreamCompaction::Thrust::scan(NPOT, c, a);
//printArray(NPOT, c, true);
printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

printf("\n");
Expand All @@ -91,7 +98,14 @@ int main(int argc, char* argv[]) {

zeroArray(SIZE, b);
printDesc("cpu compact without scan, power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);

//begin = std::chrono::high_resolution_clock::now();
//for (int i = 0; i < 1000; i++){
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
//}
//end = std::chrono::high_resolution_clock::now();
//std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "ns" << std::endl;

expectedCount = count;
printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);
Expand All @@ -105,7 +119,15 @@ int main(int argc, char* argv[]) {

zeroArray(SIZE, c);
printDesc("cpu compact with scan");
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);


//begin = std::chrono::high_resolution_clock::now();
//for (int i = 0; i < 1000; i++){
count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
//}
//end = std::chrono::high_resolution_clock::now();
//std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() << "ns" << std::endl;

printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);

Expand Down
2 changes: 1 addition & 1 deletion stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ set(SOURCE_FILES

cuda_add_library(stream_compaction
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_30
)
67 changes: 38 additions & 29 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,39 +1,48 @@
#include "common.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}

fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
}

fprintf(stderr, "CUDA error");
if (file) {
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}


namespace StreamCompaction {
namespace Common {

/**
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
}
namespace Common {

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
}
/**
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/

}
__global__ void kernMapToBoolean(int n, int *bools, const int *idata){
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n) return;

bools[index] = 1;
if (idata[index] == 0) bools[index] = 0;
}


/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n) return;

if (bools[index] == 1) odata[indices[index]] = idata[index];
}

}
}
2 changes: 2 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cuda.h>
#include <cuda_runtime.h>

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
Expand Down
Loading