Skip to content

An introduction to CUDA programming by way of a Boids Flocking simulation

Notifications You must be signed in to change notification settings

gabenaghi/Project1-CUDA-Flocking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

University of Pennsylvania, [CIS 565: GPU Programming and Architecture] (http://www.seas.upenn.edu/~cis565/)

Implemented by [Gabriel Naghi] (https://www.linkedin.com/in/gabriel-naghi-78ab4738) on Windows 7, Xeon E5-1630 @ 3.70GHz 32GB, GeForce GTX 1070 4095MB (MOR103-56 in SIG Lab)

Project 1 - Flocking

Over the course of this project, we implemented a flocking simulation. It is inteded to mimic roughly the behavior of groups of fish or birds- known throughout the code base as Boids.

There are 3 components to the flocking algorithm:

  1. Boids gravitate toward the local center of gravity within a radius r1.
  2. Boids maintain a minimum disance r2 from their neighbors.
  3. Boids attmpt to match the velocity of their neighbors within a radius r3.

We implemented three different methods of calculating the effects of these rules. The first, the naive implementation, checks, for each boid, every other boid and applies each of the rules if they are within the area of effect. The second implementation utilized a uniform grid which sorted the boid indices by the "sector" of the scene they occupied, and only checked the relevant adjacent cells for boids. The final implementation also udes a uniform grid, but removed one layer of indirection by resorting the data itself rather than saving a pointer to its original location, maximizing data coherence.

Perfomance Analysis

My performance analysis was not done in an efficient manner. If I had to do this again, I would alter the program to take in command line args for the parameters (N_FOR_VIS and blockSize) and print time elapsed between events. I would then write a script to iterate though my test cases.

But alas, I did no do that and instead relied on the nsight performace analysis tools to take time readings. I didn't have a chance to sum up all the results, but the results are desplayed below.

Essentially, what we are trying to optimize here is the time it takes to prepare the new velocities for the updatePos kernel, which is standard accross implementaions. This is the time interval I am trying to show in the results below.

The metrics below clearly indicate that performace is inversely proportional to the number of boids. This is becuase as the number of boids rises, so does the population density. As a result, each boid will have that many more neighbors for which to calculate the three rules. Moreover, since each boid needs to calculate the effect of every other boid, the impact of increased boids is exponential.

Interestingly enough, increasing the number of threads per block seems to have improved the performace of the brute force algorithm in the short term, (best time @ 256) while generally negatively impacted performance for both uniform grid algorithms (best time @ 128).
This might have occurred becuase whereas the brute force implementation's performace relied heavily on computation, since it was doing computation for each boid on every other boid, the grid implementations were bottlenecked by memory. Perhaps increasing the number of threads per block allows for more simultaneous computation but also causes more computation for memory bandwidth. This would explain the performace impact well.

Implementing the coherent uniform grid definitely resulted in performace increase. This is the result we expected, since it cuts out a memory access and instead uses a uniform addressing scheme. I found this a bit suprising, since we are still required to do a memory accces, albeit in the form of data relocation. Perhaps it has to do with a not needing to flush a data set out of cache.

Implementation vs Boid Count

###Naive Implementation Fortunately, only one kernel call occurs between position updates in the naive implementation.

# Boids Time Elapsed
500 1.2 ms
5000 11.2 ms
50000 crashed CUDA

###Uniform Grid Implementation

500 Boids

5,000 Boids

50,000 Boids

500,000 Boids

5,000,000 Boids

###Coherent Grid Implementation

500 Boids coherent

5,000 Boids

50,000 Boids

500,000 Boids

5,000,000 Boids

Block Sizes

Naive Implementation

Threads Per Block Time Elapsed
128 11.2 ms
256 9.6 ms
512 12.2 ms
1024 12.2 ms

Scattered Grid Implementation

128 Threads per Block (same as scattered/50,000 above)

256 Threads per Block

512 Threads per Block

1024 Threads per Block - for reasons unknown, attmpting to lanch the program with blocksize of 1024 crashed the program at the point where it would have done the grid search.

Coherent Grid Implementation

128 Threads per Block (same as coherent/50,000 above)

256 Threads per Block

512 Threads per Block

1024 Threads per Block - Crashed, as in scattered implementation.

Big Bugs

Black Hole Boids

My coherent grid search had a bug where, instead of moving away from neighbors as per rule 2, it would gravitate toward them. This resulted in some boids clumping and as they moved around, would suck in any boids that came within their rule2Distance event Horizon.

Boid outta Hell

Due to a faulty type delaration, boids which were being set with their own vlaues were getting high nigative values, most likely resulting from implicit float to int cast. This resulted in red boids zipping around on the top of the scene like embers above a fire.

About

An introduction to CUDA programming by way of a Boids Flocking simulation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • CMake 65.4%
  • Cuda 20.4%
  • C++ 13.4%
  • Other 0.8%