Skip to content

Commit

Permalink
Merge pull request #706 from RSE-Sheffield/ns-rse/705-hide-body-gpu
Browse files Browse the repository at this point in the history
Add excerpt_separator
  • Loading branch information
ns-rse authored Sep 14, 2023
2 parents b153141 + a46dbfd commit 122aa69
Showing 1 changed file with 4 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ tags: GPU FLAMEGPU benchmarking
category:
link:
description:
social_image:
social_image:
type: text
excerpt_separator: <!--more-->
---
Expand Down Expand Up @@ -40,6 +40,8 @@ The existing A100 nodes each contain 4 GPUs which are directly connected to one
The NVLink interconnect offers higher memory bandwidth for GPU to GPU communication, which combined with twice as many GPUs per node may lead to shorter application run-times than offered by the H100 nodes.
If even more GPUs are required moving to the Tier 2 systems may be required, with Jade 2 offering up to 8 GPUs per Job, and Bede being the only current option for multi-node GPU jobs, with up to 128 GPUs per job.

<!--more-->

Within Stanage, software may need recompiling to run on the H100 nodes, or new versions of libraries may be required. For more information see the [HPC Documentation][stanage-using-gpus].

Carl Kennedy and Nicholas Musembi of the Research and Innovation Team in IT Services have [benchmarked these new GPUs using popular machine learning frameworks][h100-rcg-ml-benchmark], however not all HPC workloads exhibit the same performance characteristics as machine learning.
Expand Down Expand Up @@ -144,7 +146,7 @@ When using Run-time compilation, performance improves significantly. This is in
Using the much more work efficient Spatial 3D communication strategy, simulation run-times are significantly quicker than any of the brute-force benchmarks, with the largest simulations taking at most `0.944`s rather than `1457`s.
On average, each agent is only reading `204.5` messages, rather than all `1000000` messages each agent must read in the bruteforce case.
This greatly reduces the number of global memory reads performed and subsequently the impact of RTC is diminished although still significant.
As the initial density of the simulations and communication radius are maintained as the population is scaled, the average number of relevant messages is roughly comparable at each scale, resulting in a more linear relationship between simulation time and population size.
As the initial density of the simulations and communication radius are maintained as the population is scaled, the average number of relevant messages is roughly comparable at each scale, resulting in a more linear relationship between simulation time and population size.

![Figure 4: Circles Spatial3D - Mean Simulation Time (s) against Population Size](/assets/images/2023-08-18-benchmarking-flamegpu2-on-h100-a100-and-v100-gpus/plot-h100-a100-v100-cuda-118-fixed-density-circles_spatial3D.png)

Expand Down

0 comments on commit 122aa69

Please sign in to comment.