layout | title | description | date | future | htmlwidgets | hidden | section_number | previous_section_url | previous_section_name | next_section_url | next_section_name | bibliography | giscus_comments | authors | toc | _styles | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
distill |
All About Rooflines |
When we run algorithms on hardware, we're bounded by three things: how fast it can do math (OPs/second), the bandwidth available for moving data around (bytes/second), and the total memory available to store data (bytes). These “roofline” constraints let us upper and lower bound the time of a given computation. |
2025-02-04 |
true |
true |
false |
1 |
.. |
Part 0: Introduction |
../tpus |
Part 2: TPUs |
main.bib |
true |
|
|
.fake-img {
background: #bbb;
border: 1px solid rgba(0, 0, 0, 0.1);
box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1);
margin-bottom: 12px;
} .fake-img p {
font-family: monospace;
color: white;
text-align: left;
margin: 12px 0;
text-align: center;
font-size: 16px;
}
|
Let's start with an extremely simple question: why does an algorithm take 50ms instead of 50s or 5ms? What is actually happening within the model that takes substantial time and how long should we expect it to take?
Computation: A deep learning model is effectively a bunch of matrix multiplications, each composed of floating-point multiplication and addition ‘operations' (FLOPs). Our accelerator speed determines how long these take to compute:
For instance, an NVIDIA H100 can perform about 9.89e14 bfloat16bf16 is short for bfloat16, a 16-bit floating point format often used in ML. FLOPs/s while a TPU v6e can perform 9.1e14 FLOPs/s. That means doing 1e12 FLOPs on an H100 will take (roughly) 1e12 / 9.89e14 = 1.01ms
and 1e12 / 9.1e14 = 1.1ms
on a TPU v6e.Note that these chips are priced differently, and this comparison does not normalize to cost.
Communication within a chip: Within an accelerator, tensors need to be transferred between on-chip memory (HBM) and the compute cores. You'll see the bandwidth of this link referred to as "HBM bandwidth"NVIDIA also calls this "memory bandwidth." On an H100, this is about 3.35TB/s and on TPU v6e this is about 1.6TB/s.
Communication between chips: When we distribute a model across multiple accelerators, tensors frequently need to be transferred between them. There are often a few options for this on our hardware (ICI, DCN, and PCIe), each with different bandwidths.
Whether the communication is within a chip or between chips, we measure this in GB/s and estimate the total communication time with:
Typically (but not always), computation within a single chip can be overlapped with communication within a chip and between chips. This means we can lower-bound training and inference time by using the maximum of computation and communication time. We can also upper-bound with their sum. In practice, we optimize against the maximum as the algebra is simpler and we can usually come close to this bound by overlapping our communication and computation. If we optimize with the maximum in mind then the lower and upper bounds differ by at most a factor of 2 since
If we assume we can perfectly overlap communication and computation, when
Definition: the arithmetic intensity of an algorithm is given by the ratio of the total FLOPs it performs to the number of bytes it needs to communicate — either within a chip or between chips.
Arithmetic intensity measures the "FLOPs per byte" of a given operation. To a first order, when our arithmetic intensity is high,
The quantity 1.97e14
FLOPs/s and load 8.2e11
bytes/s from HBM. That means if an algorithm has a lower arithmetic intensity than 240This is only true if the algorithm loads its weights from HBM and runs in the MXU. As we'll discuss in the next section, we can sometimes store parameters in VMEM which has a much higher bandwidth. Many algorithms also run in the VPU, which has different performance characteristics. FLOPs/byte, it will be bound by byte loading and thus we won't make good use of our hardware. Let's look at one such example:
Example (dot product): to compute the dot product of two vectors in bfloat16 precision, x • y: bf16[N], bf16[N] → bf16[1]
, we need to load
as
We can visualize the tradeoff between memory and compute using a roofline plot, which plots the peak achievable FLOPs/s (throughput) of an algorithm on our hardware (the y-axis) against the arithmetic intensity of that algorithm (the x-axis). Here's an example log-log plot:
{% include figure.liquid path="assets/img/roofline-improved.png" class="img-fluid" caption="Figure: an example roofline plot showing two algorithms with different arithmetic intensities (Algo 1 and Algo 2) and their corresponding theoretical peak throughput under different bandwidths (BW1 and BW2). In the red area, an algorithm is bandwidth bound at both bandwidths and is wasting some fraction of the hardware's peak FLOPs/s. The yellow area is bandwidth-bound only at the lower bandwidth (BW1). The green area is compute-bound at all bandwidths. Here, we are using the peak FLOPs/s of the accelerator and increasing bandwidth or improving intensity yield no benefit." %}
Above, as the intensity increases (moving left to right), we initially see a linear increase in the performance of our algorithm (in FLOPs/s) until we hit the critical arithmetic intensity of the hardware, 240 in the case of the TPU v5e. Any algorithm with a lower intensity will be bandwidth (BW) bound and limited by the peak memory bandwidth (shown in red). Any algorithm to the right will fully utilize our FLOPs (shown in green). Here, Algo 1 is comms-bound and uses only a fraction of the total hardware FLOPs/s. Algo 2 is compute-bound. We can generally improve the performance of an algorithm either by increasing its arithmetic intensity or by increasing the memory bandwidth available (moving from BW1 to BW2).
Let's look at our soon-to-be favorite algorithm: matrix multiplication (aka matmul). We write
We can get a nice simplification if we assume our local "batch size"
This is a reasonable assumption for Transformer matmuls since for most of our models we have our local batch size in tokens
**Takeaway:** for a bfloat16 matmul to be compute-bound on most TPUs, we need our local batch size in tokens to be greater than 240.
This comes with a few notable caveats we'll explore in the problems below, particularly with respect to quantization (e.g., if we quantize our activations but still do full-precision FLOPs), but it's a good rule to remember. For GPUs, this number is slightly higher (closer to 300), but the same conclusion generally holds. We'll discuss the lower-level GPU and TPU details in the next section.
All the rooflines we've discussed so far have been memory-bandwidth rooflines, all within a single chip. This shouldn't be taken as a rule. In fact, most of the rooflines we'll care about in this book involve communication between chips: usually matrix multiplications that involve matrices sharded across multiple TPUs.
To pick a somewhat contrived example, say we want to multiply two big matrices X[:, :D // 2] @ Y[:D // 2, :]
) and then copy the resulting "partial sums" to the other TPU and add them together. Say we can copy 4.5e10
bytes in each direction and perform 1.97e14
FLOPs/s on each chip. What are
Now what about
Therefore we become compute-bound (now with respect to the inter-chip network) when
Question 1 [int8 matmul]: Say we want to do
- How many bytes need to be loaded from memory? How many need to be written back to memory?
- How many total OPs are performed?
- What is the arithmetic intensity?
- What is a roofline estimate for
$T_\text{math}$ and$T_\text{comms}$ ? What are reasonable upper and lower bounds for the runtime of the whole operation?
Throughout you can assume our HBM bandwidth is 8.1e11
bytes/s and our int8 peak OPs/s is 3.94e14
.
{% details Click here for the answer. %}
- Because we're storing our parameters in int8, we have 1 byte per parameter, so we have
$$BD + DF$$ bytes loaded from HBM and$$BF$$ written back. - This is the same as in bfloat16, but in theory int8 OPs/s should be faster. So this is still
$2BDF$ FLOPs. - Arithmetic intensity is
$$2BDF / (BD + DF + BF)$$ . If we make the same assumption as above about$$B \ll D$$ and$$B \ll F$$ , we get an arithmetic intensity of$$2B$$ , meaning our rule becomes$B > \text{HBM int8 arithmetic intensity} / 2$ . Using the numbers given, this int8 intensity is3.94e14 / 8.1e11 = 486
, so the rule is$B > 486 / 2 = 243$ . Note that this is basically unchanged! -
$$T_\text{math} = 2BDF / 3.94e14$$ and$$T_\text{comms} = (BD + DF + BF) / 8.1e11$$ , so a reasonable lower bound is$$\max(T_\text{math}, T_\text{comms})$$ and an upper bound is$$T_\text{math} + T_\text{comms}$$ .
{% enddetails %}
Question 2 [int8 + bf16 matmul]: In practice we sometimes do different weight vs. activation quantization, so we might quantize our weights in int8 but keep activations in bfloat16 (and consequently perform the matmul in bfloat16). At what batch size do we become compute bound? As above, assume 1.97e14
bfloat16 FLOPs/s.
Hint: this means specifically bfloat16[B, D] * int8[D, F] -> bfloat16[B, F]
where $B$ is the "batch size".
{% details Click here for the answer. %}
Again assuming B is small, we have 2BDF bfloat16 FLOPs but only DF weights (instead of 2DF in bfloat16). This means we become compute-bound when
{% enddetails %}
Question 3: For the problem above, make a roofline plot of peak FLOPs vs. B for several values of D and F.
Question 4: What if we wanted to perform
{% details Click here for the answer. %}
Let's start by looking at the total FLOPs and comms.
- Total FLOPs: the FLOPs is basically the same, since we're doing the same number of
$$BD \times DF$$ matmuls (this is discussed more in section 4). So this is just$$2BDF$$ . - Total comms: we have a lot more comms here:
$$BD + BDF + BF$$ . - Therefore, our arithmetic intensity is now actually
$$2BDF / (BD + BDF + BF)$$ . Since$$BDF$$ dominates the denominator, this is roughly$$2$$ . So instead of it depending on the batch size, this is essentially constant. This is bad because it means we'll basically always be comms bound no matter what.
{% enddetails %}
Problem 5 [Memory Rooflines for GPUs]: Using the spec sheet provided by NVIDIA for the H100, calculate the batch size at which a matrix multiplication will become compute-bound. Note that the Tensor Core FLOPs numbers are twice the true value since they're only achievable with structured sparsity.
{% details Click here for the answer. %}
From the spec sheet, we see that the reported bfloat16 FLOPs value is 1.979e15
FLOPs/s with an asterisk noting "with sparsity". The true value is half this without sparsity, meaning close to 1e15
FLOPs/s. The memory bandwidth is 3.35TB/s, or 3.35e12
bytes / second. Thus 1e15 / 3.35e12 = 298
, rather similar to the TPU.
{% enddetails %}