-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
First draft of methodology for modeling generative AI systems (#221)
* wip on AI model * split ai to new section * wip * wip * first draft of training * split out cluster and nvidia * fine tuning updated * inference service * fix tests * add model for token to energy * update inference model to include LoRA; update overview to include data from BDavy * break out datacenter more clearly in cluster defintiion * update water use estimates for a100 * update cluster link & fix typos * break out memory usage for more granular calculation * Split out foundation components, fix various typos * Update overview.mdx LR updating phase description. Testing for update process going forward * Update overview.mdx Remove extra word --------- Co-authored-by: lratliff3 <[email protected]>
- Loading branch information
Showing
16 changed files
with
905 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
--- | ||
title: "Server cluster" | ||
description: "Methodology for calculating the water, energy, and embodied emissions of a server cluster based on time" | ||
--- | ||
|
||
## Overview | ||
|
||
A server cluster is a group of servers in a single datacenter or cloud platform. Calculating the aggregate emissions of the cluster creates a logical entity that can be used to model how software uses this cluster. A key note is that a cluster is assumed to have uniform utilization. A higher-level abstraction is necessary to model the relationships between different components of a computing system. | ||
|
||
As an example of what constitutes an AI cluster, Meta has [documented their genAI infrastructure](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/), serving as a decent illustration of what a scaled, purpose-built training cluster looks like. | ||
|
||
## Inputs: Defining a cluster | ||
|
||
A cluster is defined by: | ||
- Number of servers/instances in the cluster (if static) | ||
- Cloud instance type or server details (see below) | ||
- Cloud region or datacenter details (see below) | ||
|
||
### Server details | ||
A server is defined by: | ||
- CPU manufacturer and model | ||
- GPU manufacturer and model (see [gpu specs](https://github.com/mlco2/impact/blob/master/data/gpus.csv)) | ||
- Memory in GB | ||
- Number of CPUs | ||
- Number of GPUs | ||
|
||
### Datacenter details | ||
A datacenter is defined by: | ||
- PUE | ||
- WUE | ||
- Grid region | ||
- On-site or dedicated renewable energy by hour | ||
- Overhead equipment (racks, networking gear, etc) embodied emissions per server-hour (usage is included in PUE) - cool [tour of a Meta datacenter](https://metainfrahardware.com/#/web/1) | ||
|
||
### Example | ||
|
||
| Component | Disclosed data | | ||
| --------- | -------------- | | ||
| GPU | Nvidia A100 80GB | | ||
| Server | HPE Apollo 6500 Gen10 Plus | | ||
| Number of GPUs | 384 | | ||
| Number of servers | 48 | | ||
|
||
## Outputs: Calculating cluster impact | ||
|
||
The cluster methodology produces the following outputs: | ||
- Embodied emissions per hour reserved | ||
- Manufacturing water consumption per hour reserved | ||
- Usage energy coefficients per below equation: | ||
- idle cluster power | ||
- net CPU TDP (CPU max power - CPU idle power) | ||
- net GPU TDP (GPU max power - GPU idle power) | ||
- number of CPUs | ||
- number of GPUs | ||
- Peak throughput-α (as described by [OpenCarbonEval](https://arxiv.org/pdf/2405.12843)) | ||
- Peak TFLOPs/s | ||
|
||
### Embodied emissions | ||
|
||
From [LLMCarbon: Modeling the End-to-end Carbon Footprint of Large Language Models](https://arxiv.org/pdf/2309.14393), the embodied carbon from a chip can be estimated based on its area: "The Carbon emitted Per unit Area (CPA) is contingent on various semiconductor fabrication parameters, including yield, energy consumption per unit area during manufacturing, emissions from chemicals utilized in hardware production, and emissions associated with raw material sourcing for fabrication." | ||
|
||
These are the representative values shared by the article. By aggregating all of the components of the technical infstructure used to train or operate a model, the total embodied emissions can be calculated. | ||
|
||
| hardware | description | unit | CPA | | ||
| -------- | ----------- | ---- | --- | | ||
| CPU | TSMC 16nm | 147 mm² | 1 kgCO2/cm² | | ||
| DRAM | Micron 18nm | 256 GB | 0.4 kgCO2/GB | | ||
| SSD | Samsung 20nm | 32 TB | 0.018kgCO2/GB | | ||
| TPUv3 | TSMC 16nm | 700 mm² | 1 kgCO2/cm² | | ||
| TPUv4 | TSMC 7nm | 400 mm² | 1.6 kgCO2/cm² | | ||
| V100 | TSMC 12nm | 815 mm² | 1.2 kgCO2/cm² | | ||
| H100 | TSMC 4nm | 814 mm² | 1.8 kgCO2/cm² | | ||
|
||
### Energy use | ||
|
||
The energy calculation uses derived data from the cluster definition: | ||
- The TDP of the GPU (provided by the manufacturer) | ||
- The TDP of the CPU (provided by the manufacturer) | ||
- The TDP of the memory (provided by the manufacturer) | ||
- The idle power draw of the server (see [Cloud Carbon Footprint](https://www.cloudcarbonfootprint.org) for common cloud instances). This power draw should include NIC, SSD, and other components in the server. Boavizta has [some tools](https://www.google.com/url?q=https://doc.api.boavizta.org/Explanations/devices/server/&sa=D&source=docs&ust=1724184341034574&usg=AOvVaw10-gZiQ2k2VAyugfFulyqp) to help model this. | ||
|
||
The energy use of the cluster E based on the GPU utilization G and the CPU utilization C is: | ||
``` | ||
E(G,C) = ((idle cluster power) + (memory TDP) + (net CPU TDP) x (number of CPUs) + (net GPU TDP) x (number of GPUs)) | ||
``` | ||
|
||
Note that this must be multiplied by the datacenter PUE or WUE! | ||
|
||
|
||
#### Energy per GPU-hour | ||
The energy use for one GPU hour assuming 100% GPU and no incremental CPU would be: | ||
``` | ||
E(gpu-hour) = E(100,0) / 1000 / (number of GPUs) | ||
= ((idle cluster power) / (number of GPUs) + (net GPU TDP)) / 1000 | ||
``` | ||
|
||
### Embodied emissions | ||
|
||
- The embodied emissions of the server (see [Towards Green AI](https://arxiv.org/pdf/2407.10237) for an example PCF) | ||
- The embodied emissions of the GPU | ||
- The projected use life of the server (up to 6 years for cloud platforms, but suggest using 4 years for AI instances given pace of change) | ||
- The projected utilization of the servers, noting that utilization means "time reserved" not "time active" | ||
|
||
``` | ||
EmbEm(h) = ((number of GPUs) x (GPU embodied emissions) + | ||
(number of servers) x (server embodied emissions)) | ||
/ (use life in hours) | ||
/ (utilization) | ||
``` | ||
|
||
### Embodied water use | ||
|
||
The embodied water use of the CPU, GPU, and memory chips can be derived from manufacturer sustainability reporting or industry averages, generally based on die size. See [NVIDIA A100](/nvidia_a100#water_use) as an example. | ||
|
||
Using: | ||
- The manufacturing water use of the CPU | ||
- The manufacturing water use of the GPU | ||
- The manufacturing water use of the memory chips | ||
|
||
The embodied water use is: | ||
``` | ||
EmbH20(h) = ((number of GPUs) x (water use per GPU) + | ||
(number of CPUs) x (water use per CPU) + | ||
(number of memory chips) x (water use per memory chip)) | ||
/ (use life in hours) | ||
/ (utilization) | ||
(manufacturing water use per chip) = (water use per wafer mask layer per wafer) x (wafer mask layers) / (chips per wafer) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
--- | ||
title: "Fine-tuning" | ||
description: "Methodology for calculating the normalized, amortized emissions from fine-tuning AI models" | ||
--- | ||
|
||
## Overview | ||
|
||
From [Energy and Carbon Considerations of Fine-Tuning BERT](https://arxiv.org/pdf/2311.10267): | ||
We find that pre-training BERT is equivalent to anywhere from 400 (MNLI) to 45,000 (RTE) fine-tuning runs depending on the dataset size, and that number of training tokens is a reasonable heuristic for estimating fine-tuning energy use. The “true” number of training tokens seen, accounting for dynamic padding of sequences to the maximum length in a batch, is a better predictor than relying on to mean or median number of tokens per example. Further comparison of fine-tuning inference energy intensity across tasks confirms that example sequence length holds a much stronger influence on energy intensity in the fine-tuning phase than in the inference phase, in alignment with expectations from previous work. | ||
|
||
We find that, controlling for hardware, energy consumption scales most predictably with wall clock time and number of tokens encountered during training (including the pad tokens added to sequences to match the maximum sequence length in a batch). | ||
|
||
## Disclosure of fine-tuning costs | ||
|
||
To assess the environmental impact of fine-tuning a model, developers should disclose the technical infrastructure used for fine-tuning and the duration of this training process. | ||
|
||
Infrastructure data: | ||
- [Fine-tuning cluster](/cluster) details | ||
- Managed service used (eg AWS Bedrock) | ||
- Physical location of the datacenter where the fine-tuning occurred | ||
|
||
Operational data: | ||
- Based model | ||
- Total fine-tuning time | ||
- GPU and CPU utilization during fine-tuning | ||
- Total fine-tuning tokens, including padding, if total time not available, for instance if using a managed service | ||
- Start time | ||
|
||
Usage data: | ||
- Expected use life in days | ||
- Expected inferences per day | ||
|
||
### Example disclosure | ||
|
||
| Component | Disclosed data | | ||
| --------- | -------------- | | ||
| Base model | Llama 2 | | ||
| GPU | Nvidia A100 80GB | | ||
| Server | HPE Apollo 6500 Gen10 Plus | | ||
| Number of GPUs | 4 | | ||
| Number of servers | 1 | | ||
| Server location | AWS US West (Oregon) | | ||
| Total reserved time | 12 hours | | ||
| Average CPU utilization | 12% | | ||
| Average GPU utilization | 47% | | ||
|
||
## Normalization of disclosed data | ||
|
||
When disclosed data is not present or not complete, we need to use predictive or heuristic data to fill in the gaps. | ||
|
||
| Missing data point | Mechanism to replace | | ||
| - | - | | ||
| GPU model | Use the most common GPU for the training year (for instance, 2022 is Nvidia A100) | | ||
| Server model | Use the most common server or instance type for the training year | | ||
| Cluster size | Assume 1 server for fine-tuning | | ||
| Location | Use the US as a relatively high-carbon country | | ||
| Datacenter PUE | Use location average | | ||
| Datacenter WUE | Use location average | | ||
| Total fine-tuning time | Predict from number of tokens and model | | ||
| Start time | Use the published model date minus the total reserved time | | ||
| GPU and CPU utilization | Predict from model | | ||
|
||
### Example normalization: AWS Bedrock fine-tuning | ||
|
||
When a managed service is used, we need to make some assumptions about the underlying execution. | ||
|
||
| Component | Disclosed data | | ||
| --------------- | -------------- | | ||
| Base model | Llama 2 | | ||
| Managed service | AWS Bedrock | | ||
| Region | US West (Oregon) | | ||
| Start time | July 6, 2024 17:01 | | ||
| Tokens | 48,123 | | ||
|
||
TODO - model a standard AWS instance for this use case & doco the token->time prediction | ||
|
||
## Calculation of carbon emissions and water use | ||
|
||
Use the same calculations outlined in [Training](/training#calculation-of-carbon-emissions). | ||
|
||
## Amortization of fine-tuning impact across use life | ||
|
||
To amortize the fine-tuning impact, we need to estimate the number of inferences that the model will perform during its use-life. This applies both for fine-tuning a base model or for fine-tuning a previously fine-tuned model (aka continuous fine-tuning), except that in the latter case the use life should be considered the time until the next fine-tuning is performed (eg one day). | ||
|
||
``` | ||
EmissionsPerInference(fine-tuning) = Em(fine-tuning) / (inferences per day) / (use life days) | ||
``` | ||
|
||
### Example | ||
A model is fine-tuned daily using 12.8kgCO2e and 18.3 LH2O. On average, the model performs 1000 inferences a day. | ||
|
||
``` | ||
EmPerInf(fine-tuning) = (12.8 kgCO2e) / (1000 inf/d) / (1 d) | ||
= 12.8 gCO2e/inf | ||
H2OPerInf(fine-tuning) = (18.3 LH2O) / (1000 inf/d) / (1 d) | ||
= 18.3 mlH2O/inf | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.