Skip to content

Commit

Permalink
First draft of methodology for modeling generative AI systems (#221)
Browse files Browse the repository at this point in the history
* wip on AI model

* split ai to new section

* wip

* wip

* first draft of training

* split out cluster and nvidia

* fine tuning updated

* inference service

* fix tests

* add model for token to energy

* update inference model to include LoRA; update overview to include data from BDavy

* break out datacenter more clearly in cluster defintiion

* update water use estimates for a100

* update cluster link & fix typos

* break out memory usage for more granular calculation

* Split out foundation components, fix various typos

* Update overview.mdx

LR updating phase description. Testing for update process going forward

* Update overview.mdx

Remove extra word

---------

Co-authored-by: lratliff3 <[email protected]>
  • Loading branch information
bokelley and lratliff3 authored Oct 3, 2024
1 parent 14bd9e2 commit 6379647
Show file tree
Hide file tree
Showing 16 changed files with 905 additions and 4 deletions.
85 changes: 85 additions & 0 deletions defaults/docs-defaults.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,55 @@
# AUTO_GENERATED from docs/snippets
defaults:
country_l_h20_per_kwh:
AR: 11.450000000
AU: 4.730000000
AT: 5.720000000
BE: 2.620000000
BR: 18.600000000
BG: 4.620000000
CA: 8.170000000
CL: 10.520000000
CN: 6.010000000
CY: 1.290000000
CZ: 3.210000000
DK: 3.190000000
EE: 2.200000000
FI: 4.530000000
FR: 3.670000000
DE: 1.950000000
GB: 2.330000000
GR: 5.140000000
HU: 3.690000000
IS: 6.140000000
IN: 3.440000000
ID: 2.260000000
IE: 1.480000000
IT: 4.840000000
JP: 2.310000000
LV: 2.550000000
LT: 5.900000000
LU: 12.780000000
MY: 1.680000000
MT: 1.120000000
MX: 5.300000000
NL: 3.430000000
NZ: 14.970000000
false: 6.660000000
PL: 2.510000000
PT: 9.590000000
RO: 7.360000000
RU: 3.460000000
SK: 5.790000000
SI: 2.980000000
KR: 1.890000000
ES: 6.210000000
SE: 6.030000000
CH: 5.660000000
TW: 1.490000000
TH: 2.890000000
TR: 4.920000000
UA: 2.130000000
US: 3.130000000
default_audio_bitrate_kbps:
pc: 160
phone: 160
Expand Down Expand Up @@ -107,7 +157,13 @@ defaults:
default_emissions_per_bid_request_gco2_per_imp: 0.114420000
default_emissions_per_creative_request_gco2_per_imp: 0.000300000
default_emissions_per_rtdp_request_gco2_per_imp: 0.010000000
default_gpu_embodied_emissions_gco2e_per_second:
a100: 0.003360000
default_gpu_watts_pre_pue:
a100: 428
default_image_compression_ratio: 10
default_inferences_per_request:
gpt3_chat: 925
default_network_embodied_emissions_gco2e_per_kb:
scope3:
fixed: 0.000004430
Expand Down Expand Up @@ -277,6 +333,8 @@ defaults:
social: 0.150000000
streaming-video: 0.049000000
web: 0.049000000
default_s_per_inference:
gpt3: 0.002240000
default_time_in_view_seconds: 6
default_usage_kwh_per_gb:
scope3:
Expand Down Expand Up @@ -310,3 +368,30 @@ defaults:
JAPAC: 0.000300000
LATAM: 0.000100000
NAMER: 0.000100000
us_grid_subregion_l_h20_per_kwh:
akgd: 3.420000000
akms: 15.370000000
aznm: 4.970000000
camx: 5.190000000
erct: 1.270000000
frcc: 1.490000000
hims: 2.400000000
hioa: 1.450000000
mroe: 3.060000000
mrow: 3.100000000
newe: 4.110000000
nwpp: 9.480000000
nycw: 1.850000000
nyli: 1.630000000
nyup: 8.080000000
rfce: 2.310000000
rfcm: 2.510000000
rfcw: 2.220000000
rmpa: 2.580000000
spno: 1.730000000
spso: 2.030000000
srmv: 2.240000000
srmw: 2.670000000
srso: 2.290000000
srtv: 3.640000000
srvc: 2.370000000
5 changes: 5 additions & 0 deletions docs/calculations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import TimeInViewDefaults from "/snippets/defaults_time_in_view.mdx";
import AdPlatformDefaults from "/snippets/defaults_ad_platform.mdx";
import NetworkTrafficDefaults from "/snippets/defaults_network_traffic.mdx";
import ChannelMappingDefaults from "/snippets/defaults_channel_mapping.mdx";
import WaterUseDefaults from "/snippets/defaults_wue.mdx";

# Detailed walkthrough of calculations

Expand Down Expand Up @@ -222,6 +223,10 @@ Observations from various channels

<ChannelMappingDefaults />

### Water consumption by country and grid region

<WaterUseDefaults />

## Lookups from external sources

### Carbon intensity by country, region, and UTC Date/Time
Expand Down
130 changes: 130 additions & 0 deletions docs/cluster.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: "Server cluster"
description: "Methodology for calculating the water, energy, and embodied emissions of a server cluster based on time"
---

## Overview

A server cluster is a group of servers in a single datacenter or cloud platform. Calculating the aggregate emissions of the cluster creates a logical entity that can be used to model how software uses this cluster. A key note is that a cluster is assumed to have uniform utilization. A higher-level abstraction is necessary to model the relationships between different components of a computing system.

As an example of what constitutes an AI cluster, Meta has [documented their genAI infrastructure](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/), serving as a decent illustration of what a scaled, purpose-built training cluster looks like.

## Inputs: Defining a cluster

A cluster is defined by:
- Number of servers/instances in the cluster (if static)
- Cloud instance type or server details (see below)
- Cloud region or datacenter details (see below)

### Server details
A server is defined by:
- CPU manufacturer and model
- GPU manufacturer and model (see [gpu specs](https://github.com/mlco2/impact/blob/master/data/gpus.csv))
- Memory in GB
- Number of CPUs
- Number of GPUs

### Datacenter details
A datacenter is defined by:
- PUE
- WUE
- Grid region
- On-site or dedicated renewable energy by hour
- Overhead equipment (racks, networking gear, etc) embodied emissions per server-hour (usage is included in PUE) - cool [tour of a Meta datacenter](https://metainfrahardware.com/#/web/1)

### Example

| Component | Disclosed data |
| --------- | -------------- |
| GPU | Nvidia A100 80GB |
| Server | HPE Apollo 6500 Gen10 Plus |
| Number of GPUs | 384 |
| Number of servers | 48 |

## Outputs: Calculating cluster impact

The cluster methodology produces the following outputs:
- Embodied emissions per hour reserved
- Manufacturing water consumption per hour reserved
- Usage energy coefficients per below equation:
- idle cluster power
- net CPU TDP (CPU max power - CPU idle power)
- net GPU TDP (GPU max power - GPU idle power)
- number of CPUs
- number of GPUs
- Peak throughput-α (as described by [OpenCarbonEval](https://arxiv.org/pdf/2405.12843))
- Peak TFLOPs/s

### Embodied emissions

From [LLMCarbon: Modeling the End-to-end Carbon Footprint of Large Language Models](https://arxiv.org/pdf/2309.14393), the embodied carbon from a chip can be estimated based on its area: "The Carbon emitted Per unit Area (CPA) is contingent on various semiconductor fabrication parameters, including yield, energy consumption per unit area during manufacturing, emissions from chemicals utilized in hardware production, and emissions associated with raw material sourcing for fabrication."

These are the representative values shared by the article. By aggregating all of the components of the technical infstructure used to train or operate a model, the total embodied emissions can be calculated.

| hardware | description | unit | CPA |
| -------- | ----------- | ---- | --- |
| CPU | TSMC 16nm | 147 mm² | 1 kgCO2/cm² |
| DRAM | Micron 18nm | 256 GB | 0.4 kgCO2/GB |
| SSD | Samsung 20nm | 32 TB | 0.018kgCO2/GB |
| TPUv3 | TSMC 16nm | 700 mm² | 1 kgCO2/cm² |
| TPUv4 | TSMC 7nm | 400 mm² | 1.6 kgCO2/cm² |
| V100 | TSMC 12nm | 815 mm² | 1.2 kgCO2/cm² |
| H100 | TSMC 4nm | 814 mm² | 1.8 kgCO2/cm² |

### Energy use

The energy calculation uses derived data from the cluster definition:
- The TDP of the GPU (provided by the manufacturer)
- The TDP of the CPU (provided by the manufacturer)
- The TDP of the memory (provided by the manufacturer)
- The idle power draw of the server (see [Cloud Carbon Footprint](https://www.cloudcarbonfootprint.org) for common cloud instances). This power draw should include NIC, SSD, and other components in the server. Boavizta has [some tools](https://www.google.com/url?q=https://doc.api.boavizta.org/Explanations/devices/server/&sa=D&source=docs&ust=1724184341034574&usg=AOvVaw10-gZiQ2k2VAyugfFulyqp) to help model this.

The energy use of the cluster E based on the GPU utilization G and the CPU utilization C is:
```
E(G,C) = ((idle cluster power) + (memory TDP) + (net CPU TDP) x (number of CPUs) + (net GPU TDP) x (number of GPUs))
```

Note that this must be multiplied by the datacenter PUE or WUE!


#### Energy per GPU-hour
The energy use for one GPU hour assuming 100% GPU and no incremental CPU would be:
```
E(gpu-hour) = E(100,0) / 1000 / (number of GPUs)
= ((idle cluster power) / (number of GPUs) + (net GPU TDP)) / 1000
```

### Embodied emissions

- The embodied emissions of the server (see [Towards Green AI](https://arxiv.org/pdf/2407.10237) for an example PCF)
- The embodied emissions of the GPU
- The projected use life of the server (up to 6 years for cloud platforms, but suggest using 4 years for AI instances given pace of change)
- The projected utilization of the servers, noting that utilization means "time reserved" not "time active"

```
EmbEm(h) = ((number of GPUs) x (GPU embodied emissions) +
(number of servers) x (server embodied emissions))
/ (use life in hours)
/ (utilization)
```

### Embodied water use

The embodied water use of the CPU, GPU, and memory chips can be derived from manufacturer sustainability reporting or industry averages, generally based on die size. See [NVIDIA A100](/nvidia_a100#water_use) as an example.

Using:
- The manufacturing water use of the CPU
- The manufacturing water use of the GPU
- The manufacturing water use of the memory chips

The embodied water use is:
```
EmbH20(h) = ((number of GPUs) x (water use per GPU) +
(number of CPUs) x (water use per CPU) +
(number of memory chips) x (water use per memory chip))
/ (use life in hours)
/ (utilization)
(manufacturing water use per chip) = (water use per wafer mask layer per wafer) x (wafer mask layers) / (chips per wafer)
```
98 changes: 98 additions & 0 deletions docs/fine_tuning.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: "Fine-tuning"
description: "Methodology for calculating the normalized, amortized emissions from fine-tuning AI models"
---

## Overview

From [Energy and Carbon Considerations of Fine-Tuning BERT](https://arxiv.org/pdf/2311.10267):
We find that pre-training BERT is equivalent to anywhere from 400 (MNLI) to 45,000 (RTE) fine-tuning runs depending on the dataset size, and that number of training tokens is a reasonable heuristic for estimating fine-tuning energy use. The “true” number of training tokens seen, accounting for dynamic padding of sequences to the maximum length in a batch, is a better predictor than relying on to mean or median number of tokens per example. Further comparison of fine-tuning inference energy intensity across tasks confirms that example sequence length holds a much stronger influence on energy intensity in the fine-tuning phase than in the inference phase, in alignment with expectations from previous work.

We find that, controlling for hardware, energy consumption scales most predictably with wall clock time and number of tokens encountered during training (including the pad tokens added to sequences to match the maximum sequence length in a batch).

## Disclosure of fine-tuning costs

To assess the environmental impact of fine-tuning a model, developers should disclose the technical infrastructure used for fine-tuning and the duration of this training process.

Infrastructure data:
- [Fine-tuning cluster](/cluster) details
- Managed service used (eg AWS Bedrock)
- Physical location of the datacenter where the fine-tuning occurred

Operational data:
- Based model
- Total fine-tuning time
- GPU and CPU utilization during fine-tuning
- Total fine-tuning tokens, including padding, if total time not available, for instance if using a managed service
- Start time

Usage data:
- Expected use life in days
- Expected inferences per day

### Example disclosure

| Component | Disclosed data |
| --------- | -------------- |
| Base model | Llama 2 |
| GPU | Nvidia A100 80GB |
| Server | HPE Apollo 6500 Gen10 Plus |
| Number of GPUs | 4 |
| Number of servers | 1 |
| Server location | AWS US West (Oregon) |
| Total reserved time | 12 hours |
| Average CPU utilization | 12% |
| Average GPU utilization | 47% |

## Normalization of disclosed data

When disclosed data is not present or not complete, we need to use predictive or heuristic data to fill in the gaps.

| Missing data point | Mechanism to replace |
| - | - |
| GPU model | Use the most common GPU for the training year (for instance, 2022 is Nvidia A100) |
| Server model | Use the most common server or instance type for the training year |
| Cluster size | Assume 1 server for fine-tuning |
| Location | Use the US as a relatively high-carbon country |
| Datacenter PUE | Use location average |
| Datacenter WUE | Use location average |
| Total fine-tuning time | Predict from number of tokens and model |
| Start time | Use the published model date minus the total reserved time |
| GPU and CPU utilization | Predict from model |

### Example normalization: AWS Bedrock fine-tuning

When a managed service is used, we need to make some assumptions about the underlying execution.

| Component | Disclosed data |
| --------------- | -------------- |
| Base model | Llama 2 |
| Managed service | AWS Bedrock |
| Region | US West (Oregon) |
| Start time | July 6, 2024 17:01 |
| Tokens | 48,123 |

TODO - model a standard AWS instance for this use case & doco the token->time prediction

## Calculation of carbon emissions and water use

Use the same calculations outlined in [Training](/training#calculation-of-carbon-emissions).

## Amortization of fine-tuning impact across use life

To amortize the fine-tuning impact, we need to estimate the number of inferences that the model will perform during its use-life. This applies both for fine-tuning a base model or for fine-tuning a previously fine-tuned model (aka continuous fine-tuning), except that in the latter case the use life should be considered the time until the next fine-tuning is performed (eg one day).

```
EmissionsPerInference(fine-tuning) = Em(fine-tuning) / (inferences per day) / (use life days)
```

### Example
A model is fine-tuned daily using 12.8kgCO2e and 18.3 LH2O. On average, the model performs 1000 inferences a day.

```
EmPerInf(fine-tuning) = (12.8 kgCO2e) / (1000 inf/d) / (1 d)
= 12.8 gCO2e/inf
H2OPerInf(fine-tuning) = (18.3 LH2O) / (1000 inf/d) / (1 d)
= 18.3 mlH2O/inf
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/table1_carbon_emissions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 6379647

Please sign in to comment.