Skip to content

Update PyTorch Llama3 70B recipe to calculate metrics from profile #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions training/trillium/Llama3-70B-PyTorch/GCE/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,17 @@ gcloud alpha compute tpus tpu-vm create $TPU_NAME \

The following setup runs the training job with Llama 3 70B on GCE TPUs using
the docker image from this registry
(`us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-xla/llama3-70b:jan15built`).
The docker image uses torch and torch_xla nightly build from 09/28/2024
(`us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-xla/llama3-70b:feb14build`).
The docker image uses torch and torch_xla nightly build from 02/11/2024
Comment on lines +31 to +32
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not create a label for the currently used test, and then rotate that between different versions? This could avoid possible human error, and removes the requirement to change version.

and comes with all the package dependency needed to run the model training.
All the command below should run from your own machine (not the TPU host you
created).
created). The dockerfile used is to build the image is at https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/training/trillium/Llama3-70B-PyTorch/GCE/tpu.Dockerfile

1. git clone and navigate to this README repo and run training script:

```bash
git clone --depth 1 https://github.com/AI-Hypercomputer/tpu-recipes.git
cd training/trillium/GCE/Llama3-70B-PyTorch
cd training/trillium/Llama3-70B-PyTorch/GCE
```

2. Edit `env.sh` to add the hugging face token and/or setup the training parameters.
Expand Down
3 changes: 1 addition & 2 deletions training/trillium/Llama3-70B-PyTorch/GCE/host.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#!/bin/bash

DOCKER_IMAGE=us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-xla/llama3-70b:jan15built

DOCKER_IMAGE=us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-xla/llama3-70b:feb14build
worker_id=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/agent-worker-number" -H 'Metadata-Flavor: Google')

cat >> /dev/null <<EOF
Expand Down
5 changes: 2 additions & 3 deletions training/trillium/Llama3-70B-PyTorch/GCE/tpu.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Base package containing nightly PyTorch/XLA
ARG BASE_IMAGE=us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm
FROM ${BASE_IMAGE}
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_cxx11_20250211
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try run with the 20250211 base image with the full pod? Context: pytorch/xla#8683

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Let me try running on full pod as well.


# Install transformers library
ARG TRANSFORMERS_REPO=https://github.com/pytorch-tpu/transformers.git
Expand All @@ -10,7 +9,7 @@ RUN git clone "${TRANSFORMERS_REPO}" transformers && cd transformers && git chec

# Install transformers dependencies
WORKDIR /workspace/transformers
RUN pip3 install git+file://$PWD accelerate datasets evaluate "huggingface_hub[cli]" \
RUN pip3 install git+file://$PWD accelerate datasets protobuf evaluate "huggingface_hub[cli]" \
"torch_xla[pallas]" \
-f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html \
-f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
Expand Down