This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #411 from rsepassi/push
v1.2.8
- Loading branch information
Showing
45 changed files
with
2,373 additions
and
909 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Running on Cloud TPUs | ||
|
||
Tensor2Tensor supports running on Google Cloud Platforms TPUs, chips specialized | ||
for ML training. | ||
|
||
Not all models are supported but we've tested so far with Transformer (sequence | ||
model) as well as Xception (image model). | ||
|
||
To run on TPUs, you need to be part of the alpha program; if you're not, these | ||
commands won't work for you currently, but access will expand soon, so get | ||
excited for your future ML supercomputers in the cloud. | ||
|
||
## Tutorial: Transformer En-De translation on TPU | ||
|
||
Set your default zone to a TPU-enabled zone. TPU machines are only available in | ||
certain zones for now. | ||
``` | ||
gcloud config set compute/zone us-central1-f | ||
``` | ||
|
||
Launch a GCE instance; this will run the Python trainer. | ||
``` | ||
gcloud compute instances create $USER-vm \ | ||
--machine-type=n1-standard-8 \ | ||
--image-family=tf-nightly \ | ||
--image-project=ml-images \ | ||
--scopes=https://www.googleapis.com/auth/cloud-platform | ||
``` | ||
|
||
Launch the TPU instance; the Python program will connect to this to train on the | ||
TPU device. | ||
``` | ||
TPU_IP=10.240.0.2 | ||
gcloud alpha compute tpus create \ | ||
$USER-tpu \ | ||
--range=${TPU_IP/%2/0}/29 \ | ||
--version=nightly | ||
``` | ||
|
||
To see all TPU instances running: `gcloud alpha compute tpus list`. The | ||
`TPU_IP` should be unique amongst the list and follow the format `10.240.i.2`. | ||
|
||
Generate data to GCS | ||
If you already have the data locally, use `gsutil cp` to cp to GCS. | ||
``` | ||
DATA_DIR=gs://my-bucket/t2t/data/ | ||
t2t-datagen --problem=translate_ende_wmt8k --data_dir=$DATA_DIR | ||
``` | ||
|
||
SSH in with port forwarding for TensorBoard | ||
``` | ||
gcloud compute ssh $USER-vm -L 6006:localhost:6006 | ||
``` | ||
|
||
Now that you're on the cloud instance, install T2T: | ||
``` | ||
pip install tensor2tensor | ||
``` | ||
|
||
Setup some vars used below. `TPU_IP` and `DATA_DIR` should be the same as what | ||
was used above. Note that the `DATA_DIR` and `OUT_DIR` must be GCS buckets. | ||
``` | ||
TPU_IP=<IP of TPU machine> | ||
DATA_DIR=gs://my-bucket/t2t/data/ | ||
OUT_DIR=gs://my-bucket/t2t/training/ | ||
TPU_MASTER=grpc://$TPU_IP:8470 | ||
``` | ||
|
||
Launch TensorBoard in the background so you can monitor training: | ||
``` | ||
tensorboard --logdir=$OUT_DIR > /tmp/tensorboard_logs.txt 2>&1 & | ||
``` | ||
|
||
Train and evaluate. | ||
``` | ||
t2t-tpu-trainer \ | ||
--master=$TPU_MASTER \ | ||
--data_dir=$DATA_DIR \ | ||
--output_dir=$OUT_DIR \ | ||
--problems=translate_ende_wmt8k \ | ||
--model=transformer \ | ||
--hparams_set=transformer_tiny_tpu \ | ||
--train_steps=10 \ | ||
--eval_steps=10 \ | ||
--local_eval_frequency=10 \ | ||
--iterations_per_loop=10 | ||
``` | ||
|
||
The above command will train for 10 steps, then evaluate for 10 steps. You can | ||
(and should) increase the number of total training steps with the | ||
`--train_steps` flag. Evaluation will happen every `--local_eval_frequency` | ||
steps, each time for `--eval_steps`. When you increase then number of training | ||
steps, also increase `--iterations_per_loop`, which controls how frequently the | ||
TPU machine returns control to the Python code (1000 seems like a fine number). | ||
|
||
Back on your local machine, open your browser and navigate to `localhost:6006` | ||
for TensorBoard. | ||
|
||
Voila. Enjoy your new supercomputer. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ | |
|
||
setup( | ||
name='tensor2tensor', | ||
version='1.2.7', | ||
version='1.2.8', | ||
description='Tensor2Tensor', | ||
author='Google Inc.', | ||
author_email='[email protected]', | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.