Merge in Sockeye Autopilot (#405)

* Merge Sockeye Autopilot. * Typing cleanup. * Update version, changelog. * Update description of Autopilot in changelog.
awslabs · May 21, 2018 · fea3d59 · fea3d59
1 parent 3f8cb0b
commit fea3d59
Show file tree

Hide file tree

Showing 11 changed files with 2,176 additions and 9 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,8 +10,14 @@ Note that Sockeye has checks in place to not translate with an old model that wa
 
 Each version section may have have subsections for: _Added_, _Changed_, _Removed_, _Deprecated_, and _Fixed_.
 
-## [1.18.13]
+## [1.18.14]
+### Added
+- Introduced Sockeye Autopilot for single-command end-to-end system building.
+See the [Autopilot documentation]((https://github.com/awslabs/sockeye/tree/master/contrib/autopilot)) and run with: `sockeye-autopilot`.
+Autopilot is a `contrib` module with its own tests that are run periodically.
+It is not included in the comprehensive tests run for every commit.
 
+## [1.18.13]
 ### Fixed
 - Fixed two bugs with training resumption:
   1. removed overly strict assertion in the data iterator for model states before the first checkpoint.
@@ -20,7 +26,7 @@ Each version section may have have subsections for: _Added_, _Changed_, _Removed
 ### Added
 - Added support for config files. Command line parameters have precedence over the values read from the config file.
   Minimal working example:
-  `python -m sockeye.train --config config.yaml` with contents of `config.yaml` as follows: 
+  `python -m sockeye.train --config config.yaml` with contents of `config.yaml` as follows:
   ```yaml
   source: source.txt
   target: target.txt
@@ -92,7 +98,7 @@ Each version section may have have subsections for: _Added_, _Changed_, _Removed
 ### Changed
 - Removed combined linear projection of keys & values in source attention transformer layers for
   performance improvements.
-- The topk operator is performed in a single operation during batch decoding instead of running in a loop over each 
+- The topk operator is performed in a single operation during batch decoding instead of running in a loop over each
 sentence, bringing speed benefits in batch decoding.
 
 ## [1.18.1]
@@ -167,7 +173,7 @@ For each metric the mean and standard deviation will be reported across files.
      and `--input-factors` a list of files containing token-parallel factors.
    At test time, an exception is raised if the number of expected factors does not
    match the factors passed along with the input.
-   
+
  - Removed bias parameters from multi-head attention layers of the transformer.
 
 ## [1.16.6]
@@ -225,7 +231,7 @@ features, benefitting the beam search implementation.
  - New CLI `sockeye.prepare_data` for preprocessing the training data only once before training,
  potentially splitting large datasets into shards. At training time only one shard is loaded into memory at a time,
  limiting the maximum memory usage.
- 
+
 ### Changed
  - Instead of using the ```--source``` and ```--target``` arguments ```sockeye.train``` now accepts a
  ```--prepared-data``` argument pointing to the folder containing the preprocessed and sharded data. Using the raw

diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 
 This package contains the Sockeye project,
 a sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet Incubating.
-It implements state-of-the-art encoder-decoder architectures, such as 
+It implements state-of-the-art encoder-decoder architectures, such as
 - Deep Recurrent Neural Networks with Attention [[Bahdanau, '14](https://arxiv.org/abs/1409.0473)]
 - Transformer Models with self-attention [[Vaswani et al, '17](https://arxiv.org/abs/1706.03762)]
 - Fully convolutional sequence-to-sequence models [[Gehring et al, '17](https://arxiv.org/abs/1705.03122)]
@@ -112,7 +112,7 @@ where `${CUDA_VERSION}` can be `75` (7.5), `80` (8.0), `90` (9.0), or `91` (9.1)
 In order to write training statistics to a Tensorboard event file for visualization, you can optionally install mxboard
  (````pip install mxboard````). To visualize these, run the Tensorboard tool (`pip install tensorboard tensorflow`) with
  the logging directory pointed to the training output folder: `tensorboard --logdir <model>`
- 
+
 If you want to create alignment plots you will need to install matplotlib (````pip install matplotlib````).
 
 In general you can install all optional dependencies from the Sockeye source folder using:
@@ -131,6 +131,9 @@ For example *sockeye-train* can also be invoked as
 
 ## First Steps
 
+For easily training popular model types on known data sets, see the [Sockeye Autopilot documentation](https://github.com/awslabs/sockeye/tree/master/contrib/autopilot).
+For manually training and running translation models on your data, read on.
+
 ### Train
 
 In order to train your first Neural Machine Translation model you will need two sets of parallel files: one for training

diff --git a/contrib/autopilot/README.md b/contrib/autopilot/README.md
@@ -0,0 +1,132 @@
+# Sockeye Autopilot
+
+This module provides automated end-to-end system building for popular model types on public data sets.
+These capabilities can also be used independently: users can provide their own data for model training or use Autopilot to download and pre-process public data for other use.
+All intermediate files are preserved as plain text and commands are recorded, letting users take over at any point for further experimentation.
+
+## Quick Start
+
+If Sockeye is installed via pip or source, Autopilot can be run directly:
+
+```bash
+> sockeye-autopilot
+```
+
+This is equivalent to:
+
+```bash
+> python -m contrib.autopilot.autopilot
+```
+
+With a single command, Autopilot can download and pre-process training data, then train and evaluate a translation model.
+For example, to build a transformer model on the WMT14 English-German benchmark, run:
+
+```bash
+> sockeye-autopilot --task wmt14_en_de --model transformer
+```
+
+By default, systems are built under `$HOME/sockeye_autopilot`.
+The `--workspace` argument can specify a different location.
+Also by default, a single GPU is used for training and decoding.
+The `--gpus` argument can specify a larger number of GPUs for parallel training or `0` for CPU mode only.
+
+Autopilot populates the following sub-directories in a workspace:
+
+- cache: raw downloaded files from public data sets.
+- third_party: downloaded third party tools for data pre-processing (currently [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer) and [subword-nmt](https://github.com/rsennrich/subword-nmt))
+- logs: log files for various steps.
+- systems: contains a single directory for each task, such as "wmt14_en_de".  Task directories contain (after a successful build):
+  - data: raw, tokenized, and byte-pair encoded data for train, dev, and test sets.
+  - model.bpe: byte-pair encoding model
+  - model.*: directory for each Sockeye model built, such as "model.transformer"
+  - results: decoding output and BLEU scores.  When starting with raw data, the .sacrebleu file contains a score that can be compared against official WMT results.
+
+### Custom Data
+
+Models can be built using custom data with any level of pre-processing.
+For example, to use custom German-English raw data, run:
+
+```bash
+> sockeye-autopilot --model transformer \
+    --custom-task my_task \
+    --custom-text-type raw \
+    --custom-lang de en \
+    --custom-train train.de train.en \
+    --custom-dev dev.de dev.en \
+    --custom-test test.de test.en \
+```
+
+Pre-tokenized or byte-pair encoded data can be used with `--custom-text-type tok` and `--custom-text-type bpe`.
+The `--custom-task` argument is used for directory naming.
+A custom number of BPE operations can be specified with `--custom-bpe-op`.
+
+### Data Preparation Only
+
+To use Autopilot for data preparation only, simply provide `none` as the model type:
+
+```bash
+> sockeye-autopilot --task wmt14_en_de --model none
+```
+
+## Automation Steps
+
+This section describes the steps Autopilot runs as part of each system build.
+Builds can be stopped and re-started (CTRL+C).
+Some steps are atomic while others (such as translation model training) can be resumed.
+Each completed step records its success so a re-started build can pick up from the last finished step.
+
+### Checkout Third Party Tools
+
+If the task requires tokenization, check out the [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer).
+If the task requires byte-pair encoding, check out the [subword-nmt](https://github.com/rsennrich/subword-nmt)) module.
+Store git checkouts of these tools in the third_party directory for re-use with future tasks in the same workspace.
+
+NOTE: These tools have different open source licenses than Sockeye.
+See the included license files for more information.
+
+### Download Data
+
+Download to the cache directory all raw files referenced by the current task (if not already present).
+See `RAW_FILES` and `TASKS` in `tasks.py` for examples of tasks referencing various publicly available data files.
+
+### Populate Input Files
+
+For known tasks, populate parallel train, dev, and test files under "data/raw" by extracting lines from raw files downloaded in the previous step.
+For custom tasks, copy the user-provided data.
+Train and dev files are concatenated while test sets are preserved as separate files.
+
+This step includes Unicode whitespace normalization to ensure that only ASCII newlines are considered as line breaks (spurious Unicode newlines are a known issue in some noisy public data).
+
+### Tokenize Data
+
+If data is not pre-tokenized, run the Moses tokenizer and store the results in "data/tok".
+For known tasks, use the listed `src_lang` and `trg_lang` (see `TASKS` in `tasks.py`).
+For custom tasks, use the provided `--custom-lang` arguments.
+
+### Byte-Pair Encode Data
+
+If the data is not already byte-pair encoded, learn a BPE model "model.bpe" and apply it to the data, storing the results in "data/bpe".
+For known tasks, use the listed number of operations `bpe_op`.
+For custom tasks, use the provided `--custom-bpe-op` argument.
+
+### Train Translation Model
+
+Run `sockeye.train` and `sockeye.average` to learn a translation model on the byte-pair encoded data.
+Use the arguments listed for the provided `--model` argument and specify "model.MODEL" (e.g., "model.transformer") as the model directory.
+See `MODELS` in `models.py` for examples of training arguments.
+
+This step can take several days and progress can be checked via the log file or tensorboard.
+This step also supports resuming from a partially trained model.
+
+### Translate Test Sets
+
+Run `sockeye.translate` to decode each test set using the specified settings.
+See `DECODE_ARGS` in `models.py` for decoding settings.
+
+### Evaluate Translations
+
+Provide the following outputs to the user under "results":
+
+- test.N.MODEL.SETTINGS.bpe.bleu: BLEU score of raw decoder output against byte-pair encoded references
+- test.N.MODEL.SETTINGS.tok.bleu: BLEU score of word-level decoder output against tokenized references
+- test.N.MODEL.SETTINGS.detok.sacrebleu: BLEU score of detokenized decoder output against raw references using [SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu).  These scores are directly comparable to those reported in WMT evaluations.
diff --git a/contrib/autopilot/__init__.py b/contrib/autopilot/__init__.py
@@ -0,0 +1,17 @@
+# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"). You may not
+# use this file except in compliance with the License. A copy of the License
+# is located at
+#
+#     http://aws.amazon.com/apache2.0/
+#
+# or in the "license" file accompanying this file. This file is distributed on
+# an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
+# express or implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+from contrib.autopilot import autopilot
+from contrib.autopilot import tasks
+from contrib.autopilot import models
+from contrib.autopilot import third_party