Skip to content

Commit

Permalink
Merge in Sockeye Autopilot (#405)
Browse files Browse the repository at this point in the history
* Merge Sockeye Autopilot.

* Typing cleanup.

* Update version, changelog.

* Update description of Autopilot in changelog.
  • Loading branch information
mjdenkowski authored May 21, 2018
1 parent 3f8cb0b commit fea3d59
Show file tree
Hide file tree
Showing 11 changed files with 2,176 additions and 9 deletions.
16 changes: 11 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,14 @@ Note that Sockeye has checks in place to not translate with an old model that wa

Each version section may have have subsections for: _Added_, _Changed_, _Removed_, _Deprecated_, and _Fixed_.

## [1.18.13]
## [1.18.14]
### Added
- Introduced Sockeye Autopilot for single-command end-to-end system building.
See the [Autopilot documentation]((https://github.com/awslabs/sockeye/tree/master/contrib/autopilot)) and run with: `sockeye-autopilot`.
Autopilot is a `contrib` module with its own tests that are run periodically.
It is not included in the comprehensive tests run for every commit.

## [1.18.13]
### Fixed
- Fixed two bugs with training resumption:
1. removed overly strict assertion in the data iterator for model states before the first checkpoint.
Expand All @@ -20,7 +26,7 @@ Each version section may have have subsections for: _Added_, _Changed_, _Removed
### Added
- Added support for config files. Command line parameters have precedence over the values read from the config file.
Minimal working example:
`python -m sockeye.train --config config.yaml` with contents of `config.yaml` as follows:
`python -m sockeye.train --config config.yaml` with contents of `config.yaml` as follows:
```yaml
source: source.txt
target: target.txt
Expand Down Expand Up @@ -92,7 +98,7 @@ Each version section may have have subsections for: _Added_, _Changed_, _Removed
### Changed
- Removed combined linear projection of keys & values in source attention transformer layers for
performance improvements.
- The topk operator is performed in a single operation during batch decoding instead of running in a loop over each
- The topk operator is performed in a single operation during batch decoding instead of running in a loop over each
sentence, bringing speed benefits in batch decoding.

## [1.18.1]
Expand Down Expand Up @@ -167,7 +173,7 @@ For each metric the mean and standard deviation will be reported across files.
and `--input-factors` a list of files containing token-parallel factors.
At test time, an exception is raised if the number of expected factors does not
match the factors passed along with the input.

- Removed bias parameters from multi-head attention layers of the transformer.

## [1.16.6]
Expand Down Expand Up @@ -225,7 +231,7 @@ features, benefitting the beam search implementation.
- New CLI `sockeye.prepare_data` for preprocessing the training data only once before training,
potentially splitting large datasets into shards. At training time only one shard is loaded into memory at a time,
limiting the maximum memory usage.

### Changed
- Instead of using the ```--source``` and ```--target``` arguments ```sockeye.train``` now accepts a
```--prepared-data``` argument pointing to the folder containing the preprocessed and sharded data. Using the raw
Expand Down
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

This package contains the Sockeye project,
a sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet Incubating.
It implements state-of-the-art encoder-decoder architectures, such as
It implements state-of-the-art encoder-decoder architectures, such as
- Deep Recurrent Neural Networks with Attention [[Bahdanau, '14](https://arxiv.org/abs/1409.0473)]
- Transformer Models with self-attention [[Vaswani et al, '17](https://arxiv.org/abs/1706.03762)]
- Fully convolutional sequence-to-sequence models [[Gehring et al, '17](https://arxiv.org/abs/1705.03122)]
Expand Down Expand Up @@ -112,7 +112,7 @@ where `${CUDA_VERSION}` can be `75` (7.5), `80` (8.0), `90` (9.0), or `91` (9.1)
In order to write training statistics to a Tensorboard event file for visualization, you can optionally install mxboard
(````pip install mxboard````). To visualize these, run the Tensorboard tool (`pip install tensorboard tensorflow`) with
the logging directory pointed to the training output folder: `tensorboard --logdir <model>`

If you want to create alignment plots you will need to install matplotlib (````pip install matplotlib````).

In general you can install all optional dependencies from the Sockeye source folder using:
Expand All @@ -131,6 +131,9 @@ For example *sockeye-train* can also be invoked as

## First Steps

For easily training popular model types on known data sets, see the [Sockeye Autopilot documentation](https://github.com/awslabs/sockeye/tree/master/contrib/autopilot).
For manually training and running translation models on your data, read on.

### Train

In order to train your first Neural Machine Translation model you will need two sets of parallel files: one for training
Expand Down
132 changes: 132 additions & 0 deletions contrib/autopilot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Sockeye Autopilot

This module provides automated end-to-end system building for popular model types on public data sets.
These capabilities can also be used independently: users can provide their own data for model training or use Autopilot to download and pre-process public data for other use.
All intermediate files are preserved as plain text and commands are recorded, letting users take over at any point for further experimentation.

## Quick Start

If Sockeye is installed via pip or source, Autopilot can be run directly:

```bash
> sockeye-autopilot
```

This is equivalent to:

```bash
> python -m contrib.autopilot.autopilot
```

With a single command, Autopilot can download and pre-process training data, then train and evaluate a translation model.
For example, to build a transformer model on the WMT14 English-German benchmark, run:

```bash
> sockeye-autopilot --task wmt14_en_de --model transformer
```

By default, systems are built under `$HOME/sockeye_autopilot`.
The `--workspace` argument can specify a different location.
Also by default, a single GPU is used for training and decoding.
The `--gpus` argument can specify a larger number of GPUs for parallel training or `0` for CPU mode only.

Autopilot populates the following sub-directories in a workspace:

- cache: raw downloaded files from public data sets.
- third_party: downloaded third party tools for data pre-processing (currently [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer) and [subword-nmt](https://github.com/rsennrich/subword-nmt))
- logs: log files for various steps.
- systems: contains a single directory for each task, such as "wmt14_en_de". Task directories contain (after a successful build):
- data: raw, tokenized, and byte-pair encoded data for train, dev, and test sets.
- model.bpe: byte-pair encoding model
- model.*: directory for each Sockeye model built, such as "model.transformer"
- results: decoding output and BLEU scores. When starting with raw data, the .sacrebleu file contains a score that can be compared against official WMT results.

### Custom Data

Models can be built using custom data with any level of pre-processing.
For example, to use custom German-English raw data, run:

```bash
> sockeye-autopilot --model transformer \
--custom-task my_task \
--custom-text-type raw \
--custom-lang de en \
--custom-train train.de train.en \
--custom-dev dev.de dev.en \
--custom-test test.de test.en \
```

Pre-tokenized or byte-pair encoded data can be used with `--custom-text-type tok` and `--custom-text-type bpe`.
The `--custom-task` argument is used for directory naming.
A custom number of BPE operations can be specified with `--custom-bpe-op`.

### Data Preparation Only

To use Autopilot for data preparation only, simply provide `none` as the model type:

```bash
> sockeye-autopilot --task wmt14_en_de --model none
```

## Automation Steps

This section describes the steps Autopilot runs as part of each system build.
Builds can be stopped and re-started (CTRL+C).
Some steps are atomic while others (such as translation model training) can be resumed.
Each completed step records its success so a re-started build can pick up from the last finished step.

### Checkout Third Party Tools

If the task requires tokenization, check out the [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer).
If the task requires byte-pair encoding, check out the [subword-nmt](https://github.com/rsennrich/subword-nmt)) module.
Store git checkouts of these tools in the third_party directory for re-use with future tasks in the same workspace.

NOTE: These tools have different open source licenses than Sockeye.
See the included license files for more information.

### Download Data

Download to the cache directory all raw files referenced by the current task (if not already present).
See `RAW_FILES` and `TASKS` in `tasks.py` for examples of tasks referencing various publicly available data files.

### Populate Input Files

For known tasks, populate parallel train, dev, and test files under "data/raw" by extracting lines from raw files downloaded in the previous step.
For custom tasks, copy the user-provided data.
Train and dev files are concatenated while test sets are preserved as separate files.

This step includes Unicode whitespace normalization to ensure that only ASCII newlines are considered as line breaks (spurious Unicode newlines are a known issue in some noisy public data).

### Tokenize Data

If data is not pre-tokenized, run the Moses tokenizer and store the results in "data/tok".
For known tasks, use the listed `src_lang` and `trg_lang` (see `TASKS` in `tasks.py`).
For custom tasks, use the provided `--custom-lang` arguments.

### Byte-Pair Encode Data

If the data is not already byte-pair encoded, learn a BPE model "model.bpe" and apply it to the data, storing the results in "data/bpe".
For known tasks, use the listed number of operations `bpe_op`.
For custom tasks, use the provided `--custom-bpe-op` argument.

### Train Translation Model

Run `sockeye.train` and `sockeye.average` to learn a translation model on the byte-pair encoded data.
Use the arguments listed for the provided `--model` argument and specify "model.MODEL" (e.g., "model.transformer") as the model directory.
See `MODELS` in `models.py` for examples of training arguments.

This step can take several days and progress can be checked via the log file or tensorboard.
This step also supports resuming from a partially trained model.

### Translate Test Sets

Run `sockeye.translate` to decode each test set using the specified settings.
See `DECODE_ARGS` in `models.py` for decoding settings.

### Evaluate Translations

Provide the following outputs to the user under "results":

- test.N.MODEL.SETTINGS.bpe.bleu: BLEU score of raw decoder output against byte-pair encoded references
- test.N.MODEL.SETTINGS.tok.bleu: BLEU score of word-level decoder output against tokenized references
- test.N.MODEL.SETTINGS.detok.sacrebleu: BLEU score of detokenized decoder output against raw references using [SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu). These scores are directly comparable to those reported in WMT evaluations.
17 changes: 17 additions & 0 deletions contrib/autopilot/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You may not
# use this file except in compliance with the License. A copy of the License
# is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is distributed on
# an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.

from contrib.autopilot import autopilot
from contrib.autopilot import tasks
from contrib.autopilot import models
from contrib.autopilot import third_party
Loading

0 comments on commit fea3d59

Please sign in to comment.