Artifacts for pretrain and finetune

The following artifacts are made available to make pretraining and finetuning of BERT models easier:

Preprocessed data
Pretrained BERT-base and BERT-large model checkpoints

Preprocessed Data

The Wikipedia corpus used for BERT pretraining is preprocessed following the data prep instructions and uploaded to https://bertonazuremlwestus2.blob.core.windows.net/public2/bert_data.tar.gz (66 GB). The data files have the sequence length of 512. The directory structure is as follows and this directory hierarchy is assumed in the implementation in train.py.

bert_data
│   bert-base.json
│   bert-large.json
│   bert-base-single-node.json
│   bert-large-single-node.json
│
└───512
│   │
│   └───wiki_pretrain
│       │   wikipedia_segmented_part_0.bin
│       │   wikipedia_segmented_part_1.bin
│       │   ...
│       │   wikipedia_segmented_part_98.bin

Individual data files from wiki_pretrain directory are available at the following urls:

wikipedia_segmented_part_0.bin
wikipedia_segmented_part_1.bin
wikipedia_segmented_part_2.bin
...
wikipedia_segmented_part_98.bin

Use below script to transfer data to your private blob azcopy copy "https://bertonazuremlwestus2.blob.core.windows.net/public2" "https://<destination-storage-account-name>.blob.core.windows.net/<container-name>?<SAS token>" --recursive. See more about Azure Blob Shared Access Signature and azcopy.

Pretrained BERT Model Checkpoints

The models pretrained in AzureML based on the original BERT implementation are available at the following locations:

BERT-Large, Uncased (original)
BERT-Base, Uncased (original)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

artifacts.md

artifacts.md

Artifacts for pretrain and finetune

Preprocessed Data

Pretrained BERT Model Checkpoints

Files

artifacts.md

Latest commit

History

artifacts.md

File metadata and controls

Artifacts for pretrain and finetune

Preprocessed Data

Pretrained BERT Model Checkpoints