The following artifacts are made available to make pretraining and finetuning of BERT models easier:
- Preprocessed data
- Pretrained BERT-base and BERT-large model checkpoints
The Wikipedia corpus used for BERT pretraining is preprocessed following the data prep instructions and uploaded to (66 GB). The data files have the sequence length of 512. The directory structure is as follows and this directory hierarchy is assumed in the implementation in
│ bert-base.json
│ bert-large.json
│ bert-base-single-node.json
│ bert-large-single-node.json
│ │
│ └───wiki_pretrain
│ │ wikipedia_segmented_part_0.bin
│ │ wikipedia_segmented_part_1.bin
│ │ ...
│ │ wikipedia_segmented_part_98.bin
Individual data files from wiki_pretrain directory are available at the following urls:
- wikipedia_segmented_part_0.bin
- wikipedia_segmented_part_1.bin
- wikipedia_segmented_part_2.bin
- ...
- wikipedia_segmented_part_98.bin
Use below script to transfer data to your private blob azcopy copy "" "https://<destination-storage-account-name><container-name>?<SAS token>" --recursive
. See more about Azure Blob Shared Access Signature and azcopy.
The models pretrained in AzureML based on the original BERT implementation are available at the following locations: