diff --git a/README.md b/README.md index da8f792a0..569361f60 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,4 @@ # BERT - **\*\*\*\*\* New March 11th, 2020: Smaller BERT Models \*\*\*\*\*** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). @@ -79,15 +78,15 @@ the pre-processing code. In the original pre-processing code, we randomly select WordPiece tokens to mask. For example: -`Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head` -`Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil +`Input Text: the man jumped up, put his basket on Phil ##am ##mon ' s head` +`Original Masked Input: [MASK] man [MASK] up, put his [MASK] on Phil [MASK] ##mon ' s head` The new technique is called Whole Word Masking. In this case, we always mask -*all* of the the tokens corresponding to a word at once. The overall masking +*all* of the tokens corresponding to a word at once. The overall masking rate remains the same. -`Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] +`Whole Word Masked Input: the man [MASK] up, put his basket on [MASK] [MASK] [MASK] ' s head` The training is identical -- we still predict each masked WordPiece token