Skip to content

Commit

Permalink
add German RoBERTa model (GottBERT) (facebookresearch#2992)
Browse files Browse the repository at this point in the history
Summary:
# Before submitting

- There is no related issue for this pull request.
- [x] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [x] Did you make sure to update the docs?
- We did not see any necessity for tests.

## What does this PR do?
Add German RoBERTa model (GottBERT)

Pull Request resolved: facebookresearch#2992

Reviewed By: alexeib

Differential Revision: D25494927

Pulled By: myleott

fbshipit-source-id: b6790124d7c3c8dc387c141706cd8a527cc950ab
  • Loading branch information
scheiblr authored and facebook-github-bot committed Dec 12, 2020
1 parent 032a404 commit f3d5045
Show file tree
Hide file tree
Showing 7 changed files with 124 additions and 2 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ We provide reference implementations of various sequence modeling papers:

### What's New:

* December 2020: [GottBERT model and code released](examples/gottbert/README.md)
* November 2020: Adopted the [Hydra](https://github.com/facebookresearch/hydra) configuration framework
* [see documentation explaining how to use it for new and existing projects](docs/hydra_integration.md)
* November 2020: [fairseq 0.10.0 released](https://github.com/pytorch/fairseq/releases/tag/v0.10.0)
Expand Down
64 changes: 64 additions & 0 deletions examples/gottbert/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# GottBERT: a pure German language model

## Introduction

[GottBERT](http://arxiv.org/abs/2012.02110) is a pretrained language model trained on 145GB of German text based on RoBERTa.

## Example usage

### fairseq
##### Load GottBERT from torch.hub (PyTorch >= 1.1):
```python
import torch
gottbert = torch.hub.load('pytorch/fairseq', 'gottbert-base')
gottbert.eval() # disable dropout (or leave in train mode to finetune)
```

##### Load GottBERT (for PyTorch 1.0 or custom models):
```python
# Download gottbert model
wget https://dl.gottbert.de/fairseq/models/gottbert-base.tar.gz
tar -xzvf gottbert.tar.gz

# Load the model in fairseq
from fairseq.models.roberta import GottbertModel
gottbert = GottbertModel.from_pretrained('/path/to/gottbert')
gottbert.eval() # disable dropout (or leave in train mode to finetune)
```

##### Filling masks:
```python
masked_line = 'Gott ist <mask> ! :)'
gottbert.fill_mask(masked_line, topk=3)
# [('Gott ist gut ! :)', 0.3642110526561737, ' gut'),
# ('Gott ist überall ! :)', 0.06009674072265625, ' überall'),
# ('Gott ist großartig ! :)', 0.0370681993663311, ' großartig')]
```

##### Extract features from GottBERT

```python
# Extract the last layer's features
line = "Der erste Schluck aus dem Becher der Naturwissenschaft macht atheistisch , aber auf dem Grunde des Bechers wartet Gott !"
tokens = gottbert.encode(line)
last_layer_features = gottbert.extract_features(tokens)
assert last_layer_features.size() == torch.Size([1, 27, 768])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = gottbert.extract_features(tokens, return_all_hiddens=True)
assert len(all_layers) == 13
assert torch.all(all_layers[-1] == last_layer_features)
```
## Citation
If you use our work, please cite:

```bibtex
@misc{scheible2020gottbert,
title={GottBERT: a pure German Language Model},
author={Raphael Scheible and Fabian Thomczyk and Patric Tippmann and Victor Jaravine and Martin Boeker},
year={2020},
eprint={2012.02110},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
1 change: 1 addition & 0 deletions examples/roberta/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ RoBERTa iterates on BERT's pretraining procedure, including training the model l

### What's New:

- December 2020: German model (GottBERT) is available: [GottBERT](https://github.com/pytorch/fairseq/tree/master/examples/gottbert).
- January 2020: Italian model (UmBERTo) is available from Musixmatch Research: [UmBERTo](https://github.com/musixmatchresearch/umberto).
- November 2019: French model (CamemBERT) is available: [CamemBERT](https://github.com/pytorch/fairseq/tree/master/examples/camembert).
- November 2019: Multilingual encoder (XLM-RoBERTa) is available: [XLM-R](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
Expand Down
8 changes: 6 additions & 2 deletions fairseq/data/encoders/hf_byte_bpe.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from fairseq.data.encoders import register_bpe
from fairseq.dataclass import FairseqDataclass
from fairseq import file_utils


@dataclass
Expand All @@ -28,9 +29,12 @@ def __init__(self, cfg):
"Please install huggingface/tokenizers with: " "pip install tokenizers"
)

bpe_vocab = file_utils.cached_path(cfg.bpe_vocab)
bpe_merges = file_utils.cached_path(cfg.bpe_merges)

self.bpe = ByteLevelBPETokenizer(
cfg.bpe_vocab,
cfg.bpe_merges,
bpe_vocab,
bpe_merges,
add_prefix_space=cfg.bpe_add_prefix_space,
)

Expand Down
2 changes: 2 additions & 0 deletions fairseq/hub_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ def from_pretrained(
"code": "bpe_codes",
"bpecodes": "bpe_codes",
"sentencepiece.bpe.model": "sentencepiece_model",
"merges.txt": "bpe_merges",
"vocab.json": "bpe_vocab",
}.items():
path = os.path.join(model_path, file)
if os.path.exists(path):
Expand Down
1 change: 1 addition & 0 deletions fairseq/models/roberta/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@
from .hub_interface import * # noqa
from .model import * # noqa
from .model_camembert import * # noqa
from .model_gottbert import * # noqa
from .model_xlmr import * # noqa
49 changes: 49 additions & 0 deletions fairseq/models/roberta/model_gottbert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
"""
GottBERT: a pure German Language Model
"""

from fairseq.models import register_model

from .hub_interface import RobertaHubInterface
from .model import RobertaModel


@register_model('gottbert')
class GottbertModel(RobertaModel):

@classmethod
def hub_models(cls):
return {
'gottbert-base': 'https://dl.gottbert.de/fairseq/models/gottbert-base.tar.gz',
}

@classmethod
def from_pretrained(cls,
model_name_or_path,
checkpoint_file='model.pt',
data_name_or_path='.',
bpe='hf_byte_bpe',
bpe_vocab='vocab.json',
bpe_merges='merges.txt',
bpe_add_prefix_space=False,
**kwargs
):
from fairseq import hub_utils

x = hub_utils.from_pretrained(
model_name_or_path,
checkpoint_file,
data_name_or_path,
archive_map=cls.hub_models(),
bpe=bpe,
load_checkpoint_heads=True,
bpe_vocab=bpe_vocab,
bpe_merges=bpe_merges,
bpe_add_prefix_space=bpe_add_prefix_space,
**kwargs,
)
return RobertaHubInterface(x['args'], x['task'], x['models'][0])

0 comments on commit f3d5045

Please sign in to comment.