-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'origin/dev-1.x' into 1.x
- Loading branch information
Showing
45 changed files
with
1,975 additions
and
294 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# BEiT | ||
|
||
> [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) | ||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). | ||
|
||
<div align="center"> | ||
<img src="https://user-images.githubusercontent.com/36138628/203688351-adac7146-4e71-4ab6-8958-5cfe643a2dc5.png" width="70%"/> | ||
</div> | ||
|
||
## Results and models | ||
|
||
### ImageNet-1k | ||
|
||
| Model | Pretrain | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download | | ||
| :---------: | :----------: | :-------: | :------: | :-------: | :-------: | :-------------------------------------: | :-----------------------------------------------------------------------------------------------------: | | ||
| BEiT-base\* | ImageNet-21k | 86.53 | 17.58 | 85.28 | 97.59 | [config](./beit-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth) | | ||
|
||
*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit). The config files of these models are only for inference.* | ||
|
||
For BEiT self-supervised learning algorithm, welcome to [MMSelfSup page](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/beit) to get more information. | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{beit, | ||
title={{BEiT}: {BERT} Pre-Training of Image Transformers}, | ||
author={Hangbo Bao and Li Dong and Furu Wei}, | ||
year={2021}, | ||
eprint={2106.08254}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CV} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
_base_ = [ | ||
'../_base_/datasets/imagenet_bs64_swin_224.py', | ||
'../_base_/schedules/imagenet_bs1024_adamw_swin.py', | ||
'../_base_/default_runtime.py' | ||
] | ||
|
||
data_preprocessor = dict( | ||
num_classes=1000, | ||
# RGB format normalization parameters | ||
mean=[127.5, 127.5, 127.5], | ||
std=[127.5, 127.5, 127.5], | ||
# convert image from BGR to RGB | ||
to_rgb=True, | ||
) | ||
|
||
model = dict( | ||
type='ImageClassifier', | ||
backbone=dict( | ||
type='BEiT', | ||
arch='base', | ||
img_size=224, | ||
patch_size=16, | ||
avg_token=True, | ||
output_cls_token=False, | ||
use_abs_pos_emb=False, | ||
use_rel_pos_bias=True, | ||
use_shared_rel_pos_bias=False, | ||
), | ||
neck=None, | ||
head=dict( | ||
type='LinearClsHead', | ||
num_classes=1000, | ||
in_channels=768, | ||
loss=dict( | ||
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'), | ||
), | ||
init_cfg=[ | ||
dict(type='TruncNormal', layer='Linear', std=.02), | ||
dict(type='Constant', layer='LayerNorm', val=1., bias=0.), | ||
], | ||
train_cfg=dict(augments=[ | ||
dict(type='Mixup', alpha=0.8), | ||
dict(type='CutMix', alpha=1.0) | ||
])) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
Collections: | ||
- Name: BEiT | ||
Metadata: | ||
Architecture: | ||
- Attention Dropout | ||
- Convolution | ||
- Dense Connections | ||
- Dropout | ||
- GELU | ||
- Layer Normalization | ||
- Multi-Head Attention | ||
- Scaled Dot-Product Attention | ||
- Tanh Activation | ||
Paper: | ||
URL: https://arxiv.org/abs/2106.08254 | ||
Title: 'BEiT: BERT Pre-Training of Image Transformers' | ||
README: configs/beit/README.md | ||
Code: | ||
URL: https://github.com/open-mmlab/mmclassification/blob/dev-1.x/mmcls/models/backbones/beit.py | ||
Version: v1.0.0rc4 | ||
|
||
Models: | ||
- Name: beit-base_3rdparty_in1k | ||
In Collection: BEiT | ||
Metadata: | ||
FLOPs: 17581219584 | ||
Parameters: 86530984 | ||
Training Data: | ||
- ImageNet-21k | ||
- ImageNet-1k | ||
Results: | ||
- Dataset: ImageNet-1k | ||
Task: Image Classification | ||
Metrics: | ||
Top 1 Accuracy: 85.28 | ||
Top 5 Accuracy: 97.59 | ||
Weights: https://download.openmmlab.com/mmclassification/v0/beit/beit-base_3rdparty_in1k_20221114-c0a4df23.pth | ||
Converted From: | ||
Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beit/beit_base_patch16_224_pt22k_ft22kto1k.pth | ||
Code: https://github.com/microsoft/unilm/tree/master/beit | ||
Config: configs/beit/beit-base-p16_8xb64_in1k.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# BEiT V2 | ||
|
||
> [BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers](https://arxiv.org/abs/2208.06366) | ||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation. | ||
|
||
<div align="center"> | ||
<img src="https://user-images.githubusercontent.com/36138628/203912182-5967a520-d455-49ea-bc67-dcbd500d76bf.png" width="70%"/> | ||
</div> | ||
|
||
## Results and models | ||
|
||
### ImageNet-1k | ||
|
||
| Model | Pretrain | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download | | ||
| :-----------: | :------------------------: | :-------: | :------: | :-------: | :-------: | :---------------------------------------: | :-----------------------------------------------------------------------------------: | | ||
| BEiTv2-base\* | ImageNet-1k & ImageNet-21k | 86.53 | 17.58 | 86.47 | 97.99 | [config](./beitv2-base-p16_8xb64_in1k.py) | [model](https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth) | | ||
|
||
*Models with * are converted from the [official repo](https://github.com/microsoft/unilm/tree/master/beit2). The config files of these models are only for inference.* | ||
|
||
For BEiTv2 self-supervised learning algorithm, welcome to [MMSelfSup page](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/beitv2) to get more information. | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{beitv2, | ||
title={{BEiT v2}: Masked Image Modeling with Vector-Quantized Visual Tokenizers}, | ||
author={Zhiliang Peng and Li Dong and Hangbo Bao and Qixiang Ye and Furu Wei}, | ||
year={2022}, | ||
eprint={2208.06366}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CV} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
_base_ = [ | ||
'../_base_/datasets/imagenet_bs64_swin_224.py', | ||
'../_base_/schedules/imagenet_bs1024_adamw_swin.py', | ||
'../_base_/default_runtime.py' | ||
] | ||
|
||
model = dict( | ||
type='ImageClassifier', | ||
backbone=dict( | ||
type='BEiT', | ||
arch='base', | ||
img_size=224, | ||
patch_size=16, | ||
avg_token=True, | ||
output_cls_token=False, | ||
use_abs_pos_emb=False, | ||
use_rel_pos_bias=True, | ||
use_shared_rel_pos_bias=False, | ||
), | ||
neck=None, | ||
head=dict( | ||
type='LinearClsHead', | ||
num_classes=1000, | ||
in_channels=768, | ||
loss=dict( | ||
type='LabelSmoothLoss', label_smooth_val=0.1, mode='original'), | ||
), | ||
init_cfg=[ | ||
dict(type='TruncNormal', layer='Linear', std=.02), | ||
dict(type='Constant', layer='LayerNorm', val=1., bias=0.), | ||
], | ||
train_cfg=dict(augments=[ | ||
dict(type='Mixup', alpha=0.8), | ||
dict(type='CutMix', alpha=1.0) | ||
])) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
Collections: | ||
- Name: BEiTv2 | ||
Metadata: | ||
Architecture: | ||
- Attention Dropout | ||
- Convolution | ||
- Dense Connections | ||
- Dropout | ||
- GELU | ||
- Layer Normalization | ||
- Multi-Head Attention | ||
- Scaled Dot-Product Attention | ||
- Tanh Activation | ||
Paper: | ||
URL: https://arxiv.org/abs/2208.06366 | ||
Title: 'BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers' | ||
README: configs/beitv2/README.md | ||
Code: | ||
URL: https://github.com/open-mmlab/mmclassification/blob/dev-1.x/mmcls/models/backbones/beit.py | ||
Version: v1.0.0rc4 | ||
|
||
Models: | ||
- Name: beitv2-base_3rdparty_in1k | ||
In Collection: BEiTv2 | ||
Metadata: | ||
FLOPs: 17581219584 | ||
Parameters: 86530984 | ||
Training Data: | ||
- ImageNet-21k | ||
- ImageNet-1k | ||
Results: | ||
- Dataset: ImageNet-1k | ||
Task: Image Classification | ||
Metrics: | ||
Top 1 Accuracy: 86.47 | ||
Top 5 Accuracy: 97.99 | ||
Weights: https://download.openmmlab.com/mmclassification/v0/beit/beitv2-base_3rdparty_in1k_20221114-73e11905.pth | ||
Converted From: | ||
Weights: https://conversationhub.blob.core.windows.net/beit-share-public/beitv2/beitv2_base_patch16_224_pt1k_ft21kto1k.pth | ||
Code: https://github.com/microsoft/unilm/tree/master/beit2 | ||
Config: configs/beitv2/beitv2-base-p16_8xb64_in1k.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.