- [2019 ArXiv] M-BERT: Injecting Multimodal Information in the BERT Structure, [paper], [bibtex].
- [2019 NeurIPS] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, [paper], [bibtex], sources: [jiasenlu/vilbert_beta].
- [2019 ArXiv] VisualBERT: A Simple and Performant Baseline for Vision and Language, [paper], [bibtex], sources: [uclanlp/visualbert].
- [2019 EMNLP] LXMERT: Learning Cross-Modality Encoder Representations from Transformers, [paper], [bibtex], sources: [airsplay/lxmert].
- [2019 CVPR] Multi-task Learning of Hierarchical Vision-Language Representation, [paper], [bibtex].
- [2020 AAAI] Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, [paper], [bibtex].
- [2020 AAAI] Unified Vision-Language Pre-Training for Image Captioning and VQA, [paper], [bibtex], sources: [LuoweiZhou/VLP].
- [2020 ECCV] UNITER: Learning Universal Image-Text Representations, [paper], [bibtex], sources: [ChenRocks/UNITER].
- [2020 ACMMM] DeVLBert: Learning Deconfounded Visio-Linguistic Representations, [paper], [bibtex], sources: [shengyuzhang/DeVLBert].
- [2020 ICLR] VL-BERT: Pre-training of Generic Visual-Linguistic Representations, [paper], [bibtex], sources: [jackroos/VL-BERT].
- [2020 ICLR] Variational Hetero-Encoder Randomized GANs for Joint Image-Text Modeling, [paper], [bibtex].
- [2020 ECCV] Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, [paper], [bibtex], sources: [microsoft/Oscar].
- [2020 ECCV] Learning Visual Representations with Caption Annotations, [paper], [bibtex], [homepage].
- [2020 ArXiv] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, [paper], [bibtex].
- [2020 ArXiv] ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data, [paper], [bibtex].
- [2020 ArXiv] Contrastive Learning of Medical Visual Representations from Paired Images and Text, [paper], [bibtex], sources: [edreisMD/ConVIRT-pytorch].
- [2021 ArXiv] SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels, [paper], [bibtex].
- [2021 AAAI] VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning, [paper], [bibtex].
- [2021 CVPR] Causal Attention for Vision-Language Tasks, [paper], [bibtex], [supplementary], sources: [yangxuntu/lxmertcatt].
- [2021 CVPR] VirTex: Learning Visual Representations from Textual Annotations, [paper], [bibtex], [homepage], sources: [kdexd/virtex].
- [2021 TKDD] DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention, [paper], [bibtex].
- [2021 ICML] CLIP: Learning Transferable Visual Models From Natural Language Supervision, [paper], [bibtex], [slides], sources: [openai/CLIP].
- [2021 NeurIPS] Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, [paper], [bibtex], sources: [salesforce/ALBEF].
- [2020 CVPR] 12-in-1: Multi-Task Vision and Language Representation Learning, [paper], [bibtex], [supplementary].
- [2021 ICCV] UniT: Multimodal Multitask Learning with a Unified Transformer, [paper], [bibtex], [homepage], sources: [facebookresearch/mmf].
- [2021 CVPR] Learning Universal Representations via Multitask Multilingual Multimodal Pre-training, [paper], [bibtex], sources: [microsoft/M3P].