LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
This is the official PyTorch implementation of LLMDet.
🎉🎉🎉 Our paper is accepted by CVPR 2025, congratulations and many thanks to the co-authors!
If you find our work helpful, please kindly give us a star🌟
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits.
Model | APmini | APr | APc | APf | APval | APr | APc | APf |
---|---|---|---|---|---|---|---|---|
LLMDet Swin-T only p5 | 44.5 | 38.6 | 39.3 | 50.3 | 34.6 | 25.5 | 29.9 | 43.8 |
LLMDet Swin-T | 44.7 | 37.3 | 39.5 | 50.7 | 34.9 | 26.0 | 30.1 | 44.3 |
LLMDet Swin-B | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 |
LLMDet Swin-L | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 |
LLMDet Swin-L (chunk size 80) | 52.4 | 44.3 | 48.8 | 57.1 | 43.2 | 32.8 | 40.5 | 50.8 |
NOTE:
- APmini: evaluated on LVIS
minival
. - APval: evaluated on LVIS
val 1.0
. - AP is fixed AP.
- All the checkpoints and logs can be found in huggingface and modelscope.
- Other benchmarks are tested using
LLMDet Swin-T only p5
.
Note: other environments may also work.
- pytorch==2.2.1+cu121
- transformers==4.37.2
- numpy==1.22.2 (numpy should be lower than 1.24, recommend for numpy==1.23 or 1.22)
- mmcv==2.2.0, mmengine==0.10.5
- timm, deepspeed, pycocotools, lvis, jsonlines, fairscale, nltk, peft, wandb
|--huggingface
| |--bert-base-uncased
| |--siglip-so400m-patch14-384
| |--my_llava-onevision-qwen2-0.5b-ov-2
| |--mm_grounding_dino
| | |--grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
| | |--grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
| | |--grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
|--grounding_data
| |--coco
| | |--annotations
| | | |--instances_train2017_vg_merged6.jsonl
| | | |--instances_val2017.json
| | | |--lvis_v1_minival_inserted_image_name.json
| | | |--lvis_od_val.json
| | |--train2017
| | |--val2017
| |--flickr30k_entities
| | |--flickr_train_vg7.jsonl
| | |--flickr30k_images
| |--gqa
| | |--gqa_train_vg7.jsonl
| | |--images
| |--llava_cap
| | |--LLaVA-ReCap-558K_tag_box_vg7.jsonl
| | |--images
| |--v3det
| | |--annotations
| | | |--v3det_2023_v1_train_vg7.jsonl
| | |--images
|--LLMDet (code)
- pretrained models
bert-base-uncased
,siglip-so400m-patch14-384
are directly downloaded from huggingface.- To fully reproduce our results, please download
my_llava-onevision-qwen2-0.5b-ov-2
from huggingface or modelscope, which is slightly fine-tuned by us in early exploration. We find that the originalllava-onevision-qwen2-0.5b-ov
is still OK to reproduce our results but users should pretrain the projector. - Since LLMDet is fine-tuned from
mm_grounding_dino
, please download their checkpoints swin-t, swin-b, swin-l for training.
- grounding data (GroundingCap-1M)
coco
: You can download it from the COCO official website or from opendatalab.lvis
: LVIS shares the same images with COCO. You can download the minival annotation file from here, and the val 1.0 annotation file from here.flickr30k_entities
:Flickr30k images.gqa
: GQA images.llava_cap
:images .v3det
:The V3Det dataset can be downloaded from opendatalab.- Our generated jsonls can be found in huggingface or modelscope.
- For other evalation datasets, please refer to MM-GDINO.
bash dist_train.sh configs/grounding_dino_swin_t.py 8 --amp
bash dist_test.sh configs/grounding_dino_swin_t.py tiny.pth 8
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('stopwords')
- For Phrase Grounding and Referential Expression Comprehension, users should first download
nltk
packages. - If you do not want to load the llm during inference, please modify the config
lmm=None
.
- Open-Vocabuary Object Detection (开放词汇目标检测)
python image_demo.py images/demo.jpeg \
configs/grounding_dino_swin_t.py --weight tiny.pth \
--text 'apple .' -c --pred-score-thr 0.4
- Phrase Grounding (短语定位)
python image_demo.py images/demo.jpeg \
configs/grounding_dino_swin_t.py --weight tiny.pth \
--text 'There are many apples here.' --pred-score-thr 0.35
- Referential Expression Comprehension (指代性表达式理解)
python image_demo.py images/demo.jpeg \
configs/grounding_dino_swin_t.py --weight tiny.pth \
--text 'red apple.' --tokens-positive -1 --pred-score-thr 0.4
LLMDet is released under the Apache 2.0 license. Please see the LICENSE file for more information.
If you find our work helpful for your research, please consider citing our paper.
@article{fu2025llmdet,
title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
journal={arXiv preprint arXiv:2501.18954},
year={2025}
}
Our LLMDet is heavily inspired by many outstanding prior works, including
Thank the authors of above projects for open-sourcing their assets!