ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Authors: Benjamin Schneider, Florian Kerschbaum, Wenhu Chen @ TIGER-Lab

🔥News

[2025/3/24] Added scripts to easily fetch our datasets from HF hub, includiong our large (200 GB) pretraining dataset. Our training script now directly pulls these datasets from the hub making it very easy to train yuor our models / adapters. I also added a batched inference embedding function (example in batched_demo.py).
[2025/3/4] Release of the ABC Paper, along with the first release of our 🤗 Model and Datasets on Hugging Face (more to come, stay tuned!).

Overview

ABC's Design

We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions.
ABC is designed to give the user maximum control over how images are represented in embeddings. If you need to use naturral langauge to specify which aspects of an image you want emphasized and represented, ABC is the perfect model for you!
The key behind ABC's training is that we pretrain the model using a large dataset of difficult embedding samples, where each batch contains many candidates that are relevant but not quite correct. The pretrained model is therefore able to generate embeddings that capture subtle differences. After a short finetuning stage, the model ideal for tasks like VQA, where differences in user instructions result in different correct answers (right).
ABC outputs great quality embeddings, ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on zero-shot classification and VQA tasks in the Massive Multimodal Embedding Benchmark.

🤗 Models

Model	Supports Instructions	Base Model	Training Dataset
ABC-Qwen2VL-Instruct	✅	ABC-Qwen2VL-Pretrain	TIGER-Lab/ABC-VG-Instruct
ABC-Qwen2VL-Pretrain	❌	Qwen2VL-Instruct	TIGER-Lab/ABC-Pretrain

📚 Datasets

ABC-VG-Instruct: A custom dataset for multimodal finetuning. Contains multiple instructions per image, each corresponding to different aspects of each image.
ABC-Pretrain: Multimodal pretraining dataset with mined negatives.

🚀 Quick Start

Install Dependancies:

git clone $
cd ABC
pip install -r requirements.txt

Start making multimodal embeddings!

python -i ./quick_start.py

📈 Zero-shot Performance

Check out our paper for additional evaluations!

Fetching Datasets from 🤗 Hub

Our datasets are hosted on HuggingFace Hub. The text data and dataset metadata can be fetched using HF's load_dataset utility. To fetch the images from our datasets we provide scripts in the fetch_datasets directory. These scripts will pull the pretraining/finetuning image data off the hub and unpack them in your huggingface datasets cache (under a directory called tigerlab). Run python ./fetch_datasets/pretrain.py to get the pretraining dataset and python ./fetch_datasets/instruct.py to get the finetuning dataset, respectively.

🤖 Training

1. Install all requirements.

pip install -r training_requirements.txt

2. Download the appropriate dataset.
Either thhe pretraining dataset:

python ./fetch_datasets/pretrain.py

or the instruction finetuning dataset:

python ./fetch_datasets/instruct.py

3. Update Config
Find the config you want to run in the config folder (Currently the example configs are nested under the qwen folder, one for pretraining and one for finetuning). At minimum, change the output_dir field to where you want to the checkpoints to be saved. Feel free to change any other settings in your chosen config. 😊

4. Run the training script
The scripts directory contains a file for training the model with different GPU / system config settings:

./scripts/qwen_finetune.sh {GPU} {PORT} {CONFIG_PATH}

for example:

./scripts/qwen_finetune.sh 0,1 44000 ./config/qwen/QwenVL-8B-Instruct.json

Runs our pretraining on GPUs 0,1 with communication over port 44000. his script still works if you only want to specify a single GPU for your training.

If encounter any problems please open an issue on this repo. 😊

Citation

If you find this work helpful, please consider citing:

@misc{schneider2025abcachievingbettercontrol,
      title={ABC: Achieving Better Control of Multimodal Embeddings using VLMs}, 
      author={Benjamin Schneider and Florian Kerschbaum and Wenhu Chen},
      year={2025},
      eprint={2503.00329},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.00329}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
config/qwen		config/qwen
dataset_utils		dataset_utils
deepspeed		deepspeed
examples		examples
fetch_datasets		fetch_datasets
functional		functional
model		model
monkey_patch		monkey_patch
qwen		qwen
script		script
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batched_demo.py		batched_demo.py
demo.py		demo.py
quick_start.py		quick_start.py
requirements.txt		requirements.txt
training_requirements.txt		training_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

🔥News

Overview

🤗 Models

📚 Datasets

🚀 Quick Start

📈 Zero-shot Performance

Fetching Datasets from 🤗 Hub

🤖 Training

Citation

About

Releases

Packages

Contributors 2

Languages

License

TIGER-AI-Lab/ABC

Folders and files

Latest commit

History

Repository files navigation

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

🔥News

Overview

🤗 Models

📚 Datasets

🚀 Quick Start

📈 Zero-shot Performance

Fetching Datasets from 🤗 Hub

🤖 Training

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages