This repository provides models, code and data of our paper: On Domain-Specific Post-Training for Multimodal Large Language Models.
We investigate domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using open-source models, we develop a visual instruction synthesizer that effectively generates diverse visual instruction tasks from domain-specific image-caption pairs. Our synthetic tasks surpass those generated by manual rules, GPT-4, and GPT-4V in enhancing the domain-specific performance of MLLMs. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct experiments in two domains, biomedicine and food, by post-training MLLMs of different sources and scales (e.g., Qwen2-VL-2B, LLaVA-v1.6-8B, Llama-3.2-11B), and then evaluating MLLM performance on various domain-specific tasks.
*********************** Updates *************************
- [2024/1/4] Updated ALL models, code and data to reproduce our results
- [2024/1/3] Released the post-training guide
- [2024/12/16] Released the data synthesis guide
- [2024/12/13] Released the evaluation guide
- [2024/11/29] Released our paper
Model | Repo ID in HF 🤗 | Domain | Base Model | Training Data | Evaluation Benchmark |
---|---|---|---|---|---|
AdaMLLM-med-2B | AdaptLLM/biomed-Qwen2-VL-2B-Instruct | Biomedicine | Qwen2-VL-2B-Instruct | biomed-visual-instructions | biomed-VQA-benchmark |
AdaMLLM-food-2B | AdaptLLM/food-Qwen2-VL-2B-Instruct | Food | Qwen2-VL-2B-Instruct | food-visual-instructions | food-VQA-benchmark |
AdaMLLM-med-8B | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B | Biomedicine | open-llava-next-llama3-8b | biomed-visual-instructions | biomed-VQA-benchmark |
AdaMLLM-food-8B | AdaptLLM/food-LLaVA-NeXT-Llama3-8B | Food | open-llava-next-llama3-8b | food-visual-instructions | food-VQA-benchmark |
AdaMLLM-med-11B | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct | Biomedicine | Llama-3.2-11B-Vision-Instruct | biomed-visual-instructions | biomed-VQA-benchmark |
AdaMLLM-food-11B | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct | Food | Llama-3.2-11B-Vision-Instruct | food-visual-instructions | food-VQA-benchmark |
We create two separate conda environments.
-
Clone this repo:
git clone https://github.com/bigai-ai/QA-Synthesizer.git cd QA-Synthesizer
-
Install the package:
conda create -n adamllm python=3.10 -y conda activate adamllm pip install --upgrade pip pip install -e .
-
Install additional packages for training:
pip install -e ".[train]" pip install flash-attn --no-build-isolation conda deactivate
Install vLLM with pip
or from source.
As recommended in the official vLLM documentation, install vLLM in a fresh new conda environment:
conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllm # Ensure vllm>=0.6.2 for compatibility with llama3.2; if llama-3.2 is not used, vllm==0.6.1 is sufficient
conda deactivate
The steps in Synthesis.md reproduce our visual instruction synthesizer and our synthetic data.
The steps in Post-Train.md reproduce our domain-adapted models.
See Evaluation.md to reproduce our results and evaluate any MLLMs compatible with vLLM.
LICENSE AGREEMENT
Last revision: Sep, 2023
You are granted the right to use the code and/or Database under the following terms, as enlisted in this document (“Beijing Institute for General Artificial Intelligence BIGAI License Agreement”):
· The code and/or data is strictly for non-commercial academic research only.
· Any commercial use of the code or data requires prior contact and negotiation of licensing fees with the original authors or Beijing Institute for General Artificial Intelligence (BIGAI).
· Any new access to the code and/or data shall be established through this form or the official method of distributing the code and/or data. The code and/or data may not be redistributed, in whole or part, or in any format without written prior permission. A reference to the code and/or data or this License Agreement must be made if you publish information.
· The code and/or data is provided as is. No liability or responsibility assumed for the authors.
· The right to revise this License Agreement, in whole or part, at any time without prior notice is reserved by the authors.
· You warrant that you have the authorization to enter into this License Agreement.
· You comply with the terms enforced by the corporates whose products were used in collecting the code and/or data. The terms unanimously enforce, including but not limited to, restricting the use of the code and/or data to non-commercial academic research.
If you find our work helpful, please cite us:
@article{adamllm,
title={On Domain-Specific Post-Training for Multimodal Large Language Models},
author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
journal={arXiv preprint arXiv:2411.19930},
year={2024}
}
@inproceedings{
adaptllm,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}