Pritam Sarkar
Sayna Ebrahimi
Ali Etemad
Ahmad Beirami
Sercan O Arik
Tomas Pfister
conda create -n halva python=3.10 -y
conda activate halva
pip install --upgrade pip
pip install -r req.txt
module load cuda/11.7.1
pip install flash-attn --no-build-isolation
We share a minimal setup to quickly try our HALVA! See this notebook.
generative data augmented contrastive samples
- Vision-language instructions and their correct and hallucinated responses are available here: data
- Download the images from Visual Genome and save both part 1 and part 2 as
data/vg/VG_100K
anddata/vg/VG_100K_2
reference samples
- A random subset from llava_v1_5_mix665k.json. For reproducibility, we share the actual subset that has been used in our study: ref data
- Image sources:
- MSCOCO - download them as
data/MSCOCO2017
- TextVQA - download them as
data/textvqa
- GQA - download them as
data/gqa
- OCR-VQA - download them as
data/ocr_vqa
- MSCOCO - download them as
- The base model LLaVA-v1.5 weights can be found here: 7B and 13B.
- We use 4-A100 80GB GPUs for training, which takes 1.5 hours and 3 hours for training 7B and 13B variants, respectively. If you are using different GPUs, please make sure to match our default batch_size x gradient accumulation steps, for optimal performance with the default hyperparameters.
- The following training script can be used to train HALVA that uses LLaVA 1.5 as the base model:
- HALVA-7B:
src/hallava_7b.sh
- HALVA-13B:
src/hallava_13b.sh
- HALVA-7B:
Choose the HALVA variant and their base model. We provide sample validation scripts for evaluation, please make sure to update the paths based on your setup.
MODEL="halva13b-lora"
MODEL_BASE="liuhaotian/llava-v1.5-13b"
# OR
MODEL="halva7b-lora"
MODEL_BASE="liuhaotian/llava-v1.5-7b"
- Download the validation images from MSCOCO2014 and store them as
data/MSCOCO2014/val2014
. We use the same 500 images for validation, as used in prior work. - You can use the given sample script for evaluation.
##### run chair
bash src/evaluate_hall/chair.sh ${MODEL} ${MODEL_BASE}
- MME-Hall is a subset of MME consisting of
existence
,count
,position
, andcolor
. - You can follow the official instructions for MME evaluation: link and download the MME benchmark.
- Once the data is downloaded you can use the given sample script for evaluation.
##### run mme
bash src/evaluate_hall/mme.sh ${MODEL} ${MODEL_BASE}
- Download the validation images are from the source repo AMBER and keep them as
data/amber/image/
. - Download the annotation data directory and save as
eval_hall/amber/data
. - Once the data is downloaded you can use the given sample script for evaluation.
##### run amber evaluation on 4 GPUs in parallel if available, else run sequentially by removing & from the end
bash src/evaluate_hall/amber.sh g ${MODEL} ${MODEL_BASE} 0 &
bash src/evaluate_hall/amber.sh da ${MODEL} ${MODEL_BASE} 1 &
bash src/evaluate_hall/amber.sh dr ${MODEL} ${MODEL_BASE} 2 &
bash src/evaluate_hall/amber.sh de ${MODEL} ${MODEL_BASE} 3 &
wait
# get amber f1 for all discriminative tasks
bash src/evaluate_hall/amber_f1.sh ${MODEL}
- The validation data will be directly downloaded from HuggingFace. You can use the given sample script for evaluation.
##### run mmhal-bench
bash src/evaluate_hall/mmhal.sh ${MODEL} ${MODEL_BASE} 0
- Download the validation images from link and save them in
data/hallusion_bench
. - Download the annotation files from link and save them in
eval_hall/hallusion_bench
. - For more details, you can check the official repo. You can use the given sample script for evaluation.
##### run halusion-bench
bash src/evaluate_hall/hallusionbench.sh ${MODEL} ${MODEL_BASE} 0
In addition to the above-mentioned evaluation on hallucination benchmarks, we also evaluate on general vision-language benchmarks. For those, we directly follow LLaVA repo as follows:
The above instructions are mainly related to the LLaVA 1.5 based checkpoints, you can find the VILA codes inside *_vila
directories.
If you find this repository useful, please consider giving a star ⭐ and citation using the given BibTeX entry:
@misc{sarkar2024halva,
title={Data-Augmented Phrase-Level Alignment for Mitigating Object Hallucination},
author={Pritam Sarkar and Sayna Ebrahimi and Ali Etemad and Ahmad Beirami and Sercan Ö. Arık and Tomas Pfister},
year={2024},
eprint={2405.18654},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This code base is built upon LLaVA and VILA.
You may directly contact me at [email protected] or connect with me through LinkedIn.