Code for the paper: VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap
conda create --name vdgd python=3.10
pip install -r requirements.txt
All the datasets must be placed in datasets/
. For dataset format please refer to datasets/amber.jsonl
.
Link for dataset images: https://drive.google.com/drive/folders/1iPRBrvBwSdF0xEhsonCHsrZ6p0_x3_Bn?usp=share_link
Place the extracted images only into their respective folders, for example: amber_images -> datasets/AMBER/image
.
This section describes the commands and setup required for running inference of multiple LVLMs. LLaVA and MplugOwl2 require specific setup form their repositories, CogVLM and InternLM dependencies are already covered under requirements.txt
.
cd inference_files/
LLaVA(Repository Setup) -
python llava_inference.py <model_path> <dataset_name> <output_file_name> <sampling_flag>
python llava_v1_inference.py <dataset_name> <output_file_name> <sampling_flag>
CogVLM -
python cogvlm_inference.py --file_name <dataset_name> --out_file_name <output_file_name>
MplugOwl2(Repository Setup) -
python mlpug_owl2_inference.py <dataset_name> <output_file_name> <sampling_flag>
InternLM -
python internlm_inference.py <dataset_name> <output_file_name> <sampling_flag>
Examples -
python llava_inference.py liuhaotian/llava-v1.6-vicuna-7b amber llava_16_amber 0
python llava_v1_inference.py amber llava_v1_amber 0
python cogvlm_inference.py --file_name amber --out_file_name cogvlm_amber
Supported Arugments:
-
model_path
- this argument is only for llava inference file where we can pass path tollava 1.5
orllava 1.6
. -
dataset_name
- file prefix in thedatasets/
folder. -
output_file_name
- output file name which will be saved atinference_generations
. -
sampling_flag
- A 1 or 0 value which will set sampling arguments for inference.
This section describes how to evaluate quality of generation using GPT-4V.
cd gpt_evaluations/
python evaluate_gpt.py <inference_file_name>
Supported Arugments:
inference_file_name
- file prefix of LVLM output located ininference_generations
.
This section provides code for token analysis of LVLMs and base LLMs.
Token analysis per dataset entry will be stored at ./AlignTDS/src/demo/docs/{model_generated_dataset_name}_tp_justeval/
.
cd AlignTDS/
pip install -r requirements.txt #This command should be run in a new environment and includes support LLaVA and CogVLM.
sh run.sh <llm_model_name> <shard_size> <num_gpus> <model_generated_dataset_name> <dataset_length>
Example -
sh run.sh llava_1.6 126 8 llava_1.6_amber 1004
Supported arguments:
-
llm_model_name
- llava_v1, llava_1.5, llava_1.6, cogvlm. -
shard_size
- dataset_length/num_gpus. -
model_generated_dataset_name
- this argument is the name of the file to run logit analysis for in./AlignTDS/data/
.
This section requires the setup of LLaMA-Factory and meta-llama/Meta-Llama-3-8B.
We will use a prompt to identify all the visual elements:
I will provide you with a response from an AI agent which has been asked to describe an image. Please identify all the phrases that in the image description that constitute the image. These phrases might be foreground and background objects, adverbial phrases, etc. Return them as comma separated values. There should not be any additional information other than these values in the output. The response is as follows: {response}.
response
- Output generated by LVLM.
Example output of LLaMA-Factory
:
{"predict": "three individuals, a meadow, a park, wildflowers"}
{"predict": "canoe, calm waters, sky"}
{"predict": "young child, gray pants, grass, yellow flowers"}
cd hallucination_categorization/
For MMMU Openended generation:
python gpt_categorize_hallucinations_mmmu.py <model_generated_dataset_name> <object_file_path> <gpt_eval_file_name>
For Amber generation:
python gpt_categorize_hallucinations_amber.py <model_generated_dataset_name> <object_file_path> <gpt_eval_file_name>
Supported arguments:
-
model_generated_dataset_name
- this is the same argument used in Logit Analysis section above. -
object_file_path
- generated file fromLLaMA-Factory
. -
gpt_eval_file_name
- this is the same argument as <inference_file_name> in GPT Evaluation section.
This section details the execution of VDGD
algorithm for LVLM inference.
Copy code from LLaVA-Align/transformers_utils.py
to ./conda/envs/vdgd/lib/python3.10/site-packages/transformers/generation/utils.py
.
cd ./LLaVA-Align/experiments/eval/sampling/
For generating description logits:
python generate_desc_logits.py \
--model_path liuhaotian/llava-v1.5-7b \
--amateur_model_path meta-llama/Llama-2-7b-chat-hf \
--question_file vallu_benchmark.jsonl \
--answers_file desc_out.jsonl \
--logit_out_file logits_out.pkl\
--use_dd
Supported arguments:
model_path
- Path to the LVLM model.amateur_model_path
- Path to the base LLM.question_file
- Path to the inference dataset.answers_file
- Path to the output inference generation.logit_out_file
- Path to store description logits in a.pkl
file.use_dd
- Flag to use debiased decoding.
Logits are stored at ./LLaVA-Align/experiments/eval/sampling/description_logits/logits_{question_file}.pkl
For VDGD Inference:
python vdgd.py \
--model_path liuhaotian/llava-v1.5-7b \
--question_file vallu_benchmark.jsonl \
--answers_file output.jsonl\
--desc_file desc_out.jsonl\
--logits_file description_logits/logits_vallu_benchmark.pkl \
--decoding_type "vdgd" \
--kl_reduction "avg"
Supported Arguments:
model_path
- Path to the LVLM model.question_file
- Path to the inference dataset.answers_file
- Path to the output inference generation.desc_file
- Path to the description file generated above.logits_file
- Path to the description logits file.decoding_type
- Type of decoding to use, example:vdgd
,gd
(greedy) andsd
(sampling).kl_reduction
- KL Divergence reduction useavg
,sum
ormin
.
Please download the dataset from here.
We use the code from the following repositories: LLaVA, MplugOwl2, AlignTDS, LLaMA-Factory and LLaVA-Align.
Please cite the above repositories if you find their code useful.
@misc{ghosh2024vdgdmitigatinglvlmhallucinations,
title={VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap},
author={Sreyan Ghosh and Chandra Kiran Reddy Evuru and Sonal Kumar and Utkarsh Tyagi and Oriol Nieto and Zeyu Jin and Dinesh Manocha},
year={2024},
eprint={2405.15683},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2405.15683},
}