Skip to content

yhx30/VideoQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues Unlicense License LinkedIn


Logo

Video-QA

Implementing video open-ended question answering tasks on the Next-GQA dataset based on the LLaVa-1.6 and GPT-4o mini models, utilizing a sliding window sampling method.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

🔥 About The Project

Product Name Screen Shot

(back to top)

🧐 Requirement

Install the environment:

Operating System: 
Conda Version:
Python Version: 
CUDA Version: 

Main site-packages:

tqdm
moviepy
opencv-python
openai==1.14.0
torch==2.2.0
bitsandbytes==0.42.0
flash_attn==2.5.3
transformers==4.36.2
transformers-stream-generator==0.0.4
torchvision==0.17.0
pytorchvideo @ git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

Run the following code to install the required packages:

pip install requirements.txt

Configure the object tracking module:

Copy the files from the SAMTrack directory to your site-packages path to enable the target tracking functionality.

(back to top)

🤗 Datasets

We use a large-scale video-question-answer dataset, which you can access and download from here.

(back to top)

🎯 Usage

Run the following code to test the experimental results without sliding window sampling (using uniform sampling across the entire video):

python eval_gpt4v_openended.py --path_qa_pair_csv ./data/open_ended_qa/Next_GQA.csv --path_video ./data/NextGQAvideo/%s.mp4 --path_result ./result_NextGQA_gpt4/ --api_key4 <your gpt4o-mini api key> --api_key3 <your gpt3 api key>

Run the following code to test the experimental results without video input:

python eval_gpt4v_openended_novideo.py --path_qa_pair_csv ./data/open_ended_qa/Next_GQA.csv --path_video ./data/NextGQAvideo/%s.mp4 --path_result ./result_NextGQA_gpt4_novideo/ --api_key4 <your gpt4o-mini api key> --api_key3 <your gpt3 api key>

Run the following code to test the experimental results without evidence segments (i.e., segments containing ground-truth have been removed from the video):

python eval_gpt4v_openended_woevidence_separate.py --path_qa_pair_csv ./data/open_ended_qa/Next_GQA.csv --path_video ./data/NextGQAvideo/%s.mp4 --path_result ./result_NextGQA_gpt4_woevidence/ --api_key4 <your gpt4o-mini api key> --api_key3 <your gpt3 api key>

Run the following code to test the experimental results of Ground (extracting 6 frames, separate):

python eval_gpt4v_openended_separate_ground.py --path_qa_pair_csv ./data/open_ended_qa/Next_GQA.csv --path_video ./data/NextGQAvideo/%s.mp4 --path_result ./result_NextGQA_gpt4_separate_ground/ --api_key4 <your gpt4o-mini api key> --api_key3 <your gpt3 api key>

Run the following code to test the experimental results of selecting answers using perplexity under the sliding window method (15 stride size / 30 window size, extracting 6 frames, separate):

python eval_gpt4v_openended_sliding_separate.py --path_qa_pair_csv ./data/open_ended_qa/Next_GQA.csv --path_video ./data/NextGQAvideo/%s.mp4 --path_result ./result_NextGQA_gpt4_separate/ --api_key4 <your gpt4o-mini api key> --api_key3 <your gpt3 api key>

Run the following code to test the experimental results with the addition of Object Segment & Track(SAMTrack) under ground truth conditions:

python eval_gpt4v_openended_separate_ground_track.py --path_qa_pair_csv ./data/open_ended_qa/Next_GQA.csv --path_video ./data/NextGQAvideo/%s.mp4 --path_result ./result_NextGQA_gpt4_separate_ground_samtrack/ --api_key4 <your gpt4o-mini api key> --api_key3 <your gpt3 api key>

Run the following code to test the experimental results of selecting answers using confidence (with a maximum score of 1000) under the sliding window method (15 stride size / 30 window size, extracting 6 frames, separate) (Current best performance - QA-Acc: 39.80 IOP: 27.12 GQA: 13.2):

python eval_gpt4v_openended_sliding_separate_confidence.py --path_qa_pair_csv ./data/open_ended_qa/Next_GQA.csv --path_video ./data/NextGQAvideo/%s.mp4 --path_result ./result_NextGQA_gpt4_separate_confidence/ --api_key4 <your gpt4o-mini api key> --api_key3 <your gpt3 api key>

(back to top)

🚨 Results

To be added ...

🤓 Contributing

Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Top contributors:

contrib.rocks image

(back to top)

😋 License

Distributed under the Unlicense License. See LICENSE.txt for more information.

(back to top)

📝 Cite

To be added ...

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published