You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does that mean videochat2 HD training doesn't use the sub-images, and uses the resized images instead? In that case, how does that work with the vision encoder 224x224 input setup?
Hello,
Thank you for the great work!
For stage 4 (instruction tuning with HD data), the current code seems to resize/crop image to 224x224:
https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/scripts/videochat_mistral/config_7b_hd_stage4.py#L21
https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/dataset/__init__.py#L73
which means it's actually using 224x224 frames for training. Is that true? If so, what is this "HD" about? Or did I miss something?
Thank you!
The text was updated successfully, but these errors were encountered: