VSTAR

This is the official implementation of the ACL 2023 paper "VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions"

Dataset

Schedule

release dialogues
release feature (resnet, rcnn)
release test data (2024.07.17)
release meta data (genres, keywords, storyline, characters: name, avatar)
release frames

Dialogues

support language: English, 简体中文

Downloads

Storage: Train (196M); Valid(11.6M); test (24M)

Links: BaiduNetDisk or GoogleDrive

Statistics

	clips	dialogues	scene/clip	topic/clip
Train	172,041	4,319,381	2.42	3.68
Val	9753	250,311	2.64	4.29
Test	9779	250,436	2.56	4.12

Format

{
	"dialogs":[
		{
			"clip_id": "Friends_S01E01_clip_000",
			"dialog": ["hi", ...],
			"scene": [1, 1, 1, 1, 1, 1, 2, 2, ...],
			"session": [1, 1, 1, 2, 2, 2, 3, 3, ...]
		},
		...
]
}

Feature

Downloads

Storage: RCNN(246.2G), RESNET(109G)

Links: BaiduNetDisk

Format

File Structure:

# [name of TV show]_S[season]_E[episode]_clip_[clip id].npy
├── Friends_S01E01
   └── Friends_S01E01_clip_000.npy
   └── Friends_S01E01_clip_001.npy
   └── ...
├── ...

ResNet:

# numpy.load("Friends_S01E01_clip_000.npy")
(num_of_frames * 1000)

RCNN:

# numpy.load("Friends_S01E01_clip_000.npy", allow_pickle=True).item()
{
	"feature": (9 * num_of_frames * 2048) # array(float32), feature top 9 objects
	"size": (num_of_frames * 2) # list(int), size of original frame
	"box": (9 * num_of_frames * 4) # array(float32), bbox
	"obj_id": (9 * num_of_frames) # list(int), object id
	"obj_conf": (9 * num_of_frames) # array(float32), object conference 
	"obj_num": (num_of_frames) # list(int), number of objects/frame
}

Feature Extraction Tools

Please Refer to OpenViDial_extract_features

Installation

pip install -r requirements.txt

Scene Segmentation

Preprocess

move train.json, valid.json, test.json to inputs/full directory

run following script to change the original to binary format to run our baseline smoothly (check in our paper)

cd inputs/full
python preprocess.py

Train

python train_seg.py \
	--video 1 \
	--exp_set EXP_LOG \
	--train_batch_size 4 \

Infer

python generate_seg.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 1 \

Topic Segmentation

Train

python train_seg.py \
	--video 0 \
	--exp_set EXP_LOG \
	--train_batch_size 4 \

Infer

python generate_seg.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 0 \

Dialogue Generation

To use coco_caption for evaluation, run the following script to generate the reference file:

cd inputs/full
python coco_caption_reformat.py

for the evaluation details, please refer to: https://github.com/tylin/coco-caption

Train

python train_gen.py \
	--train_batch_size 4 \
	--model bart \
	--exp_set EXP_LOG \
	--video 1 \
	--fea_type resnet \

Infer

python generate.py \
	--ckptid SAVED_CKPT_ID \
	--gpuid 0 \
	--exp_set EXP_LOG \
	--video 1 \
	--sess 1 \
	--batch_size 4

Citation

@misc{wang2023vstar,
    title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions},
    author={Yuxuan Wang and Zilong Zheng and Xueliang Zhao and Jinpeng Li and Yueqian Wang and Dongyan Zhao},
    year={2023},
    eprint={2305.18756},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VSTAR

Dataset

Schedule

Dialogues

Feature

Installation

Scene Segmentation

Topic Segmentation

Dialogue Generation

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

VSTAR

Dataset

Schedule

Dialogues

Feature

Installation

Scene Segmentation

Topic Segmentation

Dialogue Generation

Citation