Xiaofen Xing, Bolun Cai, Yinhu Zhao, Shuzhen Li, Zhiwei He, Weiquan Fan
We propose a novel hierarchical recall model fusing multiple modality (including audio, video and text) for bipolar disorder classifcation, where patients with different mania level are recalled layer-by-layer. To address the complex distribution on the challenge data, the proposed framework utilizes multi-model, multi-modality and multi-layer to perform domain adaptation for each patient and hard sample mining for special patients. The experimental results show that our framework achieves competitive performance with Unweighed Average Recall (UAR) of 59.26% on the test set, and 64.29% on the development set.
If you use these codes in your research, please cite:
@inproceedings{avec2018hrf,
author = {Xing, Xiaofen and Cai, Bolun and Zhao, Yinhu and Li, Shuzhen and He, Zhiwei and Fan, Weiquan},
title={Multi-modality Hierarchical Recall Based on GBDTs for Bipolar Disorder Classification},
booktitle={Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop},
series = {AVEC'18},
location = {Seoul, Republic of Korea},
pages = {31--37},
numpages = {7},
year = {2018},
}
pip install -r requirements.txt
cp -r <path-to LLDs_audio_eGeMAPS> ./data/LLDs_audio_eGeMAPS
cp -r <path-to LLDs_audio_opensmile_MFCCs> ./data/LLDs_audio_opensmile_MFCCs
cp -r <path-to VAD_turns> ./data/VAD_turns
cp -r <path-to LLDs_video_openFace>/*.csv ./data/AU
cp <path-to recordings_video>/*/*.mp4 ./data/video
sh ./code/features_extraction.sh
sh ./code/model_generation.sh
sh ./code/model_test.sh
Note: The order of predictions is in order. In other words, the order is dev_001, dev002, ..., dev_060 in
./result/predictions_dev.csv
, and test_001, test002, ..., test_054 in./result/predictions_test.csv
.
features_extraction_video_body_action.py
is the code to extract body action features for each video, which spends more than one day. Therefore, we have prepared these features in ./features/body_action_features.csv
, and you also can run it by yourself.
2. Emotion extracted by Face++
Emotion extracted from raw official video by Face++ toolkit, and we have prepared the entire extracted emotion data in ./data/emotion
. If emotion features are required to use new data, please apply for a Face++ account and acquire and . Sample code for obtaining emotion features are available as:
python faceplusplus_emotion -k <key> -s <secret>
Note: This code is used to extract emotion for each video, which will spend more than one day. In addtion, beacase the Face++ API has been upadted recently, it may get the features with a little difference to ours. Therefore, we have prepared these features in
./data/emotion
.
3. Topic extracted by GCP
The Turkish transcript (in ./turkish_text_json
) with word time offsets are generated by Automatic Speech Recognition (https://cloud.google.com/speech/), and the English transcript (in ./translation_orgin
) translated by Neural Machine Translation (https://cloud.google.com/translate/). The corrected English transcript with topic segmentation as follows.
Topic | Descriptor |
---|---|
# 1 | Describe why you come here |
# 2 | Depict Van Gogh’s Depression |
# 3 | Describe the worst memory |
# 4 | Count 1-30 |
# 5 | Count 1-30 again (often faster) |
# 6 | Depict Dengel’s Home Sweet Home |
# 7 | Describe the best memory |
If a sentence doesn't belong to these 7 topics, then this sentence is labeled as Topic #0.
A concatenated table (7topics_data.csv
) can be generated by:
python concat_all_transcripts.py
Then, accuracy topic timestamps can be obtained from 7topics_data.csv
simply.
time_7topics.csv
-- Seven topic timestamps obtained from7topics_data.csv
.time_3topics.csv
,times_3topics_video.csv
-- Three topic timestamps obtained from7topics_data.csv
according the rules as: Topic 1-3 as the first part, Topic 4-5 as the second part, and Topic 6-7 as the third part.
If topic features are required to use new data, we provide a script of Automatic Speech Recognition for Turkish audios and a script of Neural Machine Translation for Turkish texts. Here, an account of Google Cloud Platform (GCP) is needed.
translation_ASR.sh
-- The script of Automatic Speech Recognition for Turkish audios.translation_check.sh
-- The script of Neural Machine Translation for Turkish texts.