论文标题
一种预先训练的视听变压器,用于情绪识别
A Pre-trained Audio-Visual Transformer for Emotion Recognition
论文作者
论文摘要
在本文中,我们介绍了一项经过验证的视听变压器,对Voxceleb2数据集的近4000名名人的500k言语进行了训练,以了解人类的行为理解。该模型旨在从人面部和听觉行为之间的相互作用中捕获和提取有用的信息,并在情感识别中应用。我们在两个数据集上评估了模型性能,即Cremad-D(情感分类)和MSP-IMPROV(连续的情绪回归)。实验结果表明,与从头开始的同一模型相比,微调预训练的模型有助于提高情绪分类的精度5-7%,并在连续的情绪识别中提高一致性相关系数(CCC),与从头开始训练的模型相比。我们还证明了在低资源环境中对预培训模型进行填充的鲁棒性。只有提供原始训练集的10%,对预训练的模型进行微调会导致至少10%的情绪识别精度,而CCC得分的提高至少为0.1,以持续识别。
In this paper, we introduce a pretrained audio-visual Transformer trained on more than 500k utterances from nearly 4000 celebrities from the VoxCeleb2 dataset for human behavior understanding. The model aims to capture and extract useful information from the interactions between human facial and auditory behaviors, with application in emotion recognition. We evaluate the model performance on two datasets, namely CREMAD-D (emotion classification) and MSP-IMPROV (continuous emotion regression). Experimental results show that fine-tuning the pre-trained model helps improving emotion classification accuracy by 5-7% and Concordance Correlation Coefficients (CCC) in continuous emotion recognition by 0.03-0.09 compared to the same model trained from scratch. We also demonstrate the robustness of finetuning the pre-trained model in a low-resource setting. With only 10% of the original training set provided, fine-tuning the pre-trained model can lead to at least 10% better emotion recognition accuracy and a CCC score improvement by at least 0.1 for continuous emotion recognition.