论文标题
言语情感识别中变压时代的曙光:缩小价差距
Dawn of the transformer era in speech emotion recognition: closing the valence gap
论文作者
论文摘要
以自我监督的方式预先培训的基于变压器的架构的最新进展在几项机器学习任务中表现出了巨大的希望。在音频域中,此类架构也已成功地用于语音情感识别领域(SER)。但是,现有作品尚未评估模型规模和预训练数据对下游性能的影响,并且对概括,鲁棒性,公平性和效率的关注有限。目前的贡献对我们在MSP播音的尺寸,较高的,优势和价值上进行了微调,对WAV2VEC 2.0和Hubert的几种预训练的变体进行了详尽的分析,同时还使用Iemocap和Mosi来测试交叉corpus通用。据我们所知,我们在不使用明确的语言信息的情况下获得了价值预测的最佳性能,MSP播客上的一致性相关系数(CCC)为.638。此外,我们的调查表明,与基于CNN的基线和公平相比,基于变压器的架构对生物性别群体的扰动更为强大,但对单个说话者而言不是。最后,我们是第一个证明他们在价值上的非凡成功的人是基于在变压器层进行微调过程中所学到的隐式语言信息,这解释了它们为何使用最近使用文本信息明确使用的多模式方法来实现PAR。我们的发现集体绘制了以下图片:基于变压器的架构构成了SER中的新最新,但需要进一步的进步来减轻剩余的稳健性和单个扬声器问题。为了使我们的发现可再现,我们向社区发布了最佳性能模型。
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during fine-tuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.