论文标题
半监督的多模式情绪识别,具有跨模式分布匹配
Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
论文作者
论文摘要
自动情绪识别是一个积极的研究主题,具有广泛的应用。由于手动注释成本高和不可避免的标签歧义,情感识别数据集的发展在规模和质量方面受到限制。因此,关键挑战之一是如何使用有限的数据资源构建有效的模型。以前的工作已经探索了应对这一挑战的不同方法,包括增强数据,转移学习和半监督学习等。但是,这些现有方法的弱点包括训练不稳定性,转移期间的大量绩效损失或边际改进。 在这项工作中,我们提出了一种基于跨模式分布匹配的新型半监督的多模式情感识别模型,该模型利用丰富的未标记数据来增强模型训练,假设内部情绪状态在跨模态的话语水平上保持一致。 我们进行了广泛的实验,以评估两个基准数据集Iemocap和Meld上所提出的模型。实验结果证明,所提出的半监督学习模型可以有效地利用未标记的数据并结合多模式来提高情绪识别性能,从而在相同条件下胜过其他最先进的方法。与现有的方法相比,提出的模型还可以实现竞争能力,这些方法利用了其他辅助信息,例如说话者和互动环境。
Automatic emotion recognition is an active research topic with wide range of applications. Due to the high manual annotation cost and inevitable label ambiguity, the development of emotion recognition dataset is limited in both scale and quality. Therefore, one of the key challenges is how to build effective models with limited data resource. Previous works have explored different approaches to tackle this challenge including data enhancement, transfer learning, and semi-supervised learning etc. However, the weakness of these existing approaches includes such as training instability, large performance loss during transfer, or marginal improvement. In this work, we propose a novel semi-supervised multi-modal emotion recognition model based on cross-modality distribution matching, which leverages abundant unlabeled data to enhance the model training under the assumption that the inner emotional status is consistent at the utterance level across modalities. We conduct extensive experiments to evaluate the proposed model on two benchmark datasets, IEMOCAP and MELD. The experiment results prove that the proposed semi-supervised learning model can effectively utilize unlabeled data and combine multi-modalities to boost the emotion recognition performance, which outperforms other state-of-the-art approaches under the same condition. The proposed model also achieves competitive capacity compared with existing approaches which take advantage of additional auxiliary information such as speaker and interaction context.