使声学和视觉提示很重要：CH-SIMS v2.0数据集和AV-MIXUP一致模块

论文标题

使声学和视觉提示很重要：CH-SIMS v2.0数据集和AV-MIXUP一致模块

Make Acoustic and Visual Cues Matter: CH-SIMS v2.0 Dataset and AV-Mixup Consistent Module

论文作者

Liu, Yihe, Yuan, Ziqi, Mao, Huisheng, Liang, Zhiyun, Yang, Wanqiuyue, Qiu, Yuanzhe, Cheng, Tie, Li, Xiaoteng, Xu, Hua, Gao, Kai

论文摘要

多模式情感分析（MSA）认为，通过相关的声学和视觉方式改善基于文本的情感分析，这是一个新兴的研究领域，由于其在人类计算机相互作用（HCI）中的潜在应用（HCI）。但是，现有的研究表明，声学和视觉方式的贡献远低于文本模式，称为文本主导。在这种情况下，在这项工作中，我们强调使非语言提示对MSA任务很重要。首先，从资源角度来看，我们介绍了CH-SIMS V2.0数据集，即CH-SIMS的扩展和增强。与原始数据集相比，CH-SIMS v2.0的尺寸增加了2121个精致的视频片段，并带有单峰和多模式注释，并收集了10161个未标记的原始视频片段，具有丰富的声学和视觉情感情感和视觉情感上的含量，以高显示非言语提示的节目预测。其次，从模型的角度来看，提出了CH-SIMS v2.0中的单峰注释和无监督的数据，提出了声学视觉混合一致（AV-MC）框架。设计的模式混合模块可以视为一种增强，它将不同视频的声学和视觉方式混合在一起。通过绘制未观察到的多模式上下文以及文本，该模型可以学会了解情感预测的不同非语言上下文。我们的评估表明，CH-SIMS V2.0和AV-MC框架都可以进一步研究发现情绪上的声学和视觉提示，并铺平了可解释的端到端HCI应用程序的途径，以实现现实世界中的情况。

Multimodal sentiment analysis (MSA), which supposes to improve text-based sentiment analysis with associated acoustic and visual modalities, is an emerging research area due to its potential applications in Human-Computer Interaction (HCI). However, the existing researches observe that the acoustic and visual modalities contribute much less than the textual modality, termed as text-predominant. Under such circumstances, in this work, we emphasize making non-verbal cues matter for the MSA task. Firstly, from the resource perspective, we present the CH-SIMS v2.0 dataset, an extension and enhancement of the CH-SIMS. Compared with the original dataset, the CH-SIMS v2.0 doubles its size with another 2121 refined video segments with both unimodal and multimodal annotations and collects 10161 unlabelled raw video segments with rich acoustic and visual emotion-bearing context to highlight non-verbal cues for sentiment prediction. Secondly, from the model perspective, benefiting from the unimodal annotations and the unsupervised data in the CH-SIMS v2.0, the Acoustic Visual Mixup Consistent (AV-MC) framework is proposed. The designed modality mixup module can be regarded as an augmentation, which mixes the acoustic and visual modalities from different videos. Through drawing unobserved multimodal context along with the text, the model can learn to be aware of different non-verbal contexts for sentiment prediction. Our evaluations demonstrate that both CH-SIMS v2.0 and AV-MC framework enables further research for discovering emotion-bearing acoustic and visual cues and paves the path to interpretable end-to-end HCI applications for real-world scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题