使用DAMI-P2C亲子多模式互动数据集的基于二元语音的情感识别

论文标题

使用DAMI-P2C亲子多模式互动数据集的基于二元语音的情感识别

Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child Multimodal Interaction Dataset

论文作者

Chen, Huili, Zhang, Yue, Weninger, Felix, Picard, Rosalind, Breazeal, Cynthia, Park, Hae Won

论文摘要

自动基于言语的情感对二元对话中个人的认识是一项艰巨的任务，部分原因是它非常依赖手动预处理。传统方法通常需要手工制作的语音特征和扬声器转弯的细分。在这项工作中，我们设计了端到端的深度学习方法，以识别每个人在音频流中的情感表达，并自动发现与目标扬声器相关的功能和时间区域。我们将局部注意机制集成到端到端体系结构中，并比较三个注意力实现的性能 - 一种平均池和两种加权合并方法。我们的结果表明，提议的加权注意解决方案能够学习专注于包含目标扬声器情感信息的区域，并成功提取个人的价和唤醒强度。在这里，我们介绍并使用“多模式互动中的二元性影响 - 父母的孩子”（DAMI-P2C）在34个家庭的研究中收集的数据集，父母和孩子（3-7岁）一起从事阅读故事书。与现有的公共数据集相反，为影响识别而言，dami-p2c数据集中的两个说话者的每个实例都是由三个标签者感知到的影响的注释。为了鼓励对多演讲者的具有挑战性的任务进行更多的研究，我们使带注释的DAMI-P2C数据集公开可用，包括Dyads原始音频的声学特征，影响注释，以及每个DyAD的一系列发育，社交和人口统计资料。

Automatic speech-based affect recognition of individuals in dyadic conversation is a challenging task, in part because of its heavy reliance on manual pre-processing. Traditional approaches frequently require hand-crafted speech features and segmentation of speaker turns. In this work, we design end-to-end deep learning methods to recognize each person's affective expression in an audio stream with two speakers, automatically discovering features and time regions relevant to the target speaker's affect. We integrate a local attention mechanism into the end-to-end architecture and compare the performance of three attention implementations -- one mean pooling and two weighted pooling methods. Our results show that the proposed weighted-pooling attention solutions are able to learn to focus on the regions containing target speaker's affective information and successfully extract the individual's valence and arousal intensity. Here we introduce and use a "dyadic affect in multimodal interaction - parent to child" (DAMI-P2C) dataset collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together. In contrast to existing public datasets for affect recognition, each instance for both speakers in the DAMI-P2C dataset is annotated for the perceived affect by three labelers. To encourage more research on the challenging task of multi-speaker affect sensing, we make the annotated DAMI-P2C dataset publicly available, including acoustic features of the dyads' raw audios, affect annotations, and a diverse set of developmental, social, and demographic profiles of each dyad.

下载PDF全文

下载文献需遵守相关版权规定

论文标题