通过转移学习的多模式情绪检测

论文标题

通过转移学习的多模式情绪检测

Multi-Modal Emotion Detection with Transfer Learning

论文作者

Ananthram, Amith, Saravanakumar, Kailash Karthik, Huynh, Jessica, Beigi, Homayoon

论文摘要

由于单词与说话方式之间复杂的相互依存关系，言语中的自动化情绪检测是一项具有挑战性的任务。可用数据集使它变得更加困难；它们的尺寸很小和不兼容的标记特质使得很难构建可概括的情绪检测系统。为了应对这两个挑战，我们提出了一种多模式的方法，该方法首先从语音和文本中的相关任务中学习以产生强大的神经嵌入，然后使用这些嵌入来训练PLDA分类器，以适应以前看不见的情绪和域。我们首先要培训多层tdnn关于使用Voxceleb Corpora识别说话者识别的任务，然后将其按照Crema-D Corpus的情感识别任务进行微调。使用此网络，我们使用微型BERT模型从其每个层中提取crema-d的语音嵌入，从其每个层中为随附的转录本生成和连接文本嵌入，然后在产生的密集表示上训练LDA-PLDA分类器。我们详尽地评估了每个组件的预测能力：单独的TDNN，单独的每个层中的语音嵌入，单独的文本嵌入及其每个组合。我们最好的变体仅在Voxceleb和Crema-D上接受培训，并在Iemocap上进行了评估，其EER为38.05％。在训练期间包括一部分IeMocap的一部分，平均EER为25.72％（为了进行比较，有44.71％的金标签注释包括至少一个不同意的注释者）。

Automated emotion detection in speech is a challenging task due to the complex interdependence between words and the manner in which they are spoken. It is made more difficult by the available datasets; their small size and incompatible labeling idiosyncrasies make it hard to build generalizable emotion detection systems. To address these two challenges, we present a multi-modal approach that first transfers learning from related tasks in speech and text to produce robust neural embeddings and then uses these embeddings to train a pLDA classifier that is able to adapt to previously unseen emotions and domains. We begin by training a multilayer TDNN on the task of speaker identification with the VoxCeleb corpora and then fine-tune it on the task of emotion identification with the Crema-D corpus. Using this network, we extract speech embeddings for Crema-D from each of its layers, generate and concatenate text embeddings for the accompanying transcripts using a fine-tuned BERT model and then train an LDA - pLDA classifier on the resulting dense representations. We exhaustively evaluate the predictive power of every component: the TDNN alone, speech embeddings from each of its layers alone, text embeddings alone and every combination thereof. Our best variant, trained on only VoxCeleb and Crema-D and evaluated on IEMOCAP, achieves an EER of 38.05%. Including a portion of IEMOCAP during training produces a 5-fold averaged EER of 25.72% (For comparison, 44.71% of the gold-label annotations include at least one annotator who disagrees).

下载PDF全文

下载文献需遵守相关版权规定

论文标题