基于数据驱动的深度学习的可见和看不见的语音的准确情感强度评估

论文标题

基于数据驱动的深度学习的可见和看不见的语音的准确情感强度评估

Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

论文作者

Liu, Rui, Sisman, Berrak, Schuller, Björn, Gao, Guanglai, Li, Haizhou

论文摘要

在情感文本到语音和语音conversion依等应用中，需要对语音和情感强度评估的情绪分类。提出了基于支持向量机（SVM）的情绪属性排名函数，以预测情绪语音语料库的情绪强度。但是，训练有素的排名函数并未推广到新的域，这限制了应用程序范围，尤其是对于室外或看不见的语音。在本文中，我们提出了一个数据驱动的深度学习模型，即Prothernet，以改善对可见和看不见的语音的情绪强度评估的概括。这是通过来自各个领域的情绪数据融合来实现的。我们遵循多任务学习网络体系结构，其中包括声学编码器，强度预测指标和辅助情感预测指标。实验表明，所提出的强度网的预测情绪强度与可见和看不见的言语的地面真实分数高度相关。我们在以下位置发布源代码：https：//github.com/ttslr/strengthnet。

Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题