论文标题
“我有vxxx bxx connexxxn!”:深度言语情感识别中的数据包丢失
"I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion Recognition
论文作者
论文摘要
在通过语音识别情绪识别的应用中,鉴于多种应用程序,帧损失可能是一个严重的问题,因为音频流损失了一些数据框架,出于各种原因,例如低带宽。在这项贡献中,我们第一次研究框架损坏对通过语音识别情绪识别的影响。使用最先进的端到端深神经网络在流行的Recola语料库上进行了可再现的广泛实验,该实验主要由卷积块和经常性层组成。基于马尔可夫链模型的简单环境用于建模基于两个主要参数的损失机制。我们探索匹配,不匹配和多条件培训设置。正如人们所期望的那样,匹配的设置会产生最佳性能,而不匹配的设置产生的最低。此外,引入了作为数据增强技术的框架损坏,作为一种通用策略,以克服框架损坏的影响。它可以在训练期间使用,我们观察到它可以产生在运行时环境中对帧损失更强大的模型。
In applications that use emotion recognition via speech, frame-loss can be a severe issue given manifold applications, where the audio stream loses some data frames, for a variety of reasons like low bandwidth. In this contribution, we investigate for the first time the effects of frame-loss on the performance of emotion recognition via speech. Reproducible extensive experiments are reported on the popular RECOLA corpus using a state-of-the-art end-to-end deep neural network, which mainly consists of convolution blocks and recurrent layers. A simple environment based on a Markov Chain model is used to model the loss mechanism based on two main parameters. We explore matched, mismatched, and multi-condition training settings. As one expects, the matched setting yields the best performance, while the mismatched yields the lowest. Furthermore, frame-loss as a data augmentation technique is introduced as a general-purpose strategy to overcome the effects of frame-loss. It can be used during training, and we observed it to produce models that are more robust against frame-loss in run-time environments.