论文标题

部分可观测时空混沌系统的无模型预测

Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition

论文作者

Zhao, Zihan, Wang, Yanfeng, Wang, Yu

论文摘要

多模式情绪识别的研究和应用最近变得越来越流行。但是,多模式情绪识别面临缺乏数据的挑战。为了解决这个问题,我们建议使用转移学习,哪些转移学习利用了最先进的预培训模型,包括WAV2VEC 2.0和BERT来执行此任务。探索了在两个嵌入中训练的模型,包括基于共发的早期融合和晚期融合在内的多级融合方法。此外,还提出了一个多个跨性框架,它不仅提取了帧级的语音嵌入,还提出了细分级别的嵌入,包括电话,音节和文字级语音嵌入,以进一步提高性能。通过将基于同时的早期融合模型和晚期融合模型与多粒度特征提取框架相结合,我们获得的结果使在IEMOCAP数据集上优于最佳基线方法未加权准确性(UA)。

The research and applications of multimodal emotion recognition have become increasingly popular recently. However, multimodal emotion recognition faces the challenge of lack of data. To solve this problem, we propose to use transfer learning which leverages state-of-the-art pre-trained models including wav2vec 2.0 and BERT for this task. Multi-level fusion approaches including coattention-based early fusion and late fusion with the models trained on both embeddings are explored. Also, a multi-granularity framework which extracts not only frame-level speech embeddings but also segment-level embeddings including phone, syllable and word-level speech embeddings is proposed to further boost the performance. By combining our coattention-based early fusion model and late fusion model with the multi-granularity feature extraction framework, we obtain result that outperforms best baseline approaches by 1.3% unweighted accuracy (UA) on the IEMOCAP dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源