用户生成的视频中的情感识别的端到端视觉审计注意网络

论文标题

用户生成的视频中的情感识别的端到端视觉审计注意网络

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

论文作者

Zhao, Sicheng, Ma, Yunsheng, Gu, Yang, Yang, Jufeng, Xing, Tengfei, Xu, Pengfei, Hu, Runbo, Chai, Hua, Keutzer, Kurt

论文摘要

用户生成的视频中的情感识别在以人为本的计算中起着重要作用。现有方法主要采用传统的两阶段浅水管道，即提取视觉和/或音频功能以及培训分类器。在本文中，我们建议基于卷积神经网络（CNN）以端到端方式识别视频情绪。具体而言，我们将深层视觉ADIO注意网络（VAANET）开发出来，这是一种新颖的结构，将空间，频道和时间关注整合到视觉3D CNN中，并将时间关注纳入音频2D CNN。此外，我们设计了特殊的分类损失，即基于极性情感层次结构约束，以指导注意力集中。在具有挑战性的VideoMotion-8和Ekman-6数据集上进行的广泛实验表明，所提出的Vaanet优于视频情感识别的最新方法。我们的源代码在以下网址发布：https：//github.com/maysonma/vaanet。

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题