论文标题
时空卷积神经网络的连续情绪识别
Continuous Emotion Recognition with Spatiotemporal Convolutional Neural Networks
论文作者
论文摘要
面部表情是描述人类行为中特定模式并描述人类情绪状态的最有力方法之一。尽管在过去十年中,情感计算取得了令人印象深刻的进步,但基于视频的自动面部表达识别系统仍然无法处理个人以及跨文化和人口统计学方面的面部表达方面的适当变化。然而,即使对于人类来说,认识到面部表情也是一项艰巨的任务。在本文中,我们研究了基于卷积神经网络(CNN)的最先进的深度学习体系结构的适用性,以使用捕获的长期视频序列捕获了持续的情感识别。这项研究着重于深度学习模型,这些模型允许考虑复杂且多维的情感空间,在视频中编码时空关系,其中必须预测价值和唤醒的值。我们已经开发并评估了卷积复发性神经网络,结合了2D-CNN和长期记忆单元,并使用了应用特定的视频来膨胀预训练的2D-CNN模型的重量,这些模型是通过使用应用特定的视频来构建的。关于挑战性的SEWA-DB数据集的实验结果表明,这些架构可以有效地进行微调,以从连续的原始像素图像中编码时空信息,并在此类数据集中实现最先进的结果。
Facial expressions are one of the most powerful ways for depicting specific patterns in human behavior and describing human emotional state. Despite the impressive advances of affective computing over the last decade, automatic video-based systems for facial expression recognition still cannot handle properly variations in facial expression among individuals as well as cross-cultural and demographic aspects. Nevertheless, recognizing facial expressions is a difficult task even for humans. In this paper, we investigate the suitability of state-of-the-art deep learning architectures based on convolutional neural networks (CNNs) for continuous emotion recognition using long video sequences captured in-the-wild. This study focuses on deep learning models that allow encoding spatiotemporal relations in videos considering a complex and multi-dimensional emotion space, where values of valence and arousal must be predicted. We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning, using application-specific videos. Experimental results on the challenging SEWA-DB dataset have shown that these architectures can effectively be fine-tuned to encode the spatiotemporal information from successive raw pixel images and achieve state-of-the-art results on such a dataset.