论文标题
解释视频功能:3D卷积网络和卷积LSTM网络的比较
Interpreting video features: a comparison of 3D convolutional networks and convolutional LSTM networks
论文作者
论文摘要
已经提出了许多用于在计算机视觉中进行深入学习的可解释性的技术,通常是为了了解网络基于其分类的目标。但是,对深度视频体系结构的解释性仍处于起步阶段,我们尚未对如何解码时空特征进行清晰的概念。在本文中,我们提出了一项研究,比较了3D卷积网络和卷积LSTM网络如何学习跨时间依赖帧的功能。这是两个视频模型的第一个比较,这些视频模型都在学习空间特征,但主要具有建模时间的不同方法。此外,我们将\ cite {formitfulpert}引入的有意义的扰动的概念扩展到时间维度,以识别序列对网络最有意义的分类决策的时间部分。我们的发现表明,3D卷积模型集中在输入序列中较短的事件上,并将其空间重点放在更少的连续区域上。
A number of techniques for interpretability have been presented for deep learning in computer vision, typically with the goal of understanding what the networks have based their classification on. However, interpretability for deep video architectures is still in its infancy and we do not yet have a clear concept of how to decode spatiotemporal features. In this paper, we present a study comparing how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames. This is the first comparison of two video models that both convolve to learn spatial features but have principally different methods of modeling time. Additionally, we extend the concept of meaningful perturbation introduced by \cite{MeaningFulPert} to the temporal dimension, to identify the temporal part of a sequence most meaningful to the network for a classification decision. Our findings indicate that the 3D convolutional model concentrates on shorter events in the input sequence, and places its spatial focus on fewer, contiguous areas.