解释视频功能：3D卷积网络和卷积LSTM网络的比较

论文标题

解释视频功能：3D卷积网络和卷积LSTM网络的比较

Interpreting video features: a comparison of 3D convolutional networks and convolutional LSTM networks

论文作者

Mänttäri, Joonatan, Broomé, Sofia, Folkesson, John, Kjellström, Hedvig

论文摘要

已经提出了许多用于在计算机视觉中进行深入学习的可解释性的技术，通常是为了了解网络基于其分类的目标。但是，对深度视频体系结构的解释性仍处于起步阶段，我们尚未对如何解码时空特征进行清晰的概念。在本文中，我们提出了一项研究，比较了3D卷积网络和卷积LSTM网络如何学习跨时间依赖帧的功能。这是两个视频模型的第一个比较，这些视频模型都在学习空间特征，但主要具有建模时间的不同方法。此外，我们将\ cite {formitfulpert}引入的有意义的扰动的概念扩展到时间维度，以识别序列对网络最有意义的分类决策的时间部分。我们的发现表明，3D卷积模型集中在输入序列中较短的事件上，并将其空间重点放在更少的连续区域上。

A number of techniques for interpretability have been presented for deep learning in computer vision, typically with the goal of understanding what the networks have based their classification on. However, interpretability for deep video architectures is still in its infancy and we do not yet have a clear concept of how to decode spatiotemporal features. In this paper, we present a study comparing how 3D convolutional networks and convolutional LSTM networks learn features across temporally dependent frames. This is the first comparison of two video models that both convolve to learn spatial features but have principally different methods of modeling time. Additionally, we extend the concept of meaningful perturbation introduced by \cite{MeaningFulPert} to the temporal dimension, to identify the temporal part of a sequence most meaningful to the network for a classification decision. Our findings indicate that the 3D convolutional model concentrates on shorter events in the input sequence, and places its spatial focus on fewer, contiguous areas.

下载PDF全文

下载文献需遵守相关版权规定

论文标题