论文标题
解剖的3D CNN:暂时跳过连接,用于有效的在线视频处理
Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing
论文作者
论文摘要
具有3D内核(3D-CNN)的卷积神经网络目前实现了最新的视频识别任务,因为它们在视频框架内提取时空特征方面的至高无上。有许多成功的3D-CNN体系结构连续超过最新结果。但是,几乎所有这些都旨在在线操作期间离线造成多个严重的障碍。首先,传统的3D-CNN不是动态的,因为它们的输出功能代表完整的输入剪辑,而不是夹中的最新帧。其次,由于它们固有的时间下采样,它们不是时间分辨率的。最后,将3D-CNN限制为与固定的时间输入大小一起使用,以限制其灵活性。为了解决这些缺点,我们提出了解剖的3D-CNN,其中剖析网络的中间体积并在深度(时间)维度上进行了未来计算,从而大大减少了在线操作中计算的数量。对于动作分类,Resnet模型的解剖版本在在线操作时的计算少77-90%,而在Kinetics-600数据集上的分类精度比传统的3D-Resnet模型要高约5%。此外,通过将我们的方法部署到几个视力任务上,可以始终如一地改善性能,从而证明了分解3D-CNN的优势。
Convolutional Neural Networks with 3D kernels (3D-CNNs) currently achieve state-of-the-art results in video recognition tasks due to their supremacy in extracting spatiotemporal features within video frames. There have been many successful 3D-CNN architectures surpassing the state-of-the-art results successively. However, nearly all of them are designed to operate offline creating several serious handicaps during online operation. Firstly, conventional 3D-CNNs are not dynamic since their output features represent the complete input clip instead of the most recent frame in the clip. Secondly, they are not temporal resolution-preserving due to their inherent temporal downsampling. Lastly, 3D-CNNs are constrained to be used with fixed temporal input size limiting their flexibility. In order to address these drawbacks, we propose dissected 3D-CNNs, where the intermediate volumes of the network are dissected and propagated over depth (time) dimension for future calculations, substantially reducing the number of computations at online operation. For action classification, the dissected version of ResNet models performs 77-90% fewer computations at online operation while achieving ~5% better classification accuracy on the Kinetics-600 dataset than conventional 3D-ResNet models. Moreover, the advantages of dissected 3D-CNNs are demonstrated by deploying our approach onto several vision tasks, which consistently improved the performance.