有效视频分类的各种时间聚集和深度时空分解

论文标题

有效视频分类的各种时间聚集和深度时空分解

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

论文作者

Lee, Youngwan, Kim, Hyung-Il, Yun, Kimin, Moon, Jinyoung

论文摘要

最近引起关注的视频分类研究是时间建模和3D有效体系结构的领域。但是，时间建模方法不是有效的，或者3D有效的体系结构对时间建模不太感兴趣。为了弥合它们之间的差距，我们提出了一个有效的时间建模3D体系结构，称为VOV3D，该体系结构由时间弹出汇总（T-OSA）模块和深度分解分量D（2+1）d组成。 T-OSA设计为通过汇总具有不同时间接收场的时间特征来构建特征层次结构。堆叠此T-OSA使网络本身能够在没有任何外部模块的情况下建模跨帧的短距离和远程时间关系。受内核分解和通道分解的启发，我们还设计了一个命名为D（2+1）D的深度时空分解模块，该模块将3D深度卷积分解为两个空间和时间深度卷积，以使我们的网络变得更轻巧和有效。通过使用所提出的时间建模方法（T-OSA）以及有效的分解组件（D（2+1）D），我们构建了两种类型的VOV3D网络，VOV3D-M和VOV3D-L。由于其时间建模的效率和有效性，VOV3D-L的模型参数减少了6倍，计算较少16倍，超过了某种事物和动力学400的最先进的时间建模方法。此外，VOV3D比最先进的有效3D体系结构显示出更好的时间建模能力，具有可比的模型容量。我们希望VOV3D可以作为有效视频分类的基线。

Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named, D(2+1)D that decomposes a 3D depthwise convolution into two spatial and temporal depthwise convolutions for making our network more lightweight and efficient. By using the proposed temporal modeling method (T-OSA), and the efficient factorized component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both Something-Something and Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a state-of-the-art efficient 3D architecture, X3D having comparable model capacity. We hope that VoV3D can serve as a baseline for efficient video classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题