STH：时空混合动力卷积，用于有效的动作识别

论文标题

STH：时空混合动力卷积，用于有效的动作识别

STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition

论文作者

Li, Xu, Wang, Jingwen, Ma, Lin, Zhang, Kaihao, Lian, Fengzong, Kang, Zhanhui, Wang, Jinjun

论文摘要

有效，有效的时空建模对于行动识别至关重要。现有方法在模型性能和模型复杂性之间取舍。在本文中，我们提出了一个新颖的时空混合卷积网络（称为“ STH”），该网络同时以较小的参数成本编码时空视频信息。与现有的作品不同，这些作品是在不同的卷积层上依次或与众不同的时间信息，我们将输入通道分为多组，并将空间和时间操作交织成一个卷积层，这些卷积层深层结合了空间和时间线索。这样的设计实现了有效的时空建模，并保持了较小的模型量表。 STH-CONV是一个通用的构建块，可以通过更换常规的2D-CONV块（2D卷积）来插入现有的2D CNN体系结构，例如Resnet和Mobilenet。 STH Network在基准数据集（例如Soseming（V1＆V2），Jester和HMDB-51）上的竞争对手比其竞争对手的竞争能力甚至更好。此外，STH比3D CNN的性能优越，同时保持比2D CNN的参数成本更小。

Effective and Efficient spatio-temporal modeling is essential for action recognition. Existing methods suffer from the trade-off between model performance and model complexity. In this paper, we present a novel Spatio-Temporal Hybrid Convolution Network (denoted as "STH") which simultaneously encodes spatial and temporal video information with a small parameter cost. Different from existing works that sequentially or parallelly extract spatial and temporal information with different convolutional layers, we divide the input channels into multiple groups and interleave the spatial and temporal operations in one convolutional layer, which deeply incorporates spatial and temporal clues. Such a design enables efficient spatio-temporal modeling and maintains a small model scale. STH-Conv is a general building block, which can be plugged into existing 2D CNN architectures such as ResNet and MobileNet by replacing the conventional 2D-Conv blocks (2D convolutions). STH network achieves competitive or even better performance than its competitors on benchmark datasets such as Something-Something (V1 & V2), Jester, and HMDB-51. Moreover, STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题