视频中人类行动识别的多阶段卷积

论文标题

视频中人类行动识别的多阶段卷积

Multi-Temporal Convolutions for Human Action Recognition in Videos

论文作者

Stergiou, Alexandros, Poppe, Ronald

论文摘要

有效提取时间模式对于识别视频中时间变化的动作至关重要。我们认为，可以改进用于卷积神经网络（CNN）中使用的固定尺寸时空卷积内核，以提取在不同时间尺度上执行的信息动作。为了应对这一挑战，我们提出了一个新型的时空卷积块，该卷积块能够在多种时间分辨率下提取时空模式。我们提出的多个时间卷积（MTCONV）的阻止，分别使用了两个分支，分别集中在短时间和延长时空模式上。相对于通过复发细胞的整体运动模式，提取的时变特征在第三个分支中排列。所提出的块很轻巧，可以集成到任何3D-CNN体系结构中。这引入了计算成本大幅降低。与最先进的计算足迹相比，有关动力学，时间矩和HACS动作识别基准数据集的广泛实验表明，MTCONV的竞争性能。

Effective extraction of temporal patterns is crucial for the recognition of temporally varying actions in video. We argue that the fixed-sized spatio-temporal convolution kernels used in convolutional neural networks (CNNs) can be improved to extract informative motions that are executed at different time scales. To address this challenge, we present a novel spatio-temporal convolution block that is capable of extracting spatio-temporal patterns at multiple temporal resolutions. Our proposed multi-temporal convolution (MTConv) blocks utilize two branches that focus on brief and prolonged spatio-temporal patterns, respectively. The extracted time-varying features are aligned in a third branch, with respect to global motion patterns through recurrent cells. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture. This introduces a substantial reduction in computational costs. Extensive experiments on Kinetics, Moments in Time and HACS action recognition benchmark datasets demonstrate competitive performance of MTConvs compared to the state-of-the-art with a significantly lower computational footprint.

下载PDF全文

下载文献需遵守相关版权规定

论文标题