在单一框架中捕获时间信息：频道采样策略以识别行动

论文标题

在单一框架中捕获时间信息：频道采样策略以识别行动

Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition

论文作者

Kim, Kiyoon, Gowda, Shreyank N, Mac Aodha, Oisin, Sevilla-Lara, Laura

论文摘要

我们解决了在2D网络中捕获视频分类的时间信息的问题，而不会增加其计算成本。现有方法着重于修改2D网络的体系结构（例如，通过在时间维度中包括过滤器以将其转换为3D网络或使用光流等），从而增加了计算成本。取而代之的是，我们提出了一种新颖的采样策略，在该策略中我们重新订购了输入视频的频道，以捕获短期到框架的变化。我们观察到，如果没有铃铛和哨声，提出的采样策略就可以改善多个体系结构（例如TSN，TSN，TRN，TSM和MVFNET）和数据集（CATER，SOMETHS SOMETHS-SOMETH-SOMETH-SOMETHING-V1和V2）的性能，在使用标准视频输入的基础上，高达24％。此外，我们的抽样策略不需要从头开始培训，也不需要增加培训和测试的计算成本。鉴于结果的普遍性和方法的灵活性，我们希望这对视频理解社区有用。代码可在我们的网站上找到：https：//github.com/kiyoon/channel_sampling。

We address the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that without bells and whistles, the proposed sampling strategy improves performance on multiple architectures (e.g. TSN, TRN, TSM, and MVFNet) and datasets (CATER, Something-Something-V1 and V2), up to 24% over the baseline of using the standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community. Code is available on our website: https://github.com/kiyoon/channel_sampling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题