实时在线视频检测与时间平滑变压器

论文标题

实时在线视频检测与时间平滑变压器

Real-time Online Video Detection with Temporal Smoothing Transformers

论文作者

Zhao, Yue, Krähenbühl, Philipp

论文摘要

在视频的每一帧中，流式传输视频识别原因及其动作。良好的流识别模型捕获了长期动态和视频的短期变化。不幸的是，在大多数现有方法中，计算复杂性随所考虑的动力学的长度而线性或四边形增长。此问题在基于变压器的体系结构中特别明显。为了解决这个问题，我们通过内核镜头在视频变压器中重新重新进行了跨注意，并应用了两种暂时的平滑核：盒子内核或拉普拉斯内核。由此产生的流动注意力可以从框架到框架重新重新计算，并且仅需要一个恒定的时间更新每个帧。基于这个想法，我们构建了一种时间平滑变压器Testra，它具有持续的缓存和计算开销的任意输入。具体来说，它的运行$ 6 \ times $ $ $比基于滑动窗口的同等滑动变压器的运行速度快，在流设置中具有2,048帧。此外，由于时间跨度的增加，Testra在Thumos'14和Epic-Kitchen-100上取得了最新的结果，这是两个标准的在线操作检测和行动预期数据集。 Testra的实时版本优于Thumos'14数据集上的所有方法。

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6\times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题