论文标题

实时在线视频检测与时间平滑变压器

Real-time Online Video Detection with Temporal Smoothing Transformers

论文作者

Zhao, Yue, Krähenbühl, Philipp

论文摘要

在视频的每一帧中,流式传输视频识别原因及其动作。良好的流识别模型捕获了长期动态和视频的短期变化。不幸的是,在大多数现有方法中,计算复杂性随所考虑的动力学的长度而线性或四边形增长。此问题在基于变压器的体系结构中特别明显。为了解决这个问题,我们通过内核镜头在视频变压器中重新重新进行了跨注意,并应用了两种暂时的平滑核:盒子内核或拉普拉斯内核。由此产生的流动注意力可以从框架到框架重新重新计算,并且仅需要一个恒定的时间更新每个帧。基于这个想法,我们构建了一种时间平滑变压器Testra,它具有持续的缓存和计算开销的任意输入。具体来说,它的运行$ 6 \ times $ $ $比基于滑动窗口的同等滑动变压器的运行速度快,在流设置中具有2,048帧。此外,由于时间跨度的增加,Testra在Thumos'14和Epic-Kitchen-100上取得了最新的结果,这是两个标准的在线操作检测和行动预期数据集。 Testra的实时版本优于Thumos'14数据集上的所有方法。

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6\times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源