基于金字塔区域的插槽注意网络，用于生成时间动作提案

论文标题

基于金字塔区域的插槽注意网络，用于生成时间动作提案

Pyramid Region-based Slot Attention Network for Temporal Action Proposal Generation

论文作者

Li, Shuaicheng, Zhang, Feng, Zhao, Rui-Wei, Feng, Rui, Yang, Kunlin, Liu, Lingbo, Hou, Jun

论文摘要

已经发现，旨在在未修剪视频的开始和终点范围内发现的时间动作实例的时间动作提案生成可以在很大程度上受益于适当的时间和语义上下文剥削。最新的努力致力于通过自我发项模块来考虑基于时间的环境和基于相似性的语义上下文。但是，他们仍然患有混乱的背景信息和有限的上下文特征学习。在本文中，我们提出了一种基于金字塔区域的新型插槽注意（PRSLOT）模块来解决这些问题。我们的PRSLOT模块不是使用相似性计算，而是直接以编码器方式来学习本地关系，并基于注意力输入功能（称为\ textit {slot}}的关注输入功能，生成了局部区域的表示。具体而言，在输入摘要级特征上，PRSLOT模块将目标片段作为\ textit {query}，其周围区域为\ textIt {key}，然后通过将本地smippet上下文与相关的pyralamid Pyramid策略聚集来为每个\ textit {queery-key}插槽生成插槽表示。基于PRSLOT模块，我们提出了一种基于金字塔区域的新型插槽注意网络，称为PRSA-NET，以学习具有丰富时间和语义上下文的统一视觉表示，以获得更好的提议生成。广泛的实验是在两个广泛采用的Thumos14和ActivityNet-1.3基准上进行的。我们的PRSA网络的表现优于其他最先进的方法。特别是，我们将AR@100从以前的最佳50.67％提高到56.12％的提案生成，并在0.5 tiou下提高地图从51.9 \％\％\％\％\％\％\％\％\％，以在Thumos14上进行动作检测。 \ textIt {代码可在} \ url {https://github.com/handhand123/prsa-net}获得

It has been found that temporal action proposal generation, which aims to discover the temporal action instances within the range of the start and end frames in the untrimmed videos, can largely benefit from proper temporal and semantic context exploitation. The latest efforts were dedicated to considering the temporal context and similarity-based semantic contexts through self-attention modules. However, they still suffer from cluttered background information and limited contextual feature learning. In this paper, we propose a novel Pyramid Region-based Slot Attention (PRSlot) module to address these issues. Instead of using the similarity computation, our PRSlot module directly learns the local relations in an encoder-decoder manner and generates the representation of a local region enhanced based on the attention over input features called \textit{slot}. Specifically, upon the input snippet-level features, PRSlot module takes the target snippet as \textit{query}, its surrounding region as \textit{key} and then generates slot representations for each \textit{query-key} slot by aggregating the local snippet context with a parallel pyramid strategy. Based on PRSlot modules, we present a novel Pyramid Region-based Slot Attention Network termed PRSA-Net to learn a unified visual representation with rich temporal and semantic context for better proposal generation. Extensive experiments are conducted on two widely adopted THUMOS14 and ActivityNet-1.3 benchmarks. Our PRSA-Net outperforms other state-of-the-art methods. In particular, we improve the AR@100 from the previous best 50.67% to 56.12% for proposal generation and raise the mAP under 0.5 tIoU from 51.9\% to 58.7\% for action detection on THUMOS14. \textit{Code is available at} \url{https://github.com/handhand123/PRSA-Net}

下载PDF全文

下载文献需遵守相关版权规定

论文标题