用于人类对象互动检测的空间解析和动态的时间合并网络

论文标题

用于人类对象互动检测的空间解析和动态的时间合并网络

Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection

论文作者

Li, Hongsheng, Zhu, Guangming, Zhen, Wu, Ni, Lan, Shen, Peiyi, Zhang, Liang, Wang, Ning, Hua, Cong

论文摘要

人类对象相互作用（HOI）识别的关键是推断人与物体之间的关系。最近，该图像的人类对象相互作用（HOI）检测取得了重大进展。但是，视频HOI检测性能仍然有改善的余地。现有的一阶段方法使用精心设计的端到端网络来检测视频段并直接预测交互。它使网络的模型学习和进一步的优化更加复杂。本文介绍了空间解析和动态临时池（SPDTP）网络，该网络将整个视频作为时空图形作为人类和对象节点作为输入。与现有方法不同，我们提出的网络通过显式空间解析预测交互式和非相互作用对之间的差异，然后执行交互识别。此外，我们提出了一个可学习且可区分的动态时间模块（DTM），以强调视频的关键帧并抑制冗余帧。此外，实验结果表明，SPDTP可以更加注意主动的人类对象对和有效的密钥帧。总体而言，我们在CAD-1220数据集和某些ELSE数据集上实现了最先进的性能。

The key of Human-Object Interaction(HOI) recognition is to infer the relationship between human and objects. Recently, the image's Human-Object Interaction(HOI) detection has made significant progress. However, there is still room for improvement in video HOI detection performance. Existing one-stage methods use well-designed end-to-end networks to detect a video segment and directly predict an interaction. It makes the model learning and further optimization of the network more complex. This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as a spatio-temporal graph with human and object nodes as input. Unlike existing methods, our proposed network predicts the difference between interactive and non-interactive pairs through explicit spatial parsing, and then performs interaction recognition. Moreover, we propose a learnable and differentiable Dynamic Temporal Module(DTM) to emphasize the keyframes of the video and suppress the redundant frame. Furthermore, the experimental results show that SPDTP can pay more attention to active human-object pairs and valid keyframes. Overall, we achieve state-of-the-art performance on CAD-120 dataset and Something-Else dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题