论文标题
检测人类对象与对象引导的跨模式校准语义的相互作用
Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
论文作者
论文摘要
人类对象相互作用(HOI)检测是从细粒度的角度理解以人为中心图像的必不可少的任务。尽管端到端的HOI检测模型蓬勃发展,但它们的平行人/对象检测和动词类预测的范式却失去了两阶段方法的优点:对象引导的层次结构。一个HOI三胞胎中的对象为动词提供了直接的线索。在本文中,我们旨在通过对象引导的统计先验来增强端到端模型。具体而言,我们建议使用动词语义模型(VSM)并使用语义聚合从该对象引导的层次结构中获利。提出了相似性KL(SKL)损失以优化VSM以与HOI数据集的先验保持一致。为了克服静态语义嵌入问题,我们建议通过交叉模式校准(CMC)生成交叉模式感知的视觉和语义特征。上述模块组合组成对象引导的跨模式校准网络(OCN)。在两个流行的HOI检测基准上进行的实验证明了纳入统计先验知识并产生最先进的性能的重要性。更详细的分析表明,提议的模块是一种更强的动词预测指标,也是利用先验知识的更优越的方法。这些代码可在\ url {https://github.com/jacobyuan7/ocn-hoi-benchmark}中获得。
Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods' merit: object-guided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy. Similarity KL (SKL) loss is proposed to optimize VSM to align with the HOI dataset's priors. To overcome the static semantic embedding problem, we propose to generate cross-modality-aware visual and semantic features by Cross-Modal Calibration (CMC). The above modules combined composes Object-guided Cross-modal Calibration Network (OCN). Experiments conducted on two popular HOI detection benchmarks demonstrate the significance of incorporating the statistical prior knowledge and produce state-of-the-art performances. More detailed analysis indicates proposed modules serve as a stronger verb predictor and a more superior method of utilizing prior knowledge. The codes are available at \url{https://github.com/JacobYuan7/OCN-HOI-Benchmark}.