论文标题
解决未注明的:使用偏置降低模型的场景图生成
Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models
论文作者
论文摘要
预测捕获视觉实体及其在图像中其相互作用的场景图被认为是迈向完整场景理解的关键步骤。最近的场景图生成(SGG)模型显示了它们捕获视觉实体之间最常见关系的能力。但是,最先进的结果仍然远非令人满意,例如模型可以在整体召回r@100中获得31%,而同样重要的平均均值召回MR@100的视觉基因组(VG)仅约为8%。 R和MR结果之间的差异敦促将重点从追求高R转变为具有仍然具有竞争力的R。我们怀疑观察到的差异源于VG中的注释偏差和稀疏注释,其中许多视觉实体对根本没有注释,或者仅在多个关系中就没有单个关系有效。为了解决这个特定的问题,我们提出了一种新颖的SGG培训计划,该计划利用了自学知识。它涉及两个关系分类器,一个为另一个提供了一个较不偏见的设置。提出的方案可以应用于大多数现有的SGG模型,并且可以直接实施。我们观察到在所有标准SGG任务中,MR( +6.6%和 +20.4%)和竞争性R(-2.4%和0.3%之间)的相对相对改善显着改善。
Predicting a scene graph that captures visual entities and their interactions in an image has been considered a crucial step towards full scene comprehension. Recent scene graph generation (SGG) models have shown their capability of capturing the most frequent relations among visual entities. However, the state-of-the-art results are still far from satisfactory, e.g. models can obtain 31% in overall recall R@100, whereas the likewise important mean class-wise recall mR@100 is only around 8% on Visual Genome (VG). The discrepancy between R and mR results urges to shift the focus from pursuing a high R to a high mR with a still competitive R. We suspect that the observed discrepancy stems from both the annotation bias and sparse annotations in VG, in which many visual entity pairs are either not annotated at all or only with a single relation when multiple ones could be valid. To address this particular issue, we propose a novel SGG training scheme that capitalizes on self-learned knowledge. It involves two relation classifiers, one offering a less biased setting for the other to base on. The proposed scheme can be applied to most of the existing SGG models and is straightforward to implement. We observe significant relative improvements in mR (between +6.6% and +20.4%) and competitive or better R (between -2.4% and 0.3%) across all standard SGG tasks.