论文标题
通过文本场景属性从字幕中学习对象检测
Learning Object Detection from Captions via Textual Scene Attributes
论文作者
论文摘要
对象检测是计算机视觉中的一项基本任务,需要很难收集的大量注释数据集,因为注释者需要标记对象及其边界框。因此,有效地使用更便宜的监督形式是一个重大挑战。最近的工作已经开始探索图像标题作为弱监督的来源,但是迄今为止,在对象检测的背景下,字幕仅用于推断图像中对象的类别。在这项工作中,我们认为字幕包含有关图像的更丰富信息,包括对象的属性及其关系。也就是说,文本代表图像的场景,如文献中最近所述。我们提出了一种使用此“文本场景图”中使用属性来训练对象检测器的方法。我们从经验上证明,所得模型在几个具有挑战性的对象检测数据集上实现了最新的结果,表现优于最近的方法。
Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets, outperforming recent approaches.