砂砾：一种生成区域到文本变压器，用于对象理解

论文标题

砂砾：一种生成区域到文本变压器，用于对象理解

GRiT: A Generative Region-to-text Transformer for Object Understanding

论文作者

Wu, Jialian, Wang, Jianfeng, Yang, Zhengyuan, Gan, Zhe, Liu, Zicheng, Yuan, Junsong, Wang, Lijuan

论文摘要

本文介绍了生成区域到文本的变压器，砂砾，供对象理解。砂粒的精神是将对象理解为<区域，文本>对，区域定位对象和文本描述对象。例如，对象检测中的文本表示类名称，而在密集字幕中，则表示描述性句子。具体而言，砂砾由一个视觉编码器组成，用于提取图像特征，前景对象提取器的本地化对象以及生成开放集对象描述的文本解码器。借助相同的模型体系结构，Grit不仅可以通过简单的名词来理解对象，还可以通过包括对象属性或操作在内的丰富描述性句子来理解对象。在实验上，我们将砂砾应用于对象检测和密集字幕任务。 Grit在2017年可可测试-DEV上实现60.4 AP，以进行对象检测，并在视觉基因组上进行15.5映射，用于密集字幕。代码可从https://github.com/jialianw/grit获得

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT

下载PDF全文

下载文献需遵守相关版权规定

论文标题