论文标题
attngrounder:引起关注的汽车交谈
AttnGrounder: Talking to Cars with Attention
论文作者
论文摘要
我们提出了注意接地器(ATTNGrounder),这是一种单阶段的端到端训练模型,用于视觉接地任务。视觉接地旨在根据给定的自然语言文本查询将特定对象定位在图像中。与以前使用相同文本表示为每个图像区域的方法不同,我们使用一个视觉文本注意模块,该模块将给定查询中的每个单词与相应图像中的每个区域联系起来,以构建依赖区域的文本表示。此外,为了提高模型的本地化能力,我们使用视觉文本注意模块在引用对象周围产生注意力掩码。使用提供的地面真相坐标生成的矩形掩模,将注意力面膜作为辅助任务进行训练。我们在Talk2CAR数据集上评估了Attngrounder,并且比现有方法显示了3.26%的改善。
We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.