电影：重新访问视觉计数和超越的调制卷积

论文标题

电影：重新访问视觉计数和超越的调制卷积

MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

论文作者

Nguyen, Duy-Kien, Goswami, Vedanuj, Chen, Xinlei

论文摘要

本文侧重于视觉计数，旨在预测给定自然图像和查询（例如问题或类别）的发生数量。与大多数使用明确的，符号模型可以在计算上昂贵且有限的概括的作品不同，我们通过重新访问调制的卷积来提出一种简单有效的替代方案，以融合查询和本地图像。在设计残留瓶颈之后，我们称我们的方法电影，用于调制卷积瓶颈的缩写。值得注意的是，电影理由隐含地和整体上，在推论过程中只需要一个前进。然而，电影展示了计算出色的表现：1）在更有效的同时，在计数特定的VQA任务上推进了最先进的效果； 2）在诸如可可（Coco）的困难基准上胜过公共对象计数（可可）的表现； 3）当在通用VQA模型中集成为“数字”相关问题的模块时，帮助我们确保了2020 VQA挑战的第一名。最后，我们展示了有证据表明，诸如电影之类的调制卷积可以作为推理任务以外的推理任务的一般机制。

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for 'number' related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题