语言介导的，以对象为中心的表示

论文标题

语言介导的，以对象为中心的表示

Language-Mediated, Object-Centric Representation Learning

论文作者

Wang, Ruocheng, Mao, Jiayuan, Gershman, Samuel J., Wu, Jiajun

论文摘要

我们提出语言介导的以对象为中心的表示（LORL），这是一种学习范围内，以对象为中心的场景表示的范式。 Lorl基于无监督的对象发现和细分的最新进展，特别是Monet和Slot的关注。尽管这些算法仅通过重建输入映像来学习以对象为中心的表示，但LORL使他们能够进一步学会将学习的表示形式与概念（即对象类别，属性和空间关系的单词）从语言输入中相关联。这些以对象为中心的概念源自语言，促进了以对象为中心表示的学习。可以将LORL与语言不可思议的各种无监督的对象发现算法集成。实验表明，LORL的集成始终通过语言帮助提高两个数据集上无监督的对象发现方法的性能。我们还表明，LORL学到的概念与对象发现方法结合使用，辅助下游任务，例如引用表达理解。

We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object discovery and segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised object discovery algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the performance of unsupervised object discovery methods on two datasets via the help of language. We also show that concepts learned by LORL, in conjunction with object discovery methods, aid downstream tasks such as referring expression comprehension.

下载PDF全文

下载文献需遵守相关版权规定

论文标题