使用导航信息学习视觉表示

论文标题

使用导航信息学习视觉表示

Using Navigational Information to Learn Visual Representations

论文作者

Zhu, Lizhen, Wyble, Brad, Wang, James Z.

论文摘要

孩子们从无监督的探索中学习建立世界的视觉表示，我们假设该学习能力的关键部分是使用自我生成的导航信息作为相似性标签，以推动学习目标以进行自我监督的学习。这项工作的目的是在视觉环境中利用导航信息，以提供超过最先进的自我监督培训的培训中的性能。在这里，我们表明，在对比度学习的训练阶段中，使用空间和时间信息可以改善下游分类相对于传统的对比度学习方法的性能，这些学习方法使用实例区分来区分同一图像或两个不同图像的两个变化。我们设计了一条管道，以生成来自逼真的射线追踪环境（THREEDWORLD）的自我为中心的视觉图像，并记录每个图像的相关导航信息。修改动量对比度（MOCO）模型，我们引入了空间和时间信息，以评估在训练阶段而不是实例歧视中的两个视图的相似性。这项工作揭示了上下文信息的有效性和效率，以改善表示表示学习。这项工作使我们对孩子们在没有外部监督的情况下学习世界看待世界的方式的理解。

Children learn to build a visual representation of the world from unsupervised exploration and we hypothesize that a key part of this learning ability is the use of self-generated navigational information as a similarity label to drive a learning objective for self-supervised learning. The goal of this work is to exploit navigational information in a visual environment to provide performance in training that exceeds the state-of-the-art self-supervised training. Here, we show that using spatial and temporal information in the pretraining stage of contrastive learning can improve the performance of downstream classification relative to conventional contrastive learning approaches that use instance discrimination to discriminate between two alterations of the same image or two different images. We designed a pipeline to generate egocentric-vision images from a photorealistic ray-tracing environment (ThreeDWorld) and record relevant navigational information for each image. Modifying the Momentum Contrast (MoCo) model, we introduced spatial and temporal information to evaluate the similarity of two views in the pretraining stage instead of instance discrimination. This work reveals the effectiveness and efficiency of contextual information for improving representation learning. The work informs our understanding of the means by which children might learn to see the world without external supervision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题