论文标题
通过视觉表示为共同基础探测上下文语言模型
Probing Contextual Language Models for Common Ground with Visual Representations
论文作者
论文摘要
大规模上下文语言模型的成功引起了人们对探索其表示中编码的内容的极大兴趣。在这项工作中,我们考虑了一个新问题:具体名词的上下文表示在多大程度上与相应的视觉表示形式保持一致?我们设计了一个探测模型,该模型可以评估仅文本表示在区分匹配和非匹配视觉表示方面的有效性。我们的发现表明,单独的语言表示形式为从正确的对象类别检索图像补丁提供了强烈的信号。此外,它们在检索图像贴片的特定实例方面有效。文本上下文在此过程中起着重要作用。视觉基础的语言模型在实例检索中略优胜于文本语言模型,但表现不佳。我们希望我们的分析激发了未来的研究,以理解和提高语言模型的视觉能力。
The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of language models.