从空间上下文中得出视觉语义：LSA和Word2Vec的改编以从图像中生成对象和场景嵌入

论文标题

从空间上下文中得出视觉语义：LSA和Word2Vec的改编以从图像中生成对象和场景嵌入

Deriving Visual Semantics from Spatial Context: An Adaptation of LSA and Word2Vec to generate Object and Scene Embeddings from Images

论文作者

Treder, Matthias S., Mayor-Torres, Juan, Teufel, Christoph

论文摘要

嵌入是表示单词含义的重要工具。它们的有效性取决于分布假设：在同一上下文中发生的单词带有相似的语义信息。在这里，我们将这种方法适应了场景图像中的索引视觉语义。为此，我们为对象和场景制定了一个分布假设：包含相同对象（对象上下文）的场景与语义相关。同样，出现在相同空间上下文（场景或场景的子区域内）的对象在语义上相关。我们开发了从注释图像中学习对象和场景嵌入的两种方法。在第一种方法中，我们调整了LSA和Word2Vec的Skipgram和CBOW模型，以从整个图像中的对象共发生生成两组嵌入，一个用于对象，一个用于场景。这些嵌入跨越的代表空间表明，分布假设适用于图像。在此方法的初始应用中，我们表明我们的基于图像的嵌入改善了场景分类模型，例如Resnet18和VGG-11（TOP5精度上的3.72 \％提高，TOP1精度的4.56 \％提高）。在第二种方法中，我们没有分析场景的整个图像，而是专注于图像子区域内的对象的同时出现。我们说明，这种方法将场景的明智分层分解产生到语义相关对象的集合中。总体而言，这些结果表明，对象和场景的嵌入对象共发生和空间上下文产生了语义上有意义的表示形式以及下游应用程序（例如场景分类）的计算改进。

Embeddings are an important tool for the representation of word meaning. Their effectiveness rests on the distributional hypothesis: words that occur in the same context carry similar semantic information. Here, we adapt this approach to index visual semantics in images of scenes. To this end, we formulate a distributional hypothesis for objects and scenes: Scenes that contain the same objects (object context) are semantically related. Similarly, objects that appear in the same spatial context (within a scene or subregions of a scene) are semantically related. We develop two approaches for learning object and scene embeddings from annotated images. In the first approach, we adapt LSA and Word2vec's Skipgram and CBOW models to generate two sets of embeddings from object co-occurrences in whole images, one for objects and one for scenes. The representational space spanned by these embeddings suggests that the distributional hypothesis holds for images. In an initial application of this approach, we show that our image-based embeddings improve scene classification models such as ResNet18 and VGG-11 (3.72\% improvement on Top5 accuracy, 4.56\% improvement on Top1 accuracy). In the second approach, rather than analyzing whole images of scenes, we focus on co-occurrences of objects within subregions of an image. We illustrate that this method yields a sensible hierarchical decomposition of a scene into collections of semantically related objects. Overall, these results suggest that object and scene embeddings from object co-occurrences and spatial context yield semantically meaningful representations as well as computational improvements for downstream applications such as scene classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题