嘈杂的视觉文本文档中的域特异性词汇基础

论文标题

嘈杂的视觉文本文档中的域特异性词汇基础

Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

论文作者

Yauney, Gregory, Hessel, Jack, Mimno, David

论文摘要

图像可以使我们深入了解单词的上下文含义，但是当前的图像文本接地方法需要详细的注释。在大多数特定领域的环境中，这种颗粒状注释罕见，昂贵且无法使用。相比之下，未标记的多图像，多句子文档很丰富。即使这些文档具有明显的词汇和视觉重叠，也可以从这些文档中学到词汇基础吗？使用房地产清单的案例研究数据集，我们证明了区分高度相关的基础术语（例如“厨房”和“卧室”）的挑战，并介绍指标来评估该文档的相似性。我们提出了一种简单的基于无监督的基于聚类的方法，该方法在对对象检测和图像标记基准基准进行评估时，可以提高精度，并在数据集的标记子集上进行评估。提出的方法对于单词的局部上下文含义特别有效，例如将“花岗岩”与房地产数据集中的台面相关联，以及Wikipedia数据集中的岩石景观。

Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from such documents, even though they have significant lexical and visual overlap? Working with a case study dataset of real estate listings, we demonstrate the challenge of distinguishing highly correlated grounded terms, such as "kitchen" and "bedroom", and introduce metrics to assess this document similarity. We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines when evaluated on labeled subsets of the dataset. The proposed method is particularly effective for local contextual meanings of a word, for example associating "granite" with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题