纵横交错的标题：扩展的MS-Coco的延长的通道内和模式的语义相似性判断

论文标题

纵横交错的标题：扩展的MS-Coco的延长的通道内和模式的语义相似性判断

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

论文作者

Parekh, Zarana, Baldridge, Jason, Cer, Daniel, Waters, Austin, Yang, Yinfei

论文摘要

通过支持多模式检索培训和评估，图像字幕数据集在表示学习方面取得了显着进步。不幸的是，数据集具有有限的跨模式关联：图像与其他图像没有配对，字幕仅与同一图像的其他字幕配对，没有负相关，并且缺少积极的交叉模式关联。这破坏了关于模式间学习如何影响模式内部任务的研究。我们使用纵横交错的标题（CXC）解决了这一差距，这是MS-COCO数据集的扩展，具有人类语义相似性判断，以267,095个内部和模式间对。我们报告了强大的现有单峰和多模型模型的CXC的基线结果。我们还评估了在图像捕获和字幕扣对上训练的多任务双重编码器，该编码对CXC的价值至关重要，以测量CXC的价值，以测量内模性学习和模式间学习的影响。

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and inter-modality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC's value for measuring the influence of intra- and inter-modality learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题