富含转录的关节嵌入图像和视频的口语描述

论文标题

富含转录的关节嵌入图像和视频的口语描述

Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos

论文作者

Oriol, Benet, Luque, Jordi, Diego, Ferran, Giro-i-Nieto, Xavier

论文摘要

在这项工作中，我们提出了一种有效的方法来通过结合三种同时方式来培训独特的嵌入表示形式：图像和口语和文字叙述。拟议的方法偏离了一个基线系统，该系统催生了一个只有口语叙述和图像提示的嵌入式空间。我们在Epic-Kitchen和位置标题数据集上进行的实验表明，引入人类生成的口语叙事的文本抄录有助于训练程序产生，从而获得更好的嵌入表示形式。三合会语音，图像和单词允许更好地估计点嵌入的点，并显示出图像和语音检索之类的任务中的性能，即使文本第三种模式，文本也不存在。

In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations. The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题