用于创建多模式对齐数据集的配方

论文标题

用于创建多模式对齐数据集的配方

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

论文作者

Lin, Angela S., Rao, Sudha, Celikyilmaz, Asli, Nouri, Elnaz, Brockett, Chris, Dey, Debadeepta, Dolan, Bill

论文摘要

许多高级程序任务可以分解为一系列指令序列，这些指令的顺序和选择的工具选择。在烹饪域中，Web提供了许多部分重叠的文本和视频配方（即过程），这些文本和视频食谱描述了如何制作同一道菜（即高级任务）。在不同来源对同一盘的说明对齐可以产生与传统的文本说明相比更丰富的描述性视觉解释，从而提供常识性的见解，以了解现实世界的结构如何结构。学会对齐这些不同的指令集很具有挑战性，因为：a）不同的食谱在说明和使用成分方面有所不同； b）视频说明可能是嘈杂的，并且倾向于包含的信息比文本说明更多。为了应对这些挑战，我们首先使用一种无监督的一致性算法，该算法在同一菜的不同食谱指令之间学习成对对齐。然后，我们使用图形算法来得出同一菜的多个文本和多个视频配方之间的联合对齐。我们发布了Microsoft Research多模式排列配方语料库，其中包含4,262盘配方之间的150k成对对准，并具有丰富的常分信息。

Many high-level procedural tasks can be decomposed into sequences of instructions that vary in their order and choice of tools. In the cooking domain, the web offers many partially-overlapping text and video recipes (i.e. procedures) that describe how to make the same dish (i.e. high-level task). Aligning instructions for the same dish across different sources can yield descriptive visual explanations that are far richer semantically than conventional textual instructions, providing commonsense insight into how real-world procedures are structured. Learning to align these different instruction sets is challenging because: a) different recipes vary in their order of instructions and use of ingredients; and b) video instructions can be noisy and tend to contain far more information than text instructions. To address these challenges, we first use an unsupervised alignment algorithm that learns pairwise alignments between instructions of different recipes for the same dish. We then use a graph algorithm to derive a joint alignment between multiple text and multiple video recipes for the same dish. We release the Microsoft Research Multimodal Aligned Recipe Corpus containing 150K pairwise alignments between recipes across 4,262 dishes with rich commonsense information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题