跨模式食品检索：学习具有语义一致性和注意力机制的食物图像和食谱的联合嵌入

论文标题

跨模式食品检索：学习具有语义一致性和注意力机制的食物图像和食谱的联合嵌入

Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism

论文作者

Wang, Hao, Sahoo, Doyen, Liu, Chenghao, Shu, Ke, Achananuparp, Palakorn, Lim, Ee-peng, Hoi, Steven C. H.

论文摘要

食物检索是对食品相关信息进行分析的重要任务，在该信息中，我们有兴趣检索有关查询食品的相关信息，例如成分，烹饪说明等。在本文中，我们研究了食品图像和烹饪食谱之间的交叉模式检索。目的是学习在公共特征空间中的图像和食谱的嵌入，以便相应的图像式嵌入彼此接近。解决此问题的两个主要挑战是1）跨模式食品数据跨模式的大变化和较小的变化； 2）难以获得歧视配方表示。为了解决这两个问题，我们提出了基于语义一致和基于注意力的网络（SCAN），该网络通过对齐输出语义概率来使两种方式的嵌入正规化。此外，我们利用一种自我注意的机制来改善食谱的嵌入。我们在大规模配方1M数据集上评估了所提出的方法的性能，并表明我们可以通过明显的利润率优于几种食品图像和烹饪食谱的最先进的跨模式检索策略。

Food retrieval is an important task to perform analysis of food-related information, where we are interested in retrieving relevant information about the queried food item such as ingredients, cooking instructions, etc. In this paper, we investigate cross-modal retrieval between food images and cooking recipes. The goal is to learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another. Two major challenges in addressing this problem are 1) large intra-variance and small inter-variance across cross-modal food data; and 2) difficulties in obtaining discriminative recipe representations. To address these two problems, we propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities. Besides, we exploit a self-attention mechanism to improve the embedding of recipes. We evaluate the performance of the proposed method on the large-scale Recipe1M dataset, and show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题