论文标题
图像检索的图像文本查询的组成学习
Compositional Learning of Image-Text Query for Image Retrieval
论文作者
论文摘要
在本文中,我们研究了基于多模式(图像文本)查询从数据库中检索图像的问题。具体而言,查询文本在查询映像中提示了一些修改,任务是通过所需的修改检索图像。例如,电子商务平台的用户有兴趣购买一件连衣裙,看起来应该与她的朋友的衣服相似,但是这件衣服应该是白色的,带有丝带腰带。在这种情况下,我们希望算法在查询连衣裙中取回一些带有所需修饰的连衣裙。我们提出了一个基于自动编码器的模型Composeae,以了解图像和文本查询的组成以检索图像。我们采用深度度量学习方法,并学习一个指标,该指标将源图像的组成和文本查询更接近目标图像。我们还针对优化问题提出了旋转对称性约束。我们的方法能够胜过三个基准数据集上的最先进的方法tirg \ cite {tirg},即:mit-states,fashion200k和Fashion IQ。为了确保公平的比较,我们通过增强TIRG方法引入了强大的基线。为了确保结果的可重复性,我们在此处发布代码:\ url {https://github.com/ecom-research/composeae}。
In this paper, we investigate the problem of retrieving images from a database based on a multi-modal (image-text) query. Specifically, the query text prompts some modification in the query image and the task is to retrieve images with the desired modifications. For instance, a user of an E-Commerce platform is interested in buying a dress, which should look similar to her friend's dress, but the dress should be of white color with a ribbon sash. In this case, we would like the algorithm to retrieve some dresses with desired modifications in the query dress. We propose an autoencoder based model, ComposeAE, to learn the composition of image and text query for retrieving images. We adopt a deep metric learning approach and learn a metric that pushes composition of source image and text query closer to the target images. We also propose a rotational symmetry constraint on the optimization problem. Our approach is able to outperform the state-of-the-art method TIRG \cite{TIRG} on three benchmark datasets, namely: MIT-States, Fashion200k and Fashion IQ. In order to ensure fair comparison, we introduce strong baselines by enhancing TIRG method. To ensure reproducibility of the results, we publish our code here: \url{https://github.com/ecom-research/ComposeAE}.