论文标题
培训和具有挑战性的文本引导时尚图像检索模型
Training and challenging models for text-guided fashion image retrieval
论文作者
论文摘要
从基于查询图像的目录中检索相关图像以及修改标题是一项具有挑战性的多模式任务,它特别有益于服装购物之类的领域,在这里,可以通过自然语言表达精细的细节和微妙的变化。我们介绍了一个新的评估数据集,具有挑战性的时尚查询(CFQ),以及在现有时尚智商(FIQ)数据集中实现最先进的性能的建模方法。 CFQ通过包括标题准确性和有条件图像相似性的正面和负标签的相对标题来补充现有基准测试,其中其他基准仅提供了具有综合含义的正面标签。我们证明了多模式预处理对任务的重要性,并表明基于属性标签的领域特定的弱监督可以增强通用的大规模预处理。尽管以前的模态融合机制失去了多模式预处理的好处,但我们引入了一种残留的注意融合机制,可改善性能。我们将CFQ及其代码发布给研究界。
Retrieving relevant images from a catalog based on a query image together with a modifying caption is a challenging multimodal task that can particularly benefit domains like apparel shopping, where fine details and subtle variations may be best expressed through natural language. We introduce a new evaluation dataset, Challenging Fashion Queries (CFQ), as well as a modeling approach that achieves state-of-the-art performance on the existing Fashion IQ (FIQ) dataset. CFQ complements existing benchmarks by including relative captions with positive and negative labels of caption accuracy and conditional image similarity, where others provided only positive labels with a combined meaning. We demonstrate the importance of multimodal pretraining for the task and show that domain-specific weak supervision based on attribute labels can augment generic large-scale pretraining. While previous modality fusion mechanisms lose the benefits of multimodal pretraining, we introduce a residual attention fusion mechanism that improves performance. We release CFQ and our code to the research community.