Fashionvil：以时尚为中心的视觉和语言表示学习

论文标题

Fashionvil：以时尚为中心的视觉和语言表示学习

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

论文作者

Han, Xiao, Yu, Licheng, Zhu, Xiatian, Zhang, Li, Song, Yi-Zhe, Xiang, Tao

论文摘要

事实证明，对表示学习的大规模视力和语言（V+L）预训练可以有效地增强下游V+L任务。但是，当涉及时尚域时，现有的V+L方法是不足的，因为它们忽略了时尚V+L数据和下游任务的独特特征。在这项工作中，我们提出了一个新颖的以时尚为中心的V+L表示框架，被称为Fashionvil。它包含两个新型时尚特定的预训练任务，旨在通过时尚V+L数据利用两个内在属性。首先，与其他域仅包含单个图像文本对的其他域相比，时尚域中可能有多个图像。因此，我们提出了一项多视图对比学习任务，以将一个图像的可视化表示为另一个图像+文本的组成多模式表示。其次，时尚文本（例如，产品描述）通常包含丰富的细粒概念（属性/名词短语）。为了利用这一点，引入了一个伪属性的分类任务，以鼓励同一概念的学习的单峰（视觉/文本）表示。此外，时尚V+L任务唯一包含不符合常见的一流或两条架构（例如，文本引导的图像检索）的任务。因此，我们提出了一个灵活的，多功能的V+L模型架构，该模型架构由模态敏捷的变压器组成，以便可以灵活地适应任何下游任务。广泛的实验表明，我们的FashionVil在五个下游任务中实现了新的最新技术。代码可从https://github.com/brandonhanx/mmf获得。

Large-scale Vision-and-Language (V+L) pre-training for representation learning has proven to be effective in boosting various downstream V+L tasks. However, when it comes to the fashion domain, existing V+L methods are inadequate as they overlook the unique characteristics of both the fashion V+L data and downstream tasks. In this work, we propose a novel fashion-focused V+L representation learning framework, dubbed as FashionViL. It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data. First, in contrast to other domains where a V+L data point contains only a single image-text pair, there could be multiple images in the fashion domain. We thus propose a Multi-View Contrastive Learning task for pulling closer the visual representation of one image to the compositional multimodal representation of another image+text. Second, fashion text (e.g., product description) often contains rich fine-grained concepts (attributes/noun phrases). To exploit this, a Pseudo-Attributes Classification task is introduced to encourage the learned unimodal (visual/textual) representations of the same concept to be adjacent. Further, fashion V+L tasks uniquely include ones that do not conform to the common one-stream or two-stream architectures (e.g., text-guided image retrieval). We thus propose a flexible, versatile V+L model architecture consisting of a modality-agnostic Transformer so that it can be flexibly adapted to any downstream tasks. Extensive experiments show that our FashionViL achieves a new state of the art across five downstream tasks. Code is available at https://github.com/BrandonHanx/mmf.

下载PDF全文

下载文献需遵守相关版权规定

论文标题