论文标题
不要掩盖您不需要掩盖的内容:无解析器的虚拟试验
Do Not Mask What You Do Not Need to Mask: a Parser-Free Virtual Try-On
论文作者
论文摘要
2D虚拟的尝试任务最近引起了研究社区的极大兴趣,其在线购物中的直接应用以及其固有且未解决的科学挑战。此任务需要在人的图像上安装一块店内图像,这是高度挑战性的,因为它涉及布翘曲,图像合成和合成。将虚拟的尝试施加到监督的任务面临困难:可用的数据集由成对的图片(布,戴着布)组成。因此,当人的布上变化时,我们无法获得地面真相。最先进的模型通过掩盖人类解析器和姿势估计器的人的布料来解决此问题。然后,对图像合成模块进行训练,以从蒙版的人图像和布图像中重建人图像。该程序有几个警告:首先,人类解析器容易出错。其次,这是一个昂贵的预处理步骤,也必须在推理时间应用。最后,这使任务比这要困难得多,因为蒙版涵盖了应保留的信息,例如手或配件。在本文中,我们提出了一个新颖的学生教师范式,在指导学生专注于初始任务(更换布)之前,以标准方式(重建)对老师进行培训。学生还从对抗性损失中学习,这促使其遵循真实图像的分布。因此,学生利用掩盖老师的信息。没有对抗性损失的未经对抗损失的学生不会使用此信息。此外,在推理时间既可以摆脱人类解析器和姿势估计量,允许获得实时的虚拟尝试。
The 2D virtual try-on task has recently attracted a great interest from the research community, for its direct potential applications in online shopping as well as for its inherent and non-addressed scientific challenges. This task requires fitting an in-shop cloth image on the image of a person, which is highly challenging because it involves cloth warping, image compositing, and synthesizing. Casting virtual try-on into a supervised task faces a difficulty: available datasets are composed of pairs of pictures (cloth, person wearing the cloth). Thus, we have no access to ground-truth when the cloth on the person changes. State-of-the-art models solve this by masking the cloth information on the person with both a human parser and a pose estimator. Then, image synthesis modules are trained to reconstruct the person image from the masked person image and the cloth image. This procedure has several caveats: firstly, human parsers are prone to errors; secondly, it is a costly pre-processing step, which also has to be applied at inference time; finally, it makes the task harder than it is since the mask covers information that should be kept such as hands or accessories. In this paper, we propose a novel student-teacher paradigm where the teacher is trained in the standard way (reconstruction) before guiding the student to focus on the initial task (changing the cloth). The student additionally learns from an adversarial loss, which pushes it to follow the distribution of the real images. Consequently, the student exploits information that is masked to the teacher. A student trained without the adversarial loss would not use this information. Also, getting rid of both human parser and pose estimator at inference time allows obtaining a real-time virtual try-on.