Do Not Mask What You Do Not Need to Mask: a Parser-Free Virtual Try-On

Abstract

The 2D virtual try-on task has recently attracted a great interest from theresearch community, for its direct potential applications in online shopping aswell as for its inherent and non-addressed scientific challenges. This taskrequires fitting an in-shop cloth image on the image of a person, which ishighly challenging because it involves cloth warping, image compositing, andsynthesizing. Casting virtual try-on into a supervised task faces a difficulty:available datasets are composed of pairs of pictures (cloth, person wearing thecloth). Thus, we have no access to ground-truth when the cloth on the personchanges. State-of-the-art models solve this by masking the cloth information onthe person with both a human parser and a pose estimator. Then, image synthesismodules are trained to reconstruct the person image from the masked personimage and the cloth image. This procedure has several caveats: firstly, humanparsers are prone to errors; secondly, it is a costly pre-processing step,which also has to be applied at inference time; finally, it makes the taskharder than it is since the mask covers information that should be kept such ashands or accessories. In this paper, we propose a novel student-teacherparadigm where the teacher is trained in the standard way (reconstruction)before guiding the student to focus on the initial task (changing the cloth).The student additionally learns from an adversarial loss, which pushes it tofollow the distribution of the real images. Consequently, the student exploitsinformation that is masked to the teacher. A student trained without theadversarial loss would not use this information. Also, getting rid of bothhuman parser and pose estimator at inference time allows obtaining a real-timevirtual try-on.

Quick Read (beta)

loading the full paper ...