Abstract
Recently, data-driven single-view reconstruction methods have shown greatprogress in modeling 3D dressed humans. However, such methods suffer heavilyfrom depth ambiguities and occlusions inherent to single view inputs. In thispaper, we tackle this problem by considering a small set of input views andinvestigate the best strategy to suitably exploit information from these views.We propose a data-driven end-to-end approach that reconstructs an implicit 3Drepresentation of dressed humans from sparse camera views. Specifically, weintroduce three key components: first a spatially consistent reconstructionthat allows for arbitrary placement of the person in the input views using aperspective camera model; second an attention-based fusion layer that learns toaggregate visual information from several viewpoints; and third a mechanismthat encodes local 3D patterns under the multi-view context. In theexperiments, we show the proposed approach outperforms the state of the art onstandard data both quantitatively and qualitatively. To demonstrate thespatially consistent reconstruction, we apply our approach to dynamic scenes.Additionally, we apply our method on real data acquired with a multi-cameraplatform and demonstrate our approach can obtain results comparable tomulti-view stereo with dramatically less views.