Abstract
We propose a method to learn 3D deformable object categories from rawsingleview images, without external supervision. The method is based on anautoencoder that factors each input image into depth, albedo, viewpoint andillumination. In order to disentangle these components without supervision, weuse the fact that many object categories have, at least in principle, asymmetric structure. We show that reasoning about illumination allows us toexploit the underlying object symmetry even if the appearance is not symmetricdue to shading. Furthermore, we model objects that are probably, but notcertainly, symmetric by predicting a symmetry probability map, learnedendtoend with the other components of the model. Our experiments show thatthis method can recover very accurately the 3D shape of human faces, cat facesand cars from singleview images, without any supervision or a prior shapemodel. On benchmarks, we demonstrate superior accuracy compared to anothermethod that uses supervision at the level of 2D image correspondences.
Quick Read (beta)
Unsupervised Learning of Probably Symmetric Deformable 3D Objects
from Images in the Wild
Abstract
We propose a method to learn 3D deformable object categories from raw singleview images, without external supervision. The method is based on an autoencoder that factors each input image into depth, albedo, viewpoint and illumination. In order to disentangle these components without supervision, we use the fact that many object categories have, at least in principle, a symmetric structure. We show that reasoning about illumination allows us to exploit the underlying object symmetry even if the appearance is not symmetric due to shading. Furthermore, we model objects that are probably, but not certainly, symmetric by predicting a symmetry probability map, learned endtoend with the other components of the model. Our experiments show that this method can recover very accurately the 3D shape of human faces, cat faces and cars from singleview images, without any supervision or a prior shape model. On benchmarks, we demonstrate superior accuracy compared to another method that uses supervision at the level of 2D image correspondences.
1 Introduction
Understanding the 3D structure of images is key in many computer vision applications. Futhermore, while many deep networks appear to understand images as 2D textures [14], 3D modelling can explain away much of the variability of natural images and potentially improve image understanding in general. Motivated by these facts, we consider the problem of learning 3D models for deformable object categories.
We study this problem under two challenging conditions. The first condition is that no 2D or 3D ground truth information (such as keypoints, segmentation, depth maps, or prior knowledge of a 3D model) is available. Learning without external supervisions removes the bottleneck of collecting image annotations, which is often a major obstacle to deploying deep learning for new applications. The second condition is that the algorithm must use an unconstrained collection of singleview images — in particular, it should not require multiple views of the same instance. Learning from singleview images is useful because in many applications, and especially for deformable objects, we only have a source of still images to work with. Consequently, our learning algorithm ingests a number of singleview images of a deformable object category and produces as output a deep network that can estimate the 3D shape of any instance given a single image of it (creftype 1).
We formulate this as an autoencoder that internally decomposes the image into albedo, depth, illumination and viewpoint, without direct supervision for any of these factors. However, without further assumptions, decomposing images into these four factors is illposed. In search of minimal assumptions to achieve this, we note that many object categories are symmetric (e.g. almost all animals and many handcrafted objects). Assuming an object is perfectly symmetric, one can obtain a virtual second view of it by simply mirroring the image. In fact, if correspondences between the pair of mirrored images were available, 3D reconstruction could be achieved by stereo reconstruction [Mukherjee94, 10, 52, 48, 12]. Motivated by this, we seek to leverage symmetry as a geometric cue to constrain the decomposition.
However, specific object instances are in practice never fully symmetric, neither in shape nor appearance. Shape is nonsymmetric due to variations in pose or other details (e.g. hair style or expressions on a human face), and albedo can also be nonsymmetric (e.g. asymmetric texture of cat faces). Even when both shape and albedo are symmetric, the appearance may still not be, due to asymmetric illumination.
We address this issue in two ways. First, we explicitly model illumination to exploit the underlying symmetry. Furthermore, we show that, by doing so, the model can exploit illumination as an additional cue for recovering the shape. Second, we augment the model to reason about potential lack of symmetry in the objects. To do this, the model predicts, along with the other factors, a probability map that each given pixel has a symmetric counterpart in the image.
We combine these elements in an endtoend learning formulation, where all components, including the confidence maps, are learned from raw RGB data only. We also show that symmetry can be enforced by flipping internal representations, which is particularly useful for reasoning about symmetries probabilistically.
We demonstrate the quality of our method in several datasets, including human faces, cat faces and cars. We provide a thorough ablation study using a synthetic face dataset to obtain the necessary 3D ground truth. On real images, we achieve higher fidelity reconstruction results compared to other methods [44, 50] that do not rely on 2D or 3D ground truth information, nor prior knowledge of a 3D model of the instance or class. In addition, we also outperform a recent stateoftheart method [38] that uses keypoint supervision for 3D reconstruction on real faces, while our method uses no external supervision at all. Finally, we demonstrate that our trained face model generalizes to nonnatural images such as face paintings and cartoon drawings without finetuning.
2 Related Work
Paper  Supervision  Goals  Data 

[42]  3D scans  3DMM  Face 
[58]  3DV, I  Prior on 3DV, predict from I  ShapeNet, Ikea 
[1]  3DP  Prior on 3DP  ShapeNet 
[43]  3DM  Prior on 3DM  Face 
[15]  3DMM, 2DKP, I  Refine 3DMM fit to I  Face 
[13]  3DMM, 2DKP, I  Fit 3DMM to I+2DKP  Face 
[16]  3DMM  Fit 3DMM to 3D scans  Face 
[25]  3DMM, 2DKP  Pred. 3DMM from I  Humans 
[46]  3DMM, 2DS+KP  Pred. N, A, L from I  Face 
[56]  3DMM, I  Pred. 3DM, VP, T, E from I  Face 
[45]  3DMM, 2DKP, I  Fit 3DMM to I  Face 
[11]  2DS  Prior on 3DV, pred. from I  Model/ScanNet 
[27]  I, 2DS, VP  Prior on 3DV  ScanNet, PAS3D 
[26]  I, 2DS+KP  Pred. 3DM, T, VP from I  Birds 
[7]  I, 2DS  Pred. 3DM, T, L, VP from I  ShapeNet, Birds 
[20]  I, 2DS  Pred. 3DV, VP from I  ShapeNet, others 
[50]  I  Prior on 3DM, T, I  Face 
[44]  I  Pred. 3DM, VP, T${}^{\u2020}$ from I  Face 
[19]  I  Pred. V, L, VP from I  ShapeNet 
Ours  I  Pred. D, L, A, VP from I  Face, others 
In order to assess our contribution in relation to the vast literature on imagebased 3D reconstruction, it is important to consider three aspects of each approach: which information is used, which assumptions are made, and what the output is. Below and in creftype 1 we compare our contribution to prior works based on these factors.
Our method uses singleview images of an object category as training data, assumes that the objects belong to a specific class (e.g. human faces) which is weakly symmetric, and outputs a monocular predictor capable of decomposing any image of the category into shape, albedo, illumination, viewpoint and symmetry probability.
Structure from Motion.
Traditional methods such as Structure from Motion (SfM) [9] can reconstruct the 3D structure of individual rigid scenes given as input multiple views of each scene and 2D keypoint matches between the views. This can be extended in two ways. First, monocular reconstruction methods can perform dense 3D reconstruction from a single image without 2D keypoints [66, 54, 17]. However, they require multiple views [17] or videos of rigid scenes for training [66]. Second, NonRigid SfM (NRSfM) approaches [4, 40] can learn to reconstruct deformable objects by allowing 3D points to deform in a limited manner between views, but require supervision in terms of annotated 2D keypoints for both training and testing. Hence, neither family of SfM approaches can learn to reconstruct deformable objects from raw pixels of a single view.
Shape from X.
Many other monocular cues have been used as alternatives or supplements to SfM for recovering shape from images, such as shading [21, 63], silhouettes [30], texture [57], symmetry [Mukherjee94, 10] etc. In particular, our work is inspired from shape from symmetry and shape from shading. Shape from symmetry [Mukherjee94, 10, 52, 48] reconstructs symmetric objects from a single image by using the mirrored image as a virtual second view, provided that symmetric correspondences are available. [48] also shows that it is possible to detect symmetries and correspondences using descriptors. Shape from shading [21, 63] assumes a shading model such as Lambertian reflectance, and reconstructs the surface by exploiting the nonuniform illumination.
Categoryspecific reconstruction.
Learningbased methods have recently been leveraged to reconstruct objects from a single view, either in the form of a raw image or 2D keypoints (see also creftype 1). While this task is illposed, it has been shown to be solvable by learning a suitable object prior from the training data [42, 58, 1, 43]. A variety of supervisory signals have been proposed to learn such priors. Besides using 3D ground truth directly, authors have considered using videos [2, 66, Novotny17b, 55] and stereo pairs [17, 36]. Other approaches have used single views with 2D keypoint annotations [31, 26, 38, 49, 6] or object masks [26, 7]. For objects such as human bodies and human faces, some methods [25, 16, 56, 13] have learned reconstructions from raw images, but starting from the knowledge of a predefined shape model such as SMPL [35] or Basel [42]. These prior models are constructed using specialized hardware and/or other forms of supervision, which are often difficult to obtain for deformable objects in the wild, such as animals, and also limited in details of the shape.
Only recently have authors attempted to learn the geometry of object categories from raw, monocular views only. Thewlis et al. [Thewlis17b, Thewlis18] uses equivariance to learn dense landmarks, which recovers the 2D geometry of the objects. DAE [47] learns to predict a deformation field through heavily constraining an autoencoder with a small bottleneck embedding and lift that to 3D in [44] — in post processing, they further decompose the reconstruction in albedo and shading, obtaining an output similar to ours.
Adversarial learning has been proposed as a way of hallucinating new views of an object. Some of these methods start from 3D representations [58, 1, 67, 43]. Kato et al. [27] trains a discriminator on raw images but uses viewpoint as addition supervision. HoloGAN [39] only uses raw images but does not obtain an explicit 3D reconstruction. Szabo et al. [50] uses adversarial training to reconstruct 3D meshes of the object, but does not assess their results quantitatively. Henzler et al. [20] also learns from raw images, but only experiments with images that contain the object on a white background, which is akin to supervision with 2D silhouettes. In creftype 4.3, we compare to [44, 50] and demonstrate superior reconstruction results with much higher fidelity.
Since our model generates images from an internal 3D representation, one component of the model is a differentiable renderer. However, with a traditional rendering pipeline, gradients across occlusions and boundaries are not defined. Several soft relaxations have thus been proposed [34, 28, 32]. Here, we use an implementation^{1}^{1} 1 https://github.com/daniilidisgroup/neural_renderer of [28].
3 Method
Given an unconstrained collection of images of an object category, such as human faces, our goal is to learn a model $\mathrm{\Phi}$ that receives as input an image of an object instance and produces as output a decomposition of it into 3D shape, albedo, illumination and viewpoint, as illustrated in creftype 2.
As we have only raw images to learn from, the learning objective is reconstructive: namely, the model is trained so that the combination of the four factors gives back the input image. This results in an autoencoding pipeline where the factors have, due to the way they are recomposed, an explicit photogeometric meaning.
In order to learn such a decomposition without supervision for any of the components, we use the fact that many object categories are bilaterally symmetric. However, the appearance of object instances is never perfectly symmetric. Asymmetries arise from shape deformation, asymmetric albedo and asymmetric illumination. We take two measures to account for these asymmetries. First, we explicitly model asymmetric illumination. Second, our model also estimates, for each pixel in the input image, a confidence score that explains the probability of the pixel having a symmetric counterpart in the image (see conf $\sigma ,{\sigma}^{\prime}$ in creftype 2).
The following sections describe how this is done, looking first at the photogeometric autoencoder (creftype 3.1), then at how symmetries are modelled (creftype 3.2), followed by details of the image formation (creftype 3.3) and the supplementary perceptual loss (creftype 3.4).
3.1 Photogeometric autoencoding
An image $\mathbf{I}$ is a function $\mathrm{\Omega}\to {\mathbb{R}}^{3}$ defined on a grid $\mathrm{\Omega}=\{0,\mathrm{\dots},W1\}\times \{0,\mathrm{\dots},H1\}$, or, equivalently, a tensor in ${\mathbb{R}}^{3\times W\times H}$. We assume that the image is roughly centered on an instance of the object of interest. The goal is to learn a function $\mathrm{\Phi}$, implemented as a neural network, that maps the image $\mathbf{I}$ to four factors $(d,a,w,l)$ comprising a depth map $d:\mathrm{\Omega}\to {\mathbb{R}}_{+}$, an albedo image $a:\mathrm{\Omega}\to {\mathbb{R}}^{3}$, a global light direction $l\in {\mathbb{S}}^{2}$, and a viewpoint $w\in {\mathbb{R}}^{6}$ so that the image can be reconstructed from them.
The image $\mathbf{I}$ is reconstructed from the four factors in two steps, lighting $\mathrm{\Lambda}$ and reprojection $\mathrm{\Pi}$, as follows:
$$\widehat{\mathbf{I}}=\mathrm{\Pi}(\mathrm{\Lambda}(a,d,l),d,w).$$  (1) 
The lighting function $\mathrm{\Lambda}$ generates a version of the object based on the depth map $d$, the light direction $l$ and the albedo $a$ as seen from a canonical viewpoint $w=0$. The viewpoint $w$ represents the transformation between the canonical view and the viewpoint of the actual input image $\mathbf{I}$. Then, the reprojection function $\mathrm{\Pi}$ simulates the effect of a viewpoint change and generates the image $\widehat{\mathbf{I}}$ given the canonical depth $d$ and the shaded canonical image $\mathrm{\Lambda}(a,d,l)$. Learning uses a reconstruction loss which encourages $\mathbf{I}\approx \widehat{\mathbf{I}}$ (creftype 3.2).
Discussion.
The effect of lighting could be incorporated in the albedo $a$ by interpreting the latter as a texture rather than as the object’s albedo. However, there are two good reasons to avoid this. First, the albedo $a$ is often symmetric even if the illumination causes the corresponding appearance to look asymmetric. Separating them allows us to more effectively incorporate the symmetry constraint described below. Second, shading provides an additional cue on the underlying 3D shape [22, 3]. In particular, unlike the recent work of [47] where a shading map is predicted independently from shape, our model computes the shading based on the predicted depth, mutually constraining each other.
3.2 Probably symmetric objects
Leveraging symmetry for 3D reconstruction requires identifying symmetric object points in an image. Here we do so implicitly, assuming that depth and albedo, which are reconstructed in a canonical frame, are symmetric about a fixed vertical plane. An important beneficial side effect of this choice is that it helps the model discover a ‘canonical view’ for the object, which is important for reconstruction [40].
To do this, we consider the operator that flips a map $a\in {\mathbb{R}}^{C\times W\times H}$ along the horizontal axis^{2}^{2} 2 The choice of axis is arbitrary as long as it is fixed.: ${[\mathrm{flip}a]}_{c,u,v}={a}_{c,W1u,v}.$ We then require $d\approx \mathrm{flip}{d}^{\prime}$ and $a\approx \mathrm{flip}{a}^{\prime}$. While these constraints could be enforced by adding corresponding loss terms to the learning objective, they would be difficult to balance. Instead, we achieve the same effect indirectly, by obtaining a second reconstruction ${\widehat{\mathbf{I}}}^{\prime}$ from the flipped depth and albedo:
$${\widehat{\mathbf{I}}}^{\prime}=\mathrm{\Pi}(\mathrm{\Lambda}({a}^{\prime},{d}^{\prime},l),{d}^{\prime},w),{a}^{\prime}=\mathrm{flip}a,{d}^{\prime}=\mathrm{flip}d.$$  (2) 
Then, we consider two reconstruction losses encouraging $\mathbf{I}\approx \widehat{\mathbf{I}}$ and $\mathbf{I}\approx {\widehat{\mathbf{I}}}^{\prime}$. Since the two losses are commensurate, they are easy to balance and train jointly. Most importantly, this approach allows us to easily reason about symmetry probabilistically, as explained next.
The source image $\mathbf{I}$ and the reconstruction $\widehat{\mathbf{I}}$ are compared via the loss:
$$\mathcal{L}(\widehat{\mathbf{I}},\mathbf{I},\sigma )=\frac{1}{\mathrm{\Omega}}\sum _{uv\in \mathrm{\Omega}}\mathrm{ln}\frac{1}{\sqrt{2}{\sigma}_{uv}}\mathrm{exp}\frac{\sqrt{2}{\mathrm{\ell}}_{1,uv}}{{\sigma}_{uv}},$$  (3) 
where ${\mathrm{\ell}}_{1,uv}={\widehat{\mathbf{I}}}_{uv}{\mathbf{I}}_{uv}$ is the ${L}_{1}$ distance between the intensity of pixels at location $uv$, and $\sigma \in {\mathbb{R}}_{+}^{W\times H}$ is a confidence map, also estimated by the network $\mathrm{\Phi}$ from the image $\mathbf{I}$, which expresses the aleatoric uncertainty of the model. The loss can be interpreted as the negative loglikelihood of a factorized Laplacian distribution on the reconstruction residuals. Optimizing likelihood causes the model to selfcalibrate, learning a meaningful confidence map [29].
Modelling uncertainty is generally useful, but in our case is particularly important when we consider the “symmetric” reconstruction ${\widehat{\mathbf{I}}}^{\prime}$, for which we use the same loss $\mathcal{L}({\widehat{\mathbf{I}}}^{\prime},\mathbf{I},{\sigma}^{\prime})$. Crucially, we use the network to estimate, also from the same input image $\mathbf{I}$, a second confidence map ${\sigma}^{\prime}$. This confidence map allows the model to learn which portions of the input image might not be symmetric. For instance, in some cases hair on a human face is not symmetric as shown in creftype 2, and ${\sigma}^{\prime}$ can assign a higher reconstruction uncertainty to the hair region where the symmetry assumption is not satisfied. Note that this depends on the specific instance under consideration, and is learned by the model itself.
Overall, the learning objective is given by the combination of the two reconstruction errors:
$$\mathcal{E}(\mathrm{\Phi};\mathbf{I})=\mathcal{L}(\widehat{\mathbf{I}},\mathbf{I},\sigma )+{\lambda}_{\text{f}}\mathcal{L}({\widehat{\mathbf{I}}}^{\prime},\mathbf{I},{\sigma}^{\prime}),$$  (4) 
where ${\lambda}_{\text{f}}=0.5$ is a weighing factor, $(d,a,w,l,\sigma ,{\sigma}^{\prime})=\mathrm{\Phi}(\mathbf{I})$ is the output of the neural network, and $\widehat{\mathbf{I}}$ and ${\widehat{\mathbf{I}}}^{\prime}$ are obtained according to creftypeplural 2\crefpairconjunction1.
3.3 Image formation model
We now describe the functions $\mathrm{\Pi}$ and $\mathrm{\Lambda}$ in creftype 1 in more detail. The image is formed by a camera looking at a 3D object. If we denote with $P=({P}_{x},{P}_{y},{P}_{z})\in {\mathbb{R}}^{3}$ a 3D point expressed in the reference frame of the camera, this is mapped to pixel $p=(u,v,1)$ by the following projection:
$$p\propto KP,K=\left[\begin{array}{ccc}\hfill f\hfill & \hfill 0\hfill & \hfill {c}_{u}\hfill \\ \hfill 0\hfill & \hfill f\hfill & \hfill {c}_{v}\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill \end{array}\right],\{\begin{array}{cc}{c}_{u}=\frac{W1}{2},\hfill & \\ {c}_{v}=\frac{H1}{2},\hfill & \\ f=\frac{W1}{2\mathrm{tan}\frac{{\theta}_{\text{FOV}}}{2}}.\hfill & \end{array}$$  (5) 
This model assumes a perspective camera with field of view (FOV) ${\theta}_{\text{FOV}}$. We assume a nominal distance of the object from the camera at about $1\mathrm{m}$. Given that the images are cropped around a particular object, we assume a relatively narrow FOV of ${\theta}_{\text{FOV}}\approx {10}^{\circ}$.
The depth map $d:\mathrm{\Omega}\to {\mathbb{R}}_{+}$ associates a depth value ${d}_{uv}$ to each pixel $(u,v)\in \mathrm{\Omega}$ in the canonical view. By inverting the camera model (5), we find that this corresponds to the 3D point $P={d}_{uv}\cdot {K}^{1}p.$
The viewpoint $w\in {\mathbb{R}}^{6}$ represents an Euclidean transformation $(R,T)\in SE(3)$, where ${w}_{1:3}$ and ${w}_{4:6}$ are rotation angles and translations in $x$, $y$ and $z$ axes respectively.
The map $(R,T)$ transforms 3D points from the canonical view to the actual view. Thus a pixel $(u,v)$ in the canonical view is mapped to the pixel $({u}^{\prime},{v}^{\prime})$ in the actual view by the warping function ${\eta}_{d,w}:(u,v)\mapsto ({u}^{\prime},{v}^{\prime})$ given by:
$${p}^{\prime}\propto K({d}_{uv}\cdot R{K}^{1}p+T),$$  (6) 
where ${p}^{\prime}=({u}^{\prime},{v}^{\prime},1).$
Finally, the reprojection function $\mathrm{\Pi}$ takes as input the depth $d$ and the viewpoint change $w$ and applies the resulting warp to the canonical image $\mathbf{J}$ to obtain the actual image $\widehat{\mathbf{I}}=\mathrm{\Pi}(\mathbf{J},d,w)$ as ${\widehat{\mathbf{I}}}_{{u}^{\prime}{v}^{\prime}}={\mathbf{J}}_{uv},$ where $(u,v)={\eta}_{d,w}^{1}({u}^{\prime},{v}^{\prime}).$^{3}^{3} 3 Note that this requires to compute the inverse of the warp ${\eta}_{d,w}$, which is detailed in creftype 6.1.
The canonical image $\mathbf{J}=\mathrm{\Lambda}(a,d,l)$ is in turn generated as a combination of albedo, normal map and light direction. To do so, given the depth map $d$, we derive the normal map $n:\mathrm{\Omega}\to {\mathbb{S}}^{2}$ by associating to each pixel $(u,v)$ a vector normal to the underlying 3D surface. In order to find this vector, we compute the vectors ${t}_{uv}^{u}$ and ${t}_{uv}^{v}$ tangent to the surface along the $u$ and $v$ directions. For example, the first one is: ${t}_{uv}^{u}={d}_{u+1,v}\cdot {K}^{1}(p+{e}_{x}){d}_{u1,v}\cdot {K}^{1}(p{e}_{x})$ where $p$ is defined above and ${e}_{x}=(1,0,0)$. Then the normal is obtained by taking the vector product ${n}_{uv}\propto {t}_{uv}^{u}\times {t}_{uv}^{v}$.
The normal ${n}_{uv}$ is multiplied by the light direction $l$ to obtain a value for the directional illumination and the latter is added to the ambient light. Finally, the result is multiplied by the albedo to obtain the illuminated texture, as follows: ${\mathbf{J}}_{uv}=\left({k}_{s}+{k}_{d}\mathrm{max}\{0,\u27e8l,{n}_{uv}\u27e9\}\right)\cdot {a}_{uv}.$ Here ${k}_{s}$ and ${k}_{d}$ are the scalar coefficients weighting the ambient and diffuse terms, and are predicted by the model with range between 0 and 1 via rescaling a tanh output. The light direction $l={({l}_{x},{l}_{y},1)}^{T}/{({l}_{x}^{2}+{l}_{y}^{2}+1)}^{0.5}$ is modeled as a spherical sector by predicting ${l}_{x}$ and ${l}_{y}$ with tanh.
3.4 Perceptual loss
The ${L}_{1}$ loss function creftype 3 is sensitive to small geometric imperfections and tends to result in blurry reconstructions. We add a perceptual loss term to mitigate this problem. The $k$th layer of an offtheshelf image encoder $e$ (VGG16 in our case [Simonyan15]) predicts a representation ${e}^{(k)}(\mathbf{I})\in {\mathbb{R}}^{{C}_{k}\times {W}_{k}\times {H}_{k}}$ where ${\mathrm{\Omega}}_{k}=\{0,\mathrm{\dots},{W}_{k}1\}\times \{0,\mathrm{\dots},{H}_{k}1\}$ is the corresponding spatial domain. Similar to creftype 3, assuming a Gaussian distribution, the perceptual loss is given by:
$${\mathcal{L}}_{\text{p}}^{\left(k\right)}(\widehat{\mathbf{I}},\mathbf{I},{\sigma}^{\left(k\right)})=\frac{1}{\left{\mathrm{\Omega}}_{k}\right}\sum _{uv\in {\mathrm{\Omega}}_{k}}\mathrm{ln}\frac{1}{\sqrt{2\pi {\left({\sigma}_{uv}^{\left(k\right)}\right)}^{2}}}\mathrm{exp}\frac{{\left({\mathrm{\ell}}_{uv}^{\left(k\right)}\right)}^{2}}{2{\left({\sigma}_{uv}^{\left(k\right)}\right)}^{2}},$$  (7) 
where ${\mathrm{\ell}}_{uv}^{(k)}={e}_{uv}^{(k)}(\widehat{\mathbf{I}}){e}_{uv}^{(k)}(\mathbf{I})$ for each pixel index $uv$ in the $k$th layer. We also compute the loss for ${\widehat{\mathbf{I}}}^{\prime}$ using ${\sigma}^{{(k)}^{\prime}}$. ${\sigma}^{(k)}$ and ${\sigma}^{{(k)}^{\prime}}$ are additional confidence maps predicted by our model. In practice, we found it is good enough for our purpose to use the features from only one layer relu3_3 of VGG16. We therefore shorten the notation of perceptual loss to ${\mathcal{L}}_{\text{p}}$. With this, the loss function $\mathcal{L}$ in creftype 4 is replaced by $\mathcal{L}+{\lambda}_{\text{p}}{\mathcal{L}}_{\text{p}}$ with ${\lambda}_{\text{p}}=1$.
4 Experiments
We summarize the key experimental results here, and provide more qualitative evaluations in creftype 6.3 and the supplementary video. We will release the code, trained models and the synthetic dataset.
4.1 Setup
Datasets.
We test our method on three human face datasets: CelebA [33], 3DFAW [18, 24, 65, 61] and BFM [42]. CelebA is a large scale human face dataset, consisting of over $200$k images of real human faces in the wild annotated with bounding boxes. 3DFAW contains $23$k images with $66$ 3D keypoint annotations, which we use to evaluate our 3D predictions in creftype 4.3. We roughly crop the images around the head region and use the official train/val/test splits. BFM (Basel Face Model) is a synthetic face model, which we use to assess the quality of the 3D reconstructions (since the inthewild datasets lack groundtruth). We follow the protocol of [46] to generate a dataset, sampling shapes, poses, textures, and illumination randomly. We use images from SUN Database [60] as background and save ground truth depth maps for evaluation.
We also test our method on cat faces and synthetic cars. We use two cat datasets [64, parkhi12a]. The first one has $10$k cat images with nine keypoint annotations, and the second one is a collection of dog and cat images, containing $1.2$k cat images with bounding box annotations. We combine the two datasets, crop the images around the cat heads, and split them by $8$:$1$:$1$ into train, validation and test sets. For cars, we render $35$k images of synthetic cars from ShapeNet [5] with random viewpoints and illumination, and randomly split them by $8$:$1$:$1$ into train, validation and test sets.
Metrics.
Since the scale of 3D reconstruction from projective cameras is inherently ambiguous [9], we discount it in the evaluation. Specifically, given the depth map $d$ predicted by our model in the canonical view, we warp it to a depth map $\overline{d}$ in the actual view using the predicted viewpoint and compare the latter to the groundtruth depth map ${d}^{*}$ using the scaleinvariant depth error (SIDE) [8] ${E}_{\text{SIDE}}(\overline{d},{d}^{*})={(\frac{1}{WH}{\sum}_{uv}{\mathrm{\Delta}}_{uv}^{2}{(\frac{1}{WH}{\sum}_{uv}{\mathrm{\Delta}}_{uv})}^{2})}^{\frac{1}{2}}$ where ${\mathrm{\Delta}}_{uv}=\mathrm{log}{\overline{d}}_{uv}\mathrm{log}{d}_{uv}^{*}$. We compare only valid depth pixel and erode the foreground mask by one pixel to discount rendering artefacts at object boundaries. Additionally, we report the mean angle deviation (MAD) between normals computed from ground truth depth and from the predicted depth, measuring how well the surface is captured by the prediction.
Implementation details.
The function $(d,a,w,l,\sigma )=\mathrm{\Phi}(\mathbf{I})$ that extracts depth, albedo, viewpoint, lighting, and confidence maps from the single image $\mathbf{I}$ of the object is implemented using different neural networks. The depth and albedo are generated by encoderdecoder networks, while viewpoint and lighting are regressed using simple encoder networks. The encoderdecoders do not use skip connections because input and output images are not spatially aligned (since the output is in the canonical object space). All four confidence maps are predicted using the same network, using different decoding layers for the photometric and perceptual losses since these are computed at different resolutions. The final activation function is tanh for depth, albedo, viewpoint and lighting and softplus for the confidence maps. The depth prediction is centered on the mean before tanh, as the overall distance is estimated as part of the viewpoint. We do not use any special initialization for all predictions, except that 2 border pixels of the depth maps on both the left and the right are clamped at a maximal depth to avoid boundary issues.
We train using Adam over batches of $64$ input images, resized to $64\times 64$ pixels. The size of the output depth and albedo is also $64\times 64$. We train for approximately $50$k iterations. For visualization, depth maps are upsampled to $256$. We include more details in creftype 6.2.
4.2 Results
No  Baseline  SIDE ($\times {10}^{2}$) $\downarrow $  MAD (deg.) $\downarrow $ 

(1)  Supervised  $0.410$ $\pm 0.103$  $10.78$ $\pm 1.01$ 
(2)  Const. null depth  $2.723$ $\pm 0.371$  $43.34$ $\pm 2.25$ 
(3)  Average g.t. depth  $1.990$ $\pm 0.556$  $23.26$ $\pm 2.85$ 
(4)  Ours (unsupervised)  $0.793$ $\pm 0.140$  $16.51$ $\pm 1.56$ 
No  Method  SIDE ($\times {10}^{2}$) $\downarrow $  MAD (deg.) $\downarrow $ 

(1)  Ours full  $0.793$ $\pm 0.140$  $16.51$ $\pm 1.56$ 
(2)  w/o albedo flip  $2.916$ $\pm 0.300$  $39.04$ $\pm 1.80$ 
(3)  w/o depth flip  $1.139$ $\pm 0.244$  $27.06$ $\pm 2.33$ 
(4)  w/o light  $2.406$ $\pm 0.676$  $41.64$ $\pm 8.48$ 
(5)  w/o perc. loss  $0.931$ $\pm 0.269$  $17.90$ $\pm 2.31$ 
(6)  w/o confidence  $0.829$ $\pm 0.213$  $16.39$ $\pm 2.12$ 
SIDE ($\times {10}^{2}$) $\downarrow $  MAD (deg.) $\downarrow $  

No perturb, no conf.  $0.829$ $\pm 0.213$  $16.39$ $\pm 2.12$ 
No perturb, conf.  $0.793$ $\pm 0.140$  $16.51$ $\pm 1.56$ 
Perturb, no conf.  $2.141$ $\pm 0.842$  $26.61$ $\pm 5.39$ 
Perturb, conf.  $0.878$ $\pm 0.169$  $17.14$ $\pm 1.90$ 
Depth Corr. $\uparrow $  
Ground truth  $66$ 
AIGN [53] (supervised, from [38])  $50.81$ 
DepthNetGAN [38] (supervised, from [38])  $58.68$ 
MOFA [51] (modelbased, from [38])  $15.97$ 
DepthNet [38] (from [38])  $26.32$ 
DepthNet [38] (from GitHub)  $35.77$ 
Ours  $48.98$ 
Ours (w/ CelebA pretraining)  $54.65$ 
Comparison with baselines.
creftypecap 2 uses the BFM dataset to compare the depth reconstruction quality obtained by our method, a fullysupervised baseline and two baselines. The supervised baseline is a version of our model trained to regress the groundtruth depth maps using an ${L}_{1}$ loss. The trivial baseline predicts a constant uniform depth map, which provides a performance lowerbound. The third baseline is a constant depth map obtained by averaging all groundtruth depth maps in the test set. Our method largely outperforms the two constant baselines and approaches the results of supervised training. Improving over the third baseline (which has access to GT information) confirms that the model learns an instance specific 3D representation.
Ablation.
To understand the influence of the individual parts of the model, we remove them one at a time and evaluate the performance of the ablated model in creftype 3.
In the table, row (1) shows the performance of the full model (the same as in creftype 2). Row (2) does not flip the albedo. Thus, the albedo is not encouraged to be symmetric in the canonical space, which fails to canonicalize the viewpoint of the object and to use cues from symmetry to recover shape. The performance is as low as the trivial baseline in creftype 2. Row (3) does not flip the depth, with a similar effect to row (2). Row (4) predicts a shading map instead of computing it from depth and light direction. This also harms performance significantly because shading cannot be used as a cue to recover shape. Row (5) switches off the perceptual loss, which leads to degraded image quality and hence degraded reconstruction results. Finally, row (6) switches off the confidence maps, using a fixed and uniform value for the confidence — this reduces losses (3) and (7) to the basic ${L}_{1}$ and ${L}_{2}$ losses, respectively. The reconstruction accuracy decreases slightly (its variance increases more), as faces in BFM are highly symmetric (e.g. do not have hair). To better understand the effect of the confidence maps, we specifically evaluate on partially asymmetric faces using perturbations.
Asymmetric perturbation.



In order to demonstrate that our uncertainty modelling allows the model to handle asymmetry, we add asymmetric perturbations to BFM. Specifically, we generate random rectangular color patches with $20\%$ to $50\%$ of the image size and blend them onto the images with $\alpha $values ranging from $0.5$ to $1$, as shown in creftype 3. We then train our model with and without confidence on these perturbed images, and report the results in creftype 4. Without the confidence maps, the model always predicts a symmetric albedo and geometry reconstruction often fails. With our confidence estimates, the model is able to reconstruct the asymmetric faces correctly, with very little loss in accuracy compared to the unperturbed case.
Qualitative results.









In creftype 4 we show reconstruction results of human faces from CelebA and 3DFAW, cat faces from [64, parkhi12a] and synthetic cars from ShapeNet. The 3D shapes are recovered with high fidelity. The reconstructed 3D face, for instance, contain fine details of the nose, eyes and mouth even in the presence of extreme facial expression.
To further test generalization, we collected a number of paintings and cartoon drawings of faces from [Crowley15] and the Internet, and tested them with our model trained on the CelebA dataset. As shown in creftype 5, our method still works well even though it has never seen such images during training. Please see creftype 6.3 for more qualitative results.
Symmetry and asymmetry detection.
Since our model predicts a canonical view of the objects that is symmetric about the vertical centerline of the image, we can easily visualize the symmetry plane, which is otherwise nontrivial to detect from inthewild images. In creftype 6, we render the centerline of the canonical image and warp it to the input viewpoint. The symmetry planes detected by our method are accurate despite the presence of extreme asymmetric texture and lighting effects. We also overlay the predicted confidence map ${\sigma}^{\prime}$ onto the image, confirming that the model assigns low confidence to asymmetric regions in a samplespecific way.
4.3 Comparison with the state of the art
As shown in creftype 1, most reconstruction methods in the literature require either image annotations, a prior 3D model of the object, or both. When these assumptions are dropped, the task becomes considerably harder, and there is little prior work that is directly comparable. Of these, [19] only uses synthetic, textureless objects from ShapeNet, [50] reconstructs inthewild faces but does not report any quantitative results, and [44] reports quantitative results only on the detection of 2D keypoints, but not on the 3D reconstruction quality. We were not able to obtain code or trained models from [44, 50] for a direct quantitative comparison and thus compare qualitatively.
Qualitative comparison.






In order to establish a sidebyside comparison, we cropped the examples reported in the papers [44, 50] and compare our results with theirs (creftype 7). Our method produces much higher quality reconstructions than both methods, with fine details of the facial expression, whereas [44] recovers 3D shapes poorly and [50] generates unnatural shapes. Note that [50] uses an unconditional GAN that generates high resolution 3D faces from random noise, and cannot recover 3D shapes from images. The input images for [50] in creftype 7 were generated by their method.
3D keypoint depth evaluation.
Next, we compare to the DepthNet model of [38]. This method predicts depth for selected facial keypoints, but uses 2D keypoint annotations as input — a much easier setup than the one we consider here. Still, we compare the quality of the reconstruction of these sparse point obtained by DepthNet and our method. We also compare to the baselines MOFA [51] and AIGN [53] reported in [38]. For a fair comparison, we use their public code which computes the depth correlation score (between $0$ and $66$) on the frontal faces. We use the 2D keypoint locations to sample our predicted depth and then evaluate the same metric. The set of test images from 3DFAW and the preprocessing are identical to [38]. Since 3DFAW is a small dataset with limited variation, we also report results with CelebA pretraining.
In creftype 5 we report the results from their paper and the slightly improved results we obtained from their publiclyavailable implementation. The paper also evaluates a supervised model using a GAN discriminator trained with groundtruth depth information. While our method does not use any supervision, it still outperforms DepthNet and reaches closetosupervised performance.
4.4 Limitations



While our method is robust in many challenging scenarios (e.g., extreme facial expression, abstract drawing), we do observe failure cases as shown in creftype 8. During training, we assume a simple Lambertian shading model, ignoring shadows and specularity, which leads to inaccurate reconstructions under extreme lighting conditions (creftype 7(a)) or highly nonLambertian surfaces. Disentangling noisy dark textures and shading (creftype 7(b)) is often difficult. The reconstruction quality is lower for extreme poses (creftype 7(c)), partly due to poor supervisory signal from the reconstruction loss of side images. This may be improved by imposing constraints from accurate reconstructions of frontal poses.
5 Conclusions
We have presented a method that can learn a 3D model of a deformable object category from an unconstrained collection of singleview images of the object category. The model is able to obtain highfidelity monocular 3D reconstructions of individual object instances. This is trained based on a reconstruction loss without any supervision, resembling an autoencoder. We have shown that symmetry and illumination are strong cues for shape and help the model to converge to a meaningful reconstruction. Our model outperforms a current stateoftheart 3D reconstruction method that uses 2D keypoint supervision. As for future work, the model currently represents 3D shape from a canonical viewpoint using a depth map, which is sufficient for objects such as faces that have a roughly convex shape and a natural canonical viewpoint. For more complex objects, it may be possible to extend the model to use either multiple canonical views or a different 3D representation, such as a mesh or a voxel map.
Acknowledgement
We would like to thank Soumyadip Sengupta for sharing with us the code to generate synthetic face datasets, and Mihir Sahasrabudhe for sending us the reconstruction results of Lifting AutoEncoders. We are also indebted to the members of Visual Geometry Group for insightful discussions and comments. This work is jointly supported by Facebook Research and ERC Horizon 2020 Research and Innovation Programme IDIU 638009.
References
 [1] (2018) Learning representations and generative models for 3D point clouds. In icml, Cited by: §2, §2, Table 1.
 [2] (2015) Learning to see by moving. In iccv, Cited by: §2.
 [3] (1999) The basrelief ambiguity. ijcv. Cited by: §3.1.
 [4] (2000) Recovering nonrigid 3D shape from image streams. In cvpr, Cited by: §2.
 [5] (2015) ShapeNet: an informationrich 3d model repository. arXiv abs/1512.03012. Cited by: §4.1.
 [6] (2019) Unsupervised 3d pose estimation with geometric selfsupervision. In cvpr, Cited by: §2.
 [7] (2019) Learning to predict 3d objects with an interpolationbased differentiable renderer. In nips, Cited by: §2, Table 1.
 [8] (2014) Depth map prediction from a single image using a multiscale deep network. In nips, Cited by: §4.1.
 [9] (2001) The geometry of multiple images. MIT Press. Note: (“with contributions from T. Papadopoulo”) Cited by: §2, §4.1.
 [10] (2003) Mirror symmetry $\Rightarrow $ 2view stereo geometry. ivc. Cited by: §1, §2.
 [11] (2017) 3D shape induction from 2D views of multiple objects. In 3DV, Cited by: Table 1.
 [12] (2017) Exploiting symmetry and/or manhattan properties for 3d object structure estimation from single and multiple images. In cvpr, Cited by: §1.
 [13] (2019) GANFIT: generative adversarial network fitting for high fidelity 3D face reconstruction. In cvpr, Cited by: §2, Table 1.
 [14] (2019) ImageNettrained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In icml, Cited by: §1.
 [15] (2019) 3D guided finegrained face manipulation. In cvpr, Cited by: Table 1.
 [16] (2018) Morphable face models  an open framework. In fg, Cited by: §2, Table 1.
 [17] (2017) Unsupervised monocular depth estimation with leftright consistency. In cvpr, Cited by: §2, §2.
 [18] (2010) Multipie. ivc. Cited by: §4.1.
 [19] (2019) Learning singleimage 3D reconstruction by generative modelling of shape, pose and shading. ijcv. External Links: Document Cited by: Table 1, §4.3.
 [20] (2019) Escaping plato’s cave using adversarial training: 3d shape from unstructured 2d image collections. In iccv, Cited by: §2, Table 1.
 [21] (1989) Shape from shading. MIT Press, Cambridge Massachusetts. Cited by: §2.
 [22] (1975) Obtaining shape from shading information. In The Psychology of Computer Vision, Cited by: §3.1.
 [23] (2015) Spatial transformer networks. In nips, Cited by: §6.1.
 [24] (2015) Dense 3d face alignment from 2d videos in realtime. In fg, Cited by: §4.1.
 [25] (2018) Endtoend recovery of human shape and pose. In cvpr, Cited by: §2, Table 1.
 [26] (2018) Learning categoryspecific mesh reconstruction from image collections. In eccv, Cited by: §2, Table 1.
 [27] (2019) Learning view priors for singleview 3d reconstruction. In cvpr, Cited by: §2, Table 1.
 [28] (2018) Neural 3d mesh renderer. In cvpr, Cited by: §2, §6.1.
 [29] (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In nips, Cited by: §3.2.
 [30] (1984) What does the occluding contour tell us about solid shape?. Perception. Cited by: §2.
 [31] (2018) Unsupervised adversarial learning of 3d human pose from 2d joint locations. arXiv abs/1803.08244. Cited by: §2.
 [32] (2019) Soft rasterizer: a differentiable renderer for imagebased 3d reasoning. In iccv, Cited by: §2.
 [33] (2015) Deep learning face attributes in the wild. In iccv, Cited by: §4.1.
 [34] (2014) OpenDR: an approximate differentiable renderer. In eccv, Cited by: §2.
 [35] (2015) SMPL: a skinned multiperson linear model. ACM transactions on graphics (TOG) 34 (6), pp. 248. Cited by: §2.
 [36] (2018) Single view stereo matching. In cvpr, Cited by: §2.
 [37] (2013) Rectifier nonlinearities improve neural network acoustic models. In icml, Cited by: 5th item.
 [38] (2018) Unsupervised depth estimation, 3d face rotation and replacement. In nips, Cited by: §1, §2, §4.3, Table 5.
 [39] (2019) HoloGAN: unsupervised learning of 3d representations from natural images. In iccv, Cited by: §2.
 [40] (2019) C3DPO: canonical 3d pose networks for nonrigid structure from motion. In iccv, Cited by: §2, §3.2.
 [41] (2016) Deconvolution and checkerboard artifacts. Distill. External Links: Link, Document Cited by: §6.2.
 [42] (2009) A 3D face model for pose and illumination invariant face recognition. In Advanced video and signal based surveillance, Cited by: §2, Table 1, §4.1.
 [43] (2018) Generating 3D faces using convolutional mesh autoencoders. In eccv, Cited by: §2, §2, Table 1.
 [44] (2019) Lifting autoencoders: unsupervised learning of a fullydisentangled 3d morphable model using deep nonrigid structure from motion. In ICCV Workshop on Geometry Meets Deep Learning, Cited by: §1, §2, §2, Table 1, 6(b), Figure 7, §4.3, §4.3.
 [45] (2019) Learning to regress 3D face shape and expression from an image without 3D supervision. In cvpr, Cited by: Table 1.
 [46] (2018) SfSNet: learning shape, refectance and illuminance of faces in the wild. In cvpr, Cited by: Table 1, §4.1.
 [47] (2018) Deforming autoencoders: unsupervised disentangling of shape and appearance. In eccv, Cited by: §2, §3.1.
 [48] (2012) Detecting and reconstructing 3d mirror symmetric objects. In eccv, Cited by: §1, §2.
 [49] (2018) Discovery of latent 3d keypoints via endtoend geometric reasoning. In nips, Cited by: §2.
 [50] (2019) Unsupervised generative 3d shape learning from natural images. arXiv abs/1910.00287. Cited by: §1, §2, Table 1, 6(e), Figure 7, §4.3, §4.3.
 [51] (2017) MoFA: modelbased deep convolutional face autoencoder for unsupervised monocular reconstruction. In iccv, Cited by: §4.3, Table 5.
 [52] (2005) Shape from symmetry. In iccv, Cited by: §1, §2.
 [53] (2017) Adversarial inverse graphics networks: Learning 2dto3d lifting and imagetoimage translation from unpaired supervision. In iccv, Cited by: §4.3, Table 5.
 [54] (2017) Demon: depth and motion network for learning monocular stereo. In cvpr, Cited by: §2.
 [55] (2018) Learning depth from monocular videos using direct methods. In cvpr, Cited by: §2.
 [56] (2019) An adversarial neurotensorial approach for learning disentangled representations. ijcv. Cited by: §2, Table 1.
 [57] (1981) Recovering surface shape and orientation from texture. ai. Cited by: §2.
 [58] (2016) Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In nips, Cited by: §2, §2, Table 1.
 [59] (2018) Group normalization. In eccv, Cited by: 4th item.
 [60] (2010) SUN database: largescale scene recognition from abbey to zoo. In cvpr, Cited by: §4.1.
 [61] (2008) A highresolution 3d dynamic facial expression database. In fg, Cited by: §4.1.
 [62] (2011) Adaptive deconvolutional networks for mid and high level feature learning. In iccv, Cited by: 2nd item.
 [63] (1999) Shapefromshading: a survey. pami. Cited by: §2.
 [64] (2008) Cat head detection  how to effectively exploit shape and texture features. In eccv, Cited by: §4.1, §4.2, §6.3.
 [65] (2014) Bp4dspontaneous: a highresolution spontaneous 3d dynamic facial expression database. ivc 32 (10), pp. 692–706. Cited by: §4.1.
 [66] (2017) Unsupervised learning of depth and egomotion from video. In cvpr, Cited by: §2, §2.
 [67] (2018) Visual object networks: image generation with disentangled 3D representations. In nips, Cited by: §2.
6 Supplementary Material
6.1 Differentiable rendering layer
As noted in creftype 3.3, the reprojection function $\mathrm{\Pi}$ warps the canonical image $\mathbf{J}$ to generate the actual image $\mathbf{I}$. In CNNs, image warping is usually regarded as a simple operation that can be implemented efficiently using a bilinear resampling layer [23]. However, this is true only if we can easily send pixels $({u}^{\prime},{v}^{\prime})$ in the warped image $\mathbf{I}$ back to pixels $(u,v)$ in the source image $\mathbf{J}$, a process also known as backward warping. Unfortunately, in our case the function ${\eta}_{d,w}$ obtained by creftype 6 in the paper sends pixels in the opposite way.
Implementing a forward warping layer is surprisingly delicate. One way of approaching the problem is to regard this task as a special case of rendering a textured mesh. The Neural Mesh Renderer (NMR) of [28] is a differentiable renderer of this type. In our case, the mesh has one vertex per pixel and each group of $2\times 2$ adjacent pixels is tessellated by two triangles. Empirically, we found the quality of the texture gradients of NMR to be poor in this case, likely caused by high frequency content in the texture image $\mathbf{J}$.
We solve the problem as follows. First, we use NMR to warp only the depth map $d$, obtaining a version $\overline{d}$ of the depth map as seen from the input viewpoint. This has two advantages: backpropagation through NMR is faster and secondly, the gradients are more stable, probably also due to the comparatively smooth nature of the depth map $d$ compared to the texture image $\mathbf{J}$. Given the depth map $\overline{d}$, we then use the inverse of creftype 6 in the paper to find the warp field from the observed viewpoint to the canonical viewpoint, and bilinearly resample the canonical image $\mathbf{J}$ to obtain the reconstruction.
6.2 Training details
We report the training details in creftype 6 including all hyperparameter settings, and detailed network architectures in creftypeplural 9\crefmiddleconjunction7\creflastconjunction8. We use standard encoder networks for both viewpoint and lighting predictions, and encoderdecoder networks for depth, albedo and confidence predictions. In order to mitigate checkerboard artifacts [41] in the predicted depth and albedo, we add a convolution layer after each deconvolution layer and replace the last deconvolotion layer with nearestneighbor upsampling, followed by $3$ convolution layers. Abbreviations of the operators are defined as follows:

•
$\text{Conv}({c}_{in},{c}_{out},k,s,p)$: convolution with ${c}_{in}$ input channels, ${c}_{out}$ output channels, kernel size $k$, stride $s$ and padding $p$.

•
$\text{Deconv}({c}_{in},{c}_{out},k,s,p)$: deconvolution [62] with ${c}_{in}$ input channels, ${c}_{out}$ output channels, kernel size $k$, stride $s$ and padding $p$.

•
$\text{Upsample}(s)$: nearestneighbor upsampling with a scale factor of $s$.

•
$\text{GN}(n)$: group normalization [59] with $n$ groups.

•
$\text{LReLU}(\alpha )$: leaky ReLU [37] with a negative slope of $\alpha $.
6.3 More qualitative results
We provide more qualitative results in the following. Animations of the rotated 3D reconstructions are included in the supplementary video. creftype 9 shows reconstruction results on human faces from CelebA and 3DFAW. We also show more results on face paintings (creftype 10) and abstract drawings (creftype 11) from [Crowley15] and the Internet. In creftypeplural 13\crefpairconjunction12, we show mroe examples on cat faces from [64, parkhi12a] and synthetic cars rendered using ShapeNet.
Relighting.
Since our model predicts the intrinsic components of an image, separating the albedo and illumination, we can easily relight the objects with different lighting conditions. In creftype 14, we demonstrate results of the intrinsic decomposition and the relit faces in the canonical view.
Testing on videos.
To further assess our model, we apply the model trained on CelebA faces to VoxCeleb [Chung18a] videos frame by frame and include the results in the supplementary video. Our trained model works surprisingly well, producing consistent, smooth reconstructions across different frames and recovering the details of the facial motions accurately.
Parameter  Value/Range 

Optimizer  Adam 
Learning rate  $1\times {10}^{4}$ 
Number of epochs  $30$ 
Batch size  $64$ 
Loss weight ${\lambda}_{\text{f}}$  $0.5$ 
Loss weight ${\lambda}_{\text{p}}$  $1$ 
Input image size  $64\times 64$ 
Output image size  $64\times 64$ 
Depth map  $(0.9,1.1)$ 
Albedo  $(0,1)$ 
Light coefficient ${k}_{s}$  $(0,1)$ 
Light coefficient ${k}_{d}$  $(0,1)$ 
Light direction ${l}_{x},{l}_{y}$  $(1,1)$ 
Viewpoint rotation ${w}_{1:3}$  $({60}^{\circ},{60}^{\circ})$ 
Viewpoint translation ${w}_{4:6}$  $(0.1,0.1)$ 
Field of view (FOV)  $10$ 
Encoder  Output size 

Conv(3, 32, 4, 2, 1) + ReLU  32 
Conv(32, 64, 4, 2, 1) + ReLU  16 
Conv(64, 128, 4, 2, 1) + ReLU  8 
Conv(128, 256, 4, 2, 1) + ReLU  4 
Conv(256, 256, 4, 1, 0) + ReLU  1 
Conv(256, ${c}_{out}$, 1, 1, 0) + Tanh $\to output$  1 
Encoder  Output size 

Conv(3, 64, 4, 2, 1) + GN(16) + LReLU(0.2)  32 
Conv(64, 128, 4, 2, 1) + GN(32) + LReLU(0.2)  16 
Conv(128, 256, 4, 2, 1) + GN(64) + LReLU(0.2)  8 
Conv(256, 512, 4, 2, 1) + LReLU(0.2)  4 
Conv(512, 256, 4, 1, 0) + ReLU  1 
Decoder  Output size 
Deconv(256, 512, 4, 1, 0) + ReLU  4 
Conv(512, 512, 3, 1, 1) + ReLU  4 
Deconv(512, 256, 4, 2, 1) + GN(64) + ReLU  8 
Conv(256, 256, 3, 1, 1) + GN(64) + ReLU  8 
Deconv(256, 128, 4, 2, 1) + GN(32) + ReLU  16 
Conv(128, 128, 3, 1, 1) + GN(32) + ReLU  16 
Deconv(128, 64, 4, 2, 1) + GN(16) + ReLU  32 
Conv(64, 64, 3, 1, 1) + GN(16) + ReLU  32 
Upsample(2)  64 
Conv(64, 64, 3, 1, 1) + GN(16) + ReLU  64 
Conv(64, 64, 5, 1, 2) + GN(16) + ReLU  64 
Conv(64, ${c}_{out}$, 5, 1, 2) + Tanh $\to output$  64 
Encoder  Output size 

Conv(3, 64, 4, 2, 1) + GN(16) + LReLU(0.2)  32 
Conv(64, 128, 4, 2, 1) + GN(32) + LReLU(0.2)  16 
Conv(128, 256, 4, 2, 1) + GN(64) + LReLU(0.2)  8 
Conv(256, 512, 4, 2, 1) + LReLU(0.2)  4 
Conv(512, 128, 4, 1, 0) + ReLU  1 
Decoder  Output size 
Deconv(128, 512, 4, 1, 0) + ReLU  4 
Deconv(512, 256, 4, 2, 1) + GN(64) + ReLU  8 
Deconv(256, 128, 4, 2, 1) + GN(32) + ReLU  16 
$\Lsh $ 
16 
Deconv(128, 64, 4, 2, 1) + GN(16) + ReLU  32 
Deconv(64, 64, 4, 2, 1) + GN(16) + ReLU  64 
Conv(64, 2, 5, 1, 2) + SoftPlus $\to output$  64 








































































