### Abstract

We propose a differentiable sphere tracing algorithm to bridge the gapbetween inverse graphics methods and the recently proposed deep learning basedimplicit signed distance function. Due to the nature of the implicit function,the rendering process requires tremendous function queries, which isparticularly problematic when the function is represented as a neural network.We optimize both the forward and backward pass of our rendering layer to makeit run efficiently with affordable memory consumption on a commodity graphicscard. Our rendering method is fully differentiable such that losses can bedirectly computed on the rendered 2D observations, and the gradients can bepropagated backward to optimize the 3D geometry. We show that our renderingmethod can effectively reconstruct accurate 3D shapes from various inputs, suchas sparse depth and multi-view images, through inverse optimization. With thegeometry based reasoning, our 3D shape prediction methods show excellentgeneralization capability and robustness against various noise.

### Quick Read (beta)

# DIST: Rendering Deep Implicit Signed Distance Function

with Differentiable Sphere Tracing

###### Abstract

We propose a differentiable sphere tracing algorithm to bridge the gap between inverse graphics methods and the recently proposed deep learning based implicit signed distance function. Due to the nature of the implicit function, the rendering process requires tremendous function queries, which is particularly problematic when the function is represented as a neural network. We optimize both the forward and backward pass of our rendering layer to make it run efficiently with affordable memory consumption on a commodity graphics card. Our rendering method is fully differentiable such that losses can be directly computed on the rendered 2D observations, and the gradients can be propagated backward to optimize the 3D geometry. We show that our rendering method can effectively reconstruct accurate 3D shapes from various inputs, such as sparse depth and multi-view images, through inverse optimization. With the geometry based reasoning, our 3D shape prediction methods show excellent generalization capability and robustness against various noise.

## 1 Introduction

Solving vision problem as an inverse graphics process is one of the most fundamental approaches, where the solution is the visual structure that best explains the given observations. In the realm of 3D geometry understanding, this approach has been used since the very early age [1, 33, 52].
As a critical component to the inverse graphics based 3D geometric reasoning process, an efficient renderer is required to accurately simulate the observations, *e.g*., depth map, from an optimizable 3D structure, and also be differentiable to back-propagate the error from the partial observation.

As a natural fit to the deep learning framework, differentiable rendering techniques have drawn great interests recently.
Various solutions for different 3D representations, *e.g*., volume, point cloud, mesh, have been proposed.
However, these 3D representations are all discretized up to a certain resolution, leading to the loss of geometric details and breaking the differentiable properties [22].
Recently, continuous implicit function has been used to represent the signed distance field [32], which has premium capacity to encode accurate geometry when combined with the deep learning techniques.
Given a latent code as the shape representation, the function can produce a signed distance value for any arbitrary point, and thus enable unlimited resolution and better preserved geometric details for rendering purpose.
However, a differentiable rendering solution for learning-based continuous signed distance function does not exist yet.

In this paper, we propose a differentiable renderer for continuous implicit signed distance function (SDF) to facilitate the 3D shape understanding via geometric reasoning in deep learning framework (Fig. 1).
Our method can render an implicit SDF represented by a neural network from a latent code into various 2D observations, *e.g*., depth image, surface normal, silhouette, plus other properties encoded, from arbitrary camera viewpoints.
The rendering process is fully differentiable, such that loss functions can be conveniently defined on the rendered images and the observations, and the gradients can be propagated back to the shape generator.
As major applications, our differentiable renderer can be applied to infer the 3D shape from various inputs, *e.g*., multi-view images and single depth image, through an inverse graphics process.
Specifically, given a pre-trained generative model, *e.g*., DeepSDF [32], we search within the latent code space for the 3D shape that produces the rendered images mostly consistent with the observation.
Extensive experiments show that our geometric reasoning based approaches exhibit significantly better generalization capability than traditional purely learning based approaches, and consistently produce accurate 3D shapes across dataset without finetuning.

Nevertheless, it is challenging to make differentiable rendering work on a learning-based implicit SDF with computationally affordable resources. The main obstacle is that an implicit function provides neither the exact location nor any bound of the surface geometry as in other representations like mesh, volume, and point cloud.

Inspired by traditional ray-tracing based approaches, we adopt the sphere tracing algorithm [13], which marches along each pixel’s ray direction with the queried signed distance until the ray hits the surface, *i.e*., the signed distance equals to zero (Fig. 2). However, this is not feasible in the neural network based scenario where each query on the ray would require a forward pass and recursive computational graph for back-propagation, which is prohibitive in terms of computation and memory.

To make it work efficiently on a commodity level GPU, we optimize the full life-time of the rendering process for both forward and backward propagation. In the forward rendering pass, we adopt a coarse-to-fine approach to save computation at initial steps: an aggressive strategy to speed up the marching, and a safe convergence criteria to prevent unnecessary queries and maintain resolution. In the backward propagation, we propose a gradient approximation which empirically has negligible impact on system performance but dramatically reduces the computation and memory consumption. By making the rendering tractable, we show how to produce 2D observations with the sphere tracing and interact with camera extrinsics in differentiable ways.

To sum up, our major contribution is to enable efficient differentiable rendering on implicit signed distance function represented as a neural network. It enables accurate 3D shape prediction via geometric reasoning in deep learning frameworks and exhibits outstanding generalization capability. The differentiable renderer could also potentially benefit various vision problems thanks to the marriage of implicit SDF and inverse graphics techniques. The following part of the paper is organized as follows. Section 2 introduces related works on differentiable rendering and implicit continuous functions. In Section 3, we explain the proposed renderer in detail. Section 4 shows the experimental results, followed by a conclusion in Section 5.

## 2 Related Work

3D Representation for Shape Learning
The 3D representation for shape learning is one of the main focuses in 3D deep learning community.
Early work quantizes shapes into 3D volumes, where each voxel contains either binary occupancy status (occupied / not occupied) [50, 6, 44, 37, 12] or a signed distance value [53, 9, 43].
While voxels are the most straightforward extension from 2D image domain into 3D geometry domain for neural network operations,
they normally require huge memory overhead which leads to relatively low resolutions.
Meshes are also proposed as a more memory efficient representation for 3D shape learning [45, 11, 21, 19], while the topology of meshes is normally fixed and simple.
Many deep learning methods also utilize point clouds as the 3D representation [35, 36]; however, point-based representation lacks of the topology information and thus makes it non-trivial to generate 3D meshes.
Very recently, the implicit functions, *e.g*., continuous SDF and occupancy function, are exploited as 3D representations which show much promising performance in terms of high-frequency detail modeling and high resolution [32, 27, 28, 4].
Similar idea has been also used to encode other information such as texture [31, 38] and 4D dynamics [30]. Our work aims to design an efficient and differentiable render for implicit SDF-based representation.

Differentiable Rendering
With the success of deep learning, the differentiable rendering starts to draw more attention as it is essential for end-to-end training.
Depending on 3D representations, different rendering techniques have been proposed.
Early works focus on 3D triangulated mesh as input and leverage standard rasterization [26]. Various approaches try to solve the discontinuity issue near triangle boundaries by smoothing the loss function or approximating the gradient [20, 34, 24, 3].
Solutions for point cloud and 3D volumes are also introduced [46, 17] to work jointly with PointNet [35] and 3D convolutional architecture.
However, the differentiable rendering for the implicit continuous function representation does not exist yet.
Some ray tracing based approaches are related, while they are mostly proposed for explicit representation, such as 3D volume [25, 29, 41] or mesh [22], but not implicit function.
Most related to our work, Sitzmann *et al*. [42] propose
a LSTM-based renderer for an implicit scene representation to generate color images, while their model focuses on simulating the rendering process with an LSTM without clear geometric meaning. This method can only generate low-resolution images due to the expensive memory consumption.
Alternatively, our method can directly render 3D geometry represented by implicit SDF to produce high-resolution images. It can be also applied without training to existing deep learning models.

3D Shape Prediction 3D shape prediction from 2D observations is one of the fundamental vision problems. Early works mainly focus on multi-view reconstruction using multi-view stereo methods [39, 14, 40]. These purely geometry-based methods suffer from degraded performance on texture-less regions without prior knowledge [7]. With progress of deep learning, 3D shapes can be recovered under different settings. The simplest setting is to recover 3D shape from a single image [6, 10, 49, 18]. These systems rely heavily on priors, and are prone to weak generalization. Deep learning based multi-view shape prediction methods [51, 15, 16, 47, 48] further involve geometric constraints across views in the deep learning framework, which shows better generalization. Another thread of work [9, 8] takes a single depth image as input, and the problem is usually referred as shape completion.Given the shape prior encoded in the neural network [32], our rendering method can effectively predict accurate 3D object shape from a random initial shape code with various inputs, such as depth and multi-view images, through geometric optimization.

## 3 Differentiable Sphere Tracing

(a) Coarse-to-fine Strategy | (b) Aggressive Marching | (c) Convergence Criteria |

In this section, we introduce our differentiable rendering method for implicit signed distance function represented as a neural network, such as DeepSDF [32]. In DeepSDF, a network takes a latent code and a 3D location as input, and produces the corresponding signed distance value. Even though such a network can deliver high quality geometry, the explicit surface cannot be directly obtained and requires dense sampling in the 3D space.

Our method is inspired by Sphere Tracing [13] designed for rendering SDF volumes, where rays are shot from the camera pinhole along the direction of each pixel to search for the surface level set according to the signed distance value.
However, it is prohibitive to apply this method directly on the implicit signed distance function represented as a neural network, since each tracing step needs a feedforward neural network and the whole algorithm requires unaffordable computational and memory resources.
To make this idea work in deep learning framework for inverse graphics, we optimize both the forward and backward propagation for efficient training and test-time optimization.
The sphere traced results, *i.e*., the distance along the ray, can be converted into many desired outputs, *e.g*., depth, surface normal, silhouette, and hence losses can be conveniently applied in an end-to-end manner.

### 3.1 Preliminaries - Sphere Tracing

For a self-contained purpose, we first briefly introduce the traditional sphere tracing algorithm [13].
Sphere tracing is a conventional method specifically designed to render depth from volumetric signed distance fields.
For each pixel on the image plane, as shown in Figure 2, a ray ($L$) is shot from the camera center ($\mathbf{c}$) and marches along the direction ($\stackrel{~}{\mathbf{v}}$) with a step size that is equal to the queried signed distance value ($b$).
The ray marches iteratively until it hits or gets sufficiently close to the surface (*i.e*. abs(SDF) $$ threshold).
A more detailed algorithm can be found in Algorithm 3.1.

[tb] {algorithmic}[1] \Stateinitialize $n=0$, ${d}^{(0)}=0$, ${\mathbf{p}}^{(0)}=\mathbf{c}$. \Whilenot converged: \StateTake the corresponding SDF value ${b}^{(n)}=f(\text{\U0001d69b\U0001d698\U0001d69e\U0001d697\U0001d68d}({\mathbf{p}}^{(n)}))$ of the location ${\mathbf{p}}^{(n)}$ and make update: ${d}^{(n+1)}={d}^{(n)}+{b}^{(n)}$. \State${\mathbf{p}}^{(n+1)}=\mathbf{c}+{d}^{(n+1)}\stackrel{~}{\mathbf{v}}$, $n=n+1$ \Statecheck convergence \EndWhile

### 3.2 Efficient Forward Propagation

Directly applying sphere tracing to an implicit SDF function represented by a neural network is prohibitively computational expensive, because each query of $f$ requires a forward pass of a neural network with considerable capacity. Naive parallelization is not sufficient since essentially millions of network queries are required for a single rendering with VGA resolution ($640\times 480$). Therefore, we need to cut off unnecessary marching steps and safely speed up the marching progress.

Initialization
Because all the 3D shapes represented by DeepSDF are bounded within the unit sphere, we initialize ${\mathbf{p}}^{(0)}$ to be the intersection between the camera ray and the unit sphere for each pixel.
Pixels with the camera rays that do not intersect with the unit sphere are set as background (*i.e*., infinite depth).

Coarse-to-fine Strategy At the beginning of sphere tracing, rays for different pixels are fairly close to each other, which indicates that they will likely march in a similar way. To leverage this nice property, we propose a coarse-to-fine sphere tracing strategy, which is shown in Fig. 3 (a). We start the sphere tracing from an image with $\frac{1}{4}$ of its original resolution, and split each ray into four after every three marching steps, which is equivalent to doubling the resolution. After six steps, each pixel in the full resolution has a corresponding ray, which keeps marching until convergence.

Aggressive Marching
After the ray marching begins, we apply an aggressive strategy (Fig. 3 (b)) to speed up the marching progress by updating the ray with $\alpha $ times of the queried signed distance value, where $\alpha =1.5$ in our implementation. This aggressive sampling has several benefits. First, it makes the ray march faster towards the surface, especially when it is far from surface. Second, it accelerates the convergence for the ill-posed condition, where the angle between the surface normal and the ray direction is small.
Third, the ray can pass through the surface such that space in the back (*i.e*., SDF $$ 0) could be sampled. This is crucially important to apply supervision on both sides of the surface during optimization.

Dynamic Synchronized Inference A naive parallelization for speeding up sphere tracing is to batch the rays together and synchronously update the front end position. However, depending on the 3D shape, some rays may converge earlier than others, thus leading to wasted computation. We maintain a dynamic unfinished mask indicating which rays still require further marching to prevent unnecessary computation.

Convergence Criteria Even with aggressive marching, the ray movement can be extremely slow when close to the surface since $f$ is close to zero. We define a convergence criteria to stop the marching when accuracy is sufficiently good and the gain is marginal (Fig. 3(c)). To fully maintain the detailed geometry supported by the 2D rendering resolution, it is sufficiently safe to stop when the sampled signed distance value does not confuse one pixel with its neighbors. For an object with a smallest depth of 10$cm$ captured by a camera with 60$mm$ focal length, 32$mm$ sensor width, and a resolution of $512\times 512$, the approximate minimal distance between the corresponding 3D points of two neighboring pixels is ${10}^{-4}m$ ($0.1mm$). In practice, we set the convergence threshold $\u03f5$ as $5\times {10}^{-5}$ for most of our experiments.

### 3.3 Rendering 2D Observations

After all rays converge, we can compute the distance along each ray as the following:

$$d=\alpha \sum _{n=0}^{N-1}f({\mathbf{p}}^{(n)})+(1-\alpha )f({\mathbf{p}}^{(N-1)})={d}^{\prime}+e,$$ | (1) |

where $e=(1-\alpha )f({\mathbf{p}}^{(N-1)})$ is the residual term on the last query. In the following part we will show how this computed ray distance is converted into 2D observations.

Depth and Surface Normal Suppose that we find the 3D surface point $\mathbf{p}=\mathbf{c}+d\stackrel{~}{\mathbf{v}}$ for a pixel $(x,y)$ in the image. Then we can directly get the depth for each pixel as the following:

$${z}_{c}=\frac{d}{\sqrt{{\stackrel{~}{x}}^{2}+{\stackrel{~}{y}}^{2}+1}},$$ | (2) |

where ${(\stackrel{~}{x},\stackrel{~}{y},1)}^{\top}={K}^{-1}{(x,y,1)}^{\top}$ is the normalized homogeneous coordinate.

The surface normal of the point $\mathbf{p}(x,y,z)$ can be directly computed as the normalized gradient of the function $f$. Since $f$ is an implicit function, we take the approximation of the gradient by sampling neighboring locations:

$$\mathbf{n}=\frac{1}{2\delta}\left[\begin{array}{c}\hfill f(x+\delta ,y,z)-f(x-\delta ,y,z)\hfill \\ \hfill f(x,y+\delta ,z)-f(x,y-\delta ,z)\hfill \\ \hfill f(x,y,z+\delta )-f(x,y,z-\delta )\hfill \end{array}\right],\stackrel{~}{\mathbf{n}}=\frac{\mathbf{n}}{|\mathbf{n}|}.$$ | (3) |

Silhouette Silhouette is a commonly used supervision for 3D shape prediction. To make the rendering of silhouette differentiable, we get the minimum absolute signed distance value for each pixel along its ray and subtract it by the convergence threshold $\u03f5$. This produces a tight approximation of the silhouette, where pixels with positive values belong to the background, and vice versa. Note that directly checking if ray marching stops at infinity can also generate the silhouette but it is not differentiable.

Color and Semantics Recently, it has been shown that texture can also be represented as an implicit function parameterized with a neural network [31]. Not only color, other spatially varying properties, like semantics, material, etc, can all be potentially learned by implicit functions. These information can be rendered jointly with the implicit SDF to produce corresponding 2D observations, and some examples are depicted in Fig. 8.

### 3.4 Approximated Gradient Back-Propagation

DeepSDF [32] uses the conditional implicit function to represent a 3D shape as ${f}_{\theta}(\mathbf{p},\mathbf{z})$, where $\theta $ is the network parameters, and $\mathbf{z}$ is the latent code representing a certain shape. As a result, each queried point $\mathbf{p}$ in the sphere tracing process is determined by $\theta $ and the shape code $\mathbf{z}$, which requires to unroll the network for multiple times and costs huge memory for back-propagation with respect to $\mathbf{z}$:

${{\displaystyle \frac{\partial {d}^{\prime}}{\partial \mathbf{z}}}|}_{{\mathbf{z}}_{0}}$ | $={\alpha {\displaystyle \sum _{i=0}^{N-1}}{\displaystyle \frac{\partial {f}_{\theta}({\mathbf{p}}^{\left(i\right)}\left(\mathbf{z}\right),\mathbf{z})}{\partial \mathbf{z}}}|}_{{\mathbf{z}}_{0}}$ | (4) | ||

$=\alpha {\displaystyle \sum _{i=0}^{N-1}}\left({\displaystyle \frac{\partial {f}_{\theta}({\mathbf{p}}^{\left(i\right)}\left({\mathbf{z}}_{\mathrm{\U0001d7ce}}\right),\mathbf{z})}{\partial \mathbf{z}}}+{\displaystyle \frac{\partial {f}_{\theta}({\mathbf{p}}^{\left(i\right)}\left(\mathbf{z}\right),{\mathbf{z}}_{\mathrm{\U0001d7ce}})}{\partial {\mathbf{p}}^{\left(i\right)}\left(\mathbf{z}\right)}}{\displaystyle \frac{\partial {\mathbf{p}}^{\left(i\right)}\left({\mathbf{z}}_{\mathrm{\U0001d7ce}}\right)}{\partial \mathbf{z}}}\right)$ |

Practically, we ignore the gradients from the residual term $e$ in Equation (1). In order to make back-propagation feasible, we define a loss for $K$ samples with the minimum absolute SDF value on the ray to encourage more signals near the surface. For each sample, we calculate the gradient with only the first term in Equation (4) as the high-order gradients empirically have less impact on the optimization process. In this way, our differentiable renderer is particularly useful to bridge the gap between this strong prior and some partial observations. Given a certain observation, we can search for the code that minimizes the difference between the rendering from our network and the observation. This allows a number of applications which will be introduced in the next section.

## 4 Experiments and Results

In this section, we first verify the efficacy of our differentiable sphere tracing algorithm, and then show that 3D shape understanding can be achieved through geometry based reasoning by our method.

### 4.1 Rendering Efficiency and Quality

Method | size | #step | #query | time |
---|---|---|---|---|

Naive sphere tracing | ${512}^{2}$ | 50 | N/A | N/A |

+ practical grad. | ${512}^{2}$ | 50 | 6.06M | 1.6h |

+ parallel | ${512}^{2}$ | 50 | 6.06M | 3.39s |

+ dynamic | ${512}^{2}$ | 50 | 1.99M | 1.23s |

+ aggressive | ${512}^{2}$ | 50 | 1.43M | 1.08s |

+ coarse-to-fine | ${512}^{2}$ | 50 | 887K | 0.99s |

+ coarse-to-fine | ${512}^{2}$ | 100 | 898K | 1.24s |

parallel | + dynamic | + aggressive | + coarse-to-fine |
---|---|---|---|

Run-time Efficiency In this section, we evaluate the run-time efficiency promoted by each design in our differentiable sphere tracing algorithm. The number of queries and runtime for both forward and backward pass at a resolution of $512\times 512$ on a single NVIDIA GTX-1080Ti are reported in Tab. 1, and the corresponding rendered surface normal are shown in Fig. 4. We can see that the proposed back-propagation prunes the graph and reduces the memory usage significantly, making the rendering tractable with a standard graphics card. The dynamic synchronized inference, aggressive marching and coarse-to-fine strategy all speed up rendering. With all these designs, we can render an image with only 887K query steps within 0.99s when the maximum tracing step is set to 50. The number of query steps only increases slightly when the maximum step is set to 100, indicating that most of the pixels converge safely within 50 steps. Note that related works usually render at a much lower resolution [42].

initial | optimized | |||

Back-Propagation Effectiveness We conduct sanity checks to verify the effectiveness of the back-propagation with our approximated gradient. We take a pre-trained DeepSDF [32] model and run geometry based optimization to recover the 3D shape and camera extrinsics separately using our differentiable renderer. We first assume camera pose is known and optimize the latent code for 3D shape w.r.t the given ground truth depth, surface normal and silhouette. As can be seen in Fig. 5 (left), the loss drops quickly, and using acceleration strategies does not hurt the optimization. Fig. 5 (right) shows the total loss on the 2D image plane is highly correlated with the Chamfer distance on the predicted 3D shape, indicating that the gradients originated from the 2D observation are successfully back-propagated to the shape. We then assume a known shape (fixed latent code) and optimize the camera pose using depth and a binary silhouette. Fig. 6 shows that a random initial camera pose can be effectively optimized toward the ground truth pose by minimizing the gradients on 2D observation visualized below.

Convergence Criteria The convergence criteria, i.e. the threshold on signed distance to stop the ray tracing, has a direct impact on the rendering quality. Fig. 7 shows the rendering result under different thresholds. As can be seen, rendering with large threshold will dilate the shape, which lost boundary details. Using a small threshold, on the other hand, may produces incomplete geometry. This parameter can be tuned according to applications, but in practice we found our threshold is effective in producing complete shape with details up to the image resolution.

$\u03f5=5\times {10}^{-2}$ | $\u03f5=5\times {10}^{-4}$ | $\u03f5=5\times {10}^{-6}$ | $\u03f5=5\times {10}^{-8}$ |
---|---|---|---|

Rendering Other Properties Not only the signed distance function for 3D shape, implicit function can also encode other spatially variant information. As an example, we train a network to predict both signed distance and color for each 3D location, and this grants us the capability of rendering color images. In Fig. 8, we show that with a 512-dim latent code learned from textured meshes as the ground truth, color images can be rendered in arbitrary resolution, camera viewpoints, and illumination. Note that the latent code size is significantly smaller than the mesh (vertices+triangles+texture map), and thus can be potentially used for model compression. Other per-vertex properties, such as semantic segmentation and material, can also be rendered in the same differentiable way.

LR texture | 32x HR texture | HR Relighting | HR 2nd View |
---|---|---|---|

### 4.2 3D Shape Prediction

Our differentiable implicit SDF renderer builds up the connection between 3D shape and 2D observations and enables geometry based reasoning. In this section, we show results of 3D shape prediction from a single depth image, or multi-view color images using DeepSDF as the shape generator. On a high-level, we take a pre-trained DeepSDF and fixed the decoder parameters. When given 2D observations, we define proper loss functions and propagate the gradient back to the latent code, as introduced in Section 3.4, to generate 3D shape. This method does not require any additional training and only need to run optimization at test time, which is intuitively less vulnerable to overfitting or domain gap issues in pure learning based approach. In this section, we specifically focus on evaluating the generalization capability while maintaining high shape quality.

#### 4.2.1 3D Shape Prediction from Single Depth Image

With the development of commodity range sensors, the dense or sparse depth images can be easily acquired, and several methods have been proposed to solve the problem of 3D shape prediction from a single depth image.
DeepSDF [32] has shown state-of-the art performance for this task,
however requires an offline pre-processing to lift the input 2D depth map into 3D space in order to sample the SDF values with the assistance of the surface normal.
Our differentiable render makes 3D shape prediction from a depth image more convenient by directly rendering the depth image given a latent code and comparing it with the given depth.
Moreover, with silhouette, *e.g*. calculated from depth or provided from the rendering, our renderer can also leverage it as additional supervision.
Formally, we obtain the complete 3D shape by solving the following optimization:

$$\underset{\mathbf{z}}{\mathrm{arg}\mathrm{min}}{\mathcal{L}}_{d}({\mathcal{R}}_{d}(f(\mathbf{z})),{I}_{d})+{\mathcal{L}}_{s}({\mathcal{R}}_{s}(f(\mathbf{z})),{I}_{s}),$$ | (5) |

where $f(\mathbf{z})$ is the pre-trained neural network encoding shape priors, ${\mathcal{R}}_{d}$ and ${\mathcal{R}}_{s}$ represent the rendering function for depth and silhouette respectively, ${\mathcal{L}}_{d}$ is the ${L}_{1}$ loss of depth observation, and ${\mathcal{L}}_{s}$ is the loss defined based on the differentiably rendered silhouette. In our experiment, the initial latent shape ${\mathbf{z}}_{0}$ is chosen as the mean shape.

We test our method and DeepSDF [32] on 200 models from plane, sofa and table category respectively from ShapeNet Core [2]. Specifically, for each model, we use the first camera in the dataset of Choy *et al*. [6] to generate dense depth images for testing.
The comparison between DeepSDF and our method is listed in Tab. 2. We can see that our method with only depth supervision performs even better than DeepSDF [32] when dense depth image is given.
This is probably because that DeepSDF samples the 3D space with pre-defined rule (at fixed distances along normal direction), which may not necessarily sample correct location especially near object boundary or thin structures.
In contrast, our differentiable sphere tracing algorithm samples the space adaptively with the current estimation of shape.

dense | 50% | 10% | 100pts | 50pts | 20pts | |

sofa | ||||||

DeepSDF | 5.37 | 5.56 | 5.50 | 5.93 | 6.03 | 7.63 |

Ours | 4.12 | 5.75 | 5.49 | 5.72 | 5.57 | 6.95 |

Ours (mask) | 4.12 | 3.98 | 4.31 | 3.98 | 4.30 | 4.94 |

plane | ||||||

DeepSDF | 3.71 | 3.73 | 4.29 | 4.44 | 4.40 | 5.39 |

Ours | 2.18 | 4.08 | 4.81 | 4.44 | 4.51 | 5.30 |

Ours (mask) | 2.18 | 2.08 | 2.62 | 2.26 | 2.55 | 3.60 |

table | ||||||

DeepSDF | 12.93 | 12.78 | 11.67 | 12.87 | 13.76 | 15.77 |

Ours | 5.37 | 12.05 | 11.42 | 11.70 | 13.76 | 15.83 |

Ours (mask) | 5.37 | 5.15 | 5.16 | 5.26 | 6.33 | 7.62 |

Robustness against sparsity The depth from laser scanners can be very sparse, so we also study the robustness of our method and DeepSDF against sparse depth. The results are shown in Tab. 2. Specifically, we randomly sample different percentages or fixed numbers of points from the original dense depth for testing. To make a competitive baseline, we provide DeepSDF ground truth normal to sample SDF, since it cannot be reliably estimated from sparse depth. From the table, we can see that even with very sparse depth observations, our method still recovers accurate shapes and gets consistently better performance than DeepSDF with additional normal information. When silhouette is available, our method achieves significantly better performance and robustness against the sparsity, indicating that our rendering method can back-propagate gradients effectively from the silhouette loss.

#### 4.2.2 3D Shape Prediction from Multiple Images

Video sequence | Optimization process | |||
---|---|---|---|---|

Our differentiable renderer can also enable geometry based reasoning for shape prediction from multi-view color images. The idea is to leverage cross-view photometric consistency.

Specifically, we first initialize the latent code with a random vector and render depths for each of the input views. We then warp each color image to other input views using the rendered depth and the known camera pose. The difference between the warped and the input image are then defined as the photometric loss, and the shape can be predicted by minimizing this loss. To sum up, the optimization problem is formulated as follows,

$$\underset{\mathbf{z}}{\mathrm{arg}\mathrm{min}}\sum _{i=0}^{N-1}\sum _{j\in {\mathcal{N}}_{i}}\parallel {I}_{i}-{I}_{j\to i}({\mathcal{R}}_{d}^{i}(f(\mathbf{z}))\parallel ,$$ | (6) |

where ${\mathcal{R}}_{d}^{i}$ represents the rendered depth image at view $i$, ${\mathcal{N}}_{i}$ are the neighboring images of ${I}_{i}$, and ${I}_{j\to i}$ is the warped image from view $j$ to view $i$ using the rendered depth. Note that no mask is required under the multi-view setup. Fig. 9 shows an example of the optimization process of our method. As can be seen, the shape is gradually improved while the loss is being optimized.

Method | car | plane |
---|---|---|

PMO (original) | 0.661 | 1.129 |

PMO (rand init) | 1.187 | 6.124 |

Ours (rand init) | 0.919 | 1.595 |

We take PMO [23] as a competitive baseline, since they also perform deep learning based geometric reasoning but using the triangular mesh representation. Their model first predicts an initial mesh directly from a selected input view and applies cross-view photo-consistency to improve the quality. Both the synthetic and real dataset provided in [23] are used for evaluation.

In Tab. 3, we show quantitative comparison to PMO on their synthetic test set. It can be seen that our method achieves comparable results with PMO [23] from only random initializations. Note that while PMO uses both the encoder and decoder trained on the PMO training set, our DeepSDF decoder was neither trained nor finetuned on it. Besides, if the shape code for PMO, instead of being predicted from their trained image encoder, is also initialized randomly, their performance decreases dramatically, which indicates that with our rendering method, our geometric reasoning becomes more effective.

Generalization Capability To further evaluate the generalization capability, we compare to PMO on some unseen data and initialization. We first evaluate both methods on a testing set generated using different camera focal lengths, and the quantitative comparison is in Fig. 10 (a). It clearly shows that our method generalizes well to the new images, while PMO suffers from overfitting or domain gap. To further test the effectiveness of the geometric reasoning, we also directly add random noise to the initial latent code. The performance of PMO again drops significantly, while our method is not affected since the initialization is randomized (Fig. 10 (b)). Some qualitative results are shown in Fig. 11. Our method produces accurate shapes with detailed surfaces. In contrast, PMO suffers from two main issues: 1) the low resolution mesh is not capable of maintaining geometric details; 2) their geometric reasoning struggles with the initialization from image encoder.

We further show comparison on real data in Fig. 12. Following PMO, we use the provided rough initial similarity transformation to align the camera poses to the canonical frame. As can be seen, both methods perform worse on this challenging dataset. In comparison, our method produces shape with higher quality and correct structure, while PMO only produce a very rough shape. Overall, our method shows better generalization capability and robustness against domain change.

(a) | (b) |

Video sequence | PMO (rand init) | PMO | Ours |
---|---|---|---|

Video sequence | PMO | Ours |
---|---|---|

## 5 Conclusion

We propose a differentiable sphere tracing algorithm to render 2D observations such as depth, normal, silhouette, from implicit signed distance functions parameterized as a neural network. This enables geometric reasoning in 3D shape prediction from both single and multiple views in conjunction with the high capacity 3D neural representation. Extensive experiments show that our geometry based optimization algorithm produces 3D shapes that are more accurate than SOTA, generalizes well to new datasets, and is robust to imperfect or partial observations. Promising directions to explore using our renderer include self-supervised learning, recovering other properties jointly with geometry, and neural image rendering.

## References

- [1] (1974) Geometric modeling for computer vision. Technical report STANFORD UNIV CA DEPT OF COMPUTER SCIENCE. Cited by: §1.
- [2] (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.2.1, Table 2.
- [3] (2019) Learning to predict 3d objects with an interpolation-based differentiable renderer. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
- [4] (2019) Learning implicit fields for generative shape modeling. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 5939–5948. Cited by: §2.
- [5] (2016) A large dataset of object scans. arXiv preprint arXiv:1602.02481. Cited by: Figure 12.
- [6] (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In Proc. of European Conference on Computer Vision (ECCV), pp. 628–644. Cited by: §2, §2, §4.2.1, Table 2.
- [7] (2017) Polarimetric multi-view stereo. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 1558–1567. Cited by: §2.
- [8] (2019) Scan2Mesh: from unstructured range scans to 3d meshes. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 5574–5583. Cited by: §2.
- [9] (2017) Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 5868–5877. Cited by: §2, §2.
- [10] (2016) Learning a predictable and generative vector representation for objects. In Proc. of European Conference on Computer Vision (ECCV), pp. 484–499. Cited by: §2.
- [11] (2018) AtlasNet: a papier-m$\backslash $^ ach$\backslash $’e approach to learning 3d surface generation. In Proc. of Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- [12] (2017) Hierarchical surface prediction for 3d object reconstruction. In Proc. of International Conference on 3D Vision (3DV), pp. 412–420. Cited by: §2.
- [13] (1996) Sphere tracing: a geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer 12 (10). Cited by: Figure 2, §1, §3.1, §3.
- [14] (2008) Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (3), pp. 548–554. Cited by: §2.
- [15] (2018) Deepmvs: learning multi-view stereopsis. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 2821–2830. Cited by: §2.
- [16] (2019) DPSNet: end-to-end deep plane sweep stereo. arXiv preprint arXiv:1905.00538. Cited by: §2.
- [17] (2018) Unsupervised learning of shape and pose with differentiable point clouds. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2802–2812. Cited by: §2.
- [18] (2017) Scaling cnns for high resolution volumetric reconstruction from a single image. In Proc. of Internatoinal Conference on Computer Vision (ICCV), pp. 939–948. Cited by: §2.
- [19] (2018) End-to-end recovery of human shape and pose. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 7122–7131. Cited by: §2.
- [20] (2018) Neural 3d mesh renderer. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 3907–3916. Cited by: §2.
- [21] (2017) Using locally corresponding cad models for dense 3d reconstructions from a single image. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 4857–4865. Cited by: §2.
- [22] (2018) Differentiable monte carlo ray tracing through edge sampling. In Proc. of ACM SIGGRAPH, pp. 222. Cited by: §1, §2.
- [23] (2019) Photometric mesh optimization for video-aligned 3d object reconstruction. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 969–978. Cited by: §4.2.2, §4.2.2.
- [24] (2019) Soft rasterizer: differentiable rendering for unsupervised single-view mesh reconstruction. arXiv preprint arXiv:1901.05567. Cited by: §2.
- [25] (2019) Neural volumes: learning dynamic renderable volumes from images. Proc. of ACM SIGGRAPH. Cited by: §2.
- [26] (2014) OpenDR: an approximate differentiable renderer. In Proc. of European Conference on Computer Vision (ECCV), pp. 154–169. Cited by: §2.
- [27] (2019) Occupancy networks: learning 3d reconstruction in function space. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 4460–4470. Cited by: §2.
- [28] (2019) Deep level sets: implicit surface representations for 3d shape inference. arXiv preprint arXiv:1901.06802. Cited by: §2.
- [29] (2018) Rendernet: a deep convolutional network for differentiable rendering from 3d shapes. In Advances in Neural Information Processing Systems (NeurIPS), pp. 7891–7901. Cited by: §2.
- [30] (2019) Occupancy flow: 4d reconstruction by learning particle dynamics. In Proc. of Internatoinal Conference on Computer Vision (ICCV), pp. 5379–5389. Cited by: §2.
- [31] (2019) Texture fields: learning texture representations in function space. In Proc. of Internatoinal Conference on Computer Vision (ICCV), Cited by: §2, §3.3.
- [32] (2019) DeepSDF: learning continuous signed distance functions for shape representation. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 165–174. Cited by: §1, §1, §2, §2, §3.4, §3, §4.1, §4.2.1, §4.2.1, Table 1, Table 2.
- [33] (2003) A survey of inverse rendering problems. In Computer graphics forum, Vol. 22, pp. 663–687. Cited by: §1.
- [34] (2019) Pix2Vex: image-to-geometry reconstruction using a smooth differentiable renderer. arXiv preprint arXiv:1903.11149. Cited by: §2.
- [35] (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 652–660. Cited by: §2, §2.
- [36] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5099–5108. Cited by: §2.
- [37] (2017) Octnet: learning deep 3d representations at high resolutions. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 3577–3586. Cited by: §2.
- [38] (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. of Internatoinal Conference on Computer Vision (ICCV), Cited by: §2.
- [39] (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proc. of Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 519–528. Cited by: §2.
- [40] (2014) A new variational framework for multiview surface reconstruction. In Proc. of European Conference on Computer Vision (ECCV), pp. 719–734. Cited by: §2.
- [41] (2019) Deepvoxels: learning persistent 3d feature embeddings. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 2437–2446. Cited by: §2.
- [42] (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §4.1.
- [43] (2018) Learning 3d shape completion under weak supervision. International Journal of Computer Vision (IJCV), pp. 1–20. Cited by: §2.
- [44] (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proc. of Internatoinal Conference on Computer Vision (ICCV), pp. 2088–2096. Cited by: §2.
- [45] (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In Proc. of European Conference on Computer Vision (ECCV), pp. 52–67. Cited by: §2.
- [46] (2019) Differentiable surface splatting for point-based geometry processing. Proc. of ACM SIGGRAPH Asia. Cited by: §2.
- [47] (2019-06) Conditional single-view shape generation for multi-view stereo reconstruction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
- [48] (2019) Pixel2Mesh++: multi-view 3d mesh generation via deformation. In Proc. of Internatoinal Conference on Computer Vision (ICCV), Cited by: §2.
- [49] (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems (NeurIPS), pp. 82–90. Cited by: §2.
- [50] (2015) 3d shapenets: a deep representation for volumetric shapes. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920. Cited by: §2.
- [51] (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proc. of European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §2.
- [52] (1999) Inverse global illumination: recovering reflectance models of real scenes from photographs. In siggrpah, Vol. 99, pp. 215–224. Cited by: §1.
- [53] (2017) 3dmatch: learning local geometric descriptors from rgb-d reconstructions. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 1802–1811. Cited by: §2.