### Abstract

We suggest representing light field (LF) videos as "one-off" neural networks(NN), i.e., a learned mapping from view-plus-time coordinates tohigh-resolution color values, trained on sparse views. Initially, this soundslike a bad idea for three main reasons: First, a NN LF will likely have lessquality than a same-sized pixel basis representation. Second, only few trainingdata, e.g., 9 exemplars per frame are available for sparse LF videos. Third,there is no generalization across LFs, but across view and time instead.Consequently, a network needs to be trained for each LF video. Surprisingly,these problems can turn into substantial advantages: Other than the linearpixel basis, a NN has to come up with a compact, non-linear i.e., moreintelligent, explanation of color, conditioned on the sparse view and timecoordinates. As observed for many NN however, this representation now isinterpolatable: if the image output for sparse view coordinates is plausible,it is for all intermediate, continuous coordinates as well. Our specificnetwork architecture involves a differentiable occlusion-aware warping step,which leads to a compact set of trainable parameters and consequently fastlearning and fast execution.

### Quick Read (beta)

# Neural View-Interpolation for Sparse Lightfield Video

###### Abstract

We suggest representing light field (LF) videos as “one-off” neural networks (NN), *i.e*. a learned mapping from view-plus-time coordinates to high-resolution color values, trained on sparse views.
Initially, this sounds like a bad idea for three main reasons:
First, a NN LF will likely have less quality than a same-sized pixel basis representation.
Second, only few training data, *e.g*. 9 exemplars per frame are available for sparse LF videos.
Third, there is no generalization across LFs, but across view and time instead.
Consequently, a network needs to be trained for each LF video.

Surprisingly, these problems can turn into substantial advantages:
Other than the linear pixel basis, a NN has to come up with a compact, non-linear *i.e*. more intelligent, explanation of color, conditioned on the sparse view and time coordinates.
As observed for many NN however, this representation now is *interpolatable*: if the image output for sparse view coordinates is plausible, it is for all intermediate, continuous coordinates as well.
Our specific network architecture involves a differentiable occlusion-aware warping step, which leads to a compact set of trainable parameters and consequently fast learning and fast execution.

R[2]¿\adjustboxangle=#1,lap=0pt-(#2)l¡

## 1 Introduction

Light field (LF) video provides a complete visual representation of a dynamic scene. Regrettably, this capability results in excessive storage, capture and processing requirements. The redundancy in such data appears to be high – but what is the right way of exploiting it? We will here demonstrate, how a neural network (NN), involving the right differentiable rendering steps, becomes an compact and interpolatable representation of a LF video.

In particular, we investigate LF video that is sparse. Sparsity in the angular domain means capture from a practical rig of 3$\times $3 cameras instead of hundreds of observations in dense LF. This reduces the amount of data, but introduces the new challenge of interpolation. The same holds in the temporal domain: frame rate can be reduced, but only if additional temporal interpolation is applied. Ultimately, spatial and temporal sparsity can be combined, requiring even more advanced interpolation. Such high-quality, high-speed interpolation is the challenge addressed in this article.

The industry solution to interpolation is streaming sparse images and estimating depth and using warping [29].
While NNs have been suggested to estimate depth or interpolate, we here, for the first time, suggest representing the entire LF as a NN.
This representation is a mapping from view angle and time (three dimensions) to pixel appearance in a high-resolution image (millions of dimensions).
We train our NN on very sparse data, *e.g*. 3$\times $3 images.
We find a NN that has come up with a compact, geometrically meaningful non-linear explanation of all observations will also produce suitable non-observed, *i.e*. interpolated, views.
This “interpolating effect” has frequently been observed for NNs optimized for latent encoding-decoding *e.g*. for faces.
For our task, a well-defined space (view and time) is readily available and the only requirement is to find the right non-linear decoding to benefit from interpolation.
Key to making this work is the right network structure, involving differentiable warping.

The resulting method can learn to represent a LF in a NN in little time and decode it at high frame rates (ca. 20 Hz) and high resolution (1024$\times $1024) for arbitrary continuous view and time coordinates. We compare the resulting quality to several other baselines (NN and classic) as well as to ablations of our approach.

## 2 Previous Work

Our work is rooted in LF and image-based rendering (IBR), it is inspired by general interpolation of 3D information as well as making use of differentiable rendering, which we all review now:

##### Lightfield view interpolation

Levoy and Hanrahan [24] and Gortler *et al*. [15] were first to formalize the concept of a light field and to devise hardware to capture it.
An important distinction is that LFs can either be dense or sparse.
This is less defined by the number of images, but more by the distance between the views.
In this work we focus on wider baselines, with typically $M\times N$ cameras spaced by 5–10 cm [9], respectively a large disparity ranging up to 250 pixels [4, 31], where $M$ and $N$ are single-digit numbers, *e.g*. 3$\times $3 or 5$\times $5.

Method | View | Time | Sparse | Warp | Neural | One-off | Real-time | High-res | |

ULR | [2] | ✓ | ✓ | ✓ | ✓ |
✕ |
✕ |
✓ | ✓ |

Soft 3D | [33] | ✓ | ✓ | ✓ | ✓ |
✕ |
✕ |
✓ | ✓ |

Deep Blending | [16] | ✓ | ✓ | ✓ | ✓ | ✓\tmark[1] |
✕ |
✓ | ✓ |

Puppet Dubbing | [11] |
✕ |
✓ |
✕ |
✕ |
✓\tmark[2] | ✓ |
✕ |
✓ |

Video-to-video | [43] |
✕ |
✓ |
✕ |
✕ |
✓\tmark[3] |
✕ |
✕ |
✓ |

Kalantari et al. | [20] | ✓ |
✕ |
✕ |
✓ | ✓\tmark[4] |
✕ |
✕ |
✕ |

Local LF Fusion | [31] | ✓ |
✕ |
✓ | ✓ | ✓\tmark[5] |
✕ |
✓ |
✕ |

DeepView | [9] | ✓ |
✕ |
✓ |
✕ |
✓\tmark[6] |
✕ |
✕ |
✕ |

Appearance Flow | [49] | ✓ |
✕ |
✕ |
✓ | ✓\tmark[7] |
✕ |
✓ |
✕ |

DeepVoxels | [39] | ✓ |
✕ |
✕ |
✕ |
✓\tmark[8] | ✓ |
✕ |
✕ |

Neural Volumes | [28] | ✓ | ✓ | ✓ | ✓\tmark[11] | ✓\tmark[8] | ✓ | ✓ |
✕ |

Ours | ✓ | ✓ | ✓ | ✓ | ✓\tmark[9] | ✓ | ✓ | ✓ |

For such sparse data, synthesizing intermediate views (interpolation or super-resolution) is an important problem that has received much attention as summarized in Tbl. 1.
The columns “View” and “Time” specify whether novel views can be derived in spatial and temporal domains, respectively.
“Sparse” refers to the ability to handle LFs with wide baselines.
“Warp” determines whether some form of explicit warping between neighboring views is performed.
“NN” and “One-off” specify which methods are based on neural networks and if they need to be trained per scene, *i.e*. they cannot generalize for other scenes but are specific to a particular scene.
The columns “Real-time” and “High-res” indicate that novel view rendering can be performed in real time (at least at 20 Hz) and at high-resolution (we aim for HD, *i.e*. 1920$\times $1080).

A simple solution for interpolation is linear blending, but this leads to ghosting. Unstructured lumigraph rendering (ULR) [2, 3] creates proxy geometry to warp [29] multiple observations into a novel view and blend them with specific weights. Recent work has used per-view geometry [17] or NNs to compute the weights [16]. Our approach does not learn blending, but rather a deep representation of geometry itself that enables precise interpolation with occlusion handling. Originally developed for unstructured sets of images, ULR-style IBR is a workable choice for LF video as well, in particular if analysis and novel-view-synthesis have to occur at real-time rates [4]. Avoiding the difficulty to reconstruct geometry has been addressed for LFs, without [7, 21] or with [49] NNs.

An attractive recent idea is to learn synthesizing novel LF views. One option is to explicitly computing a depth map [20, 40] that explains the light field. Our approach follows a similar route, but besides extension to time, we represent geometry as a NN, such that it becomes interpolatable.

Another option is to decompose the input LF into multiple depth planes of the output view and construct a view-dependent plane sweep volume (PSV) [10, 33, 46]. By learning how neighboring input views contribute to the output view, the multi-plane image (MPI) representation [48] can be built that enables high-quality local LF fusion [31]. Inferring a good MPI representation can be facilitated with learned gradient descent [9], where the gradient components directly encode visibility and effectively inform the NN on the occlusion relations in the scene. All these techniques avoid the problem of explicit depth reconstruction and allow for softer, and more pleasant results. A drawback is the massive volumetric data, the difficulty to distribute occupancy in it, and finally volume rendering itself.

Other work has gone fully volumetric for arbitrary views. Deep Voxels [39] in particular takes a high number of images and learns how to find a deep 3D representation that can be reprojected into many views. Notably, this is a NN trained per scene (column “One-off” in Tbl. 1), but without exploiting the interpolation property. Also, frequently [32], the differentiable tomographic rendering step is learned, while in our approach, a differentiable warping with occlusion handling is used that does not require learning any parameters and can work with off-the-shelf warping. Recent work has extended this into the time domain [28], and is closest to our approach. They also use warping, but for a very different purpose: deforming a pixel-basis 3D representation over time in order to save storing individual frames (motion compensation). Both methods [39, 28] are limited by the spatial 3D resolution of volume texture and the need to process it, while we work in 2D depth and color maps only. Ultimately, we do not claim depth maps or volumes to be better or worse per-se, but would suggest that 3D volumes have their strength for seeing objects from all views (at the expense of resolution), whereas our work, using images, is more for observing scenes from a “funnel” of views, but at high 2D resolution. No work yet is able to combine high resolution and arbitrary views, not to speak of time.

##### Interpolation

of sparse observations is an important computer graphics problem, ranging from a single pixel to a full LF and extends to many domains. We have discussed LF interpolation above, but our work also is inspired by work in other domains.

Interpolating reflectance fields [12] is a related problem, where related solutions have been suggested: Ren and colleagues use a simple one-off neural network for representation [38].
Rainer et al. [36] have used more modern encoding-decoding to compress spatially-varying reflectance.
Maximov et al. [30] encode appearance (the combination of illumination and reflectance) as a NN.
In all these works, observations are spatially registered and generalization is across view or light with no challenges of space-time geometry.
In this work we deal with appearance that changes across space and time.
Videos, as LF videos, comprise of discrete frames.
To get smooth interpolation, *e.g*. for slow-motion (individual frames), motion blur (averaging multiple frames) images need to be interpolated, potentially using NNs [42, 27, 41].
More exotic domains of video re-timing, which involve annotation of a fraction of frames and one-off NN training, include the space of visual in sync to spoken language [11].
Even more extreme is temporally-consistent video content generation using conditional GANs [43].

A key inspiration for this work is the coord-conv trick [25].
Their didactic examples show, how a NN in conjunction with their trick, has the ability to make sense of a very limited set of images to a level that it can fill the gaps faithfully, *i.e*. interpolate with high plausibility.
While their paper shows single moving white square on a black background depending on a pair of coordinates to control it, we apply it to real visual data as complex as LF videos, depending on angle and time.

##### Differentiable rendering

To learn geometric structure from observations with no direct supervision, differentiable rendering has become popular. The MonoDepth [13], system is an excellent example: Here a network learns to regress disparity for pairs of images such that each image in the pair can explain the other. This does not require supervision by depth. We follow the same idea, but extend this to occlusion handling, learning the combination and representing depth itself as a network for interpolation. MonoDepth among others uses the Spatial Transform Layer [19] to warp one stereo pair image to the other view.

Handling of occlusion is an important computer graphics problem and recently several methods were proposed to include it in a differentiable pipeline. The typical solution is to smooth binary occlusion [26, 34], which is also what we do. In the same vein, for synthesizing appearance from other views, it has been shown [49] that regressing the transformation is superior to regressing appearance itself. Our work extends this to space time and combines it with handling of occlusion.

## 3 Background

Two main observations motivate our approach: First, representing information using NNs leads to interpolation. Second, this property is retained, if the network contains more useful layers, such as a differentiable rendering step. Both will be discussed next:

##### Deep representations help interpolation.

It is well-known, that deep representations amend to interpolation of 2D images [35, 37, 45], audio [8] or 3D shape [6] much better than the pixel basis.

Consider the blue and orange bumps in Fig. 1, a; these are observed. They represent flat-land functions of appearance (vertical axis), depending on some abstract domain (horizontal axis), that later will become space, or time or both in out LF video problem. We wish to interpolate something close to the unobserved violet bump in the middle. Linear interpolation in the pixel basis (solid lines), will fade both in, resulting in two flat copies. Visually this would be unappealing and distracting ghosting. This difference is also seen in the continuous setting of Fig. 1, b that can be compared to the reference in Fig. 1, c. When representing the bumps as NNs to map coordinates to color (dotted lines), we note: They are slightly worse than the pixel basis and might not match the NNs. However, the interpolated, unobserved result is much closer to the reference, and this is what matters in LF video interpolation.

Typically, substantial effort is made to construct encoding into and decoding from these deep representations such as with auto-encoders [18], variational auto-encoders [22] or adversarial networks [14]. In our problem we already have the latent space given as beautifully laid-out space-time LF coordinates and only need to learn to decode these into images.

##### (Differentiable) rendering is just another non-linearity.

The second key insight is, that the above property is not affected by inserting more advanced layers such as differentiable rendering (warping) into a learnable pipeline. Fig. 1 shows interpolation of colors over space. We find the interpolation property to be retained, if the NNs is made more-fit-for purpose, than, say, vanilla regression of appearance using a multi-layer perceptron (MLP) or convolutional neural network (CNN).

CNNs without the cord-conv [25] trick are particularly bad at such spatially-conditioned generation. But even with coord-conv, this complex function is unnecessarily hard to find and slow to fit. In particular, consider comparing such a CNN/MLP to a design that is able to sample the observations such as proposed by spatial transformers [19, 49], a very primitive form of differentiable rendering. Learning all stripes in Fig. 2 is harder than learning how they move coherently when changing view.

We will now detail our work, motivated by those observations.

## 4 Our Approach

We will first give a formal definition of the function we learn, followed by the network architecture we choose for implementing it.

### 4.1 Objective

We represent the light field video as a non-linear function ${f}_{\theta}(\mathbf{x})\in \mathcal{X}\to {\mathbb{R}}^{{n}_{\mathrm{p}}},$
where $\mathbf{x}=(u,v,t)\in \mathcal{X}$ is the *light field coordinate* (two spatial coordinates $u,v$ and a temporal dimension $t$) in the light field coordinate system $\mathcal{X}\subset {\mathbb{R}}^{3}$,
and ${n}_{\mathrm{p}}$ is the number of pixels (millions).
Please note the different coordinate systems (two-plane parametrization of a LF): $\mathbf{x}$ allocates different images in space and time, not horizontal or vertical coordinates inside an image.

We denote as $\mathcal{Y}\subset \mathcal{X}$ the subset of *observed* LF coordinates for which we know the *light field image* denoted as $L(\mathbf{y})$.
Typically $|\mathcal{Y}|$ is sparse, *i.e*. small, like 3$\times $3 or 5$\times $5.
We find this mapping $f$ by optimizing for

$$\theta =\underset{{\theta}^{\prime}}{\mathrm{arg}\mathrm{min}}{\mathbb{E}}_{\mathbf{y}\sim \mathcal{Y}}{||{f}_{{\theta}^{\prime}}(\mathbf{y})\ominus L(\mathbf{y})||}_{1}$$ |

where ${\mathbb{E}}_{\mathbf{y}\sim \mathcal{Y}}$ is the expected value across all the discrete and sparse LF coordinates $\mathcal{Y}$, $\ominus $ is a perceptual image difference. In prose, we train a NN $f$ to map 3D LF coordinates $\mathbf{y}$ to images of observed samples $L(\mathbf{y})$ of the light field.

Note, that training never evaluates any LF coordinate $\mathbf{x}$ that is not in $\mathcal{Y}$, as we would not know what the image $L(\mathbf{x})$ at that location is. But as $f$ is an “intelligent” non-linear explanation for a few $\mathbf{y}$ it generalizes from the discrete observed coordinates $\mathcal{Y}$ to the unobserved continuous $\mathcal{X}$. Note, that as we aim for interpolation, $\mathcal{X}$ is a convex combination of $\mathcal{Y}$ and does not extend beyond.

### 4.2 Architecture

Our pipeline $f$, depicted in Fig. 3, has three main steps summarized in Sec. 4.2.1: representing space-time geometry using a NN denoted $\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}$ (Sec. 4.2.2), warping $\mathrm{\U0001d6a0\U0001d68a\U0001d69b\U0001d699}$ according to that representation (Sec. 4.2.3) and resolving occlusion (Sec. 4.2.4) using a step $\mathrm{\U0001d698\U0001d68c\U0001d68c}$.

The system is implemented in TensorFlow and trained using ADAM optimizer with a learning rate of 0.0001.

#### 4.2.1 Interpolaion

We first resolve spatial interpolation (Eq. 1), followed by a temporal one (Eq. 2). This choice is arbitrary but results in subtle differences.

Spatial interpolation creates an intermediate LF defined as

$\overline{L}(\mathbf{x})={\displaystyle \sum _{\mathbf{y}\in {\mathcal{N}}_{\mathrm{s}}(\mathbf{x})}}\mathrm{\U0001d698\U0001d68c\U0001d68c}(\mathbf{x},\mathbf{y})\odot \mathrm{\U0001d6a0\U0001d68a\U0001d69b\U0001d699}(L(\mathbf{y}),\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}(\mathbf{y}),\mathbf{x}),$ | (1) |

where ${\mathcal{N}}_{\mathrm{s}}(\mathbf{x})$ is the set of all spatial neighbors of $\mathbf{x}$ and $\odot $ is per-pixel (Hadamard) multiplication. Interpolation sums all spatial neighbors ${\mathcal{N}}_{\mathrm{s}}(\mathbf{x})$ but excludes the observation at $\mathbf{x}$ itself. In Fig. 3, different inputs to $f$ with different coordinates are encoded as colors. Every observation (blue-green colors) has to explain itself using geometry from all others at training time. In this example the 4 observations regularly cover the unit space-time quad, with a single spatial dimension only. At test time, the pipeline is ran with a continuous, in-between $\mathbf{x}$, denoted as an orange dot.

For each observation, three steps occur: $\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}$, $\mathrm{\U0001d6a0\U0001d68a\U0001d69b\U0001d699}$ and $\mathrm{\U0001d698\U0001d68c\U0001d68c}$. Geometry at that LF coordinate $\mathbf{y}$ is represented using a trainable unit $\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}$ (Sec. 4.2.2). Fig. 3 shows the output of that unit in its center, both for training (blue-greenish) and for testing (orange) coordinates. Using this geometry, the observation is warped to the desired unobserved light field coordinate $\mathbf{x}$ using $\mathrm{\U0001d6a0\U0001d68a\U0001d69b\U0001d699}$ (Sec. 4.2.3).

Finally, the warped observation is weighted using a soft occlusion term $\mathrm{\U0001d698\U0001d68c\U0001d68c}$ that will give lower weights if the value required at $\mathbf{x}$ was occluded in $\mathbf{y}$ (Sec. 4.2.4). Fig. 3 shows dense links between warping and occlusion as all warped observations are resolved jointly.

Similar to the spatial one, temporal interpolation is

$f(\mathbf{x})={\displaystyle \sum _{{\mathbf{x}}^{\prime}\in {\mathcal{N}}_{\mathrm{t}}(\mathbf{x})}}\mathrm{\U0001d698\U0001d68c\U0001d68c}(\mathbf{x},{\mathbf{x}}^{\prime})\odot \mathrm{\U0001d6a0\U0001d68a\U0001d69b\U0001d699}(\overline{L}({\mathbf{x}}^{\prime}),\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}({\mathbf{x}}^{\prime}),\mathbf{x}),$ | (2) |

where ${\mathcal{N}}_{\mathrm{t}}(\mathbf{x})$ are the temporal neighbors of $\mathbf{x}$ and $\overline{L}({\mathbf{x}}^{\prime})$ is the spatially interpolated lightfield resulting from Eq. 1. At the temporal neighbors ${\mathbf{x}}^{\prime}$ (Fig. 4), the light field $L({\mathbf{x}}^{\prime})$ is not observed. Hence, the spatial interpolation $\overline{L}$ is used as a proxy. We denote coordinates into the already-interpolated LF as ${\mathbf{x}}^{\prime}$.

#### 4.2.2 Decoder

Input to the decoder is the LF coordinate $\mathbf{x}$ and output is a per-pixel space-time geometry map

$$\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}(\mathbf{x})\in \mathcal{X}\to \mathrm{\Omega},$$ |

where $\mathrm{\Omega}={(0,1)}^{{n}_{\mathrm{p}}\times 3}$. This map has three channels for all ${n}_{\mathrm{p}}$ pixels $\omega \in \mathrm{\Omega}$. The first one ${\omega}_{\mathrm{z}}$ is related to space; a depth map. The second and third component ${\omega}_{\mathrm{u}}$ and ${\omega}_{\mathrm{v}}$ are related to motion; a flow map. As the camera transformations between views is known, this is sufficient to map from one view to another. Temporal information is in units of 2D per-pixel motion and frames are regular in time. The decoder could also be considered a conditional generator. We detail use of this space-time geometry information further when explaining the details of warping (Sec. 4.2.3).

Please also note, that no pixel-basis observation $L(\mathbf{y})$ ever is input to the decoder, and hence, all geometric structure is encoded in the network. Recalling Sec. 3, we see this is both a burden, but also required to achieve the desired interpolation property: if the geometry NN can explain the observations at a few $\mathbf{y}$, it can explain their interpolation at all $\mathbf{x}$. This also justifies why $\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}$ is a NN and we do not directly learn a pixel-basis depth-motion map: it would not be interpolatable.

We found the particular details of $\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}$ to be less relevant. Our implementation starts with a fully connected operation that transforms the coordinate into a 2$\times $2 image with 128 channels. The coord-conv [25] information is added at that stage. This is followed by as many steps as it takes to arrive at the output resolution, reducing the number of channels to arrive at 3 output channels in the end. Note, that using skip connections is not applicable to our setting, as the decoder input is a mere three numbers without any spatial meaning.

#### 4.2.3 Warping

Warping is defined as the mapping

$$\mathrm{\U0001d6a0\U0001d68a\U0001d69b\U0001d699}(I,\omega ,\mathbf{x})\in \mathcal{I}\times \mathrm{\Omega}\times \mathcal{X}\to \mathcal{I}.$$ |

from the LF slice $I\in \mathcal{I}$ (an image), the space-time geometry $\omega $ and the unobserved LF coordinate $\mathbf{x}$ to an image.

Warping needs to interpret the geometry $\omega $ to get an idea where to sample [19] the observed image $I$.
This involves constructing an inverse warp, *i.e*. finding which pixel in the observed $I$ maps to each pixel in $L(\mathbf{x})$.
Constructing such inverse warps is possible [47, 5, 1, 23], but requires operations that are difficult to back-propagate.
Instead, we make the simplifying assumption, that the inverse flow is the negation of the forward flow.
This assumptions holds for planar geometry [1].
Different from warping where the depth is given (like a z-buffer from rasterization) our method optimizes over depth to please warping.
Now, learning will choose depth values, such that when inserted into the warping, will best explain the image.
This includes avoiding depth that causes difficulties to warping, *i.e*. deviations from the model assumptions.
It could be said that here, data is fit to the model.

Constructing the forward flow is done differently for space and time. In space, we use the known spatial arrangement (baseline, directions, etc.) to convert the first channel of the geometry into disparity. In time, flow is assumed to be symmetric at the current frame, so finding backward motion from forward motion is simple negation. These assumptions drastically reduce the degrees of freedom to one in space and two in time as well as they impose additional constraints that regularize the problem.

Note, that $\mathrm{\U0001d6a0\U0001d68a\U0001d69b\U0001d699}$, while differentiable, does not have any learnable parameters and is very effective deployment: a single bi-linear texture lookup.

#### 4.2.4 Soft occlusion implementation

To combine an interpolation from an input LF coord ${\mathbf{x}}^{\prime}$ with an output LF coord $\mathbf{x}$, we again make use of the geometry model learned: The model has to report depth of $\mathbf{x}$ to be smaller than ${\mathbf{x}}^{\prime}$ for that point to be visible. Such a hard decision however is not differentiable and introduces visually distracting discontinuities. For those two reasons, we make use of a soft occlusion term, defined as

$$\mathrm{\U0001d698\U0001d68c\U0001d68c}(\mathbf{x},{\mathbf{x}}^{\prime})=\frac{\mathrm{exp}(-|\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}{(\mathbf{x})}_{\mathrm{z}}-\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}{({\mathbf{x}}^{\prime})}_{\mathrm{z}}|)}{{\sum}_{i}\mathrm{exp}(-|\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}{(\mathbf{x})}_{\mathrm{z}}-\mathrm{\U0001d690\U0001d68e\U0001d698\U0001d696}{(\mathbf{x}^{\prime}{}_{i})}_{\mathrm{z}}|)}.$$ |

In other words, depth values from observed images are down-weighted, when the position they are taken from indicates, they would not be similar to the pixel depth they will end up with. The denominator makes sure the positive weights form a partition of unity by iterating all other $\mathbf{x}^{\prime}{}_{i}$ contributing to occlusion handling. Note, that this weighting is a global construction: the weight of one observation depends on all others as well as the output coordinate.

## 5 Results

Here we will provide comparison to other work (Sec. 5.1), evaluation of our scalability (Sec. 5.2) as well as applications (Sec. 5.3).

### 5.1 Comparison

We compare our approach to other *methods*, following a specific *protocol* and by different *metrics* to be explained now:

##### Methods

Comparison is made against six methods: Ours, Blending, Warping, NeuralVolume, and two ablations NoWarping and NoOcclusion.

Linear Blending is not a serious method, but documents the sparsity: plagued by ghosting for small baselines, we see our baseline / sparsity poses a difficult interpolation task, far from linear.

Warping first estimates the depth using a light field video depth estimation method [4] and later applies warping [29] with ULR-style weights [2]. These depth estimators include consistency voting, eliminate outliers, perform bilateral sampling etc. and can be considered an upper bound on what classic methodlogy is able to do today. Note, how ULR weighting accounts for occlusion.

NeuralVolumes compares to recent NN-based view interpolation making use of volumes and ray-marching [28] and [39]. We choose to compare to the more recent NeuralVolumes [28], as it also supports time. It is assumed, that their method perfectly manages to create the ground-truth volume at the 3D resolution they used (${128}^{3}$) and it was able to ray-march it without any error to produce a ${128}^{2}$ image. So we simply down-sample the ground truth to 128 and upsample it again.

Finally, we compare two ablations of our method. The first, NoWarping uses direct regression of color values without warping. The second, NoOcclusion does not perform occlusion reasoning but averages directly.

##### Protocol

Success is quantified as the expected ability of a method to predict a set of held-out LF observed coordinates $\mathscr{H}$ when trained on $\mathcal{Y}-\mathscr{H}$, *i.e*. ${\mathbb{E}}_{\mathbf{h}\sim \mathscr{H}}f(\mathbf{h}){\ominus}_{\mathrm{m}}L(\mathbf{h})$, where ${\ominus}_{\mathrm{m}}$ is one of the metrics to be defined below.
The held-out set can be a single or multiple observation and can be across space or time or both.

##### Metrics

For comparing the predicted to the held out view we use ${L}_{2}$, DSSIM and VGG. In all cases, less is better. We also measure the joint time of pre-processing, if required.

##### Data

##### Results

Fig. 5 summarizes the outcome of the main comparison. We see, that our method provides the best quality in all tasks according to all metrics on all domains.

For example images corresponding to the plots in Fig. 5, please see Fig. 6 for interpolation in space, Fig. 7 for time and Fig. 8 for space-time results.
In each figure we document the input view and multiple insets that show the results by all competing methods.
Linear blending does not work and shows that views are substantially different and have complex disparity.
NoWarping can regress plausible colors, but without details.
Warping produces crisp images, but pixel-level outliers that are distracting in motion, *e.g*. for the bench in Fig. 6.
NoOcclusion results in crisper images but when multiple objects overlap it results in ghosting.
NeuralVolumes has cannot reproduce details, as seen in the shirt of the third line in Fig. 6.
Ours has details, plausible motion and is generally most similar to the ground truth.
The temporal interpolation comparison in Fig. 7 indicate similar conclusions: Blending is no option, not handling occlusion, also in time, creates ghosting due to overlap.
We do not show NeuralVolume for the results to follow as it is a smooth version of our ground truth images with clear lack of details.
Ultimately, Ours is similar to the ground truth.
The motion smoothness is best seen in the slow-mo application of the supplemental video.
When interpolating across space and time as in Fig. 8 ghosting effects get stronger for others, as images get increasingly different.
Ours, can have difficulties where deformations are not fully rigid as seen for faces, but overall compensates for this to produce plausible images.
Space-time results are also shown in the supplemental video.

We conclude, that both numerically and visually our approach can produce state-of the art interpolation in view and time in high spatial resolution and at high frame rates. We will next look into evaluating the dependency of this success on different factors.

### 5.2 Evaluation

Here we evaluate our approach in terms of scalability with training effort and observation sparsity, speed and detail reproduction.

##### Training effort

Our approach needs to be trained again for every LF. Typical training time is listed in Tbl. 2. The results shown for time and space-time interpolation are one hour of training for a 3$\times $5 array of camera setup with resolution 960$\times $540 with 30 frames.

512$\times $512 | 1024$\times $1024 | 1764$\times $1764 | |

Training time | 28 min. | 60 min. | 172 min. |

Network parameters | 482,753 | 492,001 | 492,001 |

Fig. 9 shows progression of interpolation quality over learning time. We see, that even after little training, results can be acceptable, at least, better than all competitors after complete training.

Overall, we see that learning the NN requires a workable amount of time, compared to the time other networks require that are in the order of many hours or days on many GPUs.

##### Observation sparsity

We interpolate form extremely sparse data. In Tbl. 3 we have evaluate the quality of out interpolation depending on the number of training exemplars. A visual representation of that progression is seen in Fig. 10.

LF | VGG19 | L2 | SSIM |

3$\times $3 | 261 | .010 | .66 |

5$\times $5 | 205 | .005 | .82 |

9$\times $9 | 148 | .003 | .89 |

*Crystal Ball*scene with resolution 512$\times $512 using different metrics (columns) for different angular density (rows).

##### Speed

At deployment, our method requires no more than taking three numbers and pass them through a decoder for each observation, followed by warping and a weighting. The end-speed is around 20 Hz (on average 46 ms per frame) at 1024$\times $1024 for a 5$\times $5 LF on a Nvidia 1080Ti with 12 GB RAM. Most of the time (31 ms) is consumed in the non-optimized warping step in TensorFlow, that could be made much faster with an OpenGL implementation [5, 1].

##### Smoothness

The depth and flow map we produce are smooth in space and time and may lack detail.

It would be easy to add skip connections to get the details form the appearance.
Regrettable, this would only work on the input image, and this needs to be withheld at training, and is unknown at test time.
An example of this smoothness if seen in Fig. 11.
This smoothness is a main source of artifacts.
Overcoming this, *e.g*., using an adversarial design, is left to future work.

### 5.3 Applications

Fig. 12 demonstrates motion blur (interpolation across time) and depth-of-field (interpolation across angle), and both (interpolation across space and angle).

## 6 Conclusion

We have demonstrated that representing a lightfield video as a NN that produces images conditions on view and time leads to high-quality, high-performance interpolation in space and time. The particular structure of a network that combines a learnable space-time geometry model, combined with warping and reasoning on occlusion, has shown to perform better than direct regression of color or warping without handling occlusion. In future work, we aim to further reduce training time (eventually using learned gradient descent [9]), explore interpolation along other domains such as illumination, wavelength or spatial audio [8], and reconstruction from even sparser observations.

## References

- [1] H. Bowles, K. Mitchell, R. W. Sumner, J. Moore, and M. Gross. Iterative image warping. Comp. Graph. Forum (Proc. Eurographics), 31(2), 2012.
- [2] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen. Unstructured lumigraph rendering. In Proc. SIGGRAPH, 2001.
- [3] G. Chaurasia, S. Duchene, O. Sorkine-Hornung, and G. Drettakis. Depth synthesis and local warps for plausible image-based navigation. ACM Trans. Graph., 32(3), 2013.
- [4] Ł. Dabała, M. Ziegler, P. Didyk, F. Zilly, J. Keinert, K. Myszkowski, H.-P. Seidel, P. Rokita, and T. Ritschel. Efficient Multi-image Correspondences for On-line Light Field Video Processing. Comp. Graph. Forum (Proc. Pacific Graphics), 2016.
- [5] P. Didyk, T. Ritschel, E. Eisemann, K. Myszkowski, and H.-P. Seidel. Adaptive image-space stereo view synthesis. In Proc. VMV, 2010.
- [6] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.
- [7] S.-P. Du, P. Didyk, F. Durand, S.-M. Hu, and W. Matusik. Improving visual quality of view transitions in automultiscopic displays. ACM Trans. Graph. (Proc. SIGGRAPH), 33(6), 2014.
- [8] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. In JMLR, 2017.
- [9] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck, N. Snavely, and R. Tucker. Deepview: View synthesis with learned gradient descent. In CVPR, 2019.
- [10] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo: Learning to predict new views from the world’s imagery. In CVPR, 2016.
- [11] O. Fried and M. Agrawala. Puppet dubbing. In Proc. EGSR, 2019.
- [12] M. Fuchs, V. Blanz, H. P. Lensch, and H.-P. Seidel. Adaptive sampling of reflectance fields. ACM Trans. Graph., 26(2), 2007.
- [13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
- [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proc. NIPS, 2014.
- [15] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen. The lumigraph. In SIGGRAPH, 1996.
- [16] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Drettakis, and G. J. Brostow. Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph. (Proc. SIGGRAPH), 37(6), 2018.
- [17] P. Hedman, T. Ritschel, G. Drettakis, and G. Brostow. Scalable inside-out image-based rendering. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 35(6), 2016.
- [18] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786), 2006.
- [19] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Proc. NIPS, 2015.
- [20] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi. Learning-based view synthesis for light field cameras. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 35(6), 2016.
- [21] P. Kellnhofer, P. Didyk, S.-P. Wang, P. Sitthi-Amorn, W. Freeman, F. Durand, and W. Matusik. 3DTV at home: Eulerian-lagrangian stereo-to-multiview conversion. ACM Trans. Graph. (Proc. SIGGRAPH), 36(4), 2017.
- [22] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proc. ICLR, 2013.
- [23] T. Leimkühler, H.-P. Seidel, and T. Ritschel. Minimal warping: Planning incremental novel-view synthesis. Comp. Graph. Form (Proc. EGSR), 36(4), 2017.
- [24] M. Levoy and P. Hanrahan. Light field rendering. In SIGGRAPH, 1996.
- [25] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. In Proc. NIPS, 2018.
- [26] S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable renderer for image-based 3D reasoning. ICCV, 2019.
- [27] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In Proc. ICCV, 2017.
- [28] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh. Neural volumes: Learning dynamic renderable volumes from images. ACM Trans. Graph. (Proc. SIGGRAPH), 38(4), 2019.
- [29] W. R. Mark, L. McMillan, and G. Bishop. Post-rendering 3D warping. In Proc. i3D, 1997.
- [30] M. Maximov, L. Leal-Taixé, M. Fritz, and T. Ritschel. Deep appearance maps. In Proc. ICCV, 2019.
- [31] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph. (Proc. SIGGRAPH), 38(4), 2019.
- [32] T. Nguyen Phuoc, C. Li, S. Balaban, and Y. Yang. Rendernet: A deep convolutional network for differentiable rendering from 3d shapes. 2018.
- [33] E. Penner and L. Zhang. Soft 3D reconstruction for view synthesis. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 36(6), 2017.
- [34] F. Petersen, A. H. Bermano, O. Deussen, and D. Cohen-Or. Pix2vex: Image-to-geometry reconstruction using a smooth differentiable renderer. Arxiv abs/1903.11149, 2019.
- [35] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Arxiv abs/1511.06434, 2015.
- [36] G. Rainer, W. Jakob, A. Ghosh, and T. Weyrich. Neural btf compression and interpolation. Comp. Graph. Forum (Proc. Eurographics), 38(2), 2019.
- [37] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In NIPS, 2015.
- [38] P. Ren, Y. Dong, S. Lin, X. Tong, and B. Guo. Image based relighting using neural networks. ACM Trans. Graph. (Proc. SIGGRAPH), 34(4), 2015.
- [39] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer. Deepvoxels: Learning persistent 3d feature embeddings. In CVPR, 2019.
- [40] P. P. Srinivasan, T. Wang, A. Sreelal, R. Ramamoorthi, and R. Ng. Learning to synthesize a 4D rgbd light field from a single image. ICCV, 2017.
- [41] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, 2018.
- [42] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Proc. NIPS, 2016.
- [43] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. In NeurIPS, 2018.
- [44] S. Wanner and B. Goldluecke. Variational light field analysis for disparity estimation and super-resolution. PAMI, 36(3), 2014.
- [45] T. White. Sampling generative networks. Arxiv 1609.04468, 2016.
- [46] Z. Xu, S. Bi, K. Sunkavalli, S. Hadap, H. Su, and R. Ramamoorthi. Deep view synthesis from sparse photometric images. ACM Trans. Graph. (Proc. SIGGRAPH), 38(4), 2019.
- [47] L. Yang, D. Nehab, P. V. Sander, P. Sitthi-amorn, J. Lawrence, and H. Hoppe. Amortized supersampling. ACM Trans. Graph., 28(5), 2009.
- [48] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images. ACM Trans. Graph. (Proc. SIGGRAPH), 37(4), 2018.
- [49] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In ECCV, 2016.