Abstract
Visual foresight gives an agent a window into the future, which it can use toanticipate events before they happen and plan strategic behavior. Althoughimpressive results have been achieved on video prediction in constrainedsettings, these models fail to generalize when confronted with unfamiliarrealworld objects. In this paper, we tackle the generalization problem viafast adaptation, where we train a prediction model to quickly adapt to theobserved visual dynamics of a novel object. Our method, ExperienceembeddedVisual Foresight (EVF), jointly learns a fast adaptation module, which encodesobserved trajectories of the new object into a vector embedding, and a visualprediction model, which conditions on this embedding to generate physicallyplausible predictions. For evaluation, we compare our method against baselineson video prediction and benchmark its utility on two realworld control tasks.We show that our method is able to quickly adapt to new visual dynamics andachieves lower error than the baselines when manipulating novel objects.
Quick Read (beta)
ExperienceEmbedded Visual Foresight
Abstract
Visual foresight gives an agent a window into the future, which it can use to anticipate events before they happen and plan strategic behavior. Although impressive results have been achieved on video prediction in constrained settings, these models fail to generalize when confronted with unfamiliar realworld objects. In this paper, we tackle the generalization problem via fast adaptation, where we train a prediction model to quickly adapt to the observed visual dynamics of a novel object. Our method, Experienceembedded Visual Foresight (EVF), jointly learns a fast adaptation module, which encodes observed trajectories of the new object into a vector embedding, and a visual prediction model, which conditions on this embedding to generate physically plausible predictions. For evaluation, we compare our method against baselines on video prediction and benchmark its utility on two real world control tasks. We show that our method is able to quickly adapt to new visual dynamics and achieves lower error than the baselines when manipulating novel objects.
ExperienceEmbedded Visual Foresight
Lin YenChen Maria Bauza Phillip Isola 

Massachusetts Institute of Technology 
{yenchenl,bauza,phillipi}@mit.edu 
noticebox[b]3rd Conference on Robot Learning (CoRL 2019), Osaka, Japan.\[email protected]
Keywords: Video Prediction, Modelbased Reinforcement Learning, Deep Learning
1 Introduction
The ability to visualize possible futures allows an agent to consider the consequences of its actions before making costly decisions. When visual prediction is coupled with a planner, the resulting framework, visual modelpredictive control (visual MPC), can perform a multitude of tasks – such as object relocation and the folding of clothes – all with the same unified model [12, 7].
Like any learned model, visual MPC works very well when predicting the dynamics of an object it has encountered many times in the past, e.g., during training. But when shown a new object, its predictions are often physically implausible and fail to be useful for planning.
This failure contrasts with the ability of humans to quickly adapt to new scenarios. When we encounter a novel object, we may initially have little idea how it will behave – is it soft or hard, heavy or light? However, just by playing with it a bit, we can quickly estimate its basic physical properties and begin to make accurate predictions about how it will react under our control.
Our goal in this paper is to develop an algorithm that learns to quickly adapt in a similar way. We consider the problem of fewshot visual dynamics adaptation, where a visual prediction model aims to better predict the future given a few videos of previous interactions with an object.
To that end, we propose Experienceembedded Visual Foresight (EVF), where adaptation is performed via an experience encoder that takes as input a set of previous experiences with the object, in the format of videos, and outputs a lowdimensional vector embedding, which we term a “context”. This embedding is then used as side information for a recurrent frame prediction module. Our method is centered on the idea that there is an embedding space of objects’ properties, where similar objects (in terms of visual dynamics) are close together while different objects are far apart. Having such a space allows for fewshot visual dynamics adaptation and opens the possibility of inferring information from new and unfamiliar objects in a zeroshot fashion.
Our approach can further be formalized as a hierarchical Bayes model in which object properties are latent variables that can be inferred from a few prior experiences with the object. The full system, as shown in Figure 1, is trained endtoend and therefore can be understood as a form of metalearning: under this view, the experience encoder is a fast learner, embodied in a feedforward network that modulates predictions based on just a few example videos. The weights of this fast learner are themselves also learned, but by the slower process of stochastic gradient descent over many metabatches of examples and predictions.
EVF can be categorized as “learning from observations” [19, 28], rather than “learning from demonstrations, as it does not require that observed trajectories contain actions. This means it can utilize experiences when actions are unknown or generated by an agent with a different actionspace.
To evaluate our approach, we compare EVF against baselines which apply gradientbased metalearning [10] to video prediction, and validate that our method improves visual MPC’s performance on two realworld control tasks in which a robot manipulates novel objects.
Our contributions include:

1.
A hierarchical Bayes model for fewshot visual dynamics adaptation.

2.
Improved performance on both video prediction and modelbased control.

3.
A learned video representation that captures properties related to object dynamics.
2 Related Work
Embeddingbased Meta Learning.
Learning an embedding function to summarize information from a few samples has shown promising results for fewshot learning tasks. For image classification, Matching Networks [31] learns an embedding space to compare a few labeled examples with testing data, and builds a classifier based on weighted nearest neighbor. Similarly, Prototypical Networks learns an embedding space but instead represents each class by the mean of its examples (a prototype). Then, a classifier is built using squared Euclidean distances between testing data’s embedding and different classes’ prototypes. For visuomotor control, TecNets [16] learns an embedding space of demo trajectories. During test time, it feeds the trajectory embeddings into a control network to perform oneshot imitation learning. We adopt a similar approach but apply it to visual MPC rather than to imitation learning. For generative modelling, Neural Statistician [9] extends the variational autoencoder (VAE) [18] to model datasets instead of individual datapoints, where the learned embedding for each dataset is used for the decoder to reconstruct samples from that dataset. Our method extends Neural Statistician [9] to model visual dynamics of diverse objects. The learned model is also used for modelbased control.
Modelbased Reinforcement Learning.
Algorithms which learn a dynamics model to predict the future and use it to improve decision making belong to the category of modelbased reinforcement learning (RL). Lately, visual modelbased RL has been applied to real world robotic manipulation problems [12, 7, 4, 35]. However, these methods do not focus on adaptation of the visual prediction model. A few recent works have explored combining metalearning with modelbased RL in order to adapt to different environments in lowdimensional state space [24, 25]. In this work, we tackle the problem of adapting highdimensional visual dynamics by learning a latent space for object properties, inspired by previous works on metricbased meta learning [31, 16].
Video Prediction.
The breakthrough of deep generative models [13, 18, 27] has lead to impressive results on deterministic video prediction [23, 30, 33, 17, 11]. Specifically, actionconditioned video prediction has been shown to be useful in video games [26, 14] and robotic manipulation [12, 7, 8, 34]. Our work is most similar to approaches which perform deterministic next frame generation based on stochastic latent variables [2, 5, 21, 20]. The main difference between these works and ours is shown in Figure 2. Rather than learning a generative model for all the videos in a single dataset, we construct a hierarchical Bayes model to learn from many small yet related datasets [9, 15]. Our goal is to robustify visual MPC’s performance by adapting to unseen objects.
3 Preliminaries
In visual MPC (or visual foresight), the robot first collects a dataset $D=\{{\tau}^{(1)},{\tau}^{(2)},\mathrm{\dots},{\tau}^{(N)}\}$ which contains $N$ trajectories. Each trajectory $\tau $ can further be represented as a sequence of images and actions $\tau =\{{I}_{1},{a}_{1},{I}_{2},{a}_{2},\mathrm{\dots},{I}_{T}\}$. Then, the dataset is used to learn an actionconditioned video prediction model ${g}_{\theta}$ that approximates the stochastic visual dynamics $p({I}_{t}{I}_{1:t1},{a}_{t1})$. Once the model is learned, it can be coupled with a planner to find actions that minimize an objective specified by user. In this work, we adopt the cross entropy method (CEM) as our planner and assume the objective is specified as a goal image. At the planning stage, CEM aims to find a sequence of actions that minimizes the difference between the predicted future frames and the goal image. Then, the robot executes the first action in the planned action sequence and performs replanning.
4 Method
Our goal is to robustify visual MPC’s performance when manipulating unseen objects through rapid adaptation of visual dynamics. Inspired by previous works, we hypothesize that humans’ capability of rapid adaptation is powered by their ability to infer latent properties that can facilitate prediction. To endow our agent the same capability, we consider a setting different from prior work where we are given $K$ datasets ${D}_{k}$ for $k\in [1,K]$. Each dataset ${D}_{k}=\{{\tau}^{(1)},{\tau}^{(2)},\mathrm{\dots},{\tau}^{(N)}\}$ contains $N$ trajectories of the robot interacting with a specific environment. We assume there is a common underlying generative process $p$ such that the stochastic visual dynamics ${p}_{k}({I}_{t}{I}_{1:t1},{a}_{t1})$ for dataset ${D}_{k}$ can be written as ${p}_{k}(\cdot )=p(\cdot {c}_{k})$ for ${c}_{k}$ drawn from some distribution $p(c)$, where $c$ is a “context” embedding. Intuitively, the context $c$ shared by all trajectories in a dataset ${D}_{i}$ captures environmental information that generates the dataset, e.g., an object’s mass and shape.
Our task can thus be split into generation and inference components. The generation component consists of learning a model ${g}_{\theta}$ that approximates the underlying generative process $p$. To model stochasticity, we introduce latent variables ${z}_{t}$ at each time step to carry stochastic information about the next frame in a video. The inference component learns an approximate posterior over the context $c$. Since encoding the whole dataset ${D}_{i}$ is computationally prohibitive, we construct a support set $S$ which contains $M$ trajectories subsampled from ${D}_{i}$ and learn a function ${E}_{\varphi}(cS)$ to estimate $c$ from $S$. In our experiments, $M$ is a small number, $M\le 5$. We denote a trajectory, image, and action in the support set as ${\tau}^{S}$, ${I}_{t}^{S}$ and ${a}_{t}^{S}$. The rightmost subfigure of Figure 2 shows the graphical model of our method. Overall, our learning approach is an extension of VAE, and we detail this formalism in Section 4.2.
4.1 Overview
Before detailing the training procedure, we explain how our method performs adaptation and generates predictions for unseen objects. To adapt the video prediction model, we first use the inference component, i.e., an experience encoder ${E}_{\varphi}(cS)$ with parameters $\varphi $, to encode videos in the support set $S$ into an approximate posterior over context. Then, we sample a context from this posterior and feed it into our generation component to produce predictions. Our generation component is an actionconditioned video prediction model ${g}_{\theta}$ parameterized by $\theta $, that generates the next frame ${\widehat{I}}_{t}$, based on previous frames ${I}_{1:t1}$, action ${a}_{t}$, stochastic latent variables ${z}_{1:t}$, and context $c$.
The two components are trained endtoend jointly with the aid of a separate frame encoder ${q}_{\psi}({z}_{t}{I}_{1:t},c)$ (not used at test time). To train our model, we again first sample $c$ from the approximate posterior constructed by the experience encoder. Then, the frame encoder encodes ${I}_{t}$, i.e., the target of the prediction model, along with previous frames ${I}_{1:t1}$ and $c$ to compute another distribution ${q}_{\psi}({z}_{t}{I}_{1:t},c)$ from which we sample ${z}_{t}$. By assuming the prior of the context $c$ and the latent variable $z$ to be normally distributed, we force ${E}_{\varphi}(cS)$ and ${q}_{\psi}({z}_{t}{I}_{1:t},c)$ to be close to the corresponding prior distribution $p(c)$ and $p(z)$ using KLdivergence terms. These two terms constrain the information that $c$ and ${z}_{t}$ can carry, forcing them to capture useful information for prediction. In addition, we have another term in our loss to penalize the reconstruction error between ${\widehat{I}}_{t}$ and ${I}_{t}$.
4.2 Variational Autoencoders Over Sets
To further explain our model, we adopt the formalism of variational autoencoders over sets [9, 15]. For each trajecory $\tau $, the likelihood given a context $c$ can be written as:
$p(\tau )$  $={\displaystyle \prod _{t=1}^{T}}{\displaystyle \int {g}_{\theta}({I}_{t}{I}_{1:t1},{a}_{t1},{z}_{1:t},c)p({z}_{1:t})d{z}_{1:t}}$  (1) 
However, this likelihood cannot be directly optimized as it involves marginalizing over the latent variables ${z}_{1:t}$, which is generally intractable. Therefore, the frame encoder ${q}_{\psi}({z}_{t}{I}_{1:t},c)$ is used to approximate the posterior with a conditional Gaussian distribution $\mathcal{N}({\mu}_{\psi}({I}_{1:t},c),{\sigma}_{\psi}({I}_{1:t},c))$. Then, the likelihood of each trajectory in a dataset can be approximated by the variational lower bound:
${L}_{\theta ,\psi}({I}_{1:T})={\displaystyle \sum _{t=1}^{T}}[{\mathbb{E}}_{{q}_{\psi}({z}_{1:t}{I}_{1:t},c)}\mathrm{log}{g}_{\theta}({I}_{t}{I}_{1:t1},{a}_{t1},{z}_{1:t},c)\beta {D}_{KL}({q}_{\psi}({z}_{t}{I}_{1:t},c)p(z))]$  (2) 
Note that although Eq. (2) can be used to maximize the likelihood of trajectories in a single dataset [2, 5, 21], our goal is to approximate the commonly shared generative process $p$ through maximizing the likelihood of many datasets. Based on the graphical model plotted on the right in Figure 2, the likelihood of a single dataset can be written as:
$p(D)$  $={\displaystyle \int p(c)\left[\prod _{\tau \in D}\prod _{t=1}^{T}\int {g}_{\theta}({I}_{t}{I}_{1:t1},{a}_{t1},{z}_{1:t},c)p({z}_{1:t}c)d{z}_{1:t}\right]dc}$  (3) 
In order to solve the intractability of marginalizing over the context $c$, we introduce an experience encoder ${E}_{\varphi}(cD)$ which takes as input the whole dataset to construct an approximate posterior over context. By applying variational inference, we can lower bound the likelihood of a dataset with:
$$\begin{array}{cc}\hfill {L}_{\theta ,\varphi}(D)& ={\mathbb{E}}_{{E}_{\varphi}(cD)}\left[\sum _{\tau \in D}\sum _{t=1}^{T}[{R}_{D}\beta {Z}_{D}]\right]\gamma {C}_{D}\hspace{1em}\text{where}\hfill \\ \hfill {R}_{D}& ={\mathbb{E}}_{{q}_{\psi}({z}_{1:t}{I}_{1:t},c)}\mathrm{log}{g}_{\theta}({I}_{t}{I}_{1:t1},{a}_{t1},{z}_{1:t},c)\hfill \\ \hfill {Z}_{D}& ={D}_{KL}({q}_{\psi}({z}_{t}{I}_{1:t},c)p(z))\hfill \\ \hfill {C}_{D}& ={D}_{KL}({E}_{\varphi}(cD)p(c))\hfill \end{array}$$  (4) 
The hyperparameters $\beta $ and $\gamma $ represent the tradeoff between minimizing frame prediction error and fitting the prior. Taking smaller $\beta $ and $\gamma $ increases the representative power of the frame encoder and the experience encoder. Nonetheless, both may learn to simply copy the target frame ${I}_{t}$ if $\beta $ or $\gamma $ are too small, resulting in low training error but struggling to generalize at test time.
4.3 Stochastic Lower Bound
Note that in Eq. (4), calculating the variational lower bound for each dataset requires passing the whole dataset through both inference and generation components for each gradient update. This can become computationally prohibitive when a dataset contains hundreds of samples. In practice, since we only pass a support set $S$ subsampled from $D$, the loss is rescaled as follows:
$$\begin{array}{cc}\hfill \mathrm{log}p(D)& \ge {\mathbb{E}}_{{E}_{\varphi}(cS)}\left[\sum _{\tau \in D}\sum _{t=1}^{T}[{R}_{D}\beta {Z}_{D}]\right]\gamma {C}_{D}=\sum _{\tau \in D}\left[{\mathbb{E}}_{{E}_{\varphi}(cS)}\sum _{t=1}^{T}[{R}_{D}\beta {Z}_{D}]\frac{1}{D}\gamma {C}_{D}\right]\hfill \end{array}$$  (5) 
In this work, we sample 5 trajectories without replacement to construct the support set $S$.
4.4 Model Architectures
We use SAVP [21] as our backbone architecture for the generation component. Our model ${g}_{\theta}$ is a convolutional LSTM which predicts the transformations of pixels between the current and next frame. Skip connections with the first frame are added as done in SNA [8]. Note that at every time step $t$, the generation component only receives ${I}_{t1}$, ${a}_{t1}$, ${z}_{t}$, and $c$ as input. The dependencies on all previous ${I}_{1:t2}$ and ${z}_{1:t1}$ stem from the recurrent nature of the model. To implement the conditioning on action, latent variables, and context, we concatenate them along the channel dimension to the inputs of all the convolutional layers of the LSTM. The experience encoder is a feedforward convolutional network which encodes the image at every time step and an LSTM which takes as input the encoded images sequentially. We apply the experience encoder to $M$ trajectories in the support set to collect $M$ LSTM final states. Sample mean is used as a pooling operation to collapse these states into the mean of our approximate posterior ${E}_{\varphi}(cS)$. The frame encoder is a feedforward convolutional network which encodes ${I}_{t}$ and ${I}_{t1}$ at every time step. We train the model using the reparameterization trick and estimate the expectation over ${q}_{\psi}({z}_{1:t}{I}_{1:t},c)$ with a single sample.
5 Experiments
In this section, we aim to answer the following questions: {enumerate*}[label=()]
Can we learn a context embedding $c$ that facilitates visual dynamics adaptation?
Does the context embedding $c$ improve the performance of downstream visual MPC tasks?
Can the context embedding $c$ be used to infer similarity between objects?
First we present video prediction results on Omnipush and KTH Action to answer questions (i) and (iii) from section 5.1 to 5.2. Then, we report results on two real world control tasks to answer question (ii) in section 5.3. In our experiments, we compare our method to the following baselines:
 •

•
SAVPF. SAVP finetuned with 5 videos in the support set with learning rate 5e4 for 50 steps.

•
SAVPMAML. SAVP trained with MAML [10] and finetuned the same way as in SAVPF.
We also tried implementing an RNNbased metalearner [6], where SAVP’s hidden states are not reset after encoding videos from the support set. However, we were not able to achieve improved performance using this method.
5.1 Omnipush Dataset
Our first experiment use Omnipush dataset [3] which contains RGBD videos of an ABB IRB 120 industrial arm pushing 250 objects, each pushed 250 times. For each push, the robot selects a random direction and makes a 5cm straight push for 1s. Using modular metalearning [1] on ground truth pose data (rather than raw RGBD), the authors showed that Omnipush is suitable for evaluating adaptation because it provides systematic variability, making it easy to study, e.g., the effects of mass distribution vs. shape on the dynamics of pushing.
We use the first split of the dataset (i.e., 70 objects without extra weight) to conduct the experiments. We condition on the first 2 frames and train to predict the next 10. To benchmark the adaptability of our approach, we train on 50 objects and test on the remaining 20 novel objects. Figure 3 shows qualitative results of video prediction. Following previous works [5, 21], we provide quantitative comparisons by computing the learned perceptual image patch similarity (LPIPS) [36], the structural similarity (SSIM) [32], and the peak signaltonoise ratio (PSNR) between ground truth and generated video sequences. Quantitative results are shown in Figure 4. Our method predicts more physically plausible videos compared to baselines, and especially captures the shape of the pushed object better. We further show that this improves the performance of visual MPC in section 5.3.
Visualization of context embedding.
In Figure 7, we use tSNE [22] to visualize the context embedding of the 20 novel objects we tested. Despite never seeing these object during training, our model is able to produce a discriminative embedding $c$ where objects with similar properties, such as mass and shape, are close together, and those with different properties are far apart.
5.2 KTH Action Dataset
The KTH Action dataset [29] consists of realworld videos of people performing different actions, where each person performs the same action in 4 different settings. We split the dataset into many small datasets according to person and action. Therefore, the support set consists of the other 3 videos belonging to the same small dataset. Following the setting of previous work [21], our models are trained to predict 10 future frames conditioned on 10 initial frames. During test time, it is tested to predict 30 frames into the future. We show qualitative results in Figure 5 and quantitative results in Figure 6.
5.3 Real World Pushing
We perform two types of realworld robotic experiments to verify that predictions generated by our method are more physically plausible, and that this benefits visual MPC. In the first experiment, we ask the robot to perform relocation, where the robot should push an object to a designated pose. In the second experiment, the robot has to imitate the pushing trajectory presented in a 10second video. In both tasks we use pixelwise ${\mathrm{\ell}}_{2}$ loss, with the robot arm masked, to measure the distance between the predicted image and the goal image. At each iteration of planning, our CEM planner samples 200 candidates and refits to the best 10 samples. The results of this experiment can be found in Table 1. Note that several more advanced loss functions have been recently proposed for visual MPC [34, 8, 7], but we opted for simple ${\mathrm{\ell}}_{2}$ to isolate the contribution of our method in terms of improved visual predictions, rather than improved planning.
Repositioning  

Method  Type  Seen  Unseen 
No motion  mean  90.4  91.9 
median  87.4  90.4  
SAVP  mean  26.2  48.4 
median  27.1  48.5  
EVF (ours)  mean  21.8  36.2 
median  20.1  29.1 
Trajectorytracking  

Method  Type  Traj 1  Traj 2  Traj 3  Traj 4 
No motion  mean  112  93.6  91.1  97.7 
final  154  127  137  141  
SAVP  mean  33.5  42.7  29.5  41.2 
final  79.8  60.8  49  59.5  
EVF (ours)  mean  31.6  25.2  21.6  27.5 
final  59.9  47.7  48.7  48.6 
6 Conclusion
In this work, we considered the problem of visual dynamics adaptation in order to robustify visual MPC’s performance when manipulating novel objects. We proposed a hierarchical Bayes model and showed improved results on both video prediction and modelbased control. In the future, it would be interesting to investigate how to acticely collect useful data for fewshot adaptation.
Acknowledgments
We thank Alberto Rodriguez, Shuran Song, and WeiChiu Ma for helpful discussions. This research was supported in part by the MIT Quest for Intelligence and by iFlytek.
References
 [1] (2018) Modular metalearning. arXiv preprint arXiv:1806.10166. Cited by: §5.1.
 [2] (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252. Cited by: §2, Figure 2, §4.2.
 [3] (2019) Omnipush: accurate, diverse, realworld dataset of pushing dynamics with rgbd video. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. . Cited by: §5.1.
 [4] (2017) Se3nets: learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 173–180. Cited by: §2.
 [5] (2018) Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687. Cited by: §2, Figure 2, §4.2, §5.1.
 [6] (2016) RL2: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: 3rd item.
 [7] (2018) Visual foresight: modelbased deep reinforcement learning for visionbased robotic control. arXiv preprint arXiv:1812.00568. Cited by: §1, §2, §2, §5.3.
 [8] (2017) Selfsupervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268. Cited by: §2, §4.4, §5.3.
 [9] (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185. Cited by: §2, §2, §4.2.
 [10] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: §1, 1st item, 3rd item.
 [11] (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72. Cited by: §2.
 [12] (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §1, §2, §2.
 [13] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
 [14] (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §2.
 [15] (2018) The variational homoencoder: learning to learn high capacity generative models from few examples. arXiv preprint arXiv:1807.08919. Cited by: §2, §4.2.
 [16] (2018) Taskembedded control networks for fewshot imitation learning. arXiv preprint arXiv:1810.03237. Cited by: §2, §2.
 [17] (2017) Video pixel networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1771–1779. Cited by: §2.
 [18] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2, §2.
 [19] (2019) A review of robot learning for manipulation: challenges, representations, and algorithms. CoRR abs/1907.03146. External Links: Link, 1907.03146 Cited by: §1.
 [20] (2019) VideoFlow: a flowbased generative model for video. arXiv preprint arXiv:1903.01434. Cited by: §2.
 [21] (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: §2, Figure 2, §4.2, §4.4, 1st item, §5.1, §5.2.
 [22] (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.1, Figure 7.
 [23] (2015) Deep multiscale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
 [24] (2018) Learning to adapt in dynamic, realworld environments through metareinforcement learning. arXiv preprint arXiv:1803.11347. Cited by: §2.
 [25] (2018) Deep online learning via metalearning: continual adaptation for modelbased rl. arXiv preprint arXiv:1812.07671. Cited by: §2.
 [26] (2015) Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871. Cited by: §2.
 [27] (2016) Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759. Cited by: §2.
 [28] (2018) Zeroshot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2050–2053. Cited by: §1.
 [29] (2004) Recognizing human actions: a local SVM approach. In Proc. Int. Conf. Pattern Recognition (ICPR’04), Cambridge, U.K. Cited by: §5.2.
 [30] (2017) Learning to generate longterm future via hierarchical prediction. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3560–3569. Cited by: §2.
 [31] (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §2, §2.
 [32] (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §5.1.
 [33] (2018) Hierarchical longterm video prediction without supervision. arXiv preprint arXiv:1806.04768. Cited by: §2.
 [34] (2018) Fewshot goal inference for visuomotor learning and planning. In Conference on Robot Learning, pp. 40–52. Cited by: §2, §5.3.
 [35] (2018) SOLAR: deep structured representations for modelbased reinforcement learning. Cited by: §2.
 [36] (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §5.1.