Experience-Embedded Visual Foresight

  • 2019-11-12 18:58:30
  • Lin Yen-Chen, Maria Bauza, Phillip Isola
  • 12


Visual foresight gives an agent a window into the future, which it can use toanticipate events before they happen and plan strategic behavior. Althoughimpressive results have been achieved on video prediction in constrainedsettings, these models fail to generalize when confronted with unfamiliarreal-world objects. In this paper, we tackle the generalization problem viafast adaptation, where we train a prediction model to quickly adapt to theobserved visual dynamics of a novel object. Our method, Experience-embeddedVisual Foresight (EVF), jointly learns a fast adaptation module, which encodesobserved trajectories of the new object into a vector embedding, and a visualprediction model, which conditions on this embedding to generate physicallyplausible predictions. For evaluation, we compare our method against baselineson video prediction and benchmark its utility on two real-world control tasks.We show that our method is able to quickly adapt to new visual dynamics andachieves lower error than the baselines when manipulating novel objects.


Quick Read (beta)

Experience-Embedded Visual Foresight

Lin Yen-Chen   Maria Bauza   Phillip Isola
Massachusetts Institute of Technology

Visual foresight gives an agent a window into the future, which it can use to anticipate events before they happen and plan strategic behavior. Although impressive results have been achieved on video prediction in constrained settings, these models fail to generalize when confronted with unfamiliar real-world objects. In this paper, we tackle the generalization problem via fast adaptation, where we train a prediction model to quickly adapt to the observed visual dynamics of a novel object. Our method, Experience-embedded Visual Foresight (EVF), jointly learns a fast adaptation module, which encodes observed trajectories of the new object into a vector embedding, and a visual prediction model, which conditions on this embedding to generate physically plausible predictions. For evaluation, we compare our method against baselines on video prediction and benchmark its utility on two real world control tasks. We show that our method is able to quickly adapt to new visual dynamics and achieves lower error than the baselines when manipulating novel objects.

Experience-Embedded Visual Foresight

Lin Yen-Chen   Maria Bauza   Phillip Isola
Massachusetts Institute of Technology

noticebox[b]3rd Conference on Robot Learning (CoRL 2019), Osaka, Japan.\[email protected]

Keywords: Video Prediction, Model-based Reinforcement Learning, Deep Learning

1 Introduction

The ability to visualize possible futures allows an agent to consider the consequences of its actions before making costly decisions. When visual prediction is coupled with a planner, the resulting framework, visual model-predictive control (visual MPC), can perform a multitude of tasks – such as object relocation and the folding of clothes – all with the same unified model [12, 7].

Like any learned model, visual MPC works very well when predicting the dynamics of an object it has encountered many times in the past, e.g., during training. But when shown a new object, its predictions are often physically implausible and fail to be useful for planning.

This failure contrasts with the ability of humans to quickly adapt to new scenarios. When we encounter a novel object, we may initially have little idea how it will behave – is it soft or hard, heavy or light? However, just by playing with it a bit, we can quickly estimate its basic physical properties and begin to make accurate predictions about how it will react under our control.

Our goal in this paper is to develop an algorithm that learns to quickly adapt in a similar way. We consider the problem of few-shot visual dynamics adaptation, where a visual prediction model aims to better predict the future given a few videos of previous interactions with an object.

To that end, we propose Experience-embedded Visual Foresight (EVF), where adaptation is performed via an experience encoder that takes as input a set of previous experiences with the object, in the format of videos, and outputs a low-dimensional vector embedding, which we term a “context”. This embedding is then used as side information for a recurrent frame prediction module. Our method is centered on the idea that there is an embedding space of objects’ properties, where similar objects (in terms of visual dynamics) are close together while different objects are far apart. Having such a space allows for few-shot visual dynamics adaptation and opens the possibility of inferring information from new and unfamiliar objects in a zero-shot fashion.

Our approach can further be formalized as a hierarchical Bayes model in which object properties are latent variables that can be inferred from a few prior experiences with the object. The full system, as shown in Figure 1, is trained end-to-end and therefore can be understood as a form of meta-learning: under this view, the experience encoder is a fast learner, embodied in a feedforward network that modulates predictions based on just a few example videos. The weights of this fast learner are themselves also learned, but by the slower process of stochastic gradient descent over many meta-batches of examples and predictions.

EVF can be categorized as “learning from observations” [19, 28], rather than “learning from demonstrations, as it does not require that observed trajectories contain actions. This means it can utilize experiences when actions are unknown or generated by an agent with a different action-space.

To evaluate our approach, we compare EVF against baselines which apply gradient-based meta-learning [10] to video prediction, and validate that our method improves visual MPC’s performance on two real-world control tasks in which a robot manipulates novel objects.

Figure 1: Experience-embedded Visual Foresight (EVF). EVF allows a visual prediction model gθ to better adapt to the visual dynamics of a novel object given a single or multiple videos of this new objects, i.e., “experiences” with it. (a) Each experience is first passed through an encoder Eϕ to generate a compact representation ci of the new object, where i is the index of the experience. Then, these representations are combined to create a context embedding c which is used to sample latent variable z. Both c and z are concatenated along the channel dimension to form an input to the visual prediction model gθ. (b) The visual prediction model is a deterministic recurrent net that predicts the next frame given the current frame, the action taken, the net’s recurrent state, and the latent variables inferred and sampled in (a). The experience-embedding encoder Eϕ and video prediction model gθ are jointly optimized.

Our contributions include:

  1. 1.

    A hierarchical Bayes model for few-shot visual dynamics adaptation.

  2. 2.

    Improved performance on both video prediction and model-based control.

  3. 3.

    A learned video representation that captures properties related to object dynamics.

2 Related Work

Embedding-based Meta Learning.

Learning an embedding function to summarize information from a few samples has shown promising results for few-shot learning tasks. For image classification, Matching Networks [31] learns an embedding space to compare a few labeled examples with testing data, and builds a classifier based on weighted nearest neighbor. Similarly, Prototypical Networks learns an embedding space but instead represents each class by the mean of its examples (a prototype). Then, a classifier is built using squared Euclidean distances between testing data’s embedding and different classes’ prototypes. For visuomotor control, TecNets [16] learns an embedding space of demo trajectories. During test time, it feeds the trajectory embeddings into a control network to perform one-shot imitation learning. We adopt a similar approach but apply it to visual MPC rather than to imitation learning. For generative modelling, Neural Statistician [9] extends the variational autoencoder (VAE) [18] to model datasets instead of individual datapoints, where the learned embedding for each dataset is used for the decoder to reconstruct samples from that dataset. Our method extends Neural Statistician [9] to model visual dynamics of diverse objects. The learned model is also used for model-based control.

Model-based Reinforcement Learning.

Algorithms which learn a dynamics model to predict the future and use it to improve decision making belong to the category of model-based reinforcement learning (RL). Lately, visual model-based RL has been applied to real world robotic manipulation problems [12, 7, 4, 35]. However, these methods do not focus on adaptation of the visual prediction model. A few recent works have explored combining meta-learning with model-based RL in order to adapt to different environments in low-dimensional state space [24, 25]. In this work, we tackle the problem of adapting high-dimensional visual dynamics by learning a latent space for object properties, inspired by previous works on metric-based meta learning [31, 16].

Video Prediction.

The breakthrough of deep generative models [13, 18, 27] has lead to impressive results on deterministic video prediction [23, 30, 33, 17, 11]. Specifically, action-conditioned video prediction has been shown to be useful in video games [26, 14] and robotic manipulation [12, 7, 8, 34]. Our work is most similar to approaches which perform deterministic next frame generation based on stochastic latent variables [2, 5, 21, 20]. The main difference between these works and ours is shown in Figure 2. Rather than learning a generative model for all the videos in a single dataset, we construct a hierarchical Bayes model to learn from many small yet related datasets [9, 15]. Our goal is to robustify visual MPC’s performance by adapting to unseen objects.

3 Preliminaries

In visual MPC (or visual foresight), the robot first collects a dataset D={τ(1),τ(2),,τ(N)} which contains N trajectories. Each trajectory τ can further be represented as a sequence of images and actions τ={I1,a1,I2,a2,,IT}. Then, the dataset is used to learn an action-conditioned video prediction model gθ that approximates the stochastic visual dynamics p(It|I1:t-1,at-1). Once the model is learned, it can be coupled with a planner to find actions that minimize an objective specified by user. In this work, we adopt the cross entropy method (CEM) as our planner and assume the objective is specified as a goal image. At the planning stage, CEM aims to find a sequence of actions that minimizes the difference between the predicted future frames and the goal image. Then, the robot executes the first action in the planned action sequence and performs replanning.

4 Method

Our goal is to robustify visual MPC’s performance when manipulating unseen objects through rapid adaptation of visual dynamics. Inspired by previous works, we hypothesize that humans’ capability of rapid adaptation is powered by their ability to infer latent properties that can facilitate prediction. To endow our agent the same capability, we consider a setting different from prior work where we are given K datasets Dk for k[1,K]. Each dataset Dk={τ(1),τ(2),,τ(N)} contains N trajectories of the robot interacting with a specific environment. We assume there is a common underlying generative process p such that the stochastic visual dynamics pk(It|I1:t-1,at-1) for dataset Dk can be written as pk()=p(|ck) for ck drawn from some distribution p(c), where c is a “context” embedding. Intuitively, the context c shared by all trajectories in a dataset Di captures environmental information that generates the dataset, e.g., an object’s mass and shape.

Our task can thus be split into generation and inference components. The generation component consists of learning a model gθ that approximates the underlying generative process p. To model stochasticity, we introduce latent variables zt at each time step to carry stochastic information about the next frame in a video. The inference component learns an approximate posterior over the context c. Since encoding the whole dataset Di is computationally prohibitive, we construct a support set S which contains M trajectories sub-sampled from Di and learn a function Eϕ(c|S) to estimate c from S. In our experiments, M is a small number, M5. We denote a trajectory, image, and action in the support set as τS, ItS and atS. The rightmost subfigure of Figure 2 shows the graphical model of our method. Overall, our learning approach is an extension of VAE, and we detail this formalism in Section 4.2.

Figure 2: Graphical models for video prediction. Actions are omitted for ease of notation. Left: time-invariant model proposed in Babaeizadeh et al. [2]. Middle Left: time variant model adopted in previous works [2, 5, 21]. Middle Right: learned prior model proposed in Denton et al. [5]. Right: our hierarchical generative process for video prediction. Different from previous works [2, 5, 21], we introduce a latent variable c that varies between videos of different datasets but is constant for videos of the same dataset. We refer to it as the context which captures environmental information used to generate the dataset.

4.1 Overview

Before detailing the training procedure, we explain how our method performs adaptation and generates predictions for unseen objects. To adapt the video prediction model, we first use the inference component, i.e., an experience encoder Eϕ(c|S) with parameters ϕ, to encode videos in the support set S into an approximate posterior over context. Then, we sample a context from this posterior and feed it into our generation component to produce predictions. Our generation component is an action-conditioned video prediction model gθ parameterized by θ, that generates the next frame I^t, based on previous frames I1:t-1, action at, stochastic latent variables z1:t, and context c.

The two components are trained end-to-end jointly with the aid of a separate frame encoder qψ(zt|I1:t,c) (not used at test time). To train our model, we again first sample c from the approximate posterior constructed by the experience encoder. Then, the frame encoder encodes It, i.e., the target of the prediction model, along with previous frames I1:t-1 and c to compute another distribution qψ(zt|I1:t,c) from which we sample zt. By assuming the prior of the context c and the latent variable z to be normally distributed, we force Eϕ(c|S) and qψ(zt|I1:t,c) to be close to the corresponding prior distribution p(c) and p(z) using KL-divergence terms. These two terms constrain the information that c and zt can carry, forcing them to capture useful information for prediction. In addition, we have another term in our loss to penalize the reconstruction error between I^t and It.

4.2 Variational Auto-encoders Over Sets

To further explain our model, we adopt the formalism of variational auto-encoders over sets [9, 15]. For each trajecory τ, the likelihood given a context c can be written as:

p(τ) =t=1Tgθ(It|I1:t-1,at-1,z1:t,c)p(z1:t)dz1:t (1)

However, this likelihood cannot be directly optimized as it involves marginalizing over the latent variables z1:t, which is generally intractable. Therefore, the frame encoder qψ(zt|I1:t,c) is used to approximate the posterior with a conditional Gaussian distribution 𝒩(μψ(I1:t,c),σψ(I1:t,c)). Then, the likelihood of each trajectory in a dataset can be approximated by the variational lower bound:

Lθ,ψ(I1:T)=t=1T[𝔼qψ(z1:t|I1:t,c)loggθ(It|I1:t-1,at-1,z1:t,c)-βDKL(qψ(zt|I1:t,c)||p(z))] (2)

Note that although Eq. (2) can be used to maximize the likelihood of trajectories in a single dataset [2, 5, 21], our goal is to approximate the commonly shared generative process p through maximizing the likelihood of many datasets. Based on the graphical model plotted on the right in Figure 2, the likelihood of a single dataset can be written as:

p(D) =p(c)[τDt=1Tgθ(It|I1:t-1,at-1,z1:t,c)p(z1:t|c)dz1:t]dc (3)

In order to solve the intractability of marginalizing over the context c, we introduce an experience encoder Eϕ(c|D) which takes as input the whole dataset to construct an approximate posterior over context. By applying variational inference, we can lower bound the likelihood of a dataset with:

Lθ,ϕ(D)=𝔼Eϕ(c|D)[τDt=1T[RD-βZD]]-γCDwhereRD=𝔼qψ(z1:t|I1:t,c)loggθ(It|I1:t-1,at-1,z1:t,c)ZD=DKL(qψ(zt|I1:t,c)||p(z))CD=DKL(Eϕ(c|D)||p(c)) (4)

The hyper-parameters β and γ represent the trade-off between minimizing frame prediction error and fitting the prior. Taking smaller β and γ increases the representative power of the frame encoder and the experience encoder. Nonetheless, both may learn to simply copy the target frame It if β or γ are too small, resulting in low training error but struggling to generalize at test time.

4.3 Stochastic Lower Bound

Note that in Eq. (4), calculating the variational lower bound for each dataset requires passing the whole dataset through both inference and generation components for each gradient update. This can become computationally prohibitive when a dataset contains hundreds of samples. In practice, since we only pass a support set S subsampled from D, the loss is re-scaled as follows:

logp(D)𝔼Eϕ(c|S)[τDt=1T[RD-βZD]]-γCD=τD[𝔼Eϕ(c|S)t=1T[RD-βZD]-1|D|γCD] (5)

In this work, we sample 5 trajectories without replacement to construct the support set S.

4.4 Model Architectures

We use SAVP [21] as our backbone architecture for the generation component. Our model gθ is a convolutional LSTM which predicts the transformations of pixels between the current and next frame. Skip connections with the first frame are added as done in SNA [8]. Note that at every time step t, the generation component only receives It-1, at-1, zt, and c as input. The dependencies on all previous I1:t-2 and z1:t-1 stem from the recurrent nature of the model. To implement the conditioning on action, latent variables, and context, we concatenate them along the channel dimension to the inputs of all the convolutional layers of the LSTM. The experience encoder is a feed-forward convolutional network which encodes the image at every time step and an LSTM which takes as input the encoded images sequentially. We apply the experience encoder to M trajectories in the support set to collect M LSTM final states. Sample mean is used as a pooling operation to collapse these states into the mean of our approximate posterior Eϕ(c|S). The frame encoder is a feed-forward convolutional network which encodes It and It-1 at every time step. We train the model using the re-parameterization trick and estimate the expectation over qψ(z1:t|I1:t,c) with a single sample.

5 Experiments

In this section, we aim to answer the following questions: {enumerate*}[label=()]

Can we learn a context embedding c that facilitates visual dynamics adaptation?

Does the context embedding c improve the performance of downstream visual MPC tasks?

Can the context embedding c be used to infer similarity between objects?

First we present video prediction results on Omnipush and KTH Action to answer questions (i) and (iii) from section 5.1 to 5.2. Then, we report results on two real world control tasks to answer question (ii) in section 5.3. In our experiments, we compare our method to the following baselines:

  • SAVP. Stochastic video prediction model trained with the VAE objective proposed in [21]. We only use VAE loss as it’s more intuitive to apply finetuning or gradient-based meta learning [10] to it. Also, we found no notable performance difference.

  • SAVP-F. SAVP finetuned with 5 videos in the support set with learning rate 5e-4 for 50 steps.

  • SAVP-MAML. SAVP trained with MAML [10] and finetuned the same way as in SAVP-F.

    We also tried implementing an RNN-based meta-learner [6], where SAVP’s hidden states are not reset after encoding videos from the support set. However, we were not able to achieve improved performance using this method.

5.1 Omnipush Dataset

Our first experiment use Omnipush dataset [3] which contains RGBD videos of an ABB IRB 120 industrial arm pushing 250 objects, each pushed 250 times. For each push, the robot selects a random direction and makes a 5cm straight push for 1s. Using modular meta-learning [1] on ground truth pose data (rather than raw RGBD), the authors showed that Omnipush is suitable for evaluating adaptation because it provides systematic variability, making it easy to study, e.g., the effects of mass distribution vs. shape on the dynamics of pushing.

We use the first split of the dataset (i.e., 70 objects without extra weight) to conduct the experiments. We condition on the first 2 frames and train to predict the next 10. To benchmark the adaptability of our approach, we train on 50 objects and test on the remaining 20 novel objects. Figure 3 shows qualitative results of video prediction. Following previous works [5, 21], we provide quantitative comparisons by computing the learned perceptual image patch similarity (LPIPS) [36], the structural similarity (SSIM) [32], and the peak signal-to-noise ratio (PSNR) between ground truth and generated video sequences. Quantitative results are shown in Figure 4. Our method predicts more physically plausible videos compared to baselines, and especially captures the shape of the pushed object better. We further show that this improves the performance of visual MPC in section 5.3.

Figure 3: Qualitative Results for Omnipush. We show generated videos from EVF (ours), SAVP, SAVP-F, and SAVP-MAML. For clarity, patches that contain objects are cropped and shown on the right of predicted frames at each time step. Each method predicts 60 frames, with the time steps indicated at the top. Left: SAVP predictions deform the novel object into a circle. For both SAVP-F and SAVP-MAML, the novel object vanishes into the background. In comparison, EVF is able to predict more physically plausible futures. Right: the novel object completely deforms into an ellipse in all the predictions generated by baseline methods. Comparatively, EVF preserves the orientation and the shape of object better. This can be useful for visual MPC whose robustness heavily relies on the correctness of the predictions. Please refer to our supplementary material for video results.
Figure 4: Quantitative Results for Omnipush. We show the similarity (higher is better) between ground truth and the best sample as a function of prediction time step. Three metrics – LPIPS (left), PSNR(middle), and SSIM (right) are used in our evaluation. Spikes appeared every 12 time steps are caused by the switch of pushing direction in Omnipush dataset.
Visualization of context embedding.

In Figure 7, we use t-SNE [22] to visualize the context embedding of the 20 novel objects we tested. Despite never seeing these object during training, our model is able to produce a discriminative embedding c where objects with similar properties, such as mass and shape, are close together, and those with different properties are far apart.

Figure 5: Qualitative Results for KTH Action. We show generated videos from EVF (ours), SAVP, SAVP-F, and SAVP-MAML. Each method predicts 30 frames, with the time steps indicated at the top. Left: SAVP’s predictions deform the human body. For SAVP-MAML and SAVP-F, the human body and head are compressed in the latter time steps. In comparison, EVF is able to predict more realistic futures. Right: For both SAVP and SAVP-F, human’s hand vanishes into its body. SAVP-MAML’s predictions preserve human’s hand but slightly deform its head. In comparison, EVF preserves the shape of the human better. Results on KTH show that our method can be applied to domains beyond robotics.
Figure 6: Quantitative Results for KTH Action. We show the similarity (higher is better) between ground truth and the best sample as a function of prediction time step. Three metrics – LPIPS (left), PSNR(middle), and SSIM (right) are used in our evaluation. .

5.2 KTH Action Dataset

The KTH Action dataset [29] consists of real-world videos of people performing different actions, where each person performs the same action in 4 different settings. We split the dataset into many small datasets according to person and action. Therefore, the support set consists of the other 3 videos belonging to the same small dataset. Following the setting of previous work [21], our models are trained to predict 10 future frames conditioned on 10 initial frames. During test time, it is tested to predict 30 frames into the future. We show qualitative results in  Figure 5 and quantitative results in Figure 6.

5.3 Real World Pushing

We perform two types of real-world robotic experiments to verify that predictions generated by our method are more physically plausible, and that this benefits visual MPC. In the first experiment, we ask the robot to perform re-location, where the robot should push an object to a designated pose. In the second experiment, the robot has to imitate the pushing trajectory presented in a 10-second video. In both tasks we use pixel-wise 2 loss, with the robot arm masked, to measure the distance between the predicted image and the goal image. At each iteration of planning, our CEM planner samples 200 candidates and refits to the best 10 samples. The results of this experiment can be found in Table 1. Note that several more advanced loss functions have been recently proposed for visual MPC [34, 8, 7], but we opted for simple 2 to isolate the contribution of our method in terms of improved visual predictions, rather than improved planning.

Table 1: Errors in re-positioning and trajectory-tracking task. Units are millimeter. Left: in re-positioning, results are calculated with 5 different random seeds. Right: in trajectory-tracking, we report the mean error across time steps and the final offset.
Method Type Seen Unseen
No motion mean 90.4 91.9
median 87.4 90.4
SAVP mean 26.2 48.4
median 27.1 48.5
EVF (ours) mean 21.8 36.2
median 20.1 29.1
Method Type Traj 1 Traj 2 Traj 3 Traj 4
No motion mean 112 93.6 91.1 97.7
final 154 127 137 141
SAVP mean 33.5 42.7 29.5 41.2
final 79.8 60.8 49 59.5
EVF (ours) mean 31.6 25.2 21.6 27.5
final 59.9 47.7 48.7 48.6

6 Conclusion

In this work, we considered the problem of visual dynamics adaptation in order to robustify visual MPC’s performance when manipulating novel objects. We proposed a hierarchical Bayes model and showed improved results on both video prediction and model-based control. In the future, it would be interesting to investigate how to acticely collect useful data for few-shot adaptation.

Figure 7: t-SNE [22] results for Omnipush. We visualize the context embedding of 20 novel objects through t-SNE. We found that embeddings are closer to each other when objects posses similar shapes and mass.


We thank Alberto Rodriguez, Shuran Song, and Wei-Chiu Ma for helpful discussions. This research was supported in part by the MIT Quest for Intelligence and by iFlytek.


  • [1] F. Alet, T. Lozano-Pérez, and L. P. Kaelbling (2018) Modular meta-learning. arXiv preprint arXiv:1806.10166. Cited by: §5.1.
  • [2] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252. Cited by: §2, Figure 2, §4.2.
  • [3] M. Bauza, P. Isola, and A. Rodriguez (2019) Omnipush: accurate, diverse, real-world dataset of pushing dynamics with rgb-d video. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. . Cited by: §5.1.
  • [4] A. Byravan and D. Fox (2017) Se3-nets: learning rigid body motion using deep neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 173–180. Cited by: §2.
  • [5] E. Denton and R. Fergus (2018) Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687. Cited by: §2, Figure 2, §4.2, §5.1.
  • [6] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel (2016) RL2: fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: 3rd item.
  • [7] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568. Cited by: §1, §2, §2, §5.3.
  • [8] F. Ebert, C. Finn, A. X. Lee, and S. Levine (2017) Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268. Cited by: §2, §4.4, §5.3.
  • [9] H. Edwards and A. Storkey (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185. Cited by: §2, §2, §4.2.
  • [10] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, 1st item, 3rd item.
  • [11] C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72. Cited by: §2.
  • [12] C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §1, §2, §2.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [14] D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §2.
  • [15] L. B. Hewitt, M. I. Nye, A. Gane, T. Jaakkola, and J. B. Tenenbaum (2018) The variational homoencoder: learning to learn high capacity generative models from few examples. arXiv preprint arXiv:1807.08919. Cited by: §2, §4.2.
  • [16] S. James, M. Bloesch, and A. J. Davison (2018) Task-embedded control networks for few-shot imitation learning. arXiv preprint arXiv:1810.03237. Cited by: §2, §2.
  • [17] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu (2017) Video pixel networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1771–1779. Cited by: §2.
  • [18] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2, §2.
  • [19] O. Kroemer, S. Niekum, and G. Konidaris (2019) A review of robot learning for manipulation: challenges, representations, and algorithms. CoRR abs/1907.03146. External Links: Link, 1907.03146 Cited by: §1.
  • [20] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma (2019) VideoFlow: a flow-based generative model for video. arXiv preprint arXiv:1903.01434. Cited by: §2.
  • [21] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: §2, Figure 2, §4.2, §4.4, 1st item, §5.1, §5.2.
  • [22] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.1, Figure 7.
  • [23] M. Mathieu, C. Couprie, and Y. LeCun (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
  • [24] A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn (2018) Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347. Cited by: §2.
  • [25] A. Nagabandi, C. Finn, and S. Levine (2018) Deep online learning via meta-learning: continual adaptation for model-based rl. arXiv preprint arXiv:1812.07671. Cited by: §2.
  • [26] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh (2015) Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871. Cited by: §2.
  • [27] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016) Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759. Cited by: §2.
  • [28] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell (2018) Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2050–2053. Cited by: §1.
  • [29] C. Sch ldt, I. Laptev, and B. Caputo (2004) Recognizing human actions: a local SVM approach. In Proc. Int. Conf. Pattern Recognition (ICPR’04), Cambridge, U.K. Cited by: §5.2.
  • [30] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee (2017) Learning to generate long-term future via hierarchical prediction. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3560–3569. Cited by: §2.
  • [31] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §2, §2.
  • [32] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §5.1.
  • [33] N. Wichers, R. Villegas, D. Erhan, and H. Lee (2018) Hierarchical long-term video prediction without supervision. arXiv preprint arXiv:1806.04768. Cited by: §2.
  • [34] A. Xie, A. Singh, S. Levine, and C. Finn (2018) Few-shot goal inference for visuomotor learning and planning. In Conference on Robot Learning, pp. 40–52. Cited by: §2, §5.3.
  • [35] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine (2018) SOLAR: deep structured representations for model-based reinforcement learning. Cited by: §2.
  • [36] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §5.1.