Stochastic video prediction is usually framed as an extrapolation problemwhere the goal is to sample a sequence of consecutive future image framesconditioned on a sequence of observed past frames. For the most part,algorithms for this task generate future video frames sequentially in anautoregressive fashion, which is slow and requires the input and output to beconsecutive. We introduce a model that overcomes these drawbacks -- it learnsto generate a global latent representation from an arbitrary set of frameswithin a video. This representation can then be used to simultaneously andefficiently sample any number of temporally consistent frames at arbitrarytime-points in the video. We apply our model to synthetic video predictiontasks and achieve results that are comparable to state-of-the-art videoprediction models. In addition, we demonstrate the flexibility of our model byapplying it to 3D scene reconstruction where we condition on location insteadof time. To the best of our knowledge, our model is the first to provideflexible and coherent prediction on stochastic video datasets, as well asconsistent 3D scene samples. Please check the project websitehttps://bit.ly/2jX7Vyu to view scene reconstructions and videos produced by ourmodel.