The video generation task can be formulated as a prediction of future videoframes given some past frames. Recent generative models for videos face theproblem of high computational requirements. Some models require up to 512Tensor Processing Units for parallel training. In this work, we address thisproblem via modeling the dynamics in a latent space. After the transformationof frames into the latent space, our model predicts latent representation forthe next frames in an autoregressive manner. We demonstrate the performance ofour approach on BAIR Robot Pushing and Kinetics-600 datasets. The approachtends to reduce requirements to 8 Graphical Processing Units for training themodels while maintaining comparable generation quality.