We present a VAE architecture for encoding and generating high dimensionalsequential data, such as video or audio. Our deep generative model learns alatent representation of the data which is split into a static and dynamicpart, allowing us to approximately disentangle latent time-dependent features(dynamics) from features which are preserved over time (content). Thisarchitecture gives us partial control over generating content and dynamics byconditioning on either one of these sets of features. In our experiments onartificially generated cartoon video clips and voice recordings, we show thatwe can convert the content of a given sequence into another one by such contentswapping. For audio, this allows us to convert a male speaker into a femalespeaker and vice versa, while for video we can separately manipulate shapes anddynamics. Furthermore, we give empirical evidence for the hypothesis thatstochastic RNNs as latent state models are more efficient at compressing andgenerating long sequences than deterministic ones, which may be relevant forapplications in video compression.