Abstract
Autoregressive generative models of images tend to be biased towardscapturing local structure, and as a result they often produce samples which arelacking in terms of large-scale coherence. To address this, we propose twomethods to learn discrete representations of images which abstract away localdetail. We show that autoregressive models conditioned on these representationscan produce high-fidelity reconstructions of images, and that we can trainautoregressive priors on these representations that produce samples withlarge-scale coherence. We can recursively apply the learning procedure,yielding a hierarchy of progressively more abstract image representations. Wetrain hierarchical class-conditional autoregressive models on the ImageNetdataset and demonstrate that they are able to generate realistic images atresolutions of 128$\times$128 and 256$\times$256 pixels. We also perform ahuman evaluation study comparing our models with both adversarial andlikelihood-based state-of-the-art generative models.