Abstract
Non-autoregressive generative transformers recently demonstrated impressiveimage generation performance, and orders of magnitude faster sampling thantheir autoregressive counterparts. However, optimal parallel sampling from thetrue joint distribution of visual tokens remains an open challenge. In thispaper we introduce Token-Critic, an auxiliary model to guide the sampling of anon-autoregressive generative transformer. Given a masked-and-reconstructedreal image, the Token-Critic model is trained to distinguish which visualtokens belong to the original image and which were sampled by the generativetransformer. During non-autoregressive iterative sampling, Token-Critic is usedto select which tokens to accept and which to reject and resample. Coupled withToken-Critic, a state-of-the-art generative transformer significantly improvesits performance, and outperforms recent diffusion models and GANs in terms ofthe trade-off between generated image quality and diversity, in the challengingclass-conditional ImageNet generation.