### Abstract

This paper reveals that every image can be understood as a first-ordernorm+linear autoregressive process, referred to as FINOLA, where norm+lineardenotes the use of normalization before the linear model. We demonstrate thatimages of size 256$\times$256 can be reconstructed from a compressed vectorusing autoregression up to a 16$\times$16 feature map, followed by upsamplingand convolution. This discovery sheds light on the underlying partialdifferential equations (PDEs) governing the latent feature space. Additionally,we investigate the application of FINOLA for self-supervised learning through asimple masked prediction technique. By encoding a single unmasked quadrantblock, we can autoregressively predict the surrounding masked region.Remarkably, this pre-trained representation proves effective for imageclassification and object detection tasks, even in lightweight networks,without requiring fine-tuning. The code will be made publicly available.