Image inpainting is an underdetermined inverse problem, it naturally allowsdiverse contents that fill up the missing or corrupted regions reasonably andrealistically. Prevalent approaches using convolutional neural networks (CNNs)can synthesize visually pleasant contents, but CNNs suffer from limitedperception fields for capturing global features. With image-level attention,transformers enable to model long-range dependencies and generate diversecontents with autoregressive modeling of pixel-sequence distributions. However,the unidirectional attention in transformers is suboptimal as corrupted regionscan have arbitrary shapes with contexts from arbitrary directions. We proposeBAT-Fill, an image inpainting framework with a novel bidirectionalautoregressive transformer (BAT) that models deep bidirectional contexts forautoregressive generation of diverse inpainting contents. BAT-Fill inherits themerits of transformers and CNNs in a two-stage manner, which allows to generatehigh-resolution contents without being constrained by the quadratic complexityof attention in transformers. Specifically, it first generates pluralisticimage structures of low resolution by adapting transformers and thensynthesizes realistic texture details of high resolutions with a CNN-basedup-sampling network. Extensive experiments over multiple datasets show thatBAT-Fill achieves superior diversity and fidelity in image inpaintingqualitatively and quantitatively.