Abstract
Most language models (LMs) are trained and applied in an autoregressiveleft-to-right fashion, assuming that the next token only depends on thepreceding ones. However, this assumption ignores the potential benefits ofusing the full sequence information during training, and the possibility ofhaving context from both sides during inference. In this paper, we propose anew pre-training paradigm with techniques that jointly improve the trainingdata efficiency and the capabilities of the LMs in the infilling task. Thefirst is a training objective that aligns the predictions of a left-to-right LMwith those of a right-to-left LM, trained on the same data but in reverseorder. The second is a bidirectional inference procedure that enables both LMsto meet in the middle. We show the effectiveness of our pre-training paradigmwith extensive experiments on both programming and natural language models,outperforming strong baselines.