Increasing model size when pretraining natural language representations oftenresults in improved performance on downstream tasks. However, at some pointfurther model increases become harder due to GPU/TPU memory limitations, longertraining times, and unexpected model degradation. To address these problems, wepresent two parameter-reduction techniques to lower memory consumption andincrease the training speed of BERT. Comprehensive empirical evidence showsthat our proposed methods lead to models that scale much better compared to theoriginal BERT. We also use a self-supervised loss that focuses on modelinginter-sentence coherence, and show it consistently helps downstream tasks withmulti-sentence inputs. As a result, our best model establishes newstate-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while havingfewer parameters compared to BERT-large.The code and the pretrained models areavailable at https://github.com/google-research/ALBERT.