In this work, we study the large-scale pretraining of BERT-Large withdifferentially private SGD (DP-SGD). We show that combined with a carefulimplementation, scaling up the batch size to millions (i.e., mega-batches)improves the utility of the DP-SGD step for BERT; we also enhance itsefficiency by using an increasing batch size schedule. Our implementationbuilds on the recent work of [SVK20], who demonstrated that the overhead of aDP-SGD step is minimized with effective use of JAX [BFH+18, FJL18] primitivesin conjunction with the XLA compiler [XLA17]. Our implementation achieves amasked language model accuracy of 60.5% at a batch size of 2M, for $\epsilon =5.36$. To put this number in perspective, non-private BERT models achieve anaccuracy of $\sim$70%.