Abstract
Language models are essential for natural language processing (NLP) tasks,such as machine translation and text summarization. Remarkable performance hasbeen demonstrated recently across many NLP domains via a Transformer-basedlanguage model with over a billion parameters, verifying the benefits of modelsize. Model parallelism is required if a model is too large to fit in a singlecomputing device. Current methods for model parallelism either suffer frombackward locking in backpropagation or are not applicable to language models.We propose the first model-parallel algorithm that speeds the training ofTransformer-based language models. We also prove that our proposed algorithm isguaranteed to converge to critical points for non-convex problems. Extensiveexperiments on Transformer and Transformer-XL language models demonstrate thatthe proposed algorithm obtains a much faster speedup beyond data parallelism,with comparable or better accuracy. Code to reproduce experiments is to befound at \url{https://github.com/LaraQianYang/Ouroboros}.