Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Abstract

We present Branch-Train-Merge (BTM), a communication-efficient algorithm forembarrassingly parallel training of large language models (LLMs). We show it ispossible to independently train subparts of a new class of LLMs on differentsubsets of the data, eliminating the massive multi-node synchronizationcurrently required to train LLMs. BTM learns a set of independent expert LMs(ELMs), each specialized to a different textual domain, such as scientific orlegal text. These ELMs can be added and removed to update data coverage,ensembled to generalize to new domains, or averaged to collapse back to asingle LM for efficient inference. New ELMs are learned by branching from(mixtures of) ELMs in the current set, further training the parameters on datafor the new domain, and then merging the resulting model back into the set forfuture use. Experiments show that BTM improves in- and out-of-domainperplexities as compared to GPT-style Transformer LMs, when controlling fortraining cost. Through extensive analysis, we show that these results arerobust to different ELM initialization schemes, but require expert domainspecialization; LM ensembles with random data splits do not perform well. Wealso present a study of scaling BTM into a new corpus of 64 domains (192Bwhitespace-separated tokens in total); the resulting LM (22.4B totalparameters) performs as well as a Transformer LM trained with 2.5 times morecompute. These gains grow with the number of domains, suggesting moreaggressive parallelism could be used to efficiently train larger models infuture work.

Quick Read (beta)

loading the full paper ...