Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Abstract

We introduce~\textsc{Domain2Vec}, a novel approach that decomposes anydataset into a linear combination of several \emph{meta-domains}, a new conceptdesigned to capture the key underlying features of datasets.\textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses aclassifier to decompose any given dataset into a domain vector that correspondsto a distribution over this vocabulary. These domain vectors enable theidentification of the optimal data mixture for language model (LM) pretrainingin a training-free manner under the \emph{\textbf{D}istribution\textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$), which suggests that whenthe data distributions of the training set and the validation set are betteraligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} canbe seamlessly integrated into previous works to model the relationship betweendomain vectors and LM performance, greatly enhancing the efficiency andscalability of previous methods. Extensive experiments demonstrate that\textsc{Domain2Vec} helps find the data mixture that enhances downstream taskperformance with minimal computational overhead. Specifically,\textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only$51.5\%$ of the computation required when training on the original mixture ofThe Pile dataset. Under equivalent compute budget, \textsc{Domain2Vec} improvesdownstream performance by an average of $2.83\%$.

Quick Read (beta)

loading the full paper ...