ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Abstract

Determining the optimal data mixture for large language model trainingremains a challenging problem with an outsized impact on performance. Inpractice, language model developers continue to rely on heuristic explorationsince no learning-based approach has emerged as a reliable solution. In thiswork, we propose to view the selection of training data mixtures as a black-boxhyperparameter optimization problem, for which Bayesian Optimization is awell-established class of appropriate algorithms. Firstly, we cast data mixturelearning as a sequential decision-making problem, in which we aim to find asuitable trade-off between the computational cost of training exploratory(proxy-) models and final mixture performance. Secondly, we systematicallyexplore the properties of transferring mixtures learned at a small scale tolarger-scale experiments, providing insights and highlighting opportunities forresearch at a modest scale. By proposing Multi-fidelity Bayesian Optimizationas a suitable method in this common scenario, we introduce a natural frameworkto balance experiment cost with model fit, avoiding the risks of overfitting tosmaller scales while minimizing the number of experiments at high cost. Wepresent results for pre-training and instruction finetuning across modelsranging from 1 million to 7 billion parameters, varying from simplearchitectures to state-of-the-art models and benchmarks spanning dozens ofdatasets. We demonstrate consistently strong results relative to a wide rangeof benchmarks, showingspeed-ups of over 500% in determining the best datamixture on our largest experiments relative to recent baselines. In addition,we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 fulltraining & evaluation runs across various model sizes worth over 13,000 GPUhours, greatly reducing the cost of conducting research in this area.

Quick Read (beta)

loading the full paper ...