Abstract
The composition of pretraining data is a key determinant of foundationmodels' performance, but there is no standard guideline for allocating alimited computational budget across different data sources. Most currentapproaches either rely on extensive experiments with smaller models or dynamicdata adjustments that also require proxy models, both of which significantlyincrease the workflow complexity and computational overhead. In this paper, weintroduce Adaptive Data Optimization (ADO), an algorithm that optimizes datadistributions in an online fashion, concurrent with model training. Unlikeexisting techniques, ADO does not require external knowledge, proxy models, ormodifications to the model update. Instead, ADO uses per-domain scaling laws toestimate the learning potential of each domain during training and adjusts thedata mixture accordingly, making it more scalable and easier to integrate.Experiments demonstrate that ADO can achieve comparable or better performancethan prior methods while maintaining computational efficiency across differentcomputation scales, offering a practical solution for dynamically adjustingdata distribution without sacrificing flexibility or increasing costs. Beyondits practical benefits, ADO also provides a new perspective on data collectionstrategies via scaling laws.