Abstract
Continual pre-training on small-scale task-specific data is an effectivemethod for improving large language models in new target fields, yet it riskscatastrophic forgetting of their original capabilities. A common solution is tore-weight training data mixtures from source and target fields on a domainspace to achieve balanced performance. Previous domain reweighting strategiesrely on manual designation with certain heuristics based on human intuition orempirical results. In this work, we prove that more general heuristics can beparameterized by proposing Data Mixing Agent, the first model-based, end-to-endframework that learns to re-weight domains. The agent learns generalizableheuristics through reinforcement learning on large quantities of data mixingtrajectories with corresponding feedback from an evaluation environment.Experiments in continual pre-training on math reasoning show that Data MixingAgent outperforms strong baselines in achieving balanced performance acrosssource and target field benchmarks. Furthermore, it generalizes well acrossunseen source fields, target models, and domain spaces without retraining.Direct application to the code generation field also indicates its adaptabilityacross target domains. Further analysis showcases the agents' well-alignedheuristics with human intuitions and their efficiency in achieving superiormodel performance with less source-field data.