Abstract
The GPT-4 technical report highlights the possibility of predicting modelperformance on downstream tasks using only pre-training signals, thoughdetailed methodologies are absent. Such predictive capabilities are essentialfor resource-efficient pre-training and the construction of task-aligneddatasets. In this paper, we aim to predict performance in closed-book questionanswering (QA), a vital downstream task indicative of a model's internalknowledge. We address three primary challenges: (1) limited access to andunderstanding of pre-training corpora, (2) limitations of current evaluationmethods for pre-trained models, and (3) limitations of frequency-based metricsin predicting model performance. In response to these challenges, we conductlarge-scale retrieval and semantic analysis across the pre-training corpora of21 publicly available and 3 custom-trained large language models. Subsequently,we develop a multi-template QA evaluation framework incorporating paraphrasedquestion variants. Building on these foundations, we propose Size-dependentMutual Information (SMI), an information-theoretic metric that linearlycorrelates pre-training data characteristics, model size, and QA accuracy,without requiring any additional training. The experimental results demonstratethat SMI outperforms co-occurrence-based baselines, achieving $R^2$ > 0.75 onmodels with over one billion parameters. Theoretical analysis further revealsthe marginal benefits of scaling model size and optimizing data, indicatingthat the upper limit of specific QA task accuracy is approximately 80%. Ourproject is available at https://github.com/yuhui1038/SMI.