Abstract
Scaling test-time compute brings substantial performance gains for largelanguage models (LLMs). By sampling multiple answers and heuristicallyaggregate their answers (e.g., either through majority voting or usingverifiers to rank the answers), one can achieve consistent performance gains inmath domains. In this paper, we propose a new way to leverage such multiplesample set. We train a compact LLM, called Sample Set Aggregator (SSA), thattakes a concatenated sequence of multiple samples and output the final answer,optimizing it for the answer accuracy with reinforcement learning. Experimentson multiple reasoning datasets show that SSA outperforms other test-timescaling methods such as reward model-based re-ranking. Our approach also showsa promising generalization ability, across sample set sizes, base modelfamilies and scales, and tasks. By separating LLMs to generate answers and LLMsto analyze and aggregate sampled answers, our approach can work with theoutputs from premier black box models easily and efficiently.