Weight Ensembling Improves Reasoning in Language Models

Abstract

We investigate a failure mode that arises during the training of reasoningmodels, where the diversity of generations begins to collapse, leading tosuboptimal test-time scaling. Notably, the Pass@1 rate reliably improves duringsupervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, asimple intervention of interpolating the weights of the latest SFT checkpointwith an early checkpoint, otherwise known as WiSE-FT, almost completelyrecovers Pass@k while also improving Pass@1. The WiSE-FT variant achievesbetter test-time scaling (Best@k, majority vote) and achieves superior resultswith less data when tuned further by reinforcement learning. Finally, we findthat WiSE-FT provides complementary performance gains that cannot be achievedonly through diversity-inducing decoding strategies, like temperature scaling.We formalize a bias-variance tradeoff of Pass@k with respect to the expectationand variance of Pass@1 over the test distribution. We find that WiSE-FT canreduce bias and variance simultaneously, while temperature scaling inherentlytrades-off between bias and variance.

Quick Read (beta)

loading the full paper ...