Theoretical guarantees on the best-of-n alignment policy

Abstract

A simple and effective method for the inference-time alignment and scalingtest-time compute of generative models is best-of-$n$ sampling, where $n$samples are drawn from a reference policy, ranked based on a reward function,and the highest ranking one is selected. A commonly used analytical expressionin the literature claims that the KL divergence between the best-of-$n$ policyand the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove thevalidity of this claim, and show that it is an upper bound on the actual KLdivergence. We also explore the tightness of this upper bound in differentregimes, and propose a new estimator for the KL divergence and empirically showthat it provides a tight approximation. We also show that the win rate of thebest-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$and derive bounds on the tightness of this characterization. We conclude withanalyzing the tradeoffs between win rate and KL divergence of the best-of-$n$alignment policy, which demonstrate that very good tradeoffs are achievablewith $n < 1000$.

Quick Read (beta)

loading the full paper ...