Abstract
The relationship between the quality of a string, as judged by a humanreader, and its probability, $p(\boldsymbol{y})$ under a language modelundergirds the development of better language models. For example, many popularalgorithms for sampling from a language model have been conceived with the goalof manipulating $p(\boldsymbol{y})$ to place higher probability on strings thathumans deem of high quality. In this article, we examine theprobability--quality relationship in language models explicitly aligned tohuman preferences, e.g., through reinforcement learning through human feedback.We show that, when sampling corpora from an aligned language model, thereexists a trade-off between the strings' average reward and averagelog-likelihood under the prior language model, i.e., the same model beforealignment with human preferences. We provide a formal treatment of thisphenomenon and demonstrate how a choice of sampling adaptor allows for aselection of how much likelihood we exchange for the reward.