Abstract
Generating diverse responses from large language models (LLMs) is crucial forapplications such as planning/search and synthetic data generation, wherediversity provides distinct answers across generations. Prior approaches relyon increasing temperature to increase diversity. However, contrary to popularbelief, we show not only does this approach produce lower quality individualgenerations as temperature increases, but it depends on model's next-tokenprobabilities being similar to the true distribution of answers. We propose\method{}, an alternative approach that uses the language model itself topartition the space into strata. At inference, a random stratum is selected anda sample drawn from within the strata. To measure diversity, we introduceCoverageQA, a dataset of underspecified questions with multiple equallyplausible answers, and assess diversity by measuring KL Divergence between theoutput distribution and uniform distribution over valid ground truth answers.As computing probability per response/solution for proprietary models isinfeasible, we measure recall on ground truth solutions. Our evaluation showusing SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36average reduction in KL Divergence compared to Llama 3.