Abstract
Large language models (LLMs) often behave inconsistently across inputs,indicating uncertainty and motivating the need for its quantification inhigh-stakes settings. Prior work on calibration and uncertainty quantificationoften focuses on individual models, overlooking the potential of modeldiversity. We hypothesize that LLMs make complementary predictions due todifferences in training and the Zipfian nature of language, and thataggregating their outputs leads to more reliable uncertainty estimates. Toleverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), asimple information-theoretic method that uses Jensen-Shannon Divergence toidentify and aggregate well-calibrated subsets of LLMs. Experiments on binaryprediction tasks demonstrate improved calibration and predictive performancecompared to single-model and na\"ive ensemble baselines. In addition, weexplore using MUSE as guided signals with chain-of-thought distillation tofine-tune LLMs for calibration. MUSE is availableat:https://github.com/LARK-NLP-Lab/MUSE.