C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning

Abstract

Large language models (LLMs) have achieved impressive results on complexreasoning tasks, but their high inference cost remains a major barrier toreal-world deployment. A promising solution is to use cascaded inference, wheresmall, cheap models handle easy queries, and only the hardest examples areescalated to more powerful models. However, existing cascade methods typicallyrely on supervised training with labeled data, offer no theoreticalgeneralization guarantees, and provide limited control over test-timecomputational cost. We introduce C3PO (Cost Controlled Cascaded PredictionOptimization), a self-supervised framework for optimizing LLM cascades underprobabilistic cost constraints. By focusing on minimizing regret with respectto the most powerful model (MPM), C3PO avoids the need for labeled data byconstructing a cascade using only unlabeled model outputs. It leveragesconformal prediction to bound the probability that inference cost exceeds auser-specified budget. We provide theoretical guarantees on both cost controland generalization error, and show that our optimization procedure is effectiveeven with small calibration sets. Empirically, C3PO achieves state-of-the-artperformance across a diverse set of reasoning benchmarks including GSM8K,MATH-500, BigBench-Hard and AIME, outperforming strong LLM cascading baselinesin both accuracy and cost-efficiency. Our results demonstrate that principled,label-free cascade optimization can enable scalable LLM deployment.

Quick Read (beta)

loading the full paper ...