Abstract
Large language models excel at many tasks but still struggle with consistent,robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), areinforcement learning framework that improves the reliability of LLM reasoningby training on cohorts of similar questions derived from shared programmaticabstractions. To enforce cohort-level consistency, we define a compositeobjective combining cohort accuracy, a retrieval bonus for effective problemdecomposition, and a rejection penalty for trivial or invalid lookups thatreinforcement learning can directly optimize, unlike supervised fine-tuning.Optimizing this reward guides the model to adopt uniform reasoning patternsacross all cohort members. Experiments on challenging reasoning benchmarks(including ARC-Challenge and StrategyQA) show that CC-Learn boosts bothaccuracy and reasoning stability over pretrained and SFT baselines. Theseresults demonstrate that cohort-level RL effectively enhances reasoningconsistency in LLMs.