Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

Abstract

Large reasoning models (LRMs) already possess a latent capacity for longchain-of-thought reasoning. Prior work has shown that outcome-basedreinforcement learning (RL) can incidentally elicit advanced reasoningbehaviors such as self-correction, backtracking, and verification phenomenaoften referred to as the model's "aha moment". However, the timing andconsistency of these emergent behaviors remain unpredictable anduncontrollable, limiting the scalability and reliability of LRMs' reasoningcapabilities. To address these limitations, we move beyond reliance on promptsand coincidental "aha moments". Instead, we explicitly align models with threemeta-abilities: deduction, induction, and abduction, using automaticallygenerated, self-verifiable tasks. Our three stage-pipeline individualalignment, parameter-space merging, and domain-specific reinforcement learning,boosting performance by over 10\% relative to instruction-tuned baselines.Furthermore, domain-specific RL from the aligned checkpoint yields anadditional gain in performance ceiling for both 7B and 32B models across math,coding, and science benchmarks, demonstrating that explicit meta-abilityalignment offers a scalable and dependable foundation for reasoning. Code isavailable at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment

Quick Read (beta)

loading the full paper ...