Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Abstract

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via Transformer's residual stream. Our entropy analysis on internal policy reveals distinct patterns: (1) universally, policies evolve from high-entropy exploration in early layers to deterministic refinement in top layers; and (2) Qwen exhibits a progressive, human-like reasoning structure, contrasting with the abrupt final-layer convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM's reasoning foundation from the bottom up by optimizing internal layers in early stages. Extensive experiments on complex reasoning benchmarks demonstrate the effectiveness of BuPO. Our code is available at https://github.com/Trae1ounG/BuPO.

Quick Read (beta)

loading the full paper ...