Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs

Abstract

Training agents to operate under strict constraints during deployment, suchas limited resource budgets or stringent safety requirements, presentssignificant challenges, especially when these constraints render the taskcomplex. In this work, we propose a curriculum learning strategy that graduallytightens constraints during training, enabling the agent to incrementallymaster the deployment requirements. Inspired by self-paced learning techniquesin unconstrained reinforcement learning (RL), our approach facilitates asmoother transition to challenging environments by initially training onsimplified versions of the constraints and progressively introducing the fulldeployment conditions. We provide a theoretical analysis using an RL agent in abinary-tree Markov Decision Process (MDP) to demonstrate that our curriculumstrategy can accelerate training relative to a baseline approach that imposesthe trajectory constraints from the outset. Moreover, we empirically validatethe effectiveness and generality of our method across both RL and largelanguage model (LLM) agents in diverse settings, including a binary-tree MDP, amulti-task navigation domain, and a math reasoning task with two benchmarks.These results highlight the potential of curriculum design in enhancing theefficiency and performance of agents operating under complex trajectoryconstraints during deployment. Moreover, when applied to LLMs, our strategyenables compression of output chain-of-thought tokens, achieving a substantialinference speedup on consumer hardware, demonstrating its effectiveness forresource-constrained deployment.

Quick Read (beta)

loading the full paper ...