Abstract
Bilevel reinforcement learning (BRL) has emerged as a powerful mathematicalframework for studying generative AI alignment and related problems. Whileseveral principled algorithmic frameworks have been proposed, key theoreticalfoundations, particularly those related to sample complexity, remainunderexplored. Understanding and deriving tight sample complexity bounds arecrucial for bridging the gap between theory and practice, guiding thedevelopment of more efficient algorithms. In this work, we present the firstsample complexity result for BRL, achieving a bound of $\epsilon^{-4}$. Thisresult extends to standard bilevel optimization problems, providing aninteresting theoretical contribution with practical implications. To addressthe computational challenges associated with hypergradient estimation inbilevel optimization, we develop a first-order Hessian-free algorithm that doesnot rely on costly hypergradient computations. By leveraging matrix-freetechniques and constrained optimization methods, our approach ensuresscalability and practicality. Our findings pave the way for improved methods inAI alignment and other fields reliant on bilevel optimization.