Abstract
Bilevel reinforcement learning (BRL) has emerged as a powerful framework foraligning generative models, yet its theoretical foundations, especially samplecomplexity bounds, remain underexplored. In this work, we present the firstsample complexity bound for BRL, establishing a rate of$\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDPanalysis techniques do not extend to BRL due to its nested structure andnon-convex lower-level problems. We overcome these challenges by leveraging thePolyak-{\L}ojasiewicz (PL) condition and the MDP structure to obtainclosed-form gradients, enabling tight sample complexity analysis. Our analysisalso extends to general bi-level optimization settings with non-convex lowerlevels, where we achieve state-of-the-art sample complexity results of$\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of$\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computationalbottleneck of hypergradient estimation by proposing a fully first-order,Hessian-free algorithm suitable for large-scale problems.