Functional Critics Are Essential for Actor-Critic: From Off-Policy Stability to Efficient Exploration

Abstract

The actor-critic (AC) framework has achieved strong empirical success in off-policy reinforcement learning but suffers from the "moving target" problem, where the evaluated policy changes continually. Functional critics, or policy-conditioned value functions, address this by explicitly including a representation of the policy as input. While conceptually appealing, previous efforts have struggled to remain competitive against standard AC. In this work, we revisit functional critics within the actor-critic framework and identify two critical aspects that render them a necessity rather than a luxury. First, we demonstrate their power in stabilizing the complex interplay between the "deadly triad" and the "moving target". We provide a convergent off-policy AC algorithm under linear functional approximation that dismantles several longstanding barriers between theory and practice: it utilizes target-based TD learning, accommodates dynamic behavior policies, and operates without the restrictive "full coverage" assumptions. By formalizing a dual trust-coverage mechanism, our framework provides principled guidelines for pursuing sample efficiency-rigorously governing behavior policy updates and critic re-evaluations to maximize off-policy data utility. Second, we uncover a foundational link between functional critics and efficient exploration. We demonstrate that existing model-free approximations of posterior sampling are limited in capturing policy-dependent uncertainty, a gap the functional critic formalism bridges. These results represent, to our knowledge, first-of-their-kind contributions to the RL literature. Practically, we propose a tailored neural network architecture and a minimalist AC algorithm. In preliminary experiments on the DeepMind Control Suite, this implementation achieves performance competitive with state-of-the-art methods without standard implementation heuristics.

Quick Read (beta)

loading the full paper ...