Is the Bellman residual a bad proxy?

Abstract

This paper aims at theoretically and empirically comparing two standardoptimization criteria for Reinforcement Learning: i) maximization of the meanvalue and ii) minimization of the Bellman residual. For that purpose, we placeourselves in the framework of policy search algorithms, that are usuallydesigned to maximize the mean value, and derive a method that minimizes theresidual $\|T_* v_\pi - v_\pi\|_{1,\nu}$ over policies. A theoretical analysisshows how good this proxy is to policy optimization, and notably that it isbetter than its value-based counterpart. We also propose experiments onrandomly generated generic Markov decision processes, specifically designed forstudying the influence of the involved concentrability coefficient. They showthat the Bellman residual is generally a bad proxy to policy optimization andthat directly maximizing the mean value is much better, despite the currentlack of deep theoretical analysis. This might seem obvious, as directlyaddressing the problem of interest is usually better, but given the prevalenceof (projected) Bellman residual minimization in value-based reinforcementlearning, we believe that this question is worth to be considered.

Quick Read (beta)

loading the full paper ...