When is Realizability Sufficient for Off-Policy Reinforcement Learning?

Abstract

Model-free algorithms for reinforcement learning typically require acondition called Bellman completeness in order to successfully operateoff-policy with function approximation, unless additional conditions are met.However, Bellman completeness is a requirement that is much stronger thanrealizability and that is deemed to be too strong to hold in practice. In thiswork, we relax this structural assumption and analyze the statisticalcomplexity of off-policy reinforcement learning when only realizability holdsfor the prescribed function class. We establish finite-sample guarantees for off-policy reinforcement learningthat are free of the approximation error term known as inherent Bellman error,and that depend on the interplay of three factors. The first two are well-know:they are the metric entropy of the function class and the concentrabilitycoefficient that represents the cost of learning off-policy. The third factoris new, and it measures the violation of Bellman completeness, namely themis-alignment between the chosen function class and its image through theBellman operator. In essence, these error bounds establish that off-policy reinforcementlearning remains statistically viable even in absence of Bellman completeness,and characterize the intermediate situation between the favorable Bellmancomplete setting and the worst-case scenario where exponential lower bounds arein force. Our analysis directly applies to the solution found by temporaldifference algorithms when they converge.

Quick Read (beta)

loading the full paper ...