Abstract
Many reinforcement learning algorithms rely on value estimation. However, themost widely used algorithms -- namely temporal difference algorithms -- candiverge under both off-policy sampling and nonlinear function approximation.Many algorithms have been developed for off-policy value estimation which aresound under linear function approximation, based on the linear mean-squaredprojected Bellman error (PBE). Extending these methods to the non-linear casehas been largely unsuccessful. Recently, several methods have been introducedthat approximate a different objective, called the mean-squared Bellman error(BE), which naturally facilities nonlinear approximation. In this work, webuild on these insights and introduce a new generalized PBE, that extends thelinear PBE to the nonlinear setting. We show how this generalized objectiveunifies previous work, including previous theory, and obtain new bounds for thevalue error of the solutions of the generalized objective. We derive aneasy-to-use, but sound, algorithm to minimize the generalized objective whichis more stable across runs, is less sensitive to hyperparameters, and performsfavorably across four control domains with neural network functionapproximation.