Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

Abstract

Policy evaluation is a key process in Reinforcement Learning (RL). Itassesses a given policy by estimating the corresponding value function. Whenusing parameterized value functions, common approaches minimize the sum ofsquared Bellman temporal-difference errors and receive a point-estimate for theparameters. Kalman-based and Gaussian-processes based frameworks were suggestedto evaluate the policy by treating the value as a random variable. Theseframeworks can learn uncertainties over the value parameters and exploit themfor policy exploration. When adopting these frameworks to solve deep RL tasks,several limitations are revealed: excessive computations in each optimizationstep, difficulty with handling batches of samples which slows training and theeffect of memory in stochastic environments which prevents off-policy learning.In this work, we discuss these limitations and propose to overcome them by analternative general framework, based on the extended Kalman filter. We devisean optimization method, called Kalman Optimization for Value Approximation(KOVA) that can be incorporated as a policy evaluation component in policyoptimization algorithms. KOVA minimizes a regularized objective function thatconcerns both parameter and noisy return uncertainties. We analyze theproperties of KOVA and present its performance on deep RL control tasks.

Quick Read (beta)

loading the full paper ...