Abstract
While reinforcement learning algorithms provide automated acquisition ofoptimal policies, practical application of such methods requires a number ofdesign decisions, such as manually designing reward functions that not onlydefine the task, but also provide sufficient shaping to accomplish it. In thispaper, we discuss a new perspective on reinforcement learning, recasting it asthe problem of inferring actions that achieve desired outcomes, rather than aproblem of maximizing rewards. To solve the resulting outcome-directedinference problem, we establish a novel variational inference formulation thatallows us to derive a well-shaped reward function which can be learned directlyfrom environment interactions. From the corresponding variational objective, wealso derive a new probabilistic Bellman backup operator reminiscent of thestandard Bellman backup operator and use it to develop an off-policy algorithmto solve goal-directed tasks. We empirically demonstrate that this methodeliminates the need to design reward functions and leads to effectivegoal-directed behaviors.