### Abstract

Reinforcement learning (RL) combines a control problem with statisticalestimation: the system dynamics are not known to the agent, but can be learnedthrough experience. A recent line of research casts `RL as inference' andsuggests a particular framework to generalize the RL problem as probabilisticinference. Our paper surfaces a key shortcoming in that approach, and clarifiesthe sense in which RL can be coherently cast as an inference problem. Inparticular, an RL agent must consider the effects of its actions upon futurerewards and observations: the exploration-exploitation tradeoff. In all but themost simple settings, the resulting inference is computationally intractable sothat practical RL algorithms must resort to approximation. We demonstrate thatthe popular `RL as inference' approximation can perform poorly in even verybasic problems. However, we show that with a small modification the frameworkdoes yield algorithms that can provably perform well, and we show that theresulting algorithm is equivalent to the recently proposed K-learning, which wefurther connect with Thompson sampling.