Applying probabilistic models to reinforcement learning (RL) enables theapplication of powerful optimisation tools such as variational inference to RL.However, existing inference frameworks and their algorithms pose significantchallenges for learning optimal policies, e.g., the absence of mode capturingbehaviour in pseudo-likelihood methods and difficulties learning deterministicpolicies in maximum entropy RL based approaches. We propose VIREL, a novel,theoretically grounded probabilistic inference framework for RL that utilises aparametrised action-value function to summarise future dynamics of theunderlying MDP. This gives VIREL a mode-seeking form of KL divergence, theability to learn deterministic optimal polices naturally from inference and theability to optimise value functions and policies in separate, iterative steps.In applying variational expectation-maximisation to VIREL we thus show that theactor-critic algorithm can be reduced to expectation-maximisation, with policyimprovement equivalent to an E-step and policy evaluation to an M-step. We thenderive a family of actor-critic methods from VIREL, including a scheme foradaptive exploration. Finally, we demonstrate that actor-critic algorithms fromthis family outperform state-of-the-art methods based on soft value functionsin several domains.