### Abstract

Maximising a cumulative reward function that is Markov and stationary, i.e.,defined over state-action pairs and independent of time, is sufficient tocapture many kinds of goals in a Markov Decision Process (MDP) based on theReinforcement Learning (RL) problem formulation. However, not all goals can becaptured in this manner. Specifically, it is easy to see that Convex MDPs inwhich goals are expressed as convex functions of stationary distributionscannot, in general, be formulated in this manner. In this paper, we reformulatethe convex MDP problem as a min-max game between the policy and cost (negativereward) players using Fenchel duality and propose a meta-algorithm for solvingit. We show that the average of the policies produced by an RL agent thatmaximizes the non-stationary reward produced by the cost player converges to anoptimal solution to the convex MDP. Finally, we show that the meta-algorithmunifies several disparate branches of reinforcement learning algorithms in theliterature, such as apprenticeship learning, variational intrinsic control,constrained MDPs, and pure exploration into a single framework.