Abstract
Reinforcement learning with general utilities has recently gained attentionthanks to its ability to unify several problems, including imitation learning,pure exploration, and safe RL. However, prior work for solving this generalproblem in a unified way has mainly focused on the tabular setting. This isrestrictive when considering larger state-action spaces because of the need toestimate occupancy measures during policy optimization. In this work, weaddress this issue and propose to approximate occupancy measures within afunction approximation class using maximum likelihood estimation (MLE). Wepropose a simple policy gradient algorithm (PG-OMA) where an actor updates thepolicy parameters to maximize the general utility objective whereas a criticapproximates the occupancy measure using MLE. We provide a sample complexityanalysis of PG-OMA showing that our occupancy measure estimation error onlyscales with the dimension of our function approximation class rather than thesize of the state action space. Under suitable assumptions, we establish firstorder stationarity and global optimality performance bounds for the proposedPG-OMA algorithm for nonconcave and concave general utilities respectively. Wecomplement our methodological and theoretical findings with promising empiricalresults showing the scalability potential of our approach compared to existingtabular count-based approaches.