Abstract
Inverse reinforcement learning attempts to reconstruct the reward function ina Markov decision problem, using observations of agent actions. As alreadyobserved in Russell [1998] the problem is ill-posed, and the reward function isnot identifiable, even under the presence of perfect information about optimalbehavior. We provide a resolution to this non-identifiability for problems withentropy regularization. For a given environment, we fully characterize thereward functions leading to a given policy and demonstrate that, givendemonstrations of actions for the same reward under two distinct discountfactors, or under sufficiently different environments, the unobserved rewardcan be recovered up to a constant. We also give general necessary andsufficient conditions for reconstruction of time-homogeneous rewards on finitehorizons, and for action-independent rewards, generalizing recent results ofKim et al. [2021] and Fu et al. [2018].