Abstract
Inverse reinforcement learning attempts to reconstruct the reward function ina Markov decision problem, using observations of agent actions. As alreadyobserved by Russell the problem is ill-posed, and the reward function is notidentifiable, even under the presence of perfect information about optimalbehavior. We provide a resolution to this non-identifiability for problems withentropy regularization. For a given environment, we fully characterize thereward functions leading to a given policy and demonstrate that, givendemonstrations of actions for the same reward under two distinct discountfactors, or under sufficiently different environments, the unobserved rewardcan be recovered up to a constant. Through a simple numerical experiment, wedemonstrate the accurate reconstruction of the reward function through ourproposed resolution.