Abstract
Reinforcement learning has traditionally been studied with exponentialdiscounting or the average reward setup, mainly due to their mathematicaltractability. However, such frameworks fall short of accurately capturing humanbehavior, which has a bias towards immediate gratification. Quasi-Hyperbolic(QH) discounting is a simple alternative for modeling this bias. Unlike intraditional discounting, though, the optimal QH-policy, starting from some time$t_1,$ can be different to the one starting from $t_2.$ Hence, the future selfof an agent, if it is naive or impatient, can deviate from the policy that isoptimal at the start, leading to sub-optimal overall returns. To prevent thisbehavior, an alternative is to work with a policy anchored in a Markov PerfectEquilibrium (MPE). In this work, we propose the first model-free algorithm forfinding an MPE. Using a two-timescale analysis, we show that, if our algorithmconverges, then the limit must be an MPE. We also validate this claimnumerically for the standard inventory system with stochastic demands. Our worksignificantly advances the practical application of reinforcement learning.