### Abstract

Reinforcement learning (RL) agents have traditionally been tasked withmaximizing the value function of a Markov decision process (MDP), either incontinuous settings, with fixed discount factor $\gamma < 1$, or in episodicsettings, with $\gamma = 1$. While this has proven effective for specific taskswith well-defined objectives (e.g., games), it has never been established thatfixed discounting is suitable for general purpose use (e.g., as a model ofhuman preferences). This paper characterizes rationality in sequential decisionmaking using a set of seven axioms and arrives at a form of discounting thatgeneralizes traditional fixed discounting. In particular, our framework admitsa state-action dependent "discount" factor that is not constrained to be lessthan 1, so long as there is eventual long run discounting. Although thisbroadens the range of possible preference structures in continuous settings, weshow that there exists a unique "optimizing MDP" with fixed $\gamma < 1$ whoseoptimal value function matches the true utility of the optimal policy, and wequantify the difference between value and utility for suboptimal policies. Ourwork can be seen as providing a normative justification for (a slightgeneralization of) Martha White's RL task formalism (2017) and other recentdepartures from the traditional RL, and is relevant to task specification inRL, inverse RL and preference-based RL.

### Introduction (beta)

None

### Conclusion (beta)

None