Discounted Reinforcement Learning is Not an Optimization Problem

  • 2019-10-04 20:52:39
  • Abhishek Naik, Roshan Shariff, Niko Yasui, Richard S. Sutton
  • 0

Abstract

Discounted reinforcement learning is fundamentally incompatible with functionapproximation for control in continuing tasks. This is because it is not anoptimization problem --- it lacks an objective function. After substantiatingthese claims, we go on to address some misconceptions about discounting and itsconnection to the average reward formulation. We encourage researchers to adoptrigorous optimization approaches for reinforcement learning in continuingtasks, such as average reward.

 

Quick Read (beta)

Discounted Reinforcement Learning is Not
an Optimization Problem

Abhishek Naik, Roshan Shariff, Niko Yasui, Richard S. Sutton
Department of Computing Science
University of Alberta
{anaik1,rshariff,yasui,rsutton}@ualberta.ca
Abstract

Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks. This is because it is not an optimization problem — it lacks an objective function. After substantiating these claims, we go on to address some misconceptions about discounting and its connection to the average reward formulation. We encourage researchers to adopt rigorous optimization approaches for reinforcement learning in continuing tasks, such as average reward.

\usetikzlibrary

chains,arrows

 

Discounted Reinforcement Learning is Not
an Optimization Problem


  Abhishek Naik, Roshan Shariff, Niko Yasui, Richard S. Sutton Department of Computing Science University of Alberta {anaik1,rshariff,yasui,rsutton}@ualberta.ca

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

Reinforcement learning (RL) is a paradigm in which an agent learns to interact with an environment in order to maximize reward [1]. Many interesting RL problems concern ‘continuing’ tasks — the agent lives and learns over a single lifetime, rather than experiencing a sequence of distinct ‘episodes’. Examples of domains for continuing tasks include routing internet packets, managing the inventory in a warehouse, and controlling the temperature in a data center.

In continuing tasks, we often resort to temporally “discounting” rewards — immediate rewards are valued more highly than rewards farther in the future. One reason is that the sum of future rewards (which is finite for episodic tasks) can grow unboundedly if the agent-environment interaction continues indefinitely. With geometric discounting (for example) the sum of future rewards remains bounded even when the length of the interaction does not.11 1 The notion of discounting discussed here is one that leads to a finite sum of future rewards. This occurs with geometric discounting, for instance, but not hyperbolic discounting [2], which is applicable only in finite-horizon episodic problems.

RL is often phrased as an optimization problem: find a policy that maximizes reward. A standard optimization problem is defined by a set of feasible solutions and an objective function that describes the quality of each feasible solution with a real number. The feasible solutions in RL are policies and, at least for episodic tasks, the natural objective function is the sum of rewards over an episode. For continuing tasks, we argue that discounted reinforcement learning is not an optimization problem — it does not maximize any objective function.

A policy is optimal in the discounted formulation if it achieves a higher discounted sum of future rewards in each state than any other policy. Mathematically, an optimal policy π* for a fixed discount rate γ satisfies

vπ*γ(s) vπγ(s), for all states s and policies π. (1)

Rather than an objective function, these inequalities produce a partial order on the set of policies. Importantly, some pairs of policies may be incomparable with each other, because each achieves a higher value in some states but a lower value in others. We will argue that the partial order fails to identify an optimal policy under function approximation.

There is no optimal representable policy when using function approximation with discounting

In many RL problems the set of all policies is too large to represent. For example, in domains with extremely large state spaces, it is common to resort to a compact state representation that cannot uniquely represent every possible state. Consequently, representable policies must behave identically in states with the same representation. We are willing to accept that the optimal policy given by equation (1) is not representable and the agent cannot hope to learn it; we would settle for the “best” representable policy.

However, even this goal is meaningless — there may no longer be any representable policy that is unambiguously “better” than all the other representable policies. In general, every representable policy will have higher value in some states and lower value in others, so the partial order will not be able to identify any policy as “optimal” — the notion of “best representable policy” is not well-defined (see [3] for examples). This is not an issue for episodic tasks because the sum of rewards is a natural objective function that can be used to compare any two policies.

An explicit objective function provides not just a partial order but a total order, capable of comparing any two policies. It is a simple matter (at least in principle) to restrict the feasible solutions to the set of representable policies but leave the objective function unchanged — there will be an optimal representable policy.

We are forced to conclude that discounting is fundamentally incompatible with function approximation for control in continuing tasks. We now argue that function approximation and continuing tasks are indispensable for reinforcement learning, and therefore discounting should be abandoned.

Continuing control with function approximation is the problem setting that matters

Reinforcement learning with tabular representations has served us well to build intuition, construct algorithms, and analyze their properties, but the world we live in is too large to be represented by tables of values or action probabilities. Furthermore, it is crucial to generalize across similar states and actions. Function approximation serves both these purposes and is necessary for agents to compactly represent complex worlds and make sense of their complexity.

Continuing tasks are those in which the agent-environment interaction does not naturally break down into episodes. Some of the most interesting RL tasks being studied today involve extremely long episodes and the challenges that come with them (including sparse rewards and credit assignment over long time horizons). Such tasks are episodic in name only, and meeting the challenges they present will require us to deeply understand continuing tasks.

Function approximation is necessary in large-scale reinforcement learning. Since the discounting formulation is fundamentally incompatible with function approximation, it is also incompatible with large-scale reinforcement learning. We believe that whole-heartedly adopting other optimization approaches for continuing task formulations (for example, average reward) is the most appropriate direction for our discipline.

Paper organization

After covering some background in Section 2, we describe an alternative continuing task formulation in Section 3 which has a well-defined objective — maximizing the average reward per step — making it suitable for use with function approximation. We discuss its equivalence with the time-average of the discounted value function, and the resulting misconception that common discounted algorithms maximize the average reward irrespective of the choice of the discount factor. We summarize the arguments in Section 4 and give pointers to the existing literature involving the average reward formulation.

2 Background

A Markov decision process (MDP) consists of a finite set of states 𝒮 and a finite set of actions 𝒜 that can be performed in those states. When the agent takes an action in a state, the agent transitions to another state and receives a scalar reinforcement signal called the reward. Formally, an MDP can be represented by a 4-tuple (𝒮,𝒜,P,r), where upon performing action a in state s the agent receives reward r(s,a) in expectation and transitions to state s with probability Pa(s,s). Note that the transition probabilities and rewards are independent of the agent’s history before reaching state s (the Markov assumption).

A policy π:𝒮×𝒜[0,1] gives, for every state in an MDP, a probability distribution over actions to be taken in that state. A policy π acting in an MDP (𝒮,𝒜,P,r) produces a Markov reward process (𝒮,Pπ,rπ), where Pπ:𝒮×𝒮[0,1] is the transition probability and rπ:𝒮 is the reward function:

Pπ(s,s) =a𝒜π(s,a)Pa(s,s),
rπ(s) =a𝒜π(s,a)r(s,a).

Average reward

The average reward of a policy is an intuitively straightforward quantity — it is the reward the policy receives on average every time step as it acts in an environment. We define the average reward more formally as follows. Under certain mild conditions, a Markov reward process has a so-called stationary distribution over states, dπ(s), that satisfies

dπ(s) =s𝒮dπ(s)Pπ(s,s) for every s𝒮.

In other words, if the agent’s state is distributed according to the stationary distribution and it acts according to its policy, then its next state will also follow the stationary distribution. Furthermore, under additional mild conditions, the state distribution converges to dπ(s) over the long run regardless of the starting state s0:

dπ(s) =limTPr(ST=s), where S0=s0 and St+1Pπ(St,) for all time steps t.

In continuing tasks, dπ(s) measures the fraction of time the agent spends in state s. Frequently visited states have higher values of dπ, and if dπ(s)=0 then s is called a transient state: it is only visited a finite number of times and never again.

The average reward of a policy is simply defined as the average one-step reward, weighted by the stationary distribution:

r¯(π) =s𝒮dπ(s)rπ(s). (2)

For the purposes of this paper, we consider policies that only depend on states. For more details on MDPs and average reward, refer to [4].

3 An Optimization Objective for Continuing Tasks

We would like to find an objective for continuing tasks that is analogous to the sum of rewards objective in episodic tasks. Such an objective function could be used to evaluate and compare any policy with any other, so that our goal of finding the best policy amongst any class of policies becomes well-defined.

The founding principle of reinforcement learning is that an agent maximizes the rewards it receives over its lifetime (which is infinite in continuing problems). We could still contemplate evaluating policies using their discounted value from the start state, but this gives undue importance to the early part of an agent’s lifetime, going against the essence of a continuing task. Indeed, the start state may never be revisited in a continuing task, so it does not deserve special treatment.

The only factors that can influence how the reward r(s,a) is valued are the state s, the action a, and the time t at which St=s and At=a. If we want to weight rewards independently of when they are received, we must weight them based on where they are received. In other words, the objective function must have the form s,aw(s,a)r(s,a), with a weighting w(s,a) that weights each reward r(St,At) based on the state St and action At but not the time t.

For policy optimization to be meaningful, w(s,a) must depend on the policy π. Setting w(s,a)=dπ(s)π(s,a) (say) has the effect of weighting each state-action pair by how frequently it is visited, which brings us to the optimization problem

argmaxπΠs𝒮dπ(s)rπ(s) argmaxπΠr¯(π),(using (2))

where Π is the set of representable policies. This objective function captures a core tenet of reinforcement learning — that the agent’s actions should cause it to visit states where it can earn large rewards. In continuing tasks, the agent should find policies that cause it to frequently visit these highly rewarding states.

The average reward objective in terms of discounted values

Interestingly, the average reward r¯(π) has several equivalent formulations. While we defined it as an average of the one-step reward weighted by the stationary distribution, it is equivalent to a weighted average of the discounted values:

s𝒮dπ(s)rπ(s) (1-γ)s𝒮dπ(s)vπγ(s), for all 0γ<1. (3)

Intuitively, if the initial state is sampled from dπ, then all future states will have the same distribution and the expected reward on every future time step will be r¯(π). Effectively, the stationary distribution allows any weighted average over time (in this case, the normalized value function (1-γ)vπγ(s)) to be replaced with a weighted average over states [3, 1].22 2 Since the stationary distribution allows time-averages to be converted to state-averages, the average reward objective does not imply that agents have to average infinitely many rewards. Indeed, the same equivalence holds for any state-independent temporal discounting scheme (not just geometric) as long the discount factors sum to a constant; the (1-γ) factor is replaced by the reciprocal of that constant. (This equivalence does not hold for hyperbolic discounting [2], for example, because the sum of discount factors diverges.)

Theoretically, equation (3) implies that the policy that maximizes the average of future discounted values (with any discount factor) also maximizes average reward. The discount rate thus becomes a hyper-parameter of the algorithm, rather than a parameter specifying the optimization objective [1].

Greedily maximizing discounted future value does not maximize average reward

At this point, it would be easy to conclude that any algorithm that maximizes discounted value must also maximize average reward regardless of the discount factor, which seems absurd on the face of it. We would be right to be skeptical — common algorithms like Q-learning or Sarsa estimate an action-value function Qγ(s,a) and behave (close to) greedily with respect to it. Such a local greedification will not in general correspond to maximizing the average discounted value over time.

As an example, consider the two-choice MDP in Figure 1. State 0 is the only state with meaningful actions: left and right. Going left leads to an immediate reward of +1, and going right leads to a delayed reward of +2. For small discount factors, the immediate reward from going left appears much more appealing as compared to the discounted delayed reward. When the discount factor is increased sufficiently, the right action becomes more appealing instead. For algorithms based on the local greedification operator, the discount factor is not just a hyper-parameter of the algorithm, it is still a part of the problem setting.

{tikzpicture}

[ roundnode/.style=circle, draw=green!60, fill=green!5, very thick, minimum size=7mm, ] \node[roundnode] at (0, 0) (state0) 0; \node[roundnode] at (-1.25, 0.75) (state1) 1; \node[roundnode] at (-2.75, 0.75) (state2) 2; \node[roundnode] at (-2.75, -0.75) (state3) 3; \node[roundnode] at (-1.25, -0.75) (state4) 4; \node[roundnode] at (1.25, 0.75) (state5) 5; \node[roundnode] at (2.75, 0.75) (state6) 6; \node[roundnode] at (2.75, -0.75) (state7) 7; \node[roundnode] at (1.25, -0.75) (state8) 8;

\draw

[->] (state0) [out=120,in=0] to node[pos=0.1,left]\colorblue+1 node[pos=0.7,right]\colorbrown  L (state1); \draw[->] (state1) [out=150,in=30] to (state2); \draw[->] (state2) [out=240,in=120] to (state3); \draw[->] (state3) [out=330,in=210] to (state4); \draw[->] (state4) [out=0,in=240] to (state0); \draw[->] (state0) [out=60,in=180] to node[pos=0.7,left]\colorbrownR   (state5); \draw[->] (state5) [out=30,in=150] to (state6); \draw[->] (state6) [out=300,in=60] to (state7); \draw[->] (state7) [out=210,in=330] to (state8); \draw[->] (state8) [out=180,in=300] to node[pos=0.95,right]\colorblue+2 (state0);

Figure 1: The two-choice MDP. This is a continuing MDP with only one action in every state except state 0. Action left ‘L’ in state 0 gives an immediate reward of +1 and action ‘R’ leads to a delayed reward of +2 after four timesteps.

Increasing γ1 is not enough

Inspired by the previous example, one might consider using a discounted algorithm but taking the limit as γ1.

Consider the optimization argmaxπlimγ1-(1-γ)vπγ(s0). This objective does indeed avoid the issues that come with a finite (discounted) time horizon. Furthermore, it is bounded, being the weighted average of bounded rewards. Finally, it does not depend on the initial state s0. Unsurprisingly, it turns out that this quantity is also equivalent to r¯(π).

In practice, unfortunately, it is hard to evaluate an objective function containing a limit. Instead, algorithms find optimal policies for increasing discount factors — they exchange the maximization with the limit — limγ1-1argmaxπ(1-γ)vπγ(s0) for some start state s0.

In theory, the limiting policy does indeed exist and is called the Blackwell-optimal policy π*. For every MDP there is a critical discount rate γ0, and π* maximizes vπ*γ(s) for every γγ0 and state s. Intuitively, in any given MDP, making a decision that is optimal over a sufficiently large horizon is equivalent to optimizing average reward. Crucially, this horizon depends on the MDP and how long it takes for actions to have consequences (in terms of reward), which is not known beforehand. The critical discount rate is related to the maximum mixing time of the MDP. [4]

There are two major obstacles to exploiting this theory to convert discounted-value algorithms into average-value algorithms for real-world RL problems. First, most algorithms that learn discounted value functions become increasingly unstable as γ increases to 1. Second, it is hard to estimate the critical discount factor, because it is intimately tied to the unknown dynamics of the environment. Thus, to ensure we find an optimal policy we cannot settle for any fixed γ<1, forcing us into the γ1 regime where our algorithms become unusable. Algorithms that optimize the average reward objective do not rely on a discount factor and automatically make decisions at the appropriate time horizon.

Maximizing average reward is algorithmically feasible

The average reward formulation has been long studied, and there are several dynamic programming algorithms for finding optimal average reward policies; see Puterman [4] or Bertsekas [5]. Mahadevan [6] has reviewed some variants of the DP-based RL algorithms maximizing the average reward. Of note is RVI Q-learning [7], a simple Q-learning-like algorithm with convergence guarantees. Furthermore, policy gradient methods are especially promising, since the gradient of the average reward objective has an elegant closed-form expression [8]. Konda and Borkar [9] have proposed an actor-critic method for average reward.

4 Conclusions

  • Discounted reinforcement learning is not an optimization problem. In continuing tasks, it does not correspond to the maximization of any objective function over a set of policies.

  • Function approximation is incompatible with discounting. Not being an optimization problem, the “best” representable policy is not well-defined.

  • Average reward RL is an optimization problem even with function approximation. Several relatively simple algorithms exist with convergence guarantees.

  • Greedily maximizing discounted future value does not maximize average reward. Common algorithms like Sarsa or Q-learning do not optimize the average reward, and they find different policies depending on the discount factor.

  • Increasing γ1 is not enough. The discount factor must be increased beyond a critical point which is problem-specific and cannot be determined a priori. Moreover, most algorithms become increasingly unstable as γ approaches unity.

There are many open problems in average reward RL. Further advances are needed in areas such as multi-step learning, off-policy learning, model-based learning, and options. Many of the recent advances in optimization can be applied to improve algorithms for average reward.

Acknowledgements

This work was partially done at Abhishek Naik’s internship at Huawei Technologies in Edmonton. Roshan Shariff was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Alberta Innovates. The authors also wish to thank Muhammad Zaheer for valuable feedback on a preliminary draft of the work.

References

  • [1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
  • [2] William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons. CoRR, abs/1902.06865, 2019.
  • [3] Satinder P. Singh, Tommi S. Jaakkola, and Michael I. Jordan. Learning without state-estimation in partially observable Markovian decision processes. In ICML, 1994.
  • [4] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.
  • [5] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume II. Athena Scientific, 2005.
  • [6] Sridhar Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1-3):159–195, 1996.
  • [7] Jinane Abounadi, Dimitri P. Bertsekas, and Vivek S. Borkar. Learning algorithms for Markov decision processes with average cost. SIAM J. Control and Optimization, 40(3):681–698, 2001.
  • [8] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 2000.
  • [9] Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In NeurIPS, 1999.