Improving reinforcement learning algorithms: towards optimal learning rate policies

  • 2020-03-12 21:12:12
  • Othmane Mounjid, Charles-Albert Lehalle
  • 0

Abstract

This paper investigates to what extent one can improve reinforcement learningalgorithms. Our study is split in three parts. First, our analysis shows thatthe classical asymptotic convergence rate $O(1/\sqrt{N})$ is pessimistic andcan be replaced by $O((\log(N)/N)^{\beta})$ with $\frac{1}{2}\leq \beta \leq 1$and $N$ the number of iterations. Second, we propose a dynamic optimal policyfor the choice of the learning rate $(\gamma_k)_{k\geq 0}$ used in stochasticapproximation (SA). We decompose our policy into two interacting levels: theinner and the outer level. In the inner level, we present the\nameref{Alg:v_4_s} algorithm (for "PAst Sign Search") which, based on apredefined sequence $(\gamma^o_k)_{k\geq 0}$, constructs a new sequence$(\gamma^i_k)_{k\geq 0}$ whose error decreases faster. In the outer level, wepropose an optimal methodology for the selection of the predefined sequence$(\gamma^o_k)_{k\geq 0}$. Third, we show empirically that our selectionmethodology of the learning rate outperforms significantly standard algorithmsused in reinforcement learning (RL) in the three following applications: theestimation of a drift, the optimal placement of limit orders and the optimalexecution of large number of shares.

 

Quick Read (beta)

Improving reinforcement learning algorithms:
towards optimal learning rate policies

Othmane Mounjid , Charles-Albert Lehalle École Polytechnique, CMAPCapital Fund Management, Paris and Imperial College, London
March 16, 2020
Abstract

This paper investigates to what extent one can improve reinforcement learning algorithms. Our study is split in three parts. First, our analysis shows that the classical asymptotic convergence rate O(1/N) is pessimistic and can be replaced by O((log(N)/N)β) with 12β1 and N the number of iterations. Second, we propose a dynamic optimal policy for the choice of the learning rate (γk)k0 used in stochastic approximation (SA). We decompose our policy into two interacting levels: the inner and the outer level. In the inner level, we present the Algorithm 3 (PASS). algorithm (for “PAst Sign Search”) which, based on a predefined sequence (γko)k0, constructs a new sequence (γki)k0 whose error decreases faster. In the outer level, we propose an optimal methodology for the selection of the predefined sequence (γko)k0. Third, we show empirically that our selection methodology of the learning rate outperforms significantly standard algorithms used in reinforcement learning (RL) in the three following applications: the estimation of a drift, the optimal placement of limit orders and the optimal execution of large number of shares.

\usetikzlibrary

fit,calc,positioning,trees \usetikzlibraryarrows \tikzstylelevel 1=[level distance=4cm, sibling distance=2.5cm] \tikzstylelevel 2=[level distance=8cm, sibling distance=0.6cm] \tikzstylebag = [text width=4em, text centered] \tikzstyleend = [circle, minimum width=3pt,fill, inner sep=0pt] \usetikzlibrarydecorations.pathmorphing \usetikzlibrarydecorations.pathreplacing \usetikzlibrarydecorations.shapes \usetikzlibrarydecorations.text \usetikzlibrarydecorations.markings \usetikzlibrarydecorations.fractals \usetikzlibrarydecorations.footprints \algnewcommand\Inputs[1]\StateInputs: \Statex #1 \algnewcommand\Initialize[1]\StateInitialize: \Statex #1

 

Improving reinforcement learning algorithms:
towards optimal learning rate policies


  Othmane Mounjidthanks: École Polytechnique, CMAP , Charles-Albert Lehalle thanks: Capital Fund Management, Paris and Imperial College, London

\@float

noticebox[b]\[email protected]

1 Introduction

We consider a discrete state space 𝒵= or 𝒵={1,,d} with d*. We are interested in finding q*𝒬𝒵 solution of

M(q,z)=𝔼[m(q,X(z),z)]=0,z𝒵, (1)

with X(z)𝒳 a random variable with an unknown distribution and m a function from 𝒬×𝒳×𝒵 to . Although the distribution of X(z) is unspecified, we assume that we can observe some variables (Zn)n0 valued in 𝒵 and (Xn(Zn-1))n1 drawn from the same distribution of X(Zn-1). Reinforcement learning (RL) addresses this problem through the following iterative procedure:

qn+1(Zn)=qn(Zn)-γn(Zn)m(qn,Xn+1(Zn),Zn),n0, (2)

where q0 is a given initial condition and each γn is a component-wise non-negative vector valued in 𝒵. The connection between RL, problem (1) and Algorithm (2) is detailed in Section 2. It is possible to recover the classical SARSA, Q-learning and double Q-learning algorithms used in RL by taking a specific expression for m and Xn+1. Note that Algorithm (2) is different from the standard Robbins-Monro (RM) algorithm used in stochastic approximation (SA)

qn+1=qn-γnm¯(qn,Xn+1), (3)

with m¯𝒵 whose z-th coordinate is defined such that m¯(q,x)z=m(q,x(z),z) for any z𝒵 and γn0, mainly because, as it is frequent in RL, we do not observe the entire variable (Xn+1(z))z𝒵) but only its value according to the coordinate Zn. Indeed, the way (Zn)n1 visits the set 𝒵 plays a key role in the convergence of Algorithm (2) which is not the case of Algorithm (3). RM algorithm was first introduced by Robbins and Monro in [robbins1951stochastic]. After that, it was studied by many authors who prove the convergence of qn towards q*, see [benveniste1987and, bertsekas1996neuro, blum1954approximation, kushner1978stochastic]. The asymptotic convergence rate has also been investigated in many papers, see [benveniste1987and, kushner2003stochastic, sacks1958asymptotic]. They show that this speed is in general proportional to 1/N with N the number of iterations.

In this work, we give a special focus to RL problems. Nowadays RL cover a very wide collection of recipes to solve control problems in an exploration-exploitation context. This literature started in the seventies, see [watkins1989learning, werbos1987building], and became famous mainly with the seminal paper of Sutton, see [sutton1998introduction]. It largely relied on the recent advances in the control theory developed in the late 1950s, see [bellman1959functional]. The key tool borrowed from this theory is the dynamic programming principle satisfied by the value function. This principle enables us to solve control problems numerically when the environment is known and the dimension is not too large. To tackle the curse of dimensionality, recent papers, see [sirignano2018dgm], use deep neural networks (DNN). For example, in [hutchinson1994nonparametric], authors use DNN to derive optimal hedging strategies for finance derivatives and in [manziukAlgorithmicTradingReinforcement2019] they use a similar method to solve a high dimensional optimal trading problem. To overcome the fact that environment is unspecified, it is common to use RM algorithm which estimates on-line quantities of interest. The combination of control theory and SA gave birth to numerous papers on RL.

Our contributions are as follows. We first conduct an error analysis to show that the classical asymptotic rate O(1/N) is pessimistic and can be enhanced in many situations. It is indeed possible to get a O((log(N)/N)β) asymptotic speed with 1/2β1 and N the number of iterations. Then, we present our main result. It consists in proposing a dynamic policy for the choice of the step size (γk)k0 used in (3). Our policy is decomposed into two interacting levels: the inner and the outer level. In the inner level, we propose the Algorithm 3 (PASS). algorithm, for “PAst Sign Search”. This algorithm builds a new sequence (γki)k0, using a predefined sequence (γko)k0 and the sign variations of m(qn,Xn+1(Zn),Zn). The error of (γki)k0 decreases faster than the one of (γko)k0. In the outer level, we propose an optimal dynamic policy for the choice of the predefined sequence (γko)k0. These two levels are interacting in the sense that Algorithm 3 (PASS). influences the construction of (γko)k0. Finally, we show that our selection methodology provides better convergence results than standard RL algorithms in three numerical examples: the drift estimation, the optimal placement of limit orders and the optimal execution of a large number of shares.

The structure of this paper is as follows: Section 2 describes the relation between RL and SA. Section 3 reformulates the problem (1) in terms of a minimisation issue and defines with accuracy the different sources of error. This enables us to exploit the most recent convergence results for each source of error to show that the slow convergence speed O(1/N) can be replaced by O((log(N)/N)β) with 1/2β1 and N the number of iterations. Section 4 contains our main contribution. We start by defining the Algorithm Algorithm 3 (PASS)., comparing it with two other schemes and proving its convergence. Then, we describe our outer level policy and discuss its optimality. Finally, we explain our selection methodology of the learning rate (γko)k0 which combines the Algorithm 3 (PASS). Algorithm with the outer level to enhance the convergence of (3). The last Section 5 provides numerical examples taken from the optimal trading literature: optimal placement of a limit order and the optimization of the trading speed of a liquidation algorithm. Proofs and additional results are relegated to an appendix.

2 Reinforcement learning

We detail in this section the relation between SA and RL since we are interested in solving RL problems. RL aims at estimating the Q-function which quantifies the value for the player to choose the action a when the system is at s. Let t be the current time, Ut𝒰 be a process defined on a filtered probability space (Ω,,t,) which represents the current state of the system and At𝒜 the agent action at time t. We assume that the process (Ut,At) is Markov. The agent aims at maximizing

𝔼[0Tρsf(s,Us,As)𝑑s+ρTg(UT)], (4)

with g the terminal constraint, f the instantaneous reward, ρ a discount factor and T the final time. Let us fix a time step Δ>0 and allow the agent to take actions only at times11 1 We recall the following classical result: when Δ goes to zero, the value function and the optimal control of this problem converges towards the one where decisions are taken at any time. kΔ with k. The Q-function satisfies

Q(t,u,a)=supA𝔼A[tTρ(s-t)f(s,Us,As)ds+ρ(T-t)g(UT)|Ut=u,At=a],(t,u,a)+×𝒰×𝒜,

with A={At,t<T} a possible control process for the agent. Note that the action of the agent depend on s=(t,u) with t the current time and u the current state. We view the agent control A as a feedback process (i.e adapted to the filtration t). The Q-function satisfies the classical dynamic programming principle (DPP)

Q(t,u,a)=𝔼[Rt+Δ+ρΔsupa𝒜Q(t+Δ,Ut+Δ,a)|Ut=u,At=a], (5)

with Rt+Δ=tt+Δρ(s-t)f(s,Us,As)𝑑s. Equation (5) reads that the optimal expected gain when the agent starts at s and chooses action a at time t is the sum of the next expected reward Rt+Δ plus the value of acting optimally starting from the new position Ut+Δ at time t+Δ. By reformulating (5), we obtain that Q solves equation

𝔼[m(q,Xt+Δz,z)]=0,z=(t,u,a)𝒵=[0,T]×𝒰×𝒜, (6)

where Xt+Δz=(Ut+Δz,Rt+Δz)𝒳=𝒰×, Usz and Rsz are respectively the conditional random variables Us and Rs given the initial condition (Utz,Atz)=(u,a) with z=(t,u,a)𝒵 and m is defined as follows:

m(q,x,z1)=H(q,x,z1)-q(z1),H(q,x,z1)=r+ρΔsupa𝒜q(t1+Δ,u,a),

for any x=(u,r)𝒳 and z=(t1,u1,a1)𝒵. Thus, one can use stochastic approximation tools to solve (6).

Note that Equation (6) shows that one can study Q only on the time grid22 2 Here, we take T=n*Δ with of n**. Such approximation is not restrictive. DT={nΔ,nT/Δ}. Thus, we define Ak and Uk such that Ak=AkΔ and Uk=UkΔ for any k. The key variable to study is not the agent decision Ak but Zk=(k,Uk,Ak). Thus, the rest of the paper formulates the results in terms of Zk only.

Actions of the agent.

It is important in practice to visit the space DT×𝒰×𝒜 sufficiently enough. Thus, to learn Q, it is common to not choose the maximising action33 3 The maximising action a* for a state u is defined such that a*=argmaxa𝒜q(u,a)., but to encourage exploration by visiting the states where the error is large. We give in Appendix A examples of policies that promote exploration and others that maximize the Q-function.

3 Improvement of the asymptotic convergence rate

In [benveniste2012adaptive, Part 2, Section 4], [kushner2003stochastic, Section 10] and [kushner1978stochastic, Section 7], the authors show a central limit theorem for the procedure (3) which ensures a convergence rate of O(1/N) where N is the number of iterations. In this section, we extend such convergence rate to Algorithm (2) and aim at understanding how one can improve it. For this, we decompose our total error into two components: estimation error and optimization error.

3.1 Error decomposition

In this section, the space 𝒵={1,,d} is finite with d*. In such case, we view q and M(q) as vectors of 𝒵. Moreover, the process (Zn)n1 is Markov. We consider the following assumption.

Assumption 1 (Existence of a solution).

There exists a solution q* of Equation (1).

Under Assumption 1, the function q* is solution of the minimization problem

minq𝒬g(q). (7)

with g(q)=M(q)2.

Remark 1.

Note that, in the special case where M is the gradient of a given function f (i.e. f=M), the quantity q* minimises a convex and differentiable cost

g~=z𝒵𝔼[L(q,X(z),z)],

and we can derive the following explicit expression for L

L(q,x,z)=01(q(z)-q*(z))×m(q*+r(q-q*),x,z)𝑑r,q𝒬,z𝒵,x𝒳.

with g~=M. Thus, one can replace g by g~ and all the results of this section hold. In the rest of this section, we use g instead of g~.

In our context we do not have a direct access to the distribution of X(z). Nevertheless, we assume that at time n we keep memory of a training sample of n(z) independent variables (Xiz)i=1n(z) drawn from the distribution X(z) where n(z) is the number of times the Markov chain Zn visited z. We define qn as a solution of

minq𝒬gn(q), (8)

with gn(q)=Mn(q)2 and

Mn(q,z)=𝔼n[m(q,X(z),z)]=(j=1n(z)m(q,Xj(z),z))/n(z),

expected value under the empirical measure μ=(j=1n(z)δXj(z))/n(z) (i.e. empirical risk). We finally define qkn as an approximate solution of the problem (8) returned by an optimization algorithm after k iterations. Thus, we can bound the error g(qkn) by

0𝔼[(g(qkn)-g(q*))(z)]𝔼[(g(qn)-g(q*))(z)]estimation error+𝔼[|g(qn)-g(qkn)|(z)]optimization error,

since q* minimizes g.

3.2 Convergence rate of the estimation error

3.2.1 Slow convergence rate

We have the following result.

Proposition 1.

We assume that the Markov chain Zn is irreducible. There exists c1>0 such that

𝔼[supq𝒬|g(q)-gn(q)|(z)]c11n,z𝒵.

For sake of completeness, we give the proof of this result in Appendix D. Proposition 1 allows us to derive the following bound for the estimation error

𝔼[(g(qn)-g(q*))(z)] =𝔼[(g(qn)-gn(qn))(z)]+𝔼[(gn(qn)-gn(q*))(z)]0+𝔼[(gn(q*)-g(q*))(z)]
2𝔼[supq|g(q)-gn(q)|(z)]2c11n. (9)

This bound is known to be pessimistic.

3.2.2 Fast convergence rate

We obtain the following fast statistical convergence rate.

Proposition 2.

Assume that the Markov chain Zn is irreducible and

𝔼[supq𝒬|g(q)-gn(q)|(z)|n(z)]c(log(n¯(z))n¯(z))β, (10)

with 12β1, c>0 and n¯(z)=n(z)1. Then, there exists c2>0 such that

𝔼[|g(qn)-g(q*)|(z)]c2(log(n)n)β,z𝒵.

The proof of this proposition is given in Appendix E. The condition (10) is established when

  • The loss function g satisfies regularity conditions, of which the most important are : Lipschitz continuity and convexity, see [bartlett2006convexity, Section 4]. Moreover under the strong convexity assumption, the constant β is equal to 1.

  • The data distribution satisfies some noise conditions, see for instance [bartlett2006convexity, Section 3] and [tsybakov2004optimal].

  • The function m has a bounded moment α (i.e. 𝔼[|m|α(q)]<) with α>1, see [cortes2019relative, Section 4].

It is also possible to get rid of the log(n) factor in (10), see the end of Section 5 in [bousquet2003introduction].

3.3 Convergence rate of the optimization error

We turn now to the optimization error. This means that expected value in (7) is replaced by the empirical risk which is known. In such case, one can use many algorithms to find qn. We introduce the following assumption.

Assumption 2 (Pseudo strong convexity).

There exists L>0 such that

M(q)-M(q),q-qLq-q2,q𝒵,q𝒵.

The above assumption is natural since the gradient of any real valued strongly convex function f satisfies Assumption 2.

We present in the table below the most important properties of some gradient methods. Note that the results of Table 1 remain valid when strong convexity is replaced by Assumption 2.

Algorithm Cost of one iteration Iterations to achieve an ϵ precision Time to reach an ϵ precision
Convex Strongly convex Convex Strongly convex
GD O(d2) O(1ϵ) O(log(1ϵ)) O(d2ϵ) O(d2log(1ϵ))
SGD O(d) O(1ϵ2) O(1ϵ) O(dϵ2) O(dϵ)
Proximal O(d) O(1ϵ) O(log(1ϵ)) O(dϵ) O(dlog(1ϵ))
Acc. prox. O(d) O(1ϵ) O(log(1ϵ)) O(dϵ) O(dlog(1ϵ))
SAGA O(d) O(1ϵ) O(log(1ϵ)) O(dϵ) O(dlog(1ϵ))
SVRG O(d) O(log(1ϵ)) O(dlog(1ϵ))
Table 1: Asymptotic properties of some gradient methods. Note that d is the dimension of the state space 𝒵 and ϵ is a desired level of accuracy. Here ϵ corresponds to 1/n. GD stands for Gradient Descent, SDG for Stochastic Gradient Descent, Proximal for Stochastic proximal gradient descent [combettes2011proximal, schmidt2011convergence], Acc. prox. for accelerated proximal stochastic gradient descent [nitanda2014stochastic, schmidt2011convergence], SAGA for Stochastic accelerated gradient approximation [defazio2014saga], and SVRG for stochastic variance reduced gradient [johnson2013accelerating].

3.4 Conclusion

Following the formalism of [bottou2008tradeoffs], we have decomposed our initial error into

  • Estimation error: its convergence is O(1/N) in pessimistic cases with N the number of iterations. In the other situations, the convergence is faster (i.e. O((log(N)/N)β)) with 1/2β1.

  • Optimization error: the convergence is exponential under suitable conditions. In unfavourable cases, the convergence rate is O(1/N).

The comparison of these error sources shows that the estimation error is the dominant component. Thus, one can overcome the O(1/N) asymptotic speed, in some situations, by improving the estimation error.

4 Optimal policy for the learning rate γ when is countable

In this section, we take 𝒵= and consider the following type of algorithms:

qn+1(Zn)=qn(Zn)-γn(Zn)m(qn,Xn+1(Zn),Zn),n.

One can recover the classical SARSA, Q-learning and double Q-learning algorithms used in RL by considering a specific expression for m and Xn+1. In such algorithms the choice of γn is a crucial point. One can find in the literature general conditions that guarantee the convergence such that

k0γk(z)=,a.s,k0γk2(z)<,a.s,z𝒵. (11)

However, since the set of processes (γn)n0 satisfying these conditions is large in general and may even be empty when (Zn)n0 is not recurrent, many authors suggest to take γn proportional to 1/nα for stochastic approximation (SA) algorithms to be more specific. The exponent α may vary from 0 to 1 depending on the algorithm used, see [gadat2017optimal, moulines2011non]. Nonetheless, such a choice may be sub optimal. For example, Figure 1.a shows that the blue curve is a way higher than the orange one. Here, the blue curve represents the variation of the logarithm of the L2-error when γn=η/n, whereas the orange curve stands for a constant learning rate (i.e. γn=γ). We choose the constant η that ensures the fastest convergence for the blue curve.

In this paper, we propose to use a stochastic learning rate (γk)k0; our learning policy is decomposed into two interacting levels: the inner and the outer level. In the inner level, we use the Algorithm 3 (PASS). algorithm, for “PAst Sign Search”. This algorithm builds a new sequence (γki)k0, based on a predefined sequence (γko)k0 and the sign variations of m(qn,Xn+1(Zn),Zn), whose error decreases faster than the predefined one. In the outer level, we propose an optimal methodology for the selection of the predefined sequence (γko)k0. These two levels are interacting in the sense that the Algorithm 3 (PASS). algorithm influences the construction of the sequence (γko)k0.


Figure 1: L2-error for the estimation of the drift when γk is constant in orange and when γk1k in blue.

4.1 The inner level

4.1.1 The algorithms

In this part, we introduce three algorithms. We start with our benchmark which is the standard algorithm used in RL. Then, we present a second algorithm inspired from SAGA [defazio2014saga], which is a method used to accelerate the convergence of the stochastic gradient descent. SAGA reduces the optimization error exponentially fast. Finally, we describe the Algorithm 3 (PASS). algorithm that modifies the learning rate (γk)k based on the sign variations of m(qn,Xn+1(Zn),Zn). The main idea is to increase γn as long as the sign of m(qn,Xn+1(Zn),Zn) remains unchanged. Then, we reinitialize or lower γn using a predefined sequence (γko)k when the sign of m(qn,Xn+1(Zn),Zn) switches. This algorithm can be seen as an adaptation of the line search strategy, which determines the maximum distance to move along a given search direction, to SA methods. Actually, the line search method requires a complete knowledge of the cost function because it demands to evaluate several times the difference g(qk+γM(qk))-g(qk) for different values of γ with g and M defined in Section 3.1. However, stochastic iterative models have neither access to g nor M. They can only compute m(qn,Xn+1(Zn),Zn) when the state z=Zk is visited. Moreover, to get a new observation they need to wait44 4 This waiting time may be very long depending on the dimension of the state space 𝒵 and the properties of the process (Zk)k0. for the next visit of the state z=Zk. Nevertheless, they have instantaneous access to the previously observed values. Thus, the main idea here is to use these past observations although it adds a small memory cost. Some theoretical properties of these algorithms are investigated in Section 4.1.3.

Algorithm 1 (RL).

We start with an arbitrary q0Q and define by induction qk

qk+1(Zk)=qk(Zk)-γk(Zk)m(qk,Xk+1(Zk),Zk).
Algorithm 2 (SAGA).

We start with an arbitrary q0Q, M0=055 5 Here M0 is the zero function in the sense that M0[z,i]=0 for any zZ and i{1,,M}. and define by induction qk and Mk

qk+1(Zk)=qk(Zk)-γk(Zk)[m(qk,Xk+1(Zk),Zk)-Mk[Zk,i]+(j=1MMk[Zk,j])M],Mk+1[Zk,i]=m(qk,Xk+1(Zk),Zk),

with i picked from the distribution p=(i=1Mδi)/M.

For the next algorithm, we give ourselves a predefined learning rate that we call (γk)k0, a non-decreasing function h:++ and a non-increasing function l:++. The function h is used to increase the learning rate and hence to accelerate the descent, while the function l is used to go back to a slower pace.

Algorithm 3 (PASS).

We start with an arbitrary q0 and define by induction qk and γ^k

  • If m(qn,Xn+1(Zn),Zn)×m(qr1n,Xr1n+1(,Zr1n),Zr1n)0, then do

    qn+1(Zn)=qn(Zn)-h(γ^n(Zn),γn(Zn))m(qn,Xn+1(Zn),Zn),γ^n+1(Zn)=h(γ^n(Zn),γn(Zn)),

    with r1n is the index of the last observation when the process X visits the state Xn.

  • Else, do

    qn+1(Xn)=qn(Xn)-l(γ^n(Xn),γn(Xn))m(qn,Xn,Xn+1),γ^n+1(Xn)=l(γ^n(Xn),γn(Xn)).

4.1.2 Assumptions

In this section, we present the assumptions needed to prove our main result about the convergence of Algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS).. We assume that Assumption is in force. Hence, we denote by q* a solution of (1) and write m* for the vector m*=m(q*)𝒳×𝒵. Let us consider the following assumptions:

Assumption 3 (Pseudo strong convexity 2).

There exists a constant L>0 such that

(𝔼k[m(qk,Xk+1(Zk),Zk)-m*(Xk+1(Zk),Zk)])(qk(Zk)-q*(Zk))L(qk(Zk)-q*(Zk))2,

with q* of (1) and Ek[X]=E[X|Fk] for any random variable X.

Recall that Assumption 3 is natural in the deterministic framework. For instance, if we take a strongly convex function f and call m its gradient (i.e m=f). Then, m satisfies Assumption 3. Additionally, the pseudo-gradient property (PG) considered in [bertsekas1996neuro, Section 4.2] is close to Assumption 3. However, Assumption 3 is slightly less restrictive than PG since it involves only the norm of the component (qk-q*)(Zk) instead of the norm of the vector (qk-q*). To get tighter approximations, we will also need the quantity Lk defined as follows:

Lk={𝔼k[m(qk,Xk+1(Zk),Zk)-m*(Xk+1(Zk),Zk)]qk(Zk)-q*(Zk), If qk(Zk)-q*(Zk)0,0, otherwise.

Note that Lk0 under Assumption 3. It is also the biggest constant that satisfies Assumption 3 for a fixed k. In particular, this means that LkL.

Assumption 4 (Lipschitz continuity of m).

There exists a positive constant B>0 such that for any random variables X and X valued in X we have

𝔼k[(m(qk,X,Zk)-m*(X,Zk))2]B{1+(qk(Zk)-q*(Zk))2+𝔼k[(X-X)2]},

with Ek[X]=E[X|Fk] for any random variable X.

Assumption 4 guarantees that m is Lipschitz. Authors in [bertsekas1996neuro, Section 4.2] use a similar condition. To get better bounds, we introduce Bk

Bk=𝔼k[(m(qk,X,Zk)-m*(X,Zk))2]1+(qk(Zk)-q*(Zk))2+𝔼k[(X-X)2].

We have BkB since Bk is the smallest constant satisfying Assumption 4 for a fixed k. We finally need an assumption on the learning (γk)k0 that describes indirectly how the process Z communicates with its different states.

Assumption 5 (Learning rate explosion).

For any zZ, there exists a non-negative deterministic sequence (γkd(z))kN such that

γk(z)γkd(z),a.s and k1γkd(z)=,z𝒵.

When the process Z is Markov and γk(z) is bounded, Assumption 5 ensures that Z is recurrent. To see this, we first remark that the boundedness assumption of γk(z) is not restrictive and can always be fulfilled. After that, we use the boundedness of γk(z) to exhibit a positive constant A such that γk(z)A,a.s for all k1. Thus, we get k1𝔼[γk(z)𝟏Zk=z]Ak1[Zk=z]. Since the left hand side of the previous inequality diverges under Assumption 5, we have

k1[Zk=z]=,

which proves that Z is recurrent.

4.1.3 Main results

In this section, we compare the algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS). and prove the convergence of Algorithm 3 (PASS).. Let c be a positive constant and k. We define the error function for the different algorithms as follows:

ek(z)={(qk(z)-q*(z))2, for algorithms Algorithm 1 (RL). and Algorithm 3 (PASS).,j=1M(Mk[z,j]-m*(z))2M+c(qk(z)-q*(z))2, for algorithm Algorithm 2 (SAGA).,

for all z𝒵 and j{1,,M}. We write Ek for the total error Ek=ek. We also use the following notations:

p(x)=2Lx-Bx2,pk(x)=2Lkx-Bkx2,γ¯k=argsuplpk(l)=LkBk,x.
Proposition 3.

Let zZ. Under Assumptions 1, 3 and 4 and when there exists r11 such that

γ2l(γ1,γ2), and   h(γ1,γ2)r1γ2,(γ1,γ2)2,

we have

𝟏A𝔼k[ek+1(z)] 𝟏A[αkek(z)+Mk], (12)

with A={Zk=z}. The values of the constant αk and Mk vary from an algorithm to another as follows:

αk(z1)={[1-p(γk(z))], for algorithm Algorithm 1 (RL).,max(1-p(γk(z))+BMc,1-(1M-6γk2(z)c)), for algorithm Algorithm 2 (SAGA).,{1-pk(γ¯k*)+d1γk2(z)𝟏ck1}, for algorithm Algorithm 3 (PASS)., (13)

and

Mk={Bγk2(z)(2+vk), for algorithm Algorithm 1 (RL).,3Bγk2(z)(2+vk), for algorithm Algorithm 2 (SAGA).,B(ckγ¯k)2(2+vk), for algorithm Algorithm 3 (PASS)., (14)

with ck=Ek[γ^k(z)m(qk,Xk+1(z),z)]γ¯k(z)Ek[m(qk,Xk+1(z),z)], γ¯k*=ck(z)γ¯kγ¯k, d1=(r1-1)2Bk and vk=Var(Zk).

Equation (12) reveals that the performances of algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS). depend on the interaction between two competing terms:

  • On the one hand the slope αk controls the decrease of the error from one step to the other,

  • On the other hand the quantity Mk gathers two sources of imprecision: the estimation error and the optimization error. Both sources of imprecision have a term in the variance vn (because the distribution of Z is unknown) and a positive constant (coming from the noise generated by the noisy nature of observations).

There is a competition between these two terms: to decrease Mk we need to send γk towards zero while the reduction of αk requires a relatively small but still non-zero value of γk. Thus, γk should satisfy a trade-off in order to ensure the convergence of the algorithms. The RM conditions (11) are a way to address this trade-off. Now, in order to analyse the properties of each algorithm, we compare for a given γk their respective values for αk and Mk in Table 2. For sake of clarity, we choose to present the variable (1-αk) instead of αk in this table; note that a large value of 1-αk means that αk is small and thus induces a fast convergence.

Algorithms (1-αk) Mk
Value Comparison with Algo 1 Value Comparison with Algo 1
RL (Algo 1) 2γk(z1)L - Bγk2(z1) Bγk2(2+vk)
SAGA (Algo 2) (2γk(z1)L-Bγk2(z1)-BMc)(1M-6γk2(z1)c) smaller 3Bγk2(2+vk) larger
PASS (Algo 3) 2γ¯k*Lk - Bk(γ¯k*)2+d1γk2𝟏ck1 larger B(ckγ¯k)2(2+vk) larger
Table 2: Comparison of the algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS)..

The result below holds only for Algorithms 1 and 3.

Theorem 1.

Let the Assumptions 1, 3, 4 and 5 be in force. Algorithms 1 and 3 verify

  • When k0𝔼[γk2](z)< for all z, we have

    𝔼0[En]n0. (15)
  • When γn(z)=γ𝟏{Zn=z} for all (n,z)×𝒵 with γ]0,1[ a constant, there exist two constants β[0,1) and M¯>0 which satisfy

    𝔼0[En]c(1+r)(1-α~n)+α~nM¯E0,α~<max(β,α),n, (16)

    with c=γ2M1-α, α=1-γ and M=supnMn.

The first part of the above result ensures the convergence of the error En towards zero and the second part describes the behaviour of the error when γk(z) is fixed constant. In such case, the error En decreases exponentially fast towards the constant c=(γ2M)/(1-α).

4.2 The upper level

In practice, to apply Algorithm 3 (PASS). we need an appropriate predefined sequence (γk)k. It is possible to take γk proportional to 1/kα with α(0,1] as proposed in [moulines2011non, robbins1951stochastic]. However, in this section, we present an optimal dynamic policy for the choice of the learning rate (γk)k. To do so, we assume that

𝟏Aen+1(z)=𝟏A(αnen(z)+Mn)+Sn, (17)

with Sn does not depend on the learning rate and verifies 𝔼[Sn]0. Equation (17) is consistent with Proposition 3. Since αn and Mn are both functions of the learning rate, the idea is to choose the learning rate γn such that

γn=argminγ(αn(γ)en(z)+Mn(γ)), (18)

which gives

γn={Len(z)Ben(z)+[B(2+vn)], for algorithm Algorithm 1 (RL).,Lnen(z)Bnen(z)+d1𝟏cn1+[(Bncn2)(2+vn)], for algorithm Algorithm 3 (PASS).,

The constants L, Ln, B, Bn, d1 and cn are defined in Proposition 3. We write Γ for the set of processes γ~=(γ~n)n0 adapted to the filtration generated by the observed errors (en)n0. We have the following result.

Proposition 4.

The sequence γ=(γn)n0 defined in (18) satisfies

eγneγ~n,a.s,n.

We write eγn to point out the dependence of the error en on the chosen control γ.

The proof of the above result is given in Appendix H. Proposition 4 shows that γ ensures that fastest convergence speed of the average error and in particular

𝔼[eγn]=infγ~Γ𝔼[eγ~n],n.

This guarantees its optimality.

Since the constants L and B are unknown in practice, a first solution consists in starting with arbitrary values for B and L and generating a sequence of learning rates. If the error m(qn,Xn+1(Zn),Zn) increases, one can take a larger value for B and a smaller one for L otherwise he keeps B and L unchanged. A second solution consists in taking a piece-wise constant (PC) policy for the learning (γk)k0. To do so, we average the error m(qn,Xn+1(Zn),Zn) over the lasts p visit times with p fixed by the controller. If this average error does not decrease, we reduce the step size by a factor α.

4.3 Extension

The results of this section still hold when the descent sequence (2) is replaced by

qn+1(z)=qn(z)-γn(z)m(qn,Xn+1(z),z),z𝒵,n. (19)

When γn(z)=0 if zZn, we recover (2). Thus, Equation (19) is slightly more general. Moreover, Algorithm (19) appears in many contexts: stochastic iterative algorithms, gradient methods, fixed point iterative techniques, etc. The results of this section can be directly transposed to (19). We present in Appendix B an adaptation of the Algorithm Algorithm 3 (PASS). for (19).

5 Some examples

5.1 Methodology

The code and numerical results presented in this section can be found in https://github.com/othmaneM/RL_adap_stepsize. Here, we compare four algorithms. The two first ones are two different versions of Algorithm 1 (RL).. In the first version, the learning rate γk(z) is taken such that γk=η/nk(z) with η>0 selected to provide the best convergence results and nk(z) the number of visits to the state z. In the second version, the step size follows the piece-wise constant policy (PC) described at the end of Section 4.2. The third algorithm is Algorithm 2 (SAGA). where the step size is derived from PC policy. Finally, we use the Algorithm 3 (PASS). algorithm presented in the previous sections with a predefined learning rate following the PC policy. We consider three numerical examples to compare the convergence speed of these algorithms: drift estimation, optimal placement of limit orders and the optimal liquidation of shares.

5.2 Drift estimation

Formulation of the problem.

We observe a process (Sn)n0 which satisfies

Sn+1=Sn+fn+1+Wn, (20)

with Wn a centred noise with finite variance. We want to estimate the quantities fi with i{1,,nmax}. Using (20) and 𝔼[Wt]=0, we get

𝔼[Si+1-Si-fi+1]=0,i{0,,nmax-1}.

Thus, we can estimate fi using stochastic iterative algorithms. The pseudo-code of our implementation of Algorithm 3 (PASS). for this problem can be found in the Appendix C under the name Implementation C.

Numerical results.

Figure 2 shows the variation of the L2-error when the number of iterations increases. We can see that the algorithm Algorithm 3 (PASS). outperforms standard stochastic approximation algorithms. Moreover, other algorithms behave as expected: the standard RL decreases very slowly (but we know it will drive the asymptotic error to zero), the constant learning rate and SAGA provides better results than RL, while Algorithm 3 (PASS). seems to have captured the best of the two worlds for this application: very fast acceleration at the beginning and the asymptotic error goes to zero.

L2-error against the number of iterations  

Figure 2: The L2-error between fk and f for different numerical methods averaged over 1000 simulated paths.

5.3 Optimal placement of a limit order

Formalisation of the problem.

We consider an agent who aims at buying a unit quantity using limit orders and market orders (see [lehalleLimitOrderStrategic2017] for detailed explanations). In such case, the agents wonder how to find the right balance between fast execution and avoiding trading costs associated to the bid-ask spread. The agent state at time t is modelled by Xt=(QBefore,QAfter,P) with QBefore the number of shares placed before the agent’s order, QAfter the queue size after the agent’s order and Pt the mid price, see Figure 3. The agents wants to minimise the quantity

𝔼[F(XτTexec)+0τTexecc𝑑s]

where

  • Texec=inf{t0,Pt=0} the first time when the limit order gets a transaction.

  • τ the first time when a market order is sent.

  • X=(QBefore,QAfter,P) the state of the order book.

  • F(u) is the price of the transaction (i.e. F(u)=p+ψ when the agents crosses the spread and F(u)=p otherwise).

We show in Section 2 that the Q-function is solution of (6), see details in https://github.com/othmaneM/RL_adap_stepsize. Thus, we can use Algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS). to estimate it. The pseudo-code of our implementation of Algorithm 3 (PASS). is available as Implementation 25 in Appendix C.

{tikzpicture}\draw

(3,-1) rectangle (4,-5); \draw[black,fill=gray!60] (3,-3) rectangle (4,-3.3); \draw(6,-2) rectangle (7,-5); \draw[dotted][->](0,-5) – ++(10,0); \draw(5,-5) node[sloped]|; \draw(3.5,-5) node[below]Same ; \draw(6.5,-5) node[below]Opp ; \draw(3,-4) node[left]QBefore ; \draw(3,-3.15) node[left]q ; \draw(3,-2) node[left]QAfter ; \draw(7,-4) node[right]QOpp ; \draw(5,-5) node[below]P(t) ; \draw(10,-5) node[right]Price ;

Figure 3: The state space of our limit order control problem.
Numerical results.

Figure 4 shows three control maps: the x-axis reads the quantity on “same side” (i.e. Qsame=QBefore+QAfter) and the y-axis reads the position of the limit order in the queue, i.e. QBefore. The color and numbers gives the control associated to a pair (Qsame,QBefore): 1 (blue) means “stay in the book”, while 0 (red) means “cross the spread” to obtain a transaction. The panel (at the left) gives the reference optimal controls obtained with a finite difference scheme, the middle panel the optimal controls obtained for a Algorithm 1 (RL). algorithm where the step-size (γk)k0 is derived from the upper level policy, and the right panel the optimal control obtained with our optimal policy (i.e. upper level and inner level combined). It shows that after few iterations our optimal policy already found the optimal controls. Figure 5 compares the log of the L2 error, averaged over 100 trajectories, between the different algorithms. We see clearly that our methodology improves basic stochastic approximation algorithms. Again, the other algorithms behave as expected: SAGA is better than a constant learning rate that is better than the standard RL (at the beginning, since we know that asymptotically RL will drive the error to zeros whereas a constant learning rate does not).

a) Theoretical optimal control b) step_cste optimal control c) PASS optimal control  

Figure 4: Comparison optimal control after 300 iteration for different methods: left is the optimal control, middle is Algorithm 1 (RL). with a step size derived from the upper level and right is our optimal policy for the step size (i.e. upper level and inner level combined).

L2-error against the number of iterations  

Figure 5: The log L2-error against the number of iterations averaged over 1000 simulated paths.

5.4 Optimal execution

Formalisation of the problem.

An investor wants to buy a given quantity q0 of a tradable instrument (see [cartea15book] and [lehalleMarketMicrostructurePractice2018] for details about this application). The price St of this instrument satisfies the following dynamic:

dSt=αdt+σdBt,

where α is the drift and σ is the price volatility. The state of the investor is described by two variables its inventory Qt and its wealth Xt at time t. The evolution of these two variables reads

{dQt=νtdt,Q0=q0,dWt=-νt(St+κνt)dt,W0=0, (21)

with νt the trading speed of the agent and κ>0. The term κνt corresponds to the temporary price impact. The investor wants to maximize the following quantity

WT+QT(ST-AQT)-ϕtTQs2𝑑s,

it represents its final wealth XT at time T, plus the value of liquidating its inventory minus a running quadratic cost. The value function V is defined such that

V(t,w,q,s)=supν𝔼[WT+QT(ST-AQT)-ϕtTQs2ds|Wt=w,Qt=q,St=s].

We remark that v(t,w,q,s)=V(t,w,q,s)-w-qs verifies

v(t,w,q,s)=supν𝔼[(WT-Wt)+(QTST-QtSt)=MTt-AQT2-ϕtTQs2ds|Wt=w,Qt=q,St=s].

Using (21), we can see that the variable MTt is independent of the initial values Wt, St and Qt. This means that v is a function of only two variables the time t and the inventory q. The dynamic programming principle ensures that v satisfies

v(t,q)=supν𝔼[Mt+Δt-ϕtt+ΔQs2ds+v(t+Δ,Qt+Δ)|Qt=q]. (22)

We fix a maximum inventory q¯. Let k=(kT,kq)(*)2, Δ=T/kT, DT={tikT;ikT} and Dq={qikq;ikq} with tikT=iΔ and qikq=-q¯+2iq¯/kq. To estimate v we use the numerical scheme (vnk)n1,k(*)2 defined such that

vn+1k(Zn)=vnk(Zn)+γn(Zn)[supνA(Zn){Mn+1ν-ϕΔQn2+vnk(Zn+1ν)-vnk(Zn)}],

with Zn=(nΔ,QnΔ) and A(Zn)Dq is the set of admissible actions66 6 We do not allow controls that lead to states where the inventory exceeds q¯.. When the final time T is reached (i.e. n=kT), we pick a new initial inventory from the set Dq and start again its liquidation. At a first sight, it is not clear that vnk approximates v. However, we show in Appendix I that vnk converges point-wise to v on DT×Dq when n and k. see Appendix C for a detailed implementation of the algorithm with the corresponding pseudo-code (as Implementation C).

Numerical results.

Figure 6 shows the value function v for different values of the elapsed time t and the remaining inventory Qt. The panel (at the left) gives the reference value function. It is computed by following the same approach of [MFGTradeCrow2016]. The middle panel the value function obtained obtained after 120 000 iterations for Algorithm 1 (RL). algorithm where the step-size (γk)k0 is derived from the upper level of our optimal policy, and the right panel the value function obtained with our optimal policy (i.e upper level and inner level combined). It shows that our optimal strategy leads to better performance results. We also plot, in Figure 7, a simulated path for the variations of the log L2 error for different algorithms. Here again, we notice that our methodology improves the basic RL algorithm and that the ordering of other approaches is similar to the one of the “drift estimation” approximation (i.e. SAGA and the constant learning rate are very similar).

a) Theoretical value function b) step_ cste value function c) PASS value function  

Figure 6: Comparison value function between methods.

L2-error against the number of iterations  

Figure 7: The log L2-error against the number of iterations averaged over 1000 simulated paths.

Appendix A Actions of the agent

We present here policies that encourage exploration and others that maximize the agent’s decisions.

Exploration policies:

Since it is important to visit the state space sufficiently enough, we propose to set the conditional distribution of the random variable Ak such that

[Ak=a|k]=eϵ¯k(Zka)aeϵ¯k(Zka),a𝒜, (23)

with Zka=(k,Uk,a), ϵ¯k(Zka)=β(Zka)aϵk(Uk+1Zka,a) and

ϵk(z)={|m(qr1(z),Xr1(z)+1(z),z)|when the state z already visited,botherwise,

where b>0 encourages the exploration, qk satisfies (2) and r1(z) is the last observation time of the state z.

Optimal policies.

To give more importance to the maximizing action, one may consider the following policy:

[Ak=a|k]=eβk(Uk)qk(Zka)aeβk(Uk)qk(Zka),a𝒜. (24)

Any mixture of these two procedures can an also be considered.

Appendix B Extension of the Algorithm 3 (PASS). Algorithm

First, we can adapt Algorithm Algorithm 3 (PASS). to (19) by considering a component-wise version which consists in applying Algorithm 3 introduced in Section 4.1.1 to each visited coordinate of qn. Next, we also propose a more direct approach with the algorithm below.

Algorithm 4 (PASS for (19)).

We start with an arbitrary q0 and define by induction qk and γ^k

  • If m(qn,Xn+1),m(qn-1,Xn)0, then do

    qn+1=qn-h(γ^n,γn)m(qn,Xn+1),γ^n+1=h(γ^n,γn).
  • Else, do

    qn+1=qn-l(γ^n,γn)m(qn,Xn+1),γ^n+1=l(γ^n,γn).

Appendix C Implementations

We give here the pseudo code used for each one of the three numerical examples considered in Section 5.

Drift estimation.

We consider the following expression for the functions h and l:

h(γ,γbase)=min(γ+γbase,3γbase),l(γ,γbase)=max(γ-γbase,γbase),(γ,γbase)+.

We use Implementation C for the numerical experiments. {algorithm} PAst Sign Search (PASS) for (RL) drift estimation problem \[email protected]@algorithmic[1] \StateAlgorithm parameters: step size (γo)n0(0,1], number of episodes n \Statexinitial guess q0, past error value Epast \StatexInitialise γ^0=γ0o \Forepisode in 1:n \For t{0,,nmax-1} \StateObserve ΔXnext=St+1-St \Ifthe first visit time to t \Stateq0(t)q0(t)-γ^0(t)m(q0,ΔXnext,t) \ElsIfm(q0,t,ΔXnext)×Epast(t)0 \Stateγ^0(t)h(γ^0(t),γo(t)) \Stateq0(t)q0(t)-γ^0(t)m(q0,ΔXnext,t) \ElsIfm(q0,t,ΔXnext)×Epast(t)<0 \Stateγ^0(t)l(γ^0(t),γo(t)) \Stateq0(t)q0(t)-γ^0(t)m(q0,ΔXnext,t) \EndIf\StateEpast(t)m(q0,t,ΔXnext) \EndFor\StateSave the norm E of the vector Epast(t). \Ifthe average value of E over the last w=5 episodes is not reduced by p=1% \Stateγo(t)max(γo(t)/2,0.01) (this is done each w episodes) \EndIf\EndFor

Optimal placement of limit orders.

We consider the following expression for the functions h and l:

h(γ,γbase)=max(min(γ+2/3γbase,3γbase),γbase),l(γ,γbase)=max(γ-2/3γbase,γbase),(γ,γbase)+. (25)

We use Algorithm 25 for the numerical experiments. Note that we do not need to send a market order to know our expected future gain. {algorithm} PAst Sign Search (PASS) for (RL) optimal placement problem \[email protected]@algorithmic[1] \StateAlgorithm parameters: step size (γo)n0(0,1], number of episodes n \Statexinitial guess q0, past error value Epast \StatexInitialise γ^0=γ0o \Forepisode in n \StateSelect initial state X0 \Foreach step within episode \StateTake the action stay in the order book \StateObserve the new order book state Xnext \Fora {0,1} \Ifthe first visit time to Xnext \Stateq0(X0,a)q0(X0,a)-γ^0(X0,a)ma(q0,Xnext,X0) \ElsIfma(q0,X0,Xnext)×Epast(X0,a)0 \Stateγ^0(X0,a)h(γ^0(X0,a),γo(X0,a)) \Stateq0(X0,a)q0(X0,a)-γ^0(X0,a)ma(q0,Xnext,X0) \ElsIfma(q0,X0,Xnext)×Epast(X0,a)<0 \Stateγ^0(X0,a)l(γ^0(X0,a),γo(X0,a)) \Stateq0(X0,a)q0(X0,a)-γ^0(X0,a)ma(q0,Xnext,X0) \EndIf\StateEpast(X0,a)ma(q0,Xnext,X0) \EndFor\StateX0Xnext \EndFor\StateSave the norm E of the vector Epast(t). \Ifthe average value of E over the last w=40 episodes is not reduced by p=5% \Stateγo(t)max(γo(t)/2,0.01) (this is done each w episodes) \EndIf\EndFor

Optimal execution of a large number of shares.

To solve this problem we use the same functions h and l considered in the previous problem, see (25). Then, we apply Algorithm C. In this problem, it is crucial to select actions according to the policy (23) in order to encourage exploration. The coefficient β¯ used by the agent to select its actions is taken constant equal to β¯=5. We consider the same policy for all the tested algorithms.
{algorithm} PAst Sign Search (PASS) for (RL) optimal execution problem \[email protected]@algorithmic[1] \StateAlgorithm parameters: step size (γo)n0(0,1], number of episodes n \Statexinitial guess q0, past error value Epast \StatexInitialise γ^0=γ0o \Forepisode in n \StateSelect the initial inventory Q0 \Fort{0,,nT-1} \StateObserve the new price state Snext and set X0=(t,Q0) \StateObserve the new price state Snext \Ifthe first visit time to X0 \Stateq0(X0)q0(X0)-γ^0(X0)m(q0,Snext,X0) \ElsIfm(q0,Snext,X0)×Epast(X0)0 \Stateγ^0(X0)h(γ^0(X0),γo(X0)) \Stateq0(X0)q0(X0)-γ^0(X0)m(q0,Snext,X0) \ElsIfm(q0,Snext,X0)×Epast(X0)<0 \Stateγ^0(X0)l(γ^0(X0),γo(X0)) \Stateq0(X0)q0(X0)-γ^0(X0)m(q0,Snext,X0) \EndIf\StateEpast(X0)m(q0,Snext,X0) \StateSelect an action A and observe Qnext \StateQ0Qnext \EndFor\StateSave the norm E of the vector Epast(t). \Ifthe average value of E over the last w=300 episodes is not reduced by p=1% \Stateγo(t)max(γo(t)-0.01,0.01) (this is done each w episodes) \EndIf\EndFor

Appendix D Proof of Proposition 1

Proof of Proposition 1.

Let z𝒵. Standard uniform convergence results ensure that

𝔼[supq|g(q)-gn(q)|(z)|n(z)]c1n(z)1,a.s.

with c>0 a positive constant. Since the Markov chain (Zn)n1 is irreducible and the set 𝒵 is finite, the sequence (Zn)n1 is positive recurrent and we have

n(z)n=k=1n𝟏Zk=znnμ[Zn=z]>0=p(z),a.s,

with μ the unique invariant distribution of (Zn)n1. Thus, we have

un(z)=𝔼[nn(z)1]n1p(z)>0.

This shows that un(z) is bounded by a constant called u(z) and ensures that

𝔼[supq|g(q)-gn(q)|(z)]c1(z)1n,

with c1(z)=cu(z). The constant c1 can be taken independent of z since 𝒵 is finite. ∎

Appendix E Proof of Proposition 2

Proof of Proposition 2.

Let z𝒵. We follow the same approach used in the proof of Proposition 1 to get

vn(z)=𝔼[(log(n¯(z))/log(n)n¯(z)/n)β]𝔼[(1n¯(z)/n)β]n1p(z)β>0.

This shows that vn(z) is bounded by a constant v(z) and ensures that

𝔼[supq|g(q)-gn(q)|(z)]c2(z)1n,

with c2(z)=cv(z). The constant c2 can be taken independent of z since 𝒵 is finite. Using the same manipulations of (9), we complete the proof. ∎

Appendix F Proof of Proposition 3

Proof of Proposition 3.

Let k0, A the set A={Zk=z} and m(qk)𝒵 such that m(qk)(z)=m(qk,Xk+1(z),z) for any z𝒵. We split the proof in three steps. In each one of these steps, we prove (12) for a given algorithm.

Step (i):

In this step, we prove (12) for Algorithm Algorithm 1 (RL).. We have

𝟏A𝔼k[(qk+1-q*)2(z)] =𝟏A𝔼k[(qk-γkm(qk)-q*)2(z)]
=𝟏A{(qk-q*)2(z)-2γk𝔼k[m(qk)](qk-q*)(z)=(i)+γk2𝔼k[m(qk)2(z)]=(ii)}.

Using Assumption 3 and 𝔼k[m*]=0, we get (i)-2Lγk|qk-q*|2(z). Since 𝔼k[m*]=0, Assumption 4 gives

(ii) =𝔼k[(m(qk)-𝔼k[m(qk)])2(z)]+(𝔼k[m(qk)-m*])2(z)
B(1+vk)+B(1+(qk-q*)2(z)).

Thus, we deduce that

𝟏A𝔼k[(qk+1-q*)2(z)] 𝟏A{(1-2γkL+Bγk2=-p(γk))(qk-q*)2(z)+γk2(z)B(2+vk)=Mk},

which shows (12) for Algorithm Algorithm 1 (RL)..

Step (ii):

Here we show (12) for Algorithm Algorithm 2 (SAGA).. Let M¯k(z)=(j=1MMk[z,j])/M. Using 𝔼k[m*]=0 and 𝔼k[Mk[z,i]]=M¯k(z), we have

𝟏A𝔼k[(qk+1-q*)2(z)]   =𝟏A{(qk-q*)2(z)+2(𝔼k[qk+1]-qk)(qk-q*)(z)+𝔼k[(qk+1-qk)2(z)]}
  =𝟏A{(qk-q*)2(z)-2(γk𝔼k[m(qk)-m*])(qk-q*)(z)
    +γk2𝔼k[(m(qk)(z)-Mk[z,i]+M¯k(z))2]}
 Assumption 3 𝟏A{(1-2Lγk)(qk-q*)2(z)+γk2𝔼k[(m(qk)(z)-Mk[z,i]+M¯k(z))2]=(1)}. (26)

We first dominate the term (1). Since 𝔼k[m*](z)=0 and

𝔼k[(Mk[z,i]-m*)2(z)]=1/Mj𝔼k[(Mk[z,j]-m*(z))2],

we have

(1) =𝔼k|(m(qk)-𝔼k[m*])(z)-(Mk[z,i]-m*(z))+(1/Mj(Mk[z,j]-m*(z)))|2
3[𝔼k[m(qk)-m*]2(z)+𝔼k[(Mk[z,i]-m*(z))2]+𝔼k[(1/Mj(Mk[z,j]-m*(z))]2]
Jensen’s inequality3[𝔼k[(m(qk)-m*)2(z)]+𝔼k[(Mk[z,i]-m*)2(z)]+1/Mj𝔼k[(Mk[z,j]-m*(z))2]]
=3[𝔼k[(m(qk)-𝔼k[m(qk)])2(z)]+(𝔼k[m(qk)-m*](z))2+2/Mj𝔼k[(Mk[z,j]-m*(z))2]]
 Assumption 43[B(2+vk+(qk-q*)2(z))+2/Mj𝔼k[(Mk[z,j]-m*(z))2]]
=3B(2+vk)+3B(qk-q*)2(z)+6/Mj𝔼k[(Mk[z,j]-m*(z))2]]. (27)

By combining (26) and (27), we get

𝟏A𝔼k[|qk+1-q*|2(z)] 𝟏A{(1-2γk(z)L+3Bγk2(z))(qk-q*)2(z)
+6/Mj𝔼k[(Mk[z,j]-m*(z))2]+3Bγk2(vk+2)(z)}. (28)

Moreover, we have

𝟏A1/M𝔼k[j=1M(Mk+1[z,j]-m*(z))2] =𝟏A{1M𝔼k[(m(qk)-m*)2(z)]
  +(1-1M)1Mj=1M𝔼k[(Mk[z,j]-m*(z))2]}
𝟏A{BM(1+(qk-q*)2(z))
  +(1-1M)1Mj=1M𝔼k[(Mk[z,j]-m*(z))2]}. (29)

Thus using (28) and (29), we conclude

𝟏A𝔼k[ek+1(z)] 𝟏A{(1-2γkL+3Bγk2+BMc)=α1c(qk-q*)2(z)
+(1-1M+6γk2(z)c)α21Mj𝔼k[(Mk[z,j]-m*(z))2]}+3Bγ2k(vk+2)(z)
𝟏Aαek(z)+3Bγk2(vk+2)(z),

with α=max(α1,α2)[0,1).

Step (iii):

In this final step, we show (12) for Algorithm Algorithm 3 (PASS).. We have

𝟏A𝔼k[(qk+1-q*)2(z)] =𝟏A𝔼k[(qk-γ^km(qk)-q*)2(z)]
=𝟏A{(qk-q*)2(z)-2𝔼k[γ^km(qk)](qk-q*)(z)=(i)+𝔼k[γ^k2(m(qk))2(z)]=(ii)}.

For the term (i), using Assumption 3 and 𝔼k[m*]=0, we have (i)-2ckγ¯k(qk-q*)2(z) with ck=𝔼k[γ^km(qk)(z)]γ¯k𝔼k[m(qk)(z)]. Using Assumption 4 and 𝔼k[m*]=0, we get

(ii)=ck2γ¯k2𝔼k[m(qk)2(z)] =ck2γ¯k2(𝔼k[(m(qk)-𝔼k[m(qk)])2(z)]+(𝔼k[m(qk)-m*])2(z))
ck2γ¯k2(Bk(1+vk)+Bk(1+(qk-q*)2(z))).

Thus, we deduce that

𝟏A𝔼k[(qk+1-q*)2(z)] 𝟏A{(1-2ckγ¯kLk+Bk(ckγ¯k)2=-pk(ckγ¯k))(qk-q*)2(z)+ck2γ¯k2(z)Bk(2+vk)=Mk}. (30)

We write γ¯k for the quantity γ¯k=ckγ¯kγ¯k. Since γ^k[γk,r1γk], we have ckγ¯k[γk,r1γk]. When ckγ¯k]γk,γ¯k], we have pk(γ¯k)=pk(ckγ¯k)>pk(γk)0. When ckγ¯k]γ¯k,r1γk] (i.e ck1), we use the following canonical decomposition of the function pk:

pk(x)=Bk(x-γ¯k)2-(LkBk-1),

and γ¯k=γ¯k to get

|pk(ckγ¯k)-pk(γ¯k)|=|pk(ckγ¯k)-pk(γ¯k)|Bk(ckγ¯k-γ¯k)2Bk(r1γk-γk)2=γk2d1,

with d1=(r1-1)2Bk. Thus, using (30), we conclude

𝟏A𝔼k[(qk+1-q*)2(z)] 𝟏A{(1-pk(γ¯k)+γk2d1𝟏ck1)(qk-q*)2(z)+ck2γ¯k2(z)Bk(2+vk)=Mk}.

This completes the proof. ∎

Appendix G Proof of Theorem 1

G.1 Preparation for the proof of Theorem 1

We introduce the following notations. Let j and (μn)n1 a real sequence, we write (μnj)n1 for the delayed sequence μnj=μj+n for any n1. Additionally, we define recursively the sequence (anμ)n1 as follows:

a1μ=1, and   an+1μ=μnl=1nan+1-lalμ,n1. (31)
Lemma 1.

By convention, an empty sum is equal to zero. Let (vn)n1 be the sequence defined as follows:

vn=ϵn+μn(j=1n-1an-jvj),n1,

where (ϵn)n0 and (μn)n0 are two real sequences and j0aj=1. Then, we have

vn=j=1nan+1-jμjϵj,n1. (32)
Proof of Lemma 1.

Let us prove the result by induction on n1. By definition, Equation (32) is satisfied for n=1. By applying the induction hypothesis (32) to all jn, we get

vn+1=1×ϵn+1+μn+1(j=1nan+1-jvj) =a1μn+1ϵn+1+μn+1(j=1nan+1-jl=1jaj+1-lμlϵl)
=a1μn+1ϵn+1+[l=1nμn+1(j=lnan+1-jaj+1-lμl)ϵl]
=a1μn+1ϵn+1+[l=1nμn+1(j=1n-l+1an-l+2-jajμl)ϵl]
=a1μn+1ϵn+1+l=1nan+2-lμlϵl=j=1n+1an+2-jμlϵj.

Lemma 2.

Let nN, (μn)n0 be a positive sequence, (anμ)n0 be the sequence defined in (31) and rn=1-μn.

  • When n0rn=+, we have

    anμn0. (33)
  • When n0rn=+ and there exists β[0,1) such that n0βnan<, we have

    βnanμn0. (34)
Proof of Lemma 2.

We need to introduce the following notations. Let n* and (un)n1 be a real sequence. We define respectively the translation operator T1 and the aggregation operator T2 as follows T1(u)n=un-1𝟏n2 and T2(u)n=(l1alul)𝟏n=1. We denote by (𝐰k)k1 the sequence

𝐰n1=ϵ1𝟏n=1,𝐰k+1=(T1+μk+1T2)(𝐰k),k1,n1. (35)

We will assume that |ϵ1|1 since it is not restrictive. Our construction ensures that vk+1=𝐰1k+1=μk+1T2(𝐰k)1. In the sequel, we first prove (33) then show (34).

Step (i):

Here, we demonstrate (33). We first handle the case where a1=1. In such case, we have

vn=μnvn-1=(i=2nμi)v1=(i=2n(1-ri))v1e-l=2nrlv1n0,

since l2rl=+. Now we place ourselves in the case where a1<1. Using (35), we have 𝐰n=[l=1n-1(T1+μl+1T2)]𝐰1 which gives

vn =μnT2[l=1n-1(T1+μl+1T2)](𝐰1)1
=μnT2[𝐤=(k1,,km)𝒫n-1𝐢=(i1,,im){1,2}mμ¯𝐢𝐤TimkmTi1k1](𝐰1)1,

with 𝒫n-1={𝐤=(k1,,km)(*)m;s.tk1++km=n-1} the set containing all the partitions of (n-1) and μ¯𝐢𝐤=l𝒜𝐢𝐤μl. To define 𝒜𝐢𝐤 we first note that each index l{1,,n-1} belongs to a group pl (i.e. r=1plkpl<r=1pl+1kr), then we write 𝒜𝐢𝐤={l{1,,n-1};s.t.ipl=2} for the indexes where we select the operator T2 instead of T1.

Let us take ϵ>0 and m*. We introduce here the integers n0, n1 and n2. Using that 𝐰1=supl1|𝐰l1| is finite, we have T2T1m(𝐰1)1=lm+1al𝐰l1(lm+1al)𝐰1. Since l1al<, there exists n0 such that T2T1m(𝐰1)1ϵ2 when mn0. We write a¯ for the quantity a¯=maxln0al. Given that a1<1 and l1al=1, we have a¯<1. Thus there exists n1 such that a¯mϵ2 for all mn1. We recall that l0rl=+. Hence, there exists n2 such that e-(l=0mrl)+n0(n1+1)ϵ2 when mn2.

We take now nn3=max(n0(n1+1),n2) and denote by kT1=#{l,l𝒜𝐢𝐤} the integer that counts the number of times the operator T1 is chosen. When kT1n0(n1+1), we have

μ¯𝐢𝐤=l𝒜𝐢𝐤μl=l𝒜𝐢𝐤(1-rl)e-(l𝒜𝐢𝐤rl)=e-(l=0nrl)+(l𝒜𝐢𝐤rl)e-(l=0nrl)+n0(n1+1), (36)

since rl1. Using that nn2, we get μ¯𝐢𝐤ϵ2.

When kT1>n0(n1+1), we differentiate two cases. First, if there exists r such that krn0 and ir=1. In such case, since T2(u)T(u) with T the operator T(u)n=u𝟏n=1, we have

T2(TimkmT2kr+1T1krTi1k1)(𝐰1)1 =T2(TimkmT2kr+1T1n0T1kr-n0Ti1k1)(𝐰1)1
ϵ2T2(TimkmTkr+1T1kr-n0Ti1k1)(𝐰1)1. (37)

Second, when we cannot find such r, then necessarily

kT2nn0-1n1, (38)

with kT2=(n-1)-kT1 represents the number of times the operator T2 is selected. To get Inequality (38) one may consider the worst case where

kr={n0-1, when ir=1,1, when ir=2,r1.

This case provides the minimum possible value for kT2 under the constraint kr<n0 whenever ir=1. Since the initial value 𝐰1 is 𝐰n1=ϵ1𝟏n=1 and kl<n0 whenever il=1, the coefficients al with l>k0 never appear in the product

(TimkmTi1k1)𝐰1.

Thus, using basic inequalities, we get

T2(TijkjTi1k1𝐰1)a¯T(TijkjTi1k1𝐰1),jm, (39)

with TijkjTi1k1𝐰1 a sub-product of TimkmTi1k1𝐰1. The inequalities (39) and a¯kT2a¯n1ϵ2 ensure

T2(TimkmTi1k1)(𝐰1)1a¯kT2T2(Ti¯mkmTi¯1k1)(𝐰1)1ϵ2T2(Ti¯mkmTi¯1k1)(𝐰1)1, (40)

with i¯m=im𝟏im=1+𝟏im=2. By combining the inequalities (37), (40), (LABEL:Eq:step_1_proof_eq_3) and T2(u)T(u), we conclude

vn ϵ2[kT1n0(n1+1)(TimkmTi1k1)(𝐰1)1]+ϵ2[kT1>n0(n1+1)(Ti¯mkmTi¯1k1)(𝐰1)1]
ϵ2[l=1n-1(T1+μl+1T2)](𝐰1)1+ϵ2[l=1n-1(T1+μl+1T)](𝐰1)1ϵ2+ϵ2ϵ.

which proves (33) since |ϵ1|1.

Step (ii):

Let us prove (34). Note that (hn)n0=(βnanμ)n0 verifies

h0=0,hn+1=μn+1l=0n(βn+1-lan+1-l)=gn+1-lhl,n0,

with (gn)n0 the sequence defined such that gn=βnan. Thus a straightforward application of (33) to the sequence (gn)n0 gives (34). This completes the proof. ∎

G.2 Propagation of the error

As a first step, we consider the case where the process (Zk)k1 is Markov. For any z1𝒵, we write τ𝒵10=inf{l>0,Zl=z1} for the first visit time of (Zk)k1 to the state z1. Moreover, we define the sequence (ak)k0 such that ak(z1)=[τ𝒵10k|0]. We have the following result.

Proposition 5.

Let (Zk)k1 be a Makov process, z1Z and nN*. Under Assumptions 1, 3, 4 and 5, we have

𝔼0[en(z1)]ϵn+j=1n-1an-jα¯j𝔼0[ej(z1)],

with ϵn=e1(z1)an(z1)+j=1n-1an-jMj and α¯j=E0[αjej(z1)]E0[ej(z1)]1E0[ej(z1)]0. The variables αj and Mj are given by (13) and (14).

Proof of Proposition 5.

Let (z,z1)𝒵2 and n*. Using the last-exit decomposition, see Section 8.2.1 in [meyn2012markov], we have

𝔼z[en(z1)] =𝔼z[en(z1)𝟏τ𝒵10n]+j=1n-1𝔼z[en(z1)𝟏{τ𝒵10n-j,Zj=z1}]
=𝔼z[e1(z1)𝟏τ𝒵10n]+j=1n-1𝔼z[ej+1(z1)𝟏{τ𝒵10n-j,Zj=z1}]
Proposition 3𝔼z[e1(z1)𝟏τ𝒵10n]+j=1n-1𝔼z[(αjej(z1)+Mj)𝟏{τ𝒵10n-j,Zj=z1}]
=𝔼z[e1(z1)𝟏τ𝒵10n]+j=1n-1𝔼z[(αjej(z1)+Mj)𝟏{Zj=z1}𝔼[𝟏τ{𝒵10n-j}|j]]
e1(z1)𝔼z[𝟏τ𝒵10n]+j=1n-1z1[τ𝒵1n-j0]𝔼z[(αjej(z1)+Mj)]
=ϵn+j=1n-1an-jα¯j𝔼z[ej(z1)],

with aj(z1)=z1[τ𝒵10j], τ𝒵10=inf{l>0,Zl=z1} and αn and Mn defined in Proposition 3. In the second equality, we use that en(z1) does not change as long as the state z1 is not reached. This completes the proof. ∎

We turn now to the general case. However, we use the same kind of arguments needed in the Markov setting. Let n* and (z1,,zn)𝒵n. For any i{1,,n}, we denote by Ai the set Ai={Z1=z1,,Zi=zi}. Moreover, we write τ𝒵1n=inf{l>0,Zl+n=z1} and akn=[τ𝒵1n-kk|n-k]. We have the following result.

Proposition 6.

Under Assumptions 1, 3, 4 and 5, we have

𝔼0[en(z1)𝟏An]ϵn+j=1n-1an-jnα¯j𝔼0[ej(z1)𝟏An],

with ϵn=e1(z1)an0(z1)+j=1n-1an-jnMj and α¯j=E0[αjej(z1)1An]E0[ej(z1)1An]1E0[ej(z1)1An]0. The variables αj and Mj are given by (13) and (14).

We do not provide the proof of Proposition 6 since it is similar to the one of Proposition 5.

G.3 Proof of Theorem 1

Proof of Theorem 1.

We split the proof in two steps. In Step (i), we show (15) and then in Step (ii) we show (16).

Step (i):

In this part, we first prove (15) when the space 𝒵 is finite and (Zk)k0 is Markov. Then, we show (15) when (Zk)k0 is no more Markov but 𝒵 is still finite. Finally, we explain how to extend the proof to the general case.

Sub-step (i-1):

Here we demonstrate (15) when the space 𝒵 is finite and (Zk)k0 is Markov. Let z1𝒵 and n*. Using Proposition 5, the sequence vn(z1)=α¯n(z1)𝔼[en(z1)] verifies

vn(z1)ϵ¯n(z1)+α¯n(z1)j=1n-1an-j(z1)vj(z1),

with ϵ¯n(z1)=α¯n(z1)ϵn(z1). For clarity and since there is no ambiguity, we forget the dependencies to z1 in the rest of the proof. Thus, using Lemma 1, we get

vn=j=0nan-jμjϵ¯j,

with (anμ)n0 defined in (31). We can assimilate the sequence ϵ¯ to the measure μ=k0ϵ¯kδk with δk the Dirac measure at k. Since ϵk=e1(z1)ak+k=1n-1an-k𝔼[Mk], kak<, k0𝔼[γk2]< and 𝔼[Mk]=O(𝔼[γk2]) for all the algorithms, we get k0ϵk< using the properties of the Cauchy product between two sequences. This property ensures that the measure μ has a finite mass. Additionally, using Lemma 2, we have anμjn0 for any j0. Thus, the dominated convergence theorem ensures that vnn0.

Sub-step (i-2):

Let us prove (15) when the space 𝒵 is finite and (Zk)k0 is not Markov. In such case, one can show using the same methodology above and Proposition 6 that for any sequence 𝐳=(zk)k*𝒵 and set A𝐳n={Z1=𝐳1,,Zn=𝐳n}, the quantity u~n=𝔼[en(z1)𝟏A𝐳n] converges towards 0. Since 𝔼[en𝟏A𝐳n]𝔼[e0𝟏A𝐳n], we use the dominate convergence theorem to get

limn𝔼[en]=limn𝔼[𝐳Sen𝟏A𝐳n]=Fubinilimn𝐳S𝔼[en𝟏A𝐳n]=dominate convergence𝐳Slimn𝔼[en𝟏A𝐳n]=0,

where S is the set of sequences valued in 𝒵 and A𝐳n={Z1=𝐳1,,Zn=𝐳n} for any 𝐳𝒮.

Sub-step(i-3):

In this third step, we show (15) when the space 𝒵 is countable and (Zk)k0 is not Markov. Let ϵ>0. Since E1=k1𝔼[e1(zk)]<, there exists k0 such that

kk0𝔼[e1(zk)]<ϵ2.

We write Ak0 for the set Ak0={zk,kk0}. Since Ak0 is finite, we use Sub-step(i-2) to show the existence of k1 such that

kk0𝔼[ek1(zk)]<ϵ2,kk1,zkAk0.

We take now kk1. Using (𝔼[el(z)])l1 is non-increasing for any z𝒵, we get Ek=kk0𝔼[ek(zk)]+k<k0𝔼[ek(zk)]ϵ2+ϵ2=ϵ.

Step (ii):

In this step, we show (16). Since γk(z)=γ𝟏{Zk=z} for all (k,z)×𝒵, the quantity αk is constant (i.e αk=α). Let M=max(supnMn,e1(z1)). Using Proposition 5 and a direct induction, we get un=𝔼[en(z1)]c with c solution of the equation c=M+αc (i.e. c=M1-α). We define (mn)n0 as follows:

mn=M+αj=1nan+1-jmj,n0.

By direct induction, we have unmn for all n0. Using a direct induction again, we check that

mnM1-αn1-α=(1-αn)c,n0.

This completes the proof. ∎

Appendix H Proof of Proposition 4

Proof of of Propostion 4.

For simplicity, we prove the result for Algorithm 1 however the same arguments hold for Algorithm 3. Let us prove by induction on n that

eγneγ~n,a.s,γ~Γ. (41)

For n=0, Inequality (41) is directly satisfied since the initial error e0 does not depend on the choice of the learning rate. Let γ~Γ. We assume now that eγneγ~n, a.s. Using Equation (17) and the definition of (γn)n0, we have

eγn+1=g(eγn)+Sn,

with g(x)=x-L2x22B(x+(2+vn)). The previous expression of g is obtained by minimising the function y:(1-2Ly+By2)eγn+B(2+vn)×y2. This means that

g(x)(1-2Ly+By2)x+Mny2,y,x+. (42)

Since the function g is non-decreasing on +, eγneγ~n almost surely and g verifies Equation (41), we get

eγn+1=g(eγn)+Sn g(eγ~n)+Sn(1-2Lγ~n+Bγ~n2)eγn+Mnγ~n2+Sn=eγ~n+1,a.s,

which completes the proof. ∎

Appendix I Proof of the convergence of (vnk)n1,k(*)2

In this section we prove the following result.

Proposition 7.

The sequence (vnk)n1,k1 converges point-wise towards v on DT×Dq when n and k.

Proof of Proposition 7.

We prove this result in three steps. First, we show that v can be approximated by a numerical scheme v¯k. Then, we replace v¯k by another scheme vk that also converges towards v. Finally, we show that vnk tends to vk when n.

Step (i):

We start with our initial control problem where the agents may choose its trading speed at any time. It was studied by many authors, see for example [MFGTradeCrow2016], who show that the optimal trading speed verifies

ν(t,q)=h1(t)-qh2(t)2κ, (43)

with h1:[0,T] and h2:[0,T]+ a positive function. This means that the optimal inventory follows:

Q(t)=h1(t)-Qh2(t)2κ,t[0,T],

with Q(0)=q0 given. Thus Q verifies

Q(t)=q0e-0th2(s)2κ𝑑s+e-0th2(s)2κ𝑑s0th1(t)2κe-0sh2(u)2κ𝑑u𝑑s. (44)

Using Equation (44) and basic inequalities, one can show that if we take a large enough q¯+ and place ourselves in Sq=[-q¯,q¯] then

Q(t)Sq,

when Q(0)Sq. Hence, we rewrite the dynamic programming principle as follows:

v(t,q)=supνA¯(t,q)𝔼[Mt+Δt-ϕtt+ΔQs2ds+v(t+Δ,Qt+Δ)|Qt=q]. (45)

where A¯(t,q)Sq is the set of admissible actions77 7 We only allow controls that lead to states where the inventory stays in Sq.. We can focus on the set A¯(t,q) instead of since other controls are not optimal. Now, we approximate this problem in a classical way using the numerical scheme v¯k defined such that

v¯k(nt,nq)=supνDa𝔼[Mnt+1nt-ϕQntΔt2Δ+v¯k(nt+1,nq+1ν)|QntΔt=nqΔq],(nt,nq)DT×Dq,

with Mnt+1nt=MΔt(nt+1)ntΔ, nq+1ν the index such that Q(nt+1)Δtν=nq+1νΔq and A¯=A{iΔq,i}. The convergence of (v¯k)k(*)2 towards v on the set DT×Dq when k is standard.

Step (ii):

We denote by vk the numerical scheme

vk(nt,nq)=𝔼[supνDa{Mnt+1nt-ϕQntΔt2Δ+vk(nt+1,nq+1ν)}|QntΔt=nqΔq],(nt,nq)DT×Dq,.

Let us show that v¯k and vk have the same limit. For this, we use a backward recurrence. For the moment, we assume that |supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|KΔt2 and we will prove it at the end of Step (ii). We want to show that |v¯k(nt,nq)-vk(nt,nq)|K(T-t)Δt for all (nt,nq)DT×Dq. At the terminal time v¯k and vk coincide. We move now to the induction part. We have

|v¯k(nt,nq)-vk(nt,nq)| =|supν𝔼[Mnt+1nt-ϕq2Δt+v¯k(nt+1,nqt+1ν)]
  -𝔼[supνMnt+1nt-ϕq2Δt+vk(nt+1,nqt+1ν)]|
v¯ and v are not random|supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|+|supν{v¯k(nt+1,nqt+1ν)-vk(nt+1,nqt+1ν)}|
𝔼[supν{KΔt2+K(T-t-Δt)Δt}]
=K(T-t)Δt.

In the third inequality, we use the induction assumption to complete the proof. Now, let us show that |supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|KΔt2. We write t=ntΔt, ΔQ=Qt+Δt-Qt=νΔt, ΔS=St+Δt-St and ΔS¯=tt+Δt(Ss-St)𝑑s. Thus, we have

Mnt+1nt =(Wt+Δ-Wt)+(Qt+ΔSt+Δ-QtSt)
=-νΔS¯-ΔQSt-κν2Δt+QtΔS+ΔQSt+ΔQΔS
=-νΔS¯-κν2Δt+QtΔS+νΔtΔS. (46)

The above equation shows supνMnt+1nt=(ΔtΔS-ΔS¯)24κΔt+QtΔS. Using 𝔼[ΔS]=αΔt and

𝔼[ΔS¯]=tt+Δα(s-t)𝑑s=αΔt2/2,

we get

supν𝔼[Mnt+1nt]=α2Δt316κ+αQtΔt, and 𝔼[supνMnt+1nt]=α2Δt316κ+σ2Δt212κ+αQtΔt.

Thus we deduce that |supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|KΔt2 with K=σ212κ.

Step (iii):

Theorem 1, proves that vnk converges towards vk. Thus by composition we have vnk converges point-wise towards v when n and k which completes the proof. ∎

References