Improving reinforcement learning algorithms: towards optimal learning rate policies

  • 2019-11-06 11:17:53
  • Othmane Mounjid, Charles-Albert Lehalle
  • 1

Abstract

This paper investigates to what extent we can improve reinforcement learningalgorithms. Our study is split in three parts. First, our analysis shows thatthe classical asymptotic convergence rate $O(1/\sqrt{N})$ is pessimistic andcan be replaced by $O((\log(N)/N)^{\beta})$ with $\frac{1}{2}\leq \beta \leq 1$and $N$ the number of iterations. Second, we propose a dynamic optimal policyfor the choice of the learning rate $(\gamma_k)_{k\geq 0}$ used in stochasticalgorithms. We decompose our policy into two interacting levels: the inner andthe outer level. In the inner level, we present the PASS algorithm (for "PAstSign Search") which, based on a predefined sequence $(\gamma^o_k)_{k\geq 0}$,constructs a new sequence $(\gamma^i_k)_{k\geq 0}$ whose error decreasesfaster. In the outer level, we propose an optimal methodology for the selectionof the predefined sequence $(\gamma^o_k)_{k\geq 0}$. Third, we show empiricallythat our selection methodology of the learning rate outperforms significantlystandard algorithms used in reinforcement learning (RL) in the three followingapplications: the estimation of a drift, the optimal placement of limit ordersand the optimal execution of large number of shares.

 

Quick Read (beta)

Improving reinforcement learning algorithms:
towards optimal learning rate policies

Othmane Mounjid , Charles-Albert Lehalle École Polytechnique, CMAPCapital Fund Management, Paris and Imperial College, London
November 30, 2019
Abstract

This paper investigates to what extent we can improve reinforcement learning algorithms. Our study is split in three parts. First, our analysis shows that the classical asymptotic convergence rate O(1/N) is pessimistic and can be replaced by O((log(N)/N)β) with 12β1 and N the number of iterations. Second, we propose a dynamic optimal policy for the choice of the learning rate (γk)k0 used in stochastic algorithms. We decompose our policy into two interacting levels: the inner and the outer level. In the inner level, we present the Algorithm 3 (PASS). algorithm (for “PAst Sign Search”) which, based on a predefined sequence (γko)k0, constructs a new sequence (γki)k0 whose error decreases faster. In the outer level, we propose an optimal methodology for the selection of the predefined sequence (γko)k0. Third, we show empirically that our selection methodology of the learning rate outperforms significantly standard algorithms used in reinforcement learning (RL) in the three following applications: the estimation of a drift, the optimal placement of limit orders and the optimal execution of large number of shares.

\usetikzlibrary

fit,calc,positioning,trees \usetikzlibraryarrows \tikzstylelevel 1=[level distance=4cm, sibling distance=2.5cm] \tikzstylelevel 2=[level distance=8cm, sibling distance=0.6cm] \tikzstylebag = [text width=4em, text centered] \tikzstyleend = [circle, minimum width=3pt,fill, inner sep=0pt] \usetikzlibrarydecorations.pathmorphing \usetikzlibrarydecorations.pathreplacing \usetikzlibrarydecorations.shapes \usetikzlibrarydecorations.text \usetikzlibrarydecorations.markings \usetikzlibrarydecorations.fractals \usetikzlibrarydecorations.footprints \algnewcommand\Inputs[1]\StateInputs: \Statex #1 \algnewcommand\Initialize[1]\StateInitialize: \Statex #1

 

Improving reinforcement learning algorithms:
towards optimal learning rate policies


  Othmane Mounjidthanks: École Polytechnique, CMAP , Charles-Albert Lehalle thanks: Capital Fund Management, Paris and Imperial College, London

\@float

noticebox[b]\[email protected]

1 Introduction

We consider a discrete state space 𝒵= or 𝒵={1,,d} with d*. We are interested in finding q*𝒬=𝒵 solution of

M(q,z)=𝔼[m(q,X(z),z)]=0,z𝒵, (1)

with X(z)𝒳 a random variable with an unknown distribution and m a function from 𝒬×𝒳×𝒵 to 𝒬. Although the distribution of X(z) is unspecified, we assume that we can observe some variables (Zn)n1 valued in 𝒵 and (Xn(Zn))n1 drawn from the same distribution of X(Zn). Reinforcement learning (RL) addresses this problem through the following iterative procedure:

qn+1(Zn)=qn(Zn)-γn(Zn)m(qn,Xn+1(Zn),Zn), (2)

where q0 is a given initial condition and each γn is a component-wise non-negative vector valued in 𝒵. The connection between RL, problem (1) and Algorithm (2) is detailed in Section 2. It is possible to recover the classical SARSA, Q-learning and double Q-learning algorithms used in RL by taking a specific expression for m and Xn+1. Note that Algorithm (2) is different from the standard Robbins-Monro (RM) algorithm used in stochastic approximation (SA)

qn+1=qn-γnm¯(qn,Xn+1), (3)

with m¯(q,x)(z)=m(q,x(z),z) and γn0, mainly because, as it is frequent in RL, we do not observe the entire variable (Xn+1(z))z𝒵) but only its value according to the coordinate Zn. Indeed, the way (Zn)n1 visits the set 𝒵 plays a key role in the convergence of Algorithm (2) which is not the case of Algorithm (3). RM algorithm was first introduced by Robbins and Monro in [robbins1951stochastic]. After that, it was studied by many authors who prove the convergence of qn towards q*, see [benveniste1987and, bertsekas1996neuro, blum1954approximation, kushner1978stochastic]. The asymptotic convergence rate has also been investigated in many papers, see [benveniste1987and, kushner2003stochastic, sacks1958asymptotic]. They show that this speed is in general proportional to 1/N with N the number of iterations.

In this work, we give a special focus to RL problems. Nowadays RL cover a very wide collection of recipes to solve control problems in an exploration-exploitation context. This literature started in the seventies, see [watkins1989learning, werbos1987building], and became famous mainly with the seminal paper of Sutton, see [sutton1998introduction]. It largely relied on the recent advances in the control theory developed in the late 1950s, see [bellman1959functional]. The key tool borrowed from this theory is the dynamic programming principle satisfied by the value function. This principle enables us to solve control problems numerically when the environment is known and the dimension is not too large. To tackle the curse of dimensionality, recent papers, see [sirignano2018dgm], use deep neural networks (DNN). For example, in [hutchinson1994nonparametric], authors use DNN to derive optimal hedging strategies for finance derivatives and in [manziukAlgorithmicTradingReinforcement2019] they use a similar method to solve a high dimensional optimal trading problem. To overcome the fact that environment is unspecified, it is common to use RM algorithm which estimates on-line quantities of interest. The combination of control theory and SA gave birth to numerous papers on RL.

Our contributions are as follows. We first conduct an error analysis to show that the classical asymptotic rate O(1/N) is pessimistic and can be enhanced in many situations. It is indeed possible to get a O((log(N)/N)β) asymptotic speed with 1/2β1 and N the number of iterations. Then, we present our main result. It consists in proposing a dynamic policy for the choice of the step size (γk)k0 used in (3). Our policy is decomposed into two interacting levels: the inner and the outer level. In the inner level, we propose the Algorithm 3 (PASS). algorithm, for “PAst Sign Search”. This algorithm builds a new sequence (γki)k0, using a predefined sequence (γko)k0 and the sign variations of m(qn,Xn+1(Zn),Zn). The error of (γki)k0 decreases faster than the one of (γko)k0. In the outer level, we propose an optimal methodology for the construction of a piece-wise constant predefined sequence (γko)k0. These two levels are interacting in the sense that Algorithm 3 (PASS). influences the construction of (γko)k0. Finally, we show that our selection methodology provides better convergence results than standard RL algorithms in three numerical examples: the drift estimation, the optimal placement of limit orders and the optimal execution of a large number of shares.

The structure of this paper is as follows: Section 2 describes the relation between RL and SA. Section 3 reformulates the problem (1) in terms of a minimisation issue and defines with accuracy the different sources of error. This enables us to exploit the most recent convergence results for each source of error to show that the slow convergence speed O(1/N) can be replaced by O((log(N)/N)β) with 1/2β1 and N the number of iterations. Section 4 contains our main contribution. We start by defining the Algorithm Algorithm 3 (PASS)., comparing it with two other schemes and proving its convergence. Then, we describe our outer level policy and discuss its speed of convergence. Finally, we explain our selection methodology of the learning rate (γko)k0 which combines the Algorithm 3 (PASS). Algorithm with the outer level to enhance the convergence of (3). The last Section 5 provides numerical examples taken from the optimal trading literature: optimal placement of a limit order and the optimization of the trading speed of a liquidation algorithm. Proofs and additional results are relegated to an appendix.

2 Reinforcement learning

We detail in this section the relation between SA and RL since we are interested in solving RL problems. RL aims at estimating the Q-function which quantifies the value for the player to choose the action a when the system is at z. Let t be the current time, Ut𝒰 be a process defined on a filtered probability space (Ω,,t,) which represents the state of the system and At𝒜 the agent action at time t. We assume that the process (Ut,At) is Markov. The agent aims at maximizing

𝔼[0Tρsf(s,Us,As)𝑑s+ρTg(UT)], (4)

with g the terminal constraint, f the instantaneous reward, ρ a discount factor and T the final time. Let us fix a time step Δ>0 and allow the agent to take actions only at times11 1 We recall the following classical result: when Δ goes to zero, the value function and the optimal control of this problem converges towards the one where decisions are taken at any time. kΔ with k. The Q-function satisfies

Q(t,u,a)=supA𝔼A[tTρ(s-t)f(s,Us,As)ds+ρ(T-t)g(UT)|Ut=u,At=a],(t,u,a)+×𝒰×𝒜,

with A={At,t<T} a possible control process for the agent. We view the agent control A as a feedback process (i.e adapted to the filtration t). The Q-function satisfies the classical dynamic programming principle (DPP)

Q(t,u,a)=𝔼[Rt+Δ+ρΔsupa𝒜Q(t+Δ,Ut+Δ,a)|Ut=u,At=a], (5)

with Rt+Δ=tt+Δρ(s-t)f(s,Us,As)𝑑s. Equation (5) reads that the optimal expected gain when the agent starts at z and chooses action a at time t is the sum of the next expected reward Rt+Δ plus the value of acting optimally starting from the new position Ut+Δ at time t+Δ. By reformulating (5), we obtain that Q solves equation

𝔼[m(q,Xt+Δz,z)]=0,z=(t,u,a)𝒵=[0,T]×𝒰×𝒜, (6)

where Xt+Δz=(Ut+Δz,Rt+Δz)𝒳=𝒰×, Usz and Rsz are respectively the conditional random variables Us and Rs given the initial condition (Utz,Atz)=(u,a) with z=(t,u,a)𝒵 and m is defined as follows:

m(q,x,z1)=H(q,x,z1)-q(z1),H(q,x,z1)=r+ρΔsupa𝒜q(t1+Δ,u,a),

for any x=(u,r)𝒳 and z=(t1,u1,a1)𝒵. Thus, one can use stochastic approximation tools to solve (6).

Actions of the agent.

Note that Equation (6) shows that one can study Q only on the time grid22 2 Here, we take T=n*Δ with of n**. Such approximation is not restrictive. DT={nΔ,nT/Δ}. Thus, we define Ak and Uk such that Ak=AkΔ and Uk=UkΔ for any k. The key variable to study is not the agent decision Ak but Zk=(k,Uk,Ak). Thus, the rest of the paper formulates the results in terms of Zk only.

Moreover, it is important in practice to visit the space DT×𝒰×𝒜 sufficiently enough. Thus, to learn Q, it is common to not choose the maximising action33 3 The maximising action a* for a state u is defined such that a*=argmaxa𝒜q(u,a)., but to set the conditional distribution of the random variable Ak such that

[Ak=a|k]=eϵ¯k(Zka)aeϵ¯k(Zka),a𝒜, (7)

with Zka=(k,Uk,a), ϵ¯k(Zka)=β(Zka)aϵk(Uk+1Zka,a) and

ϵk(z)={|m(qrk(z),Xrk(z)+1(z),z)|when the state z already visited,botherwise,

where b>0 encourages the exploration, qk satisfies (2) and rk(z) is the last observation time of the state z. Beside, to give more importance to the maximizing action, one may consider the following policy:

[Ak=a|k]=eβk(Uk)qk(Zka)aeβk(Uk)qk(Zka),a𝒜. (8)

Any mixture of these two procedures can an also be considered.

3 Improvement of the asymptotic convergence rate

In [benveniste2012adaptive, Part 2, Section 4], [kushner2003stochastic, Section 10] and [kushner1978stochastic, Section 7], the authors show a central limit theorem for the procedure (3) which ensures a convergence rate of O(1/N) where N is the number of iterations. In this section, we extend such convergence rate to Algorithm (2) and aim at understanding how one can improve it. For this, we decompose our total error into two components: estimation error and optimization error.

3.1 Error decomposition

In this section, the space 𝒵={1,,d} is finite with d*. In such case, we view q and M(q) as vectors of 𝒵. Moreover, the process (Zn)n1 is Markov. We consider the following assumptions.

Assumption 1 (Existence of a solution).

There exists a solution q* of Equation (1).

Assumption 2 (Pseudo convexity).

The function M satisfies

M(q)-M(q),q-q0,q𝒵,q𝒵.

The above assumption is natural since the gradient of any real valued convex function f satisfies Assumption 2.

Assumption 3 (Pseudo strong convexity).

There exists L>0 such that

M(q)-M(q),q-qLq-q2,q𝒵,q𝒵.

Assumption 3 replaces the strong convexity condition. Under Assumption 1, the function q* is solution of the minimization problem

minq𝒬g(q). (9)

with g(q)=M(q)2.

Remark 1.

Note that, in the special case where M is the gradient of a given function f (i.e. f=M), the quantity q* minimises a convex and differentiable cost g~=zZE[L(q,X(z),z)] with

L(q,x,z)=01(q(z)-q*(z))×m(q*+r(q-q*),x,z)𝑑r,q𝒬,z𝒵,x𝒳.

and g~=M. Thus, we can replace g by g~ and all the results of this section hold. In the rest of this section, we use g instead of g~.

In our context we do not have a direct access to the distribution of X(z) because we are in the case of “on-line” RL. Nevertheless, we assume that at time n we keep memory of a training sample of n(z) independent variables (Xiz)i=1n(z) drawn from the distribution X(z) where n(z) is the number of times the Markov chain Zn visited z. We define qn as a solution of

minq𝒬gn(q), (10)

with gn(q)=Mn(q)2 and Mn(q)=𝔼n[m(q,X(z),z)]=(j=1n(z)m(q,Xj(z),z))/n(z) the expected value under the empirical measure μ=(j=1n(z)δXj(z))/n(z). We finally define qkn as an approximate solution of the problem (10) returned by an optimization algorithm after k iterations. Thus, we can bound the error g(qkn) by

0𝔼[(g(qkn)-g(q*))(z)]𝔼[(g(qn)-g(q*))(z)]estimation error+𝔼[|g(qn)-g(qkn)|(z)]optimization error,

since q* minimizes g.

3.2 Convergence rate of the estimation error

3.2.1 Slow convergence rate

We have the following result.

Proposition 1.

We assume that the Markov chain Zn is irreducible. There exists c1>0 such that

𝔼[supq(g(q)-gn(q))(z)]c11n.

The proof of this result is given in Appendix C. This result allows us to derive the following bound for the estimation error

𝔼[(g(qn)-g(q*))(z)] =𝔼[(g(qn)-gn(qn))(z)]+𝔼[(gn(qn)-gn(q*))(z)]0+𝔼[(g(q*)-gn(q*))(z)]
2𝔼[supq(g(q)-gn(q))(z)]2c11n. (11)

This bound is known to be pessimistic.

3.2.2 Fast convergence rate

We obtain the following fast statistical convergence rate.

Proposition 2.

Assume that the Markov chain Zn is irreducible and

𝔼[supq(g(q)-gn(q))(z)|n¯(z)]c(log(n¯(z))n¯(z))β, (12)

with 12β1, c>0 and n¯(z)=n(z)1. Then, there exists c2>0 such that

𝔼[(g(qn)-g(q*))(z)]c2(log(n)n)β.

The proof of this proposition is given in Appendix D. The condition (12) is established when

  • The loss function g satisfies regularity conditions, of which the most important are : Lipschitz continuity and convexity, see [bartlett2006convexity]. Moreover under the strong convexity assumption, the constant β is equal to 1.

  • The data distribution satisfies some noise conditions, see for instance [bartlett2006convexity, tsybakov2004optimal] in the pattern recognition problem.

  • The function m has a bounded moment α (i.e. 𝔼[mα(q)]<) with α>1, see [cortes2019relative].

3.3 Convergence rate of the optimization error

We turn now to the optimization error. This means that expected value in (9) is replaced by the empirical mean which is known. In such case, one can use many algorithms to find qn. We present in the table below the most important properties of some gradient methods. Note that the results of Table 1 remain valid when strong convexity is replaced by Assumption 3 and in general when convexity is replaced by Assumption 244 4 Assumption 3 is the analogous of the strong convexity condition and convexity can be replaced by Assumption 2 in many cases..

Algorithm Cost of one iteration Iterations to achieve an ϵ precision Time to reach an ϵ precision
Convex Strongly convex Convex Strongly convex
GD O(d2) O(1ϵ) O(log(1ϵ)) O(d2ϵ) O(d2log(1ϵ))
SGD O(d) O(1ϵ2) O(1ϵ) O(dϵ2) O(dϵ)
Proximal O(d) O(1ϵ) O(log(1ϵ)) O(dϵ) O(dlog(1ϵ))
Acc. prox. O(d) O(1ϵ) O(log(1ϵ)) O(dϵ) O(dlog(1ϵ))
SAGA O(d) O(1ϵ) O(log(1ϵ)) O(dϵ) O(dlog(1ϵ))
SVRG O(d) O(log(1ϵ)) O(dlog(1ϵ))
Table 1: Asymptotic properties of some gradient methods. Note that d is the dimension of the state space 𝒵 and ϵ is a desired level of accuracy. Here ϵ corresponds to 1/n. GD stands for Gradient Descent, SDG for Stochastic Gradient Descent, Proximal for Stochastic proximal gradient descent [combettes2011proximal, schmidt2011convergence], Acc. prox. for accelerated proximal stochastic gradient descent [nitanda2014stochastic, schmidt2011convergence], SAGA for Stochastic accelerated gradient approximation [defazio2014saga], and SVRG for stochastic variance reduced gradient [johnson2013accelerating].

3.4 Conclusion

Following the formalism of [bottou2008tradeoffs], we have decomposed our initial error into

  • Estimation error: its convergence is O(1/N) in pessimistic cases. In the other situations, the convergence is faster (i.e. O((log(N)/N)β)) with 1/2β1.

  • Optimization error: the convergence is exponential under suitable conditions. In unfavourable cases, the convergence rate is O(1/N).

The comparison of these error sources shows that the estimation error is the dominant component. Thus, we can overcome the O(1/N) asymptotic speed, in some situations, by improving the estimation error.

4 Optimal policy for the learning rate γ when is countable

In this section, we take 𝒵= and consider the following type of algorithms:

qn+1(Zn)=qn(Zn)-γn(Zn)m(qn,Xn+1(Zn),Zn),n.

One can recover the classical SARSA, Q-learning and double Q-learning algorithms used in RL by considering a specific expression for m and Xn+1. In such algorithms the choice of γn is a crucial point. One can find in the literature general conditions that guarantee the convergence such that

k0γk(z)=,a.s,k0γk2(z)<,a.s,z𝒵. (13)

However, since the set of processes (γn)n0 satisfying these conditions is large in general and may even be empty when (Zn)n0 is not recurrent. Thus, to be more specific, many authors suggest to take γn proportional to 1/nα for stochastic approximation algorithms. The exponent α may vary from 0 to 1 depending on the algorithm used, see to cite [gadat2017optimal, moulines2011non]. Nonetheless, such a choice may be sub optimal. For example, Figure 1.a shows that the blue curve is a way higher than the orange one. Here, the blue curve represents the variation of the logarithm of the L2-error when γn=η/n, whereas the orange curve stands for a constant learning rate (i.e. γn=γ). We choose the constant η that ensures the fastest convergence for the blue curve.

In this paper, we propose to use a stochastic learning rate (γk)k0; our learning policy is decomposed into two interacting levels: the inner and the outer level. In the inner level, we use the Algorithm 3 (PASS). algorithm, for “PAst Sign Search”. This algorithm builds a new sequence (γki)k0, based on a predefined sequence (γko)k0 and the sign variations of m(qn,Xn+1(Zn),Zn), whose error decreases faster than the predefined one. In the outer level, we propose an optimal methodology for the selection of a piecewise constant predefined sequence (γko)k0. These two levels are interacting in the sense that the Algorithm 3 (PASS). algorithm influences the construction of the sequence (γko)k0.


Figure 1: L2-error for the estimation of the drift when γk is constant in orange and when γk1k in blue.

4.1 The inner level

4.1.1 The algorithms

In this part, we introduce three algorithms. We start with our benchmark which is the standard algorithm used in RL. Then, we present a second algorithm inspired from SAGA [defazio2014saga], which is a method used to accelerate the convergence of the stochastic gradient descent. SAGA reduces the optimization error exponentially fast. Finally, we describe the Algorithm 3 (PASS). algorithm that modifies the learning rate (γk)k based on the sign variations of m(qn,Xn+1(Zn),Zn). The main idea is to increase (γk)k as long as the sign of m(qn,Xn+1(Zn),Zn) remains unchanged. Then, we reinitialize or lower γk using a predefined sequence (γko)k when the sign of m(qn,Xn+1(Zn),Zn) switches. This algorithm can be seen as an adaptation of the line search strategy, which determines the maximum distance to move along a given search direction, to stochastic iterative methods. Actually, the line search method requires a complete knowledge of the cost function because it demands to evaluate several times the difference g(qk+γM(qk))-g(qk) for different values of γ with g and M defined in Section 3.1. However, stochastic iterative models have neither access to g nor M. They can only compute m(qn,Xn+1(Zn),Zn) when the state z=Zk is visited. Moreover, to get a new observation they need to wait55 5 This waiting time may be very long depending on the dimension of the state space 𝒵 and the properties of the process (Zk)k0. for the next visit of the state z=Zk. Nevertheless, they have instantaneous access to the previously observed values. Thus, the main idea here is to use these past observations although it adds a small memory cost. Some theoretical properties of these algorithms are investigated in Section 4.1.3.

Algorithm 1 (RL).

We start with an arbitrary q0Q and define by induction qk

qk+1(Zk)=qk(Zk)-γk(Zk)m(qk,Xk+1(Zk),Zk).
Algorithm 2 (SAGA).

We start with an arbitrary q0Q, M0=066 6 Here M0 is the zero function in the sense that M0[z,i]=0 for any zZ and i{1,,M}., n0=0 and define by induction qk, Mk and nk

qk+1(Zk)=qk(Zk)-γk(Zk)[m(qk,Xk+1(Zk),Zk)-Mk[Zk,i]+(j=1Mnk[Zk,j]Mk[Zk,j])(j=1Mnk[Zk,j])],Mk+1[Zk,i]=m(qk,Xk+1(Zk),Zk),nk+1[k,i]=1,

with i picked from the distribution p=(i=1Mnk[k,i]δi)/(j=1Mnk[k,j]). The quantity nk[k,j] records the j-th time when the process Z visits the state Zk before the k-th iteration.

For the next algorithm, we give ourselves a predefined sequence (γk)k0, an increasing function h:++ and a non-increasing function l:++. The function h is used to increase the learning rate and hence to accelerate the descent, while the function l is used to go back to a slower pace. A typical way to do it is to use a predefined learning rate.

Algorithm 3 (PASS).

We start with an arbitrary q0 and define by induction qk and γ^k

  • If m(qn,Xn+1(Zn),Zn)×m(qr1n,Xr1n+1(,Zr1n),Zr1n)0, then do

    qn+1(Zn)=qn(Zn)-h(γ^n(Zn),γn(Zn))m(qn,Xn+1(Zn),Zn),γ^n+1(Zn)=h(γ^n(Zn),γn(Zn)),

    with r1n is the index of the last observation when the process X visits the state Xn.

  • Else, do

    qn+1(Xn)=qn(Xn)-l(γ^n(Xn),γn(Xn))m(qn,Xn,Xn+1),γ^n+1(Xn)=l(γ^n(Xn),γn(Xn)).

4.1.2 Assumptions

In this section, we present the assumptions needed to prove our main result about the convergence of Algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS).. We consider the following assumptions:

Assumption 4 (Existence of a solution).

There exists a solution q* of (1).

We write m* for the vector m*=m(q*)𝒳×𝒵 with q* solution of (1).

Assumption 5 (Pseudo strong convexity 2).

There exists a constant L>0 such that

(𝔼k[m(qk,Xk+1(Zk),Zk)-m*(Xk+1(Zk),Zk)])(qk(Zk)-q*(Zk))L(qk(Zk)-q*(Zk))2, (14)

with q* of (1) and Ek[X]=E[X|Fk] for any random variable X.

Recall that Assumption 5 is natural in the deterministic framework. For instance, if we take a strongly convex function f and call m its gradient (i.e m=f). Then, m satisfies Assumption 5. Additionally, the pseudo-gradient property (PG) considered in [bertsekas1996neuro, Section 4.2] is close to Assumption 2. However, Assumption 2 is slightly less restrictive than PG since it involves only the norm of the component (qk-q*)(Zk) instead of the norm of the vector (qk-q*). To get tighter approximations, we will also need the quantity Lk defined as follows:

Lk={𝔼k[m(qk,Xk+1(Zk),Zk)-m*(Xk+1(Zk),Zk)]qk(Zk)-q*(Zk), If qk(Zk)-q*(Zk)0,0, otherwise.

Note that Lk0 under Assumption 5. It is also the biggest constant that satisfies (14) for a fixed k. In particular, this means that LkL.

Assumption 6 (Lipschitz continuity of m).

There exists a positive constant B>0 such that for any random variables X and X valued in X we have

𝔼k[(m(qk,X,Zk)-m*(X,Zk))2]B{1+(qk(Zk)-q*(Zk))2+𝔼k[(X-X)2]}, (15)

with Ek[X]=E[X|Fk] for any random variable X.

Assumption 6 guarantees that m is Lipschitz. Authors in [bertsekas1996neuro, Section 4.2] use a similar condition. To get better bounds, we introduce Bk

Bk=𝔼k[(m(qk,X,Zk)-m*(X,Zk))2]1+(qk(Zk)-q*(Zk))2+𝔼k[(X-X)2].

We have BkB since Bk is the smallest constant satisfying (15) for a fixed k. We finally need an assumption on the learning (γk)k0 that describes indirectly how the process Z communicates with its different states.

Assumption 7 (Learning rate explosion).

For any zZ, there exists a non-negative deterministic sequence (γkd(z))kN such that

γk(z)γkd(z) and k1γkd(z)=,z𝒵.

When the process Z is Markov and γk(z) is bounded, Assumption 7 ensures that Z is recurrent. The boundedness assumption of γk(z) is not restrictive and can always be fulfilled. Indeed, since γk(z) is bounded, there exists a positive constant A such that γk(z)A,a.s for all k1. Thus, we get k1𝔼[γk(z)𝟏Zk=z]Ak1[Zk=z]. Since the left hand side of the previous inequality diverges under Assumption 7, we have k1[Zk=z]= which proves that Z is recurrent.

4.1.3 Main results

In this section, we compare the algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS). and prove the convergence of Algorithm 3 (PASS).. Let c be a positive constant and k. We define the error function for the different algorithms as follows:

ek(z)={(qk(z)-q*(z))2, for algorithms Algorithm 1 (RL). and Algorithm 3 (PASS).,j=1Mn¯k[z,j](Mk[z,j]-m*(z))2+c(qk(z)-q*(z))2, for algorithm Algorithm 2 (SAGA).,

for all z𝒵 with n¯k[z,j]=(nk[z,j])/(i=1Mnk[z,j]) for any j{1,,M} and the total error Ek such that Ek=ek. We also use the following notations:

p(x)=2Lx-Bx2,pk(x)=2Lkx-Bkx2,γ¯k=argsuplpk(l)=LkBk,x.
Proposition 3.

Let zZ. Under Assumptions 4, 5 and 6, we have

𝟏A𝔼k[ek+1(z)] 𝟏A[αkek(z)+Mk], (16)

with A={Zk=z}. The values of the constant αk and Mk vary from an algorithm to another as follows:

αk(z1)={[1-p(γk(z))], for algorithm Algorithm 1 (RL).,max(1-p(γk(z))+BMc,1-(1M-6γk2(z)c)), for algorithm Algorithm 2 (SAGA).,{1-pk(γ¯k*)+d1γk2(z)𝟏ck1}, for algorithm Algorithm 3 (PASS)., (17)

with ck=Ek[γ^k(z)m(qk,Xk+1(z),z)]γ¯k(z)Ek[m(qk,Xk+1(z),z)], γ¯k*=ck(z)γ¯kγ¯k and

Mk={Bγk2(z)(2+vk), for algorithm Algorithm 1 (RL).,3Bγk2(z)(2+vk), for algorithm Algorithm 2 (SAGA).,B(ckγ¯k)2(2+vk), for algorithm Algorithm 3 (PASS)., (18)

where vk is the variance of Zk.

Equation (16) reveals that the performances of algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS). depend on the interaction between two competing terms:

  • On the one hand the slope αk controls the decrease of the error from one step to the other,

  • On the other hand the quantity Mk gathers two sources of imprecision: the estimation error and the optimization error. Both sources of imprecision have a term in the variance vn (because the distribution of Z is unknown) and a positive constant (coming from the noise generated by the noisy nature of observations).

There is a competition between these two terms: to decrease Mk we need to send γk towards zero while the reduction of αk requires a relatively small but still non-zero value of γk. Thus, γk should satisfy a trade-off in order to ensure the convergence of the algorithms. The RM conditions (13) are a way to address this trade-off. Now, in order to analyse the properties of each algorithm, we compare for a given γk their respective values for αk and Mk in Table 2. For sake of clarity, we choose to present the variable (1-αk)/2 instead of αk in this table; note that a large value of 1-αk means that αk is small and thus induces a fast convergence.

Algorithms (1-αk)/2 Mk
Value Comparison with Algo 1 Value Comparison with Algo 1
RL (Algo 1) 2γk(z1)L - Bγk2(z1) Bγk2(2+vk)
SAGA (Algo 2) (2γk(z1)L-Bγk2(z1)-BMc)(1M-6γk2(z1)c) smaller 3Bγk2(2+vk) larger
PASS (Algo 4) 2γ¯k*Lk - Bk(γ¯k*)2+d1γk2𝟏ck1 larger B(ckγ¯k)2(2+vk) larger
Table 2: Comparison of the algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS)..

The result below holds only for Algorithms 1 and 3.

Theorem 1.

Let the Assumptions 4, 5, 6 and 7 be in force. Algorithms 1 and 3 verify

  • When k0𝔼[γk2](z)< for all z, we have

    𝔼0[En]n0. (19)
  • When γk(z)=γ𝟏{Zk=z} for all (k,z)×𝒵 with γ>0 a positive constant, for any positive number r there exist two constants β[0,1) and M¯>0 which satisfy

    𝔼0[En]c(1+r)(1-α~n)+α~nM¯E0,α~<max(β,α),n, (20)

    with c=γ2M1-α, α=1-γ and M=supnMn.

4.2 The upper level

In practice, to apply Algorithm 3 (PASS). we need an appropriate predefined sequence (γk)k. It is possible to take γk proportional to 1/kα with α(0,1] as proposed in [moulines2011non, robbins1951stochastic]. However, in this section, we present a piece-wise constant policy for the selection of the learning rate (γk)k. To do so, we construct a sequence of integers (ki)i such that γk is constant within each interval Ii=[ki,ki+1[ with i. Additionally, we choose the “best" constant γki such that the error 𝔼[En] decreases the fastest possible by at least a factor α during the time period Ii in the sense that 𝔼[Eki+1]α𝔼[Eki]. Here the factor α is a fixed constant.

First, note that the function f(x)=2Lx-B2 with L defined in Assumption 5 and B introduced in Assumption 6 reaches its maximum value at the point γ¯1=LB0. Since we need f(γ0) to be the largest possible for a fast convergence, we take γ0 as close as possible to γ¯1. Let r>0 be a precision factor and γ¯2=2E0LM(1+r)+BE0 with E0 the initial total error. To guarantee a decrease of the error by a factor α during the first iterations, we also need γ0γ¯2. Thus, we define recursively the sequence (ki)i and (γki)i as follows:

{ki+1=ki+log(α2(1+r)M¯)log(αki),γki+1=αγki,i0,k0=0,γk0=γ¯1γ¯2, (21)

with αki=1-γki and M¯ defined in Theorem 1. The proposition below provides a non-asymptotic approximation of the number of steps ki needed to reduce the error by the factor αi with i.

Proposition 4.

Under Assumptions 4, 5 and 6, we have

kiL1i+L2(i-i*)++L3(α-(i-i*+1)+-1).

with i*=sup{i0,αkiβ}, β defined in Theorem 1, L1=-Llog(β), L2=Lb¯2qL(1-b¯γ¯1), L3=L2qLγ0(1-α-1), L=|log(αr2(1+r)M¯)|, b¯=B2L and q=12-β.

The proof of the above result is given in Appendix G. Proposition 4 shows that the convergence speed due to (21) is exponentially fast as long as ii* (i.e kiL1i). In such case, the leading term is L1i. This means that we need around L1i steps to reduce the error by a factor αi/277 7 Recall that αi is the square of the error.. However, when i>i*, the dominating term becomes L3(α-(i-i*+1)+-1). This means that after a number k of iterations proportional to kα-i we reduce the error by 1/k=αi/2. Then, we recover the slow convergence rate of O(1/k).

Since the constants L and B are unknown in practice, a first solution consists in starting with arbitrary values for B and L and generating a sequence of learning rates. If the error m(qn,Xn+1(Zn),Zn) increases, one can take a larger value for B and a smaller one for L otherwise he keeps B and L unchanged. A second solution consists in directly averaging the error m(qn,Xn+1(Zn),Zn) over the lasts p visit times with p fixed by the controller. If this average error does not decrease, Equation (21) reduces the value of the step size by a factor α.

4.3 Extension

The results of this section still hold when the descent sequence (2) is replaced by

qn+1(z)=qn(z)-γn(z)m(qn,Xn+1(z),z),z𝒵,n. (22)

When γn(z)=0 if zZn, we recover (2). Thus, Equation (22) is slightly more general. Moreover, Algorithm (22) appears in many contexts: stochastic iterative algorithms, gradient methods, fixed point iterative techniques, etc. The results of this section can be directly transposed to (22). We present in Appendix A an adaptation of the Algorithm Algorithm 3 (PASS). for (22).

5 Some examples

5.1 Methodology

In this section, we compare four algorithms. The two first ones are two different versions of Algorithm 1 (RL).. In the first version, the learning rate γk is taken such that γk=ηk with η>0 selected to provide the best convergence results. In the second version, the step size follows the upper level policy, described in Section 4.2. The third algorithm is Algorithm 2 (SAGA). where the step size is derived from the upper level policy. Finally, we use the Algorithm 3 (PASS). algorithm (for “PAst Sign Search”) presented and studied theoretically in the previous sections. We consider three numerical examples to compare the convergence speed of these algorithms: drift estimation, optimal placement of limit orders and the optimal liquidation of shares.

5.2 Drift estimation

Formulation of the problem.

We observe a process (Sn)n0 which satisfies

Sn+1=Sn+fn+1+Wn, (23)

with Wn a centred noise with finite variance. We want to estimate the quantities fi with i{1,,nmax}. Using (23) and 𝔼[Wt]=0, we get

𝔼[Si+1-Si-fi+1]=0,i{0,,nmax-1}.

Thus, we can estimate fi using stochastic iterative algorithms. The pseudo-code of our implementation of Algorithm 3 (PASS). for this problem can be found in the Appendix B under the name Implementation B.

Numerical results.

Figure 2 shows the variation of the L2-error when the number of iterations increases. We can see that the algorithm Algorithm 3 (PASS). outperforms standard stochastic approximation algorithms. Moreover, other algorithms behave as expected: the standard RL decreases very slowly (but we know it will drive the asymptotic error to zero), the constant learning rate and SAGA provides better results than RL, while Algorithm 3 (PASS). seems to have captured the best of the two worlds for this application: very fast acceleration at the beginning and the asymptotic error goes to zero.

L2-error against the number of iterations  

Figure 2: The L2-error between fk and f for different numerical methods averaged over 1000 simulated paths.

5.3 Optimal placement of a limit order

Formalisation of the problem.

We consider an agent who aims at buying a unit quantity using limit orders and market orders (see [lehalleLimitOrderStrategic2017] for detailed explanations). In such case, the agents wonder how to find the right balance between fast execution and avoiding trading costs associated to the bid-ask spread. The agent state at time t is modelled by Xt=(QBefore,QAfter,P) with QBefore the number of shares placed before the agent’s order, QAfter the queue size after the agent’s order and Pt the mid price, see Figure 3. The agents wants to minimise the quantity

𝔼[F(XτTexec)+0τTexecc𝑑s]

where

  • Texec=inf{t0,Pt=0} the first time when the limit order gets a transaction.

  • τ the first time when a market order is sent.

  • X=(QBefore,QAfter,P) the state of the order book.

  • F(u) is the price of the transaction (i.e. F(u)=p+ψ when the agents crosses the spread and F(u)=p otherwise).

We show in Section 2 that the Q-function is solution of (6). Thus, we can use Algorithms Algorithm 1 (RL)., Algorithm 2 (SAGA). and Algorithm 3 (PASS). to estimate it. The pseudo-code of our implementation of Algorithm 3 (PASS). is available as Implementation 27 in Appendix B.

{tikzpicture}\draw

(3,-1) rectangle (4,-5); \draw[black,fill=gray!60] (3,-3) rectangle (4,-3.3); \draw(6,-2) rectangle (7,-5); \draw[dotted][->](0,-5) – ++(10,0); \draw(5,-5) node[sloped]|; \draw(3.5,-5) node[below]Same ; \draw(6.5,-5) node[below]Opp ; \draw(3,-4) node[left]QBefore ; \draw(3,-3.15) node[left]q ; \draw(3,-2) node[left]QAfter ; \draw(7,-4) node[right]QOpp ; \draw(5,-5) node[below]P(t) ; \draw(10,-5) node[right]Price ;

Figure 3: The state space of our limit order control problem.
Numerical results.

Figure 4 shows three control maps: the x-axis reads the quantity on “same side” (i.e. Qsame=QBefore+QAfter) and the y-axis reads the position of the limit order in the queue, i.e. QBefore. The color and numbers gives the control associated to a pair (Qsame,QBefore): 1 (blue) means “stay in the book”, while 0 (red) means “cross the spread” to obtain a transaction. The panel (at the left) gives the reference optimal controls obtained with a finite difference scheme, the middle panel the optimal controls obtained for a Algorithm 1 (RL). algorithm where the step-size (γk)k0 is derived from the upper level policy, and the right panel the optimal control obtained with our optimal policy (i.e. upper level and inner level combined). It shows that after few iterations our optimal policy already found the optimal controls. Figure 5 compares the log of the L2 error, averaged over 100 trajectories, between the different algorithms. We see clearly that our methodology improves basic stochastic approximation algorithms. Again, the other algorithms behave as expected: SAGA is better than a constant learning rate that is better than the standard RL (at the beginning, since we know that asymptotically RL will drive the error to zeros whereas a constant learning rate does not).

a) Theoretical optimal control b) step_cste optimal control c) PASS optimal control  

Figure 4: Comparison optimal control after 300 iteration for different methods: left is the optimal control, middle is Algorithm 1 (RL). with a step size derived from the upper level and right is our optimal policy for the step size (i.e. upper level and inner level combined).

L2-error against the number of iterations  

Figure 5: The log L2-error against the number of iterations averaged over 1000 simulated paths.

5.4 Optimal execution

Formalisation of the problem.

An investor wants to buy a given quantity q0 of a tradable instrument (see [cartea15book] and [lehalleMarketMicrostructurePractice2018] for details about this application). The price St of this instrument satisfies the following dynamic:

dSt=αdt+σdBt,

where α is the drift and σ is the price volatility. The state of the investor is described by two variables its inventory Qt and its wealth Xt at time t. The evolution of these two variables reads

{dQt=νtdt,Q0=q0,dWt=-νt(St+κνt)dt,W0=0, (24)

with νt the trading speed of the agent and κ>0. The term κνt corresponds to the temporary price impact. The investor wants to maximize the following quantity

WT+QT(ST-AQT)-ϕtTQs2𝑑s,

it represents its final wealth XT at time T, plus the value of liquidating its inventory minus a running quadratic cost. The value function V is defined such that

V(t,w,q,s)=supν𝔼[WT+QT(ST-AQT)-ϕtTQs2ds|Wt=w,Qt=q,St=s].

We remark that v(t,w,q,s)=V(t,w,q,s)-w-qs verifies

v(t,w,q,s)=supν𝔼[(WT-Wt)+(QTST-QtSt)=MTt-AQT2-ϕtTQs2ds|Wt=w,Qt=q,St=s]. (25)

Using (24), we can see that the variable MTt is independent of the initial values Wt, St and Qt. This means that v is a function of only two variables the time t and the inventory q. The dynamic programming principle ensures that v satisfies

v(t,q)=supν𝔼[Mt+Δt-ϕtt+ΔQs2ds+v(t+Δ,Qt+Δ)|Qt=q]. (26)

We fix a maximum inventory q¯. Let k=(kT,kq)(*)2, Δ=T/kT, DT={tikT;ikT} and Dq={qikq;ikq} with tikT=iΔ and qikq=-q¯+2iq¯/kq. To estimate v we use the numerical scheme (vnk)n1,k(*)2 defined such that

vn+1k(Zn)=vnk(Zn)+γn(Zn)[supνA(Zn){Mn+1ν-ϕΔQn2+vnk(Zn+1ν)-vnk(Zn)}],

with Zn=(nΔ,QnΔ) and A(Zn)Dq is the set of admissible actions88 8 We do not allow controls that leads to states where the inventory exceeds q¯.. When the final time T is reached (i.e. n=kT), we pick a new initial inventory from the set Dq and start again its liquidation. At a first sight, it is not clear that vnk approximates v. However, we show in Appendix H that vnk converges point-wise to v on DT×Dq when n and k. see Appendix B for a detailed implementation of the algorithm with the corresponding pseudo-code (as Implementation B).

Numerical results.

Figure 6 shows the value function v for different values of the elapsed time t and the remaining inventory Qt. The panel (at the left) gives the reference value function. It is computed by following the same approach of [MFGTradeCrow2016]. The middle panel the value function obtained obtained after 120 000 iterations for Algorithm 1 (RL). algorithm where the step-size (γk)k0 is derived from the upper level of our optimal policy, and the right panel the value function obtained with our optimal policy (i.e upper level and inner level combined). It shows that our optimal strategy leads to better performance results. We also plot, in Figure 7, a simulated path for the variations of the log L2 error for different algorithms. Here again, we notice that our methodology improves the basic RL algorithm and that the ordering of other approaches is similar to the one of the “drift estimation” approximation (i.e. SAGA and the constant learning rate are very similar).

a) Theoretical value function b) step_ cste value function c) PASS value function  

Figure 6: Comparison value function between methods.

L2-error against the number of iterations  

Figure 7: The log L2-error against the number of iterations averaged over 1000 simulated paths.

Appendix A Extension of the Algorithm 3 (PASS). Algorithm

First, we can adapt Algorithm Algorithm 3 (PASS). to (22) by considering component-wise version which consists in applying Algorithm 3 introduced in Section 4.1.1 to all the visited coordinates of qn. Next, we also propose a more direct approach with the algorithm below.

Algorithm 4 (PASS for (22)).

We start with an arbitrary q0 and define by induction qk and γ^k

  • If m(qn,Xn+1),m(qn-1,Xn)0, then do

    qn+1=qn-h(γ^n,γn)m(qn,Xn+1),γ^n+1=h(γ^n,γn).
  • Else, do

    qn+1=qn-l(γ^n,γn)m(qn,Xn+1),γ^n+1=l(γ^n,γn).

Appendix B Implementations

We give here the pseudo code used for each one of the three numerical examples considered in Section 5.

Drift estimation.

We consider the following expression for the functions h and l:

h(γ,γbase)=min(γ+γbase,3γbase),l(γ,γbase)=max(γ-γbase,γbase),(γ,γbase)+.

We use Implementation B for the numerical experiments. {algorithm} PAst Sign Search (PASS) for (RL) drift estimation problem \[email protected]@algorithmic[1] \StateAlgorithm parameters: step size (γo)n0(0,1], number of episodes n \Statexinitial guess q0, past error value Epast \StatexInitialise γ^0=γ0o \Forepisode in 1:n \For t{0,,nmax-1} \StateObserve ΔXnext=St+1-St \Ifthe first visit time to t \Stateq0(t)q0(t)-γ^0(t)m(q0,ΔXnext,t) \ElsIfm(q0,t,ΔXnext)×Epast(t)0 \Stateγ^0(t)h(γ^0(t),γo(t)) \Stateq0(t)q0(t)-γ^0(t)m(q0,ΔXnext,t) \ElsIfm(q0,t,ΔXnext)×Epast(t)<0 \Stateγ^0(t)l(γ^0(t),γo(t)) \Stateq0(t)q0(t)-γ^0(t)m(q0,ΔXnext,t) \EndIf\StateEpast(t)m(q0,t,ΔXnext) \EndFor\StateSave the norm E of the vector Epast(t). \Ifthe average value of E over the last w=5 episodes is not reduced by p=1% \Stateγo(t)max(γo(t)/2,0.01) (this is done each w episodes) \EndIf\EndFor

Optimal placement of limit orders.

We consider the following expression for the functions h and l:

h(γ,γbase)=max(min(γ+2/3γbase,3γbase),γbase),l(γ,γbase)=max(γ-2/3γbase,γbase),(γ,γbase)+. (27)

We use Algorithm 27 for the numerical experiments. Note that we do not need to send a market order to know our expected future gain. {algorithm} PAst Sign Search (PASS) for (RL) optimal placement problem \[email protected]@algorithmic[1] \StateAlgorithm parameters: step size (γo)n0(0,1], number of episodes n \Statexinitial guess q0, past error value Epast \StatexInitialise γ^0=γ0o \Forepisode in n \StateSelect initial state X0 \Foreach step within episode \StateTake the action stay in the order book \StateObserve the new order book state Xnext \Fora {0,1} \Ifthe first visit time to Xnext \Stateq0(X0,a)q0(X0,a)-γ^0(X0,a)ma(q0,X0,Xnext) \ElsIfma(q0,X0,Xnext)×Epast(X0,a)0 \Stateγ^0(X0,a)h(γ^0(X0,a),γo(X0,a)) \Stateq0(X0,a)q0(X0,a)-γ^0(X0,a)ma(q0,X0,Xnext) \ElsIfma(q0,X0,Xnext)×Epast(X0,a)<0 \Stateγ^0(X0,a)l(γ^0(X0,a),γo(X0,a)) \Stateq0(X0,a)q0(X0,a)-γ^0(X0,a)ma(q0,X0,Xnext) \EndIf\StateEpast(X0,a)ma(q0,X0,Xnext) \EndFor\StateX0Xnext \EndFor\StateSave the norm E of the vector Epast(t). \Ifthe average value of E over the last w=40 episodes is not reduced by p=5% \Stateγo(t)max(γo(t)/2,0.01) (this is done each w episodes) \EndIf\EndFor

Optimal execution of a large number of shares.

To solve this problem we use the same functions h and l considered in the previous problem, see (27). Then, we apply Algorithm B. In this problem, it is crucial to select actions according to the policy (7) in order to encourage exploration. The coefficient β¯ used by the agent to select its actions is taken constant equal to β¯=5. We consider the same policy for all the tested algorithms.
{algorithm} PAst Sign Search (PASS) for (RL) optimal execution problem \[email protected]@algorithmic[1] \StateAlgorithm parameters: step size (γo)n0(0,1], number of episodes n \Statexinitial guess q0, past error value Epast \StatexInitialise γ^0=γ0o \Forepisode in n \StateSelect the initial inventory Q0 \Fort{0,,nT-1} \StateObserve the new price state Snext and set X0=(t,Q0) \StateObserve the new price state Snext \Ifthe first visit time to X0 \Stateq0(X0)q0(X0)-γ^0(X0)m(q0,Snext,X0) \ElsIfm(q0,Snext,X0)×Epast(X0)0 \Stateγ^0(X0)h(γ^0(X0),γo(X0)) \Stateq0(X0)q0(X0)-γ^0(X0)m(q0,Snext,X0) \ElsIfm(q0,Snext,X0)×Epast(X0)<0 \Stateγ^0(X0)l(γ^0(X0),γo(X0)) \Stateq0(X0)q0(X0)-γ^0(X0)m(q0,Snext,X0) \EndIf\StateEpast(X0)m(q0,Snext,X0) \StateSelect an action A and observe Qnext \StateQ0Qnext \EndFor\StateSave the norm E of the vector Epast(t). \Ifthe average value of E over the last w=300 episodes is not reduced by p=1% \Stateγo(t)max(γo(t)-0.01,0.01) (this is done each w episodes) \EndIf\EndFor

Appendix C Proof of Proposition 1

Proof of Proposition 1.

Let z𝒵. Standard uniform convergence results ensure that

𝔼[supq(g(q)-gn(q))(z)|n(z)]c1n(z)1,a.s.

with c>0 a positive constant. Since the Markov chain (Zn)n1 is irreducible and the set 𝒵 is finite, the sequence (Zn)n1 is positive recurrent and we have

n(z)n=k=1n𝟏Zk=znnμ[Zn=z]>0=p(z),

with μ the unique invariant distribution of (Zn)n1. Thus, we have

un(z)=𝔼[nn(z)1]n1p(z)>0.

This shows that un(z) is bounded by u(z) and ensures that

𝔼[supq(g(q)-gn(q))(z)]c1(z)1n,

with c1(z)=cu(z). The constant c1 can be taken independent of z since 𝒵 is finite. ∎

Appendix D Proof of Proposition 2

Proof of Proposition 2.

Let z𝒵. We follow the same approach used in the proof of Proposition 1 to get

vn(z)=𝔼[(log(n¯(z))/log(n)n¯(z)/n)β]𝔼[(1n¯(z)/n)β]n1p(z)β>0.

This shows that vn(z) is bounded by v(z) and ensures that

𝔼[supq(g(q)-gn(q))(z)]c2(z)1n,

with c2(z)=cv(z). The constant c2 can be taken independent of z since 𝒵 is finite. Using the same manipulations of (11), we complete the proof. ∎

Appendix E Proof of Proposition 3

Proof of Proposition 3.

Let k0, A the set A={Zk=z} and m(qk)𝒵 such that m(qk)(z)=m(qk,Xk+1(z),z) for any z𝒵. We split the proof in three steps. In each one of these steps, we prove (16) for a given algorithm.

Step (i):

In this step, we prove (16) for Algorithm Algorithm 1 (RL).. We have

𝟏A𝔼k[(qk+1-q*)2(z)] =𝟏A𝔼k[(qk-γkm(qk)-q*)2(z)]
=𝟏A{(qk-q*)2(z)-2γk𝔼k[m(qk)](qk-q*)(z)=(i)+γk2𝔼k[m(qk)2(z)]=(ii)}.

Using Assumption 5 and 𝔼k[m*]=0, we get (i)-2Lγk|qk-q*|2(z). Since 𝔼k[m*]=0, Assumption 6 gives

(ii) =𝔼k[(m(qk)-𝔼k[m(qk)])2(z)]+(𝔼k[m(qk)-m*])2(z)
B(1+vk)+B(1+(qk-q*)2(z)).

Thus, we deduce that

𝟏A𝔼k[(qk+1-q*)2(z)] 𝟏A{(1-2γkL+Bγk2=-p(γk))(qk-q*)2(z)+γk2(z)B(2+vk)=Mk},

which shows (16) for Algorithm Algorithm 1 (RL)..

Step (ii):

Here we show (16) for Algorithm Algorithm 2 (SAGA).. Let M¯k(z)=j=1Mn¯k[z,j]Mk[z,j] and n¯k[z,j]=nk[z,j]/j=1Mnk[z,j]. Using 𝔼k[m*]=0 and 𝔼k[Mk[z,i]]=M¯k(z), we have

𝟏A𝔼k[(qk+1-q*)2(z)]   =𝟏A{(qk-q*)2(z)+2(𝔼k[qk+1]-qk)(qk-q*)(z)+𝔼k[(qk+1-qk)2(z)]}
  =𝟏A{(qk-q*)2(z)-2(γk𝔼k[m(qk)-m*])(qk-q*)(z)
    +γk2𝔼k[(m(qk)(z)-Mk[z,i]+M¯k(z))2]}
 Assumption 5 𝟏A{(1-2Lγk)(qk-q*)2(z)+γk2𝔼k[(m(qk)(z)-Mk[z,i]+M¯k(z))2]=(1)}. (28)

We first dominate the term (1). Since 𝔼k[m*](z)=0 and

𝔼k[(Mk[z,i]-m*)2(z)]=jn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2],

we have

(1) =𝔼k|(m(qk)-𝔼k[m*])(z)-(Mk[z,i]-m*(z))+(jn¯k[z,j](Mk[z,j]-m*(z)))|2
3[𝔼k[m(qk)-m*]2(z)+𝔼k[(Mk[z,i]-m*(z))2]+𝔼k[(jn¯k[z,j](Mk[z,j]-m*(z))]2]
Jensen’s inequality3[𝔼k[(m(qk)-m*)2(z)]+𝔼k[(Mk[z,i]-m*)2(z)]+jn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]]
=3[𝔼k[(m(qk)-𝔼k[m(qk)])2(z)]+(𝔼k[m(qk)-m*](z))2+2jn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]]
 Assumption 63[B(2+vk+(qk-q*)2(z))+2jn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]]
=3B(2+vk)+3B(qk-q*)2(z)+6jn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]]. (29)

By combining (28) and (29), we get

𝟏A𝔼k[|qk+1-q*|2(z)] 𝟏A{(1-2γk(z)L+3Bγk2(z))(qk-q*)2(z)
+6jn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]+3Bγk2(vk+2)(z)}. (30)

Moreover, we have

𝟏A𝔼k[j=1Mn¯k[z,j](Mk+1[z,j]-m*(z))2] =𝟏A{1M𝔼k[(m(qk)-m*)2(z)]
  +(1-1M)j=1Mn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]}
𝟏A{BM(1+(qk-q*)2(z))
  +(1-1M)j=1Mn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]}. (31)

Thus using (30) and (31), we conclude

𝟏A𝔼k[ek+1(z)] 𝟏A{(1-2γkL+3Bγk2+BMc)=α1c(qk-q*)2(z)
+(1-1M+6γk2(z)c)α2jn¯k[z,j]𝔼k[(Mk[z,j]-m*(z))2]}+3Bγ2k(vk+2)(z)
𝟏Aαek(z)+3Bγk2(vk+2)(z),

with α=max(α1,α2)[0,1).

Step (iii):

In this final step, we show (16) for Algorithm Algorithm 3 (PASS).. We have

𝟏A𝔼k[(qk+1-q*)2(z)] =𝟏A𝔼k[(qk-γ^km(qk)-q*)2(z)]
=𝟏A{(qk-q*)2(z)-2𝔼k[γ^km(qk)](qk-q*)(z)=(i)+𝔼k[γ^k2(m(qk))2(z)]=(ii)}.

For the term (i), using Assumption 5 and 𝔼k[m*]=0, we have (i)=-2ckγ¯k(qk-q*)2(z) with ck=𝔼k[γ^km(qk)(z)]γ¯k𝔼k[m(qk)(z)]. Using Assumption 6 and 𝔼k[m*]=0, we get

(ii)=ck2γ¯k2𝔼k[m(qk)2(z)] =ck2γ¯k2(𝔼k[(m(qk)-𝔼k[m(qk)])2(z)]+(𝔼k[m(qk)-m*])2(z))
ck2γ¯k2(Bk(1+vk)+Bk(1+(qk-q*)2(z))).

Thus, we deduce that

𝟏A𝔼k[(qk+1-q*)2(z)] 𝟏A{(1-2ckγ¯kLk+Bk(ckγ¯k)2=-pk(ckγ¯k))(qk-q*)2(z)+ck2γ¯k2(z)Bk(2+vk)=Mk}. (32)

We write γ¯k for the quantity γ¯k=ckγ¯kγ¯k. Since γ^k[γk,r1γk], we have ckγ¯k[γk,r1γk]. When ckγ¯k]γk,γ¯k], we have pk(γ¯k)=pk(ckγ¯k)>pk(γk)0. When ckγ¯k]γ¯k,r1γk] (i.e ck1), we use that

|pk(ckγ¯k)-pk(γ¯k)|=|pk(ckγ¯k)-pk(γ¯k)|Bk(ckγ¯k-γ¯k)2Bk(r1γk-γk)2=γk2d1,

with d1=(r1-1)2Bk. Thus, using (32), we conclude

𝟏A𝔼k[(qk+1-q*)2(z)] 𝟏A{(1-pk(γ¯k)+γk2d1𝟏ck1)(qk-q*)2(z)+ck2γ¯k2(z)Bk(2+vk)=Mk}.

This completes the proof. ∎

Appendix F Proof of Theorem 1

F.1 Preparation for the proof of Theorem 1

We introduce the following notations. Let j and (μn)n1 a real sequence, we write (μnj)n1 for the delayed sequence μnj=μj+n for any n1. Additionally, we define recursively the sequence (anμ)n1 as follows:

a1μ=1, and   an+1μ=μnl=1nan+1-lalμ,n1. (33)
Lemma 1.

By convention, an empty sum is equal to zero. Let (vn)n1 be the sequence defined as follows:

vn=ϵn+μn(j=1n-1an-jvj),n1,

where (ϵn)n0 and (μn)n0 are two real sequences and j0aj=1. Then, we have

vn=j=1nan+1-jμjϵj,n1. (34)
Proof of Lemma 1.

Let us prove the result by induction on n1. By definition, Equation (34) is satisfied for n=1. By applying the induction hypothesis (34) to all jn, we get

vn+1=1×ϵn+1+μn+1(j=1nan+1-jvj) =a1μn+1ϵn+1+μn+1(j=1nan+1-jl=1jaj+1-lμlϵl)
=a1μn+1ϵn+1+[l=1nμn+1(j=lnan+1-jaj+1-lμl)ϵl]
=a1μn+1ϵn+1+[l=1nμn+1(j=1n-l+1an-l+2-jajμl)ϵl]
=a1μn+1ϵn+1+l=1nan+2-lμlϵl=j=1n+1an+2-jμlϵj.

Lemma 2.

Let nN, (μn)n0 be a positive sequence, (anμ)n0 be the sequence defined in (33) and rn=1-μn.

  • When n0rn=, we have

    anμn0. (35)
  • When n0rn= and there exists β[0,1) such that n0βnan<, we have

    βnanμn0. (36)
Proof of Lemma 2.

We need to introduce the following notations. Let n* and (un)n1 be a real sequence. We define respectively the translation operator T1 and the aggregation operator T2 as follows T1(u)n=un-1𝟏n2 and T2(u)n=(l1alul)𝟏n=1. We denote by (𝐰k)k1 the sequence

𝐰n1=ϵ1𝟏n=1,𝐰k+1=(T1+μk+1T2)(𝐰k),k1,n1. (37)

Our construction ensures that vk+1=𝐰1k+1=μk+1T2(𝐰k)1. In the sequel, we first prove (35) then show (36).

Step (i):

Here, we demonstrate (35). We first handle the case where a1=1. In such case, we have

vn=μnvn-1=(i=2nμi)v1=(i=2n(1-ri))v1e-l=2nrlv1n0,

since l2rl=. Now we place ourselves in the case where a1<1. Using (37), we have 𝐰n=[l=1n-1(T1+μl+1T2)]𝐰1 which gives

vn=μnT2[l=1n-1(T1+μl+1T2)](𝐰1)1=μnT2[𝐤=(k1,,km)𝒫n-1𝐢=(i1,,im){1,2}mμ¯𝐢𝐤TimkmTi1k1](𝐰1)1,

with 𝒫n-1={𝐤=(k1,,km)(*)m;s.tk1++km=n-1} the set containing all the partitions of (n-1) and μ¯𝐢𝐤=l𝒜𝐢𝐤μl. Note that each integer l{1,,n-1} belongs to a group pl (i.e pl=r=1plkpl<r=1pl+1kr) and the set 𝒜𝐢𝐤={l{1,,n-1};s.tipl=2} refers to the groups where we select the operator T2 instead of T1.

Let ϵ>0 and m*. Using that 𝐰1=supl1|𝐰l1| is finite, we have T2T1m(𝐰1)1=lm+1al𝐰l1(lm+1al)𝐰1. Since l1al<, there exists n0 such that when mn0, we have T2T1m(𝐰1)1ϵ2. We write a¯ for the quantity a¯=maxln0al. Since a1<1 and l1al=1, we have a¯<1. Thus there exists n1 such that for all mn1, we have a¯n1ϵ2.

We denote by kT1=r1,s.tir=1kr the integer that counts the number of times the operator T1 is chosen. When kT1n0n1, we have #{l,l𝒜𝐢𝐤}n0n1, which gives

μ¯𝐢𝐤=l𝒜𝐢𝐤μl=l𝒜𝐢𝐤(1-rl)e-(l𝒜𝐢𝐤rl)=e-(l=0nrl)+(l𝒜𝐢𝐤rl)e-(l=0nrl)+n0n1, (38)

since rl1. Thus, there exists n2 such that for all mn2, we have μ¯𝐢𝐤ϵ2.

We take now nn3=(n0n1)n2. When kT1>n0n1, we differentiate two cases. First, if there exists r such that krn0 and ir=1. In such case, since T2(u)T(u) with T(u)n=u𝟏n=1, we have

T2(TimkmTi1k1)(𝐰1)1=T2(TimkmT2kr+1T1n0T1kr-n0Ti1k1)(𝐰1)1ϵ2T2(TimkmTkr+1T1krTi1k1)(𝐰1)1. (39)

Second, when we cannot find such r, then necessarily kT2nn0n1 with kT2=r1,s.tir=2kr represents the number of times the operator T2 is selected. Since 𝐰n1=ϵ1𝟏n=1 and klk0 whenever il=2, the coefficients al with l>k0 never appear. In such situation, we get T2(u)a¯T(u) using basic inequalities. The inequalities T2(u)a¯T(u) and a¯kT2a¯n1ϵ2 ensure

T2(TimkmTi1k1)(𝐰1)1a¯kT2T2(Ti¯mkmTi¯1k1)(𝐰1)1ϵ2T2(Ti¯mkmTi¯1k1)(𝐰1)1, (40)

with i¯m=im𝟏im=1+𝟏im=2. When kT1n0n1, we use (38) and nn2 to get

μ¯𝐢𝐤T2(TimkmTi1k1)(𝐰1)1ϵ2(TimkmTi1k1)(𝐰1)1,. (41)

By combining the inequalities (39), (40), (41) and T2(u)T(u), we conclude

vn ϵ2[kT1n0n1(TimkmTi1k1)(𝐰1)1]+ϵ2[kT1>n0n1(Ti¯mkmTi¯1k1)(𝐰1)1]
ϵ2[l=1n-1(T1+μl+1T2)](𝐰1)1+ϵ2[l=1n-1(T1+μl+1T)](𝐰1)1ϵ2+ϵ2ϵ.

which proves (35).

Step (ii):

Let us prove (36). Note that (hn)n0=(βnanμ)n0 verifies

h0=0,hn+1=μn+1l=0n(βn+1-lan+1-lμ)=gn+1-lhl,n0,

with (gn)n0 the sequence defined such that gn=βnanμ. This means that hn=gnμ. Thus a straightforward application of (35) to the sequence (gnμ)n0 gives (36). This completes the proof. ∎

Lemma 3.

Let Z be a finite space and (z,z1)Z2. Under Assumption 7, there exist two constants β[0,1) and d>0 such that

an(z1)=z[τ𝒵1n]βnd,n. (42)
Proof of Lemma 3.

Let (z,z1)𝒵2 and Az1={z𝒵;[Z1=z1|Z0=z]=1} the states that leads to z1 with probability 1. To avoid the states z𝒵 where the quantity [Z1z1|Z0=z]=0, we first show (42) for the Markov chain Z~n=Zn𝟏ZnAz1+z𝟏ZnAz1. Then, we prove (42) in the general case.

Step (i):

We prove (42) by induction on n* for the Markov chain Z~. We denote by τ𝒵~11 the first visit time of the Markov chain Z~ to the state z1. For n=1, we have

z[τ𝒵~11=1]=z[Z~1=z1]maxz𝒵[max(z[Z~1=z1],z[Z~1z1])]=β.

Note that β[0,1) since z[Z~1=z1]<1. Let us take now n>1. We have

z[τ𝒵~11n]=z[τ𝒵~11=1]+zz1P(z,z)z[τ𝒵~11n-1],

with P(z,z)=[Z~1=z|Z~0=z]. The induction assumption gives z[τ𝒵~11n-1]1-βn-1 which ensures

z[τ𝒵~11n](1-Pz,z1c)+(1-βn-1)Pz,z1c1-βn,

with Pz,z1c=[Z~1z1|Z~0=z]. Thus, we get z[τ𝒵~11>n]βn.

Step (ii):

Let us now prove (42) for Z. Since Z reaches z1 at most one iteration after Z~, we have

z[τ𝒵11n]z[τ𝒵~11n-1]βn-1=1β=dβn,n1.

Recall that τ𝒵11 is the first visit time of the Markov chain Z to the state z1. This completes the proof. ∎

F.2 Propagation of the error

As a first step, we consider the case where the process (Zk)k1 is Markov. For any z1𝒵, we write τ𝒵11=inf{l>0,Zl=z1} for the first visit time of (Zk)k1 to the state z1. Moreover, we define the sequence (ak)k0 such that ak(z1)=[τ𝒵11k|0]. We have the following result.

Proposition 5.

Let (Zk)k1 be a Makov process, z1Z and nN*. Under Assumptions 4, 5, 6 and 7, we have

𝔼0[en(z1)]ϵn+j=1n-1an-jα¯j𝔼0[ej(z1)],

with ϵn=e1(z1)an(z1)+j=1n-1an-jMj and α¯j=E0[αjej(z1)]E0[ej(z1)]1E0[ej(z1)]0. The variables αj and Mj are given by (17) and (18).

Proof of Proposition 5.

Let (z,z1)𝒵2 and n*. Using the last-exit decomposition, see Section 8.2.1 in [meyn2012markov], we have

𝔼z[en(z1)] =𝔼z[en(z1)𝟏τ𝒵11n]+j=1n-1𝔼z[en(z1)𝟏{τ𝒵11n-j,Zj=z1}]
=𝔼z[e1(z1)𝟏τ𝒵11n]+j=1n-1𝔼z[ej+1(z1)𝟏{τ𝒵11n-j,Zj=z1}]
Proposition 3𝔼z[e1(z1)𝟏τ𝒵11n]+j=1n-1𝔼z[(αjej(z1)+Mj)𝟏{τ𝒵11n-j,Zj=z1}]
=𝔼z[e1(z1)𝟏τ𝒵11n]+j=1n-1𝔼z[(αjej(z1)+Mj)𝟏{Zj=z1}𝔼[𝟏τ{𝒵11n-j}|j]]
e1(z1)𝔼z[𝟏τ𝒵11n]+j=1n-1z1[τ𝒵1n-j1]𝔼z[(αjej(z1)+Mj)]
=ϵn+j=1n-1an-jα¯j𝔼z[ej(z1)],

with aj(z1)=z1[τ𝒵11j], τ𝒵11=inf{l>0,Zl=z1} and αn and Mn defined in Proposition 3. In the second equality, we use that en(z1) does not change as long as the state z1 is not reached. This completes the proof. ∎

We turn now to the general case. However, we use the same kind of arguments needed in the Markov setting. Let n* and (z1,,zn)𝒵n. For any i{1,,n}, we denote by Ai the set Ai={Z1=z1,,Zi=zi}. Moreover, we write τ𝒵n=inf{l>0,Zl+n=z1} and akn=[τ𝒵nn-kk|n]. We have the following result.

Proposition 6.

Under Assumptions 4, 5, 6 and 7, we have

𝔼0[en(z1)𝟏An]ϵn+j=1n-1an-jnα¯j𝔼0[ej(z1)𝟏An],

with ϵn=e1(z1)an0(z1)+j=1n-1an-jnMj and α¯j=E0[αjej(z1)1An]E0[ej(z1)1An]1E0[ej(z1)1An]0. The variables αj and Mj are given by (17) and (18).

Proof of Proposition 6.

We have

𝔼z[en(z1)𝟏An] =𝔼z[en(z1)𝟏τ𝒵11n,An]+j=1n-1𝔼z[en(z1)𝟏{τ𝒵jjn-j,Zj=z1,An}]
=𝔼z[e1(z1)𝟏τ𝒵11n,An]+j=1n-1𝔼z[ej+1(z1)𝟏{τ𝒵jjn-j,Zj=z1,An}]
Proposition 3𝔼z[e1(z1)𝟏τ𝒵11n,An]+j=1n-1𝔼z[(αjej(z1)+Mj)𝟏{τ𝒵jjn-j,Zj=z1,An}]
𝔼z[e1(z1)𝟏τ𝒵11n,An]+j=1n-1𝔼z[(αjej(z1)+Mj)𝟏{Zj=z1,An}𝔼[𝟏{τ𝒵jjn-j}|n]]
e1(z1)𝔼z[𝟏τ𝒵11n]+j=1n-1[τ𝒵jn-jj|n]𝔼z[(αjej(z1)+Mj)𝟏An]
ϵn+j=1n-1an-jnα¯j𝔼z[ej(z1)𝟏An],

In the second equality, we use that en(z1) does not change as long as the state z1 is not reached. This completes the proof.∎

F.3 Proof of Theorem 1

Proof of Theorem 1.

We split the proof in two steps. We first show (19) and (20) when 𝒵 is finite then when 𝒵 is countable.

Step (i):

In this part, we prove (19) and (20) when the space 𝒵 is finite. For clarity, we first proof the results when (Zk)k0 is Markov then we explain how to show it for the general case.

Sub-step (i-1):

Here we demonstrate (19) when the space 𝒵 is finite. Let z1𝒵 and n*. Using Proposition 5, the sequence vn(z1)=α¯n(z1)𝔼[en(z1)] verifies

vn(z1)ϵ¯n(z1)+α¯n(z1)j=1n-1an-j(z1)vj(z1),

with ϵ¯n(z1)=α¯n(z1)ϵn(z1). For clarity and since there is no ambiguity, we forget the dependencies to z1 in the rest of the proof. Thus, using Lemma 1, we get

vn=j=0nan-jμjϵ¯j,

with (anμ)n0 defined in (33). We can assimilate the sequence ϵ¯ to the measure μ=k0ϵ¯kδk with δk the Dirac measure at k. Since ϵk=e1(z1)ak+k=1n-1an-k𝔼[Mk], kak<, k0𝔼[γk2]< and 𝔼[Mk]=O(𝔼[γk2]) for all the algorithms, we get k0ϵk< using the properties of the Cauchy product between two sequences. This property ensures that the measure μ has a finite mass. Additionally, using Lemma 2, we have anμjn0 for any j0. Thus, the dominated convergence theorem ensures that vnn0.

When (Zk)k0 is not Markov, one can show using the same methodology above and Proposition 6 that for any sequence 𝐳=(zk)k*𝒵 the quantity u~n=𝔼[en(z1)𝟏A𝐳n] converges towards 0 with A𝐳n={Z1=𝐳1,,Zn=𝐳n}. Since en𝟏A𝐳ne0𝟏A𝐳n, we use the dominate convergence theorem to get

limn𝔼[en]=𝐳Slimn𝔼[en𝟏A𝐳n]=0,

where S is the set of sequences valued in 𝒵 and A𝐳n={Z1=𝐳1,,Zn=𝐳n} for any 𝐳𝒮.

Sub-step (i-2):

In this step, we show (20) when the space 𝒵 is finite. Since γk(z)=γ𝟏{Zk=z} for all (k,z)×𝒵, the quantity αk is constant (i.e αk=α). Let M=supnMn. Using a direct induction and Proposition 5, we get un=𝔼[en(z1)]c with c solution of the equation c=γ2M+αc (i.e c=γ2M1-α). We define (mn)n0 as follows:

mn=γ2M+(e1(z1))an(z1)=a~n=bn+αj=1nan+1-jvj,n0.

By direct induction, we have unmn for all n0. Lemma 1 gives

mn=l=0nan-lμlbl=γ2M(l=0nan-lμl)+(l=0nan-lμla~l),n0,

Using a direct induction and (33), we check that an-lμl=αn-lan-l*n-l. The sequence a*l is defined recursively such that an*1=an and an*l+1=(a*l*a)n=j=1n-1an+1-j*laj for any n. Using this expression of an-lμl, we get

γ2M(l=0nan-lμl)γ2M(l=0nαl)=γ2M1-αn1-α=(1-αn)c.

Thanks to Lemma 3, we know that an(z1)βnd with β[0,1) and d>0 which gives l0rna~n(z1)< for any r<1β. Let r<min(1/β,1/α). Using the properties of the Cauchy product between sequences, we have

n1rn(l=0nan-lμla~l)n1(l=0nrn-lαn-lrla~l)(n1rnαn)(n1rna~n)<. (43)

Inequality (43) ensures the existence of M¯>0 such that (l=0nan-lμla~l)α~nM¯ for any α~<max(β,α). This shows that mn(1-α~n)c+α~nM¯e1(z1) which completes the proof when 𝒵 is finite.

To handle the case where (Zk)k0 is not Markov, we first show using Proposition 6 and the methodology above that for any sequence 𝐳=(zk)k*𝒵 we have 𝔼[en𝟏An](1-α~n)c+α~nM¯e1(z1)𝔼[𝟏An] with A𝐳n={Z1=𝐳1,,Zn=𝐳n}. Then, we use the same approach of Section Sub-step (i-1) to deduce that un(1-α~n)c+α~nM¯e1(z1).

Step (ii):

Now we assume that 𝒵 is countable. We prove the result when (Zk)k0 is Markov then we explain how to extend it in the general case.

Sub-step(ii-1):

Let ϵ>0. Since E1=k1e1(zk)<, there exists k0 such that for all kk0, we have kk0e1(zk)<ϵ2. We write Ak0 for the set Ak0={zk,kk0}. Since Ak0 is finite, we use Sub-step (i-1) to show the existence of k1 such that for all kk1, we have kk0ek1(zk)<ϵ2. Let us now take kk1. Using (el(z))l1 is non-increasing for any z𝒵, we get Ek=kk0e1(zk)+k<k0e1(zk)ϵ2+ϵ2=ϵ.

Sub-step(ii-2):

Let M=supnMn and ϵ=(1-α)γ2Mr>0. Since E1=k1e1(zk)<, there exists k0 such that for all kk0, we have kk0e1(zk)<ϵ. We write Ak0 for the set Ak0={zk,kk0}. Since Ak0 is finite, we use Sub-step (i-2) to show that for all n, we have en(zk)<(1-α~n)c+α~nM¯e1(z1) with zkAk0. Using ϵ=(1-α)γ2Mr(1-α)γ2M1-αr=(1-α)cr and (el(z))l1 is non-increasing for any z𝒵, we deduce that en=max(supk>k0e1(zk),supkk0e1(zk))max((1-α~n)c+α~nM¯e1(z1),ϵ)(1-α~n)c(1+r)+α~nM¯e1(z1). ∎

Appendix G Proof of Proposition 4

Lemma 4.

Under Assumptions 4, 5 and 6, we have

𝔼[Eki+1]αiM¯γ0(1+r)u-Bγ0,i. (44)
Proof of Lemma 4.

Let us show the result by induction on i. Since the proof for the initialisation step and the induction step are similar, we only show the induction step. We assume that (44) holds for a given i-1. We consider the delayed error sequence unki=𝔼[en+ki|ki]. By following the same methodology used in the proof of Proposition 5, we get

unkiϵnki+αkij=1k-1aki+k-jujki,k*.

This allows us to use Theorem 1 in order to exhibit a dominating sequence M¯ which verifies

𝔼zi[unki](1-αkin)(1+r2)ci+αkinM¯u1ki,n,

with ci=γkiM¯(1-αki)αi+1M¯γ0u-Bγ0 and u1kiαiM¯(1+r)γ0u-Bγ0 under the induction hypothesis. Thus, by taking ki+1=ki+log(αr2(1+r)M¯)log(αki) and ni+1=ki+1-ki, we check that αkini+1M¯αr2(1+r) which gives

𝔼zi[eki+1]=𝔼zi[uni+1ki](1+r2)ci+r2u1kiαi+1M¯(1+r)γ0u-Bγ0.

This completes the proof. ∎

Proof of Proposition 4.

Let L=|log(αr2(1+r)M¯)| and i*=sup{i0,αkiβ}. When αkiβ (i.e f(γki)(1-β)), we use the inequality log(1-x)-qx, for all x[0,1-β] with q=12-β, to get

ki Lγkjγ¯-1log(β)+Lγkj<γ¯-1log(1-f(γki))
Lii*-1log(β)+Li>i*1qf(γki)=Lii*-1log(β)+Lqui>i*1γki(1-b¯γki),

with b¯=bu. Then, we use that 1x(1-b¯x)=1x+b¯1-b¯x to deduce

ki Li-1log(β)+(i-i*)+Lb¯qu(1-b¯γ¯1)+Lqui>i*1γki.

Since γki=αiγ0, we conclude that

kiL1i+L2(i-i*)++L3(α-(i-i*+1)+-1).

with L1=-Llog(β), L2=Lb¯qu(1-b¯γ¯1) and L3=Lquγ0(1-α-1). This completes the proof. ∎

Appendix H Proof of the convergence of (vnk)n1,k(*)2

In this section we prove the following result.

Proposition 7.

The sequence (vnk)n1,k1 converges point-wise towards v on DT×Dq when n and k.

Proof of Proposition 7.

We prove this result in three steps. First, we show that v can be approximated by a numerical scheme v¯k. Then, we replace v¯k by another scheme vk that also converges towards v. Finally, we show that vnk tends to vk when n.

Step (i):

We start with our initial control problem where the agents may choose its trading speed at any time. It was studied by many authors, see for example [MFGTradeCrow2016], who show that the optimal trading speed verifies

ν(t,q)=h1(t)-qh2(t)2κ, (45)

with h1:[0,T] and h2:[0,T]+ a positive function. The above equation shows that when q is large enough ν has the opposite sign of q. Let us take such a large enough q~ and place ourselves in Sq=[-q¯,q¯] with q¯=2q~. In such case, Equation (45) ensures that ν is bounded by ν¯ on Sq. This show that we can replace the initial set of controls by the subset D[-ν¯,ν¯] where the sign of ν positive is (resp. negative) when the inventory is -q~ (resp. q~). Now, we approximate this problem in a classical way using the numerical scheme v¯k defined such that

v¯k(nt,nq)=supνDa𝔼[Mnt+1nt-ϕQntΔt2Δ+v¯k(nt+1,nq+1ν)|QntΔt=nqΔq],(nt,nq)DT×Dq,

with Mnt+1nt=MΔt(nt+1)ntΔ, nq+1ν the index such that Q(nt+1)Δtν=nq+1νΔq and Da=D{iΔq,i}. Note that the set Da is contained in our admissible set of controls A(ntΔt,nqΔq) when Δ is small enough. This shows that we can substitute A(ntΔt,nqΔq) to Da without any restriction. The convergence of (v¯k)k(*)2 towards v on the set DT×Dq when k is standard.

Step (ii):

We denote by vk the numerical scheme

vk(nt,nq)=𝔼[supνDa{Mnt+1nt-ϕQntΔt2Δ+vk(nt+1,nq+1ν)}|QntΔt=nqΔq],(nt,nq)DT×Dq,.

Let us show that v¯k and vk have the same limit. For this, we use a backward recurrence. For the moment, we assume that |supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|KΔt2 and we will prove it at the end of Step (ii). We want to show that |v¯k(nt,nq)-vk(nt,nq)|K(T-t)Δt for all (nt,nq)DT×Dq. At the terminal time v¯k and vk coincide. We move now to the induction part. We have

|v¯k(nt,nq)-vk(nt,nq)| =|supν𝔼[Mnt+1nt-ϕq2Δt+v¯k(nt+1,nqt+1)]
  -𝔼[supνMnt+1nt-ϕq2Δt+vk(nt+1,nqt+1)]|
|supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|+|𝔼[supν{v¯k(nt+1,nqt+1)-vk(nt+1,nqt+1)}]|
𝔼[supν{KΔt2+K(T-t-Δt)Δt}]
=K(T-t)Δt.

In the third inequality, we use the induction assumption to complete the proof. Now, let us show that |supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|KΔt2. We write t=ntΔt, ΔQ=Qt+Δt-Qt=νΔt, ΔS=St+Δt-St and ΔS¯=tt+Δt(Ss-St)𝑑s. Thus, we have

Mnt+1nt =(Wt+Δ-Wt)+(Qt+ΔSt+Δ-QtSt)
=-νΔS¯-ΔQSt-κν2Δt+QtΔS+ΔQSt+ΔQΔS
=-νΔS¯-κν2Δt+QtΔS+νΔtΔS. (46)

The above equation shows supνMnt+1nt=(ΔtΔS-ΔS¯)24κΔt+QtΔS. Using 𝔼[ΔS]=αΔt and

𝔼[ΔS¯]=tt+Δα(s-t)𝑑s=αΔt2/2,

we get

supν𝔼[Mnt+1nt]=α2Δt316κ+αQtΔt, and 𝔼[supνMnt+1nt]=α2Δt316κ+σ2Δt212κ+αQtΔt.

Thus we deduce that |supν𝔼[Mnt+1nt]-𝔼[supνMnt+1nt]|KΔt2 with K=σ212κ.

Step (iii):

Theorem 1, proves that vnk converges towards vk. Thus by composition we have vnk converges point-wise towards v when n and k which completes the proof. ∎

References