On stochastic gradient Langevin dynamics with dependent data streams: the fully non-convex case

  • 2019-05-30 16:09:26
  • Ngoc Huy Chau, Éric Moulines, Miklos Rásonyi, Sotirios Sabanis, Ying Zhang
  • 19


We consider the problem of sampling from a target distribution which is\emph{not necessarily logconcave}. Non-asymptotic analysis results areestablished in a suitable Wasserstein-type distance of the Stochastic GradientLangevin Dynamics (SGLD) algorithm, when the gradient is driven by even\emph{dependent} data streams. Our estimates are sharper and \emph{uniform} inthe number of iterations, in contrast to those in previous studies.


Quick Read (beta)

On stochastic gradient Langevin dynamics with dependent
data streams: the fully non-convex case thanks: All the authors were supported by The Alan Turing Institute, London under the EPSRC grant EP/N510129/1. N. H. C. and M. R. also enjoyed the support of the NKFIH (National Research, Development and Innovation Office, Hungary) grant KH 126505 and the “Lendület” grant LP 2015-6 of the Hungarian Academy of Sciences. Y. Z. was supported by The Maxwell Institute Graduate School in Analysis and its Applications, a Centre for Doctoral Training funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016508/01), the Scottish Funding Council, Heriot-Watt University and the University of Edinburgh. We thank the Alan Turing Institute, London, UK; the Rényi Institute, Budapest, Hungary and the École Polytechnique, Palaiseau, France for hosting research meetings of the authors.

N. H. Chau Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, Hungary. É. Moulines Centre de Mathématiques Appliquées, UMR 7641, Ecole Polytechnique, France. M. Rásonyi Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, Hungary. S. Sabanis Y. Zhang School of Mathematics, The University of Edinburgh, UK.
June 19, 2019

We consider the problem of sampling from a target distribution which is not necessarily logconcave. Non-asymptotic analysis results are established in a suitable Wasserstein-type distance of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, when the gradient is driven by even dependent data streams. Our estimates are sharper and uniform in the number of iterations, in contrast to those in previous studies.

1 Introduction

In this paper, the problem of sampling from a target distribution


is investigated, where θd and the function U:d satisfies Lipschitz continuity and a certain dissipativity condition. We establish non-asymptotic convergence rates for the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, based on the stochastic differential equation

dxt=-βU(xt)dt+2β-1dBt (1)

where B is the standard Brownian motion in d and β+ is the inverse temperature parameter.

Non-asymptotic convergence rates of Langevin dynamics based algorithms for approximate sampling of log-concave distributions have been intensively studied in recent years, starting with [7]. This was followed by [9], [13], [12], [5] amongst others.

Relaxing log-concavity is a more challenging problem. In [21], the log-concavity assumption is replaced by a logconcavity at infinity condition and L1 and L2-Wasserstein distances convergence rates are obtained. In a similar setting, [6] analyzes sampling errors in the L1-Wasserstein distance for both overdamped and underdamped Langevin MCMC. In [23], only a dissipativity condition is assumed and convergence rates are obtained in the L2-Wasserstein distance. Moreover, a clear and strong link between sampling via SGLD algorithms and non-convex optimization is highlighted. One can further consult [27], [8] and references therein.

In the present paper, we impose the dissipativity condition as in [23]. Using a different Wasserstein-type metric, we obtain shaper estimates and allow for possibly dependent data sequences. The key new idea is that we compare the SGLD algorithm to a suitable auxiliary continuous time processes inspired by (1) and we rely on contraction results developed in [15] for (1).

2 Main results

Let (Ω,,P) be a probability space. We denote by 𝔼[X] the expectation of a random variable X. For 1p<, Lp is used to denote the usual space of p-integrable real-valued random variables. Fix an integer d1. For an d-valued random variable X, its law on (d) (the Borel sigma-algebra of d) is denoted by (X). Scalar product is denoted by ,, with || standing for the corresponding norm (where the dimension of the space may vary depending on the context). We fix a discrete-time filtration 𝒢n:=σ(εk,kn,k), n where (εn)n is an i.i.d. sequence with values in some Polish space. This represents the flow of past information. The notation 𝒢 is self-explanatory. We also define the decreasing sequence of sigma-algebras 𝒢n+:=σ(εk,k>n), n, representing future information at the respective time instants.

Fix an d-valued random variable θ0, representing the initial value of the procedure we consider. For each β,λ>0, define the d-valued random process θnλ, n by recursion:

θ0λ:=θ0,θn+1λ:=θnλ-λH(θnλ,Xn+1)+2λβξn+1,n, (2)

where H:d×md is a measurable function, Xn, n is an m-valued, (𝒢n)n-adapted process and ξn, n is an independent sequence of standard d-dimensional Gaussian random variables.

We interpret Xn, n as a stream of data and ξn, n as an artificially generated noise sequence. We assume throughout the paper that θ0, 𝒢 and (ξn)n are independent.

Let U:d+ be continuously differentiable with derivative h:=U. Let us define the probability


It is implicitly assumed that de-βU(θ)dθ< and this is indeed the case under Assumption 2.5 below, as easily seen. Our objective is to (approximately) sample from the distribution πβ using the scheme (2).

We now present our assumptions. First, the moments of the initial condition need to be controlled.

Assumption 2.1.

Next, we require joint Lipschitz-continuity of H.

Assumption 2.2.

There is K1< and K2< such that for all θ,θRd and x,xRm,


We set

H*:=|H(0,0)|. (3)

The data sequence Xn, n need not be i.i.d., we require only a mixing property, defined in Section 3.1 below.

Assumption 2.3.

The process Xn, nN is conditionally L-mixing with respect to (Gn,Gn+)nN. It satisfies

𝔼[H(θ,Xn)]=h(θ),θd,n. (4)
Remark 2.4.

Stationarity of the process Xn, n would also be natural to assume but we need only the weaker property (4).

Finally, we present a dissipativity condition on H.

Assumption 2.5.

There exist a, b>0 such that, for all θRd and xRm,

H(θ,x),θa|θ|2-b. (5)

When Xn=c for all n for some cm (i.e. when H(θ,Xt) is replaced by h(θ) in (2)) then we arrive at the well-known unadjusted Langevin algorithm whose convergence properties have been amply analyzed, see e.g. [7, 12, 6, 21]. The case of i.i.d. Xn, n has also been investigated in great detail, see e.g. [23, 27, 21].

In the present article, better estimates are obtained for the distance between (θnλ) and πβ than those of [23] and [27]. Such rates have already been obtained in [2] for strongly convex U and in [21] for U that is convex outside a compact set. Here we make no convexity assumptions at all. This comes at the price of using the metric W1 defined in (6) below while [23, 27, 21, 2] use Wasserstein distances with respect to the standard Euclidean metric, see (10) below.

Another novelty of our paper is that, just like in [2], we allow the data sample Xn, n to be dependent. As observed data have no reason to be i.i.d., we believe that such a result is fundamental to assure the robustness of the sampling method based on the stochastic gradient Langevin dynamics (2).

For any integer q1, let 𝒫(q) denote the set of probability measures on (q). For μ,ν𝒫(d), let 𝒞(μ,ν) denote the set of probability measures ζ on (2d) such that its respective marginals are μ,ν. Define

W1(μ,ν):=infζ𝒞(μ,ν)dd[|x-y|1]ζ(dx,dy), (6)

which is the Wasserstein-1 distance associated to the bounded metric |x-y|1, x,yd.

Remark 2.6.

In this work, the constants appearing are often denoted by Cj for some natural number j. Without further mention, these constants depend on θ0, K1, K2, a, b, H*, β, d and on the process Xn, n through the quantities (13) below and, unless otherwise stated, they do not depend on anything else. In case of further dependencies (e.g. dependence on p, which is due to the drift condition, coming from Lemma 3.6 below), we signal these in parentheses, e.g. C6(p).

Our main contribution is summarized in the following result. Define

λmax=min{a/2K12,1/a}. (7)
Theorem 2.7.

Let Assumptions 2.1, 2.2, 2.3 and 2.5 be valid. Then there are finite constants C0>0, C1, C2 such that, for every 0<λλmax.

W1((θnλ),π)C1e-C0λn+C2λ,n (8)

Example 3.4 of [2] suggests that the best rate we can hope to get in (8) is λ, even in the convex case. The above theorem achieves this rate. We remark that, although the statement of Theorem 2.7 concerns the discrete-time recursive scheme (2), its proof is carried out entirely in a continuous-time setting, in Section 3. It relies on techniques from [2] and [15]. The principal new idea is the introduction of the auxiliary process Y~tλ(𝐱), t+, see (25) below.

Consider now a strengthening of Assumption 2.5 by imposing convexity outside a compact set.

Assumption 2.8.

There exist b,a>0 such that, for each θ, θ satisfying |θ-θ|>b,

H(θ,x)-H(θ,x),θ-θa|θ-θ|2,xm. (9)

Then, we can recover analogous results to Theorem 2.7 by considering the L1-Wasserstein distance. At this point, let us recall the definition of the familiar, “usual” Wasserstein-p (also know as Lp-Wasserstein) distance, for p1 :

W~p(μ,ν):=infζ𝒞(μ,ν)(dd|x-y|pζ(dx,dy))1/p,μ,ν𝒫(d). (10)
Theorem 2.9.

Let Assumptions 2.1, 2.2, 2.3 and 2.8 be valid. Then there are constants C3,C4,C5>0 such that, for every 0<λλmax,

W~1((θnλ),π)C4e-C3λn+C5λ,n. (11)

Strengthening the monotonicity condition (9) even guarantees convergence in W~2.

Assumption 2.10.

There exists a>0 such that, for each θ, θRd,

Theorem 2.11.

Let Assumptions 2.1, 2.2, 2.3 and 2.10 be valid. Then there are constants C3,C4,C5>0 such that

W~2((θnλ),π)C4e-C3λn+C5λ,n (12)

holds for every 0<λλmax.

2.1 Related work and our contributions

In the remarkable paper [23], a non-convex optimization problem is considered in the context of empirical risk minimization, which plays a central role in ML algorithms. The excess risk is decomposed into a sampling error resulting from the application of SGLD, a generalization error and a suboptimality error. Our aim is to improve the sampling error in the non-convex setting and provide sharper convergence estimates under more relaxed conditions. To this end, we focus on the comparison of our results with Proposition 3.3 of [23].

Condition (A.5) of [23] is (much) stronger than Assumption 2.1 above. Assumption 2.5 is identical to (A.3) in [23]. Condition (A.2) in [23] corresponds to Lipschitz-continuity of H in its first variable with a Lipschitz-constant independent from its second variable and (A.1) there means that H(0,), u(0,) are bounded where U(θ)=𝔼[u(θ,X0)] and H(,)=θu(,). Hence Assumption 2.2 here is neither stronger nor weaker than (A.2) of [23], they are incomparable conditions. In any case, Assumption 2.2 does not seem to be restrictive for practical purposes. Condition (A.4) in [23] is implied by Assumptions 2.2 and 2.3.

We obtain stronger rates (which we believe to be optimal) than those of [23]. More precisely, we obtain a rate λ1/2 in (8) for the W1 distance while [23] only obtains λ5/4n (which depends on n) but in the W~2 distance. Furthermore, we allow a possibly dependent data sequence. In other words, [23] is applicable only if Xn,n is i.i.d. while Assumption 2.3 suffices for the derivation of our results.

Now let us turn to [21]. The comparison is made only in the presence of convexity (outside a compact set) for U as it is a requirement for the results in [21]. Their Assumption 1.1 is precisely Assumptions 2.2 and 2.8 combined, however this is stipulated for h in [21] while we need it for H(,x), for all x, as we allow dependent data streams. Furthermore, Assumption 1.3 in [21] requires that the variance of H(θ,X0) is controlled by a power of the step size λ while we do not need such an assumption. The second conclusion of their Theorem 1.4 (with α=1, using their notation α) is the same as our Theorem 2.9.

Remark 2.12.

In the particular case where Xn, n are i.i.d., one can replace Theorem 3.2 below by Doob’s inequality in the arguments for proving Theorem 2.7. The full power of Assumption 2.2 is used only in Lemma 3.14. When Xn are i.i.d. then Lemma 3.14 is trivial and it is enough to assume only (A.2) of [23] instead of Assumption 2.2.

3 Proofs

3.1 Conditional L-mixing

L-mixing processes and random fields were introduced in [16]. In [4], the closely related concept of conditional L-mixing was created. We define this concept below and recall some related results. This section is an almost exact replica of Section 2 in [2].

We assume that the probability space is equipped with a discrete-time filtration n, n as well as with a decreasing sequence of sigma-fields n+, n such that n is independent of n+, for all n.

Fix an integer q1 and let Dq be a set of parameters. A measurable function U:×D×Ωm is called a random field. We drop the dependence on ωΩ in the notation henceforth and write (Un(θ))n,θD. A random process (Un)n corresponds to a random field where D is a singleton. A random field is Lr-bounded for some r1 if


Now we define conditional L-mixing. Recall that, for any family Zi, iI of real-valued random variables, ess.supiIZi denotes a random variable that is an almost sure upper bound for each Zi and it is, almost surely, smaller than or equal to any other such bound.

Let (Un(θ))n,θD be Lr-bounded for each r1. Define, for each n, and for τ0,

Mrn(U) :=esssupθDsupm𝔼1/r[|Un+m(θ)|r|n]
γrn(τ,U) :=esssupθDsupmτ𝔼1/r[|Un+m(θ)-𝔼[Un+m(θ)|n+m-τ+n]|r|n],
Γrn(U) :=τ=0γrn(τ,U). (13)

When necessary, Mrn(U,D), γrn(τ,U,D) and Γrn(U,D) are used to emphasize dependence of these quantities on the domain D which may vary.

Definition 3.1 (Conditional L-mixing).

We call (Un(θ))nN,θD uniformly conditionally L-mixing (UCLM) with respect to (Rn,Rn+)nN if (Un(θ))nN is adapted to (Rn)nN for all θD; for all r1, it is Lr-bounded; and the sequences (Mrn(U))nN, (Γrn(U))nN are also Lr-bounded for all r1. In the case of stochastic processes (when D is a singleton) the terminology “conditionally L-mixing process” is used.

Conditionally L-mixing encompasses a broad class of processes (linear processes, functionals of Markov processes, etc.), see Example 2.1 in [2]. The following maximal inequality is pivotal for our arguments.

Theorem 3.2.

Assume that Rk:=σ(ϵj,jk) for some Polish space-valued independent random variables ϵj, jk, jZ. Fix r>2 and kN. Let Wn, nN be a conditionally L-mixing process w.r.t. (Rn,Rn+), satisfying E[Wn|Rk]=0 a.s. for all nk. Let m>k and let bj, k<jm be deterministic numbers. Then we have

𝔼1/r[maxk<jm|i=k+1jbiWi|r|k]C(r)(i=k+1mbi2)1/2Mrk(W)Γrk(W), (14)

almost surely, where C(r) is a deterministic constant depending only on r but independent of k,m.


See Theorem 2.6 of [4] (there, ϵj, j, are assumed to be i.i.d.; the proof, though, trivially works for a merely independent sequence, too). ∎

Remark 3.3.

We will apply Theorem 3.2 with the choice r=3. In that case it is known that C(3)20, see Theorem A.1 of [2].

Lemma 3.4.

Let Xt, tN be conditionally L-mixing. Let Assumption 2.2 hold true. Then, for each iN, the random field H(θ,Xt), tN, θB(i), the closed ball of radius i centered at 0, is uniformly conditionally L-mixing with

Mrn(H(θ,X),B(i))K1i+K2Mrn(X)+H* (15)


Γrn(H(θ,X),B(i))2K2Γrn(X). (16)

See Lemma 6.4 and Example 2.4 of [2].

Lemma 3.5.

Let DRd be bounded. Fix nN and let ψt,tn be a sequence of Rn measurable random variables. Let Xt(θ) be conditionally L-mixing and Lipschitz in θD. Define the process Yt=Xt(ψt),tn. Then


The proof is identical to that of Lemma 6.3 of [4], noting the Lipschitz continuity. ∎

3.2 Further notation and introduction of auxiliary processses

Throughout this section we assume that the hypotheses of Theorem 2.7 are valid. Note that Assumption 2.2 implies

|h(θ)-h(θ)|K1|θ-θ|,θ,θd, (17)

Assumption 2.5 implies

h(θ),θa|θ|2-b,θd. (18)

Also, Assumption 2.2 implies

|H(θ,x)|K1|θ|+K2|x|+H*, (19)

with the constant H* defined in (3). We will employ a family of Lyapunov-functions in the sequel. For this purpose, let us define, for each p2, vp(x):=(1+x2)p/2, for any real x0, and similarly


Notice that these functions are twice continuously differentiable and

lim|θ|Vp(θ)Vp(θ)=0. (20)

Let 𝒫Vp denote the set of μ𝒫(d) satisfying


For μ𝒫(d) and for a non-negative measurable f:d, the notation


is used. The following functional is pivotal in our arguments as it is used to measure the distance between probability measures. We define, for any p1 and μ,ν𝒫Vp,

w1,p(μ,ν):=infζ𝒞(μ,ν)dd[1|θ-θ|](1+Vp(θ)+Vp(θ))ζ(dθdθ), (21)

Though w1,p is not a metric, it satisfies trivially

W1(μ,ν)w1,p(μ,ν). (22)

In the sequel we will need the case p=2, that is, w1,2.

Our estimations are carried out in a continuous-time setting, so we define and discuss a number of auxiliary continuous-time processes below. First, consider Lt, t+ defined by the stochastic differential equation (SDE)

dLt=-h(Lt)dt+2βdBt,L0:=θ0, (23)

where B is standard Brownian motion on (Ω,,P), independent of 𝒢σ(θ0). Its natural filtration is denoted by t, t+ henceforth. The meaning of is clear. Equation (23) has a unique solution on + adapted to (t)t+ since h is Lipschitz-continuous by (17). We proceed by defining, for each λ>0,


Notice that B~tλ:=Bλt/λ, t+ is also a Brownian motion and

dLtλ=-λh(Ltλ)dt+2λβdB~tλ,L0λ=θ0. (24)

Define tλ:=λt, t+, the natural filtration of B~tλ, t+.

Let us also introduce, for each λ>0 and for each 𝐱=(x0,x1,)(m), the process Y~λ(𝐱), t+ satisfying

dY~tλ(𝐱)=-λH(Y~tλ(𝐱),xt)dt+2λβdB~tλ, (25)

with initial condition Y~0λ(𝐱)=θ0. Due to Assumption 2.2, there is a unique solution to (25) which is adapted to (tλ)t+. Moreover, for any given s0 and ts, consider the following auxiliary process, which plays an important role in the derivation of our results,

dζ~λ(t,s;𝐱,θ)=-λH(ζ~λ(t,s;𝐱,θ),xt)dt+2λβdB~tλ,for t>s, (26)

with initial condition ζ~λ(s,s;𝐱,θ)=θ, and notice that ζ~λ(t,s;𝐱,Y~sλ(𝐱))=Y~tλ(𝐱).

Let us now define the continuously interpolated Euler-Maruyama approximation of Y~tλ(𝐱), t+ via

dYtλ(𝐱)=-λH(Ytλ(𝐱),xt)dt+2λβdB~tλ, (27)

with initial condition Y0λ(𝐱)=θ0. Notice at this point that (27), can be solved by a simple recursion. In addition, if one considers Ytλ(𝐗), t+, where 𝐗 is a random element in (m) defined by 𝐗i:=Xi, i, then for each integer n,

(Ynλ(𝐗))=(θnλ). (28)

3.3 Layout of the proof

In view of (28), the main objective is to bound W1((Ytλ(𝐗)),πβ), which is decomposed as follows


The last term is controlled using the drift condition (29) below, due to the dissipativity Assumption 2.5, and Lipschitzness of the mean field h, see (17). The second term is controlled uniformly in t by a quantity which is proportional to λ. For that purpose, we use novel results by [15], which give us a contraction in w1,2, see Proposition 3.12 and, in particular, (49). To obtain this result, the mixing condition also plays a crucial role, see Lemma 3.19. Finally, the first term is controlled uniformly in t by a quantity which is also proportional to λ, see Corollary 3.25. This is based on Kullback-Leibler distance estimates which go back to [10].

3.4 Crucial estimates

The next lemma shows that the SDEs (25) and (23) satisfy standard drift conditions involving the functions Vp. Note that, on the left-hand side of (29) below, the infinitesimal generator of the diffusion process L appears which is applied to the function Vp.

Lemma 3.6.

Let Assumption 2.5 hold. For each p2,

ΔVp(θ)β-h(θ),Vp(θ)-C6(p)Vp(θ)+C7(p),θd, (29)

and, for all xRm,

ΔVp(θ)β-H(θ,x),Vp(θ)-C6(p)Vp(θ)+C7(p),θd, (30)

where C6(p)=ap/4, C7(p)=(3/4)apvp(M¯(p)) with

M¯(p)=1/3+4b/(3a)+4d/(3aβ)+4(p-2)/(3aβ). (31)

By direct calculation, the left-hand side of (29) equals

dp(|θ|2+1)(p-2)/2β+p(p-2)(|θ|2+1)(p-4)/2|θ|2β-ph(θ),(|θ|2+1)(p-2)/2θ. (32)

By Assumption 2.5, see also (18), the third term of (32) is dominated by

-pa|θ|2(|θ|2+1)(p-2)/2+pb(|θ|2+1)(p-2)/2. (33)

Then, for |θ|>M¯(p), one observes that


As for |θ|M¯(p), one obtains


Take into consideration of the two cases, we have for all θd,


The statement (30) follows in an identical way, noting that the constants which appear do not depend on x.

Now, we proceed with the required moment estimates which play a crucial role in the derivation of the main results as given in Theorems 2.7, 2.9 and 2.11.

Lemma 3.7.

Let Assumptions 2.1, 2.2 and 2.5 hold. Let p2. For 0st, let ζ~λ(t,s;x,θ~) be the solution of (26) with an initial condition θ~L2p-2. Then, for any t>s0,

sup𝐱(m)supts𝔼[Vp(ζ~λ(t,s;𝐱,θ~))]𝔼[Vp(θ~)]+3vp(M¯(p)) (34)

where M¯(p) is defined in (31).


We note that 2p-2p for p2, hence 𝔼[Vp(θ~)]<. For any fixed sequence 𝐱(m) and t>s0, by Itô’s formula, one obtains almost surely,


which implies


where the expectation of the stochastic integral disappears since sup0stE(Vp)2(ζ~λ(t,s;𝐱,θ~))< by θ~L2p-2. Differentiating both sides and using Lemma 3.6, one obtains

ddt𝔼[Vp(ζ~λ(t,s;𝐱,θ~))] =𝔼[λΔVp(ζ~λ(t,s;𝐱,θ~))β-λH(ζ~λ(t,s;𝐱,θ~),xt),Vp(ζ~λ(t,s;𝐱,θ~))]

which yields

𝔼[Vp(ζ~λ(t,s;𝐱,θ~))]e-λC6(p)(t-s)𝔼[Vp(θ~)]+C7(p)C6(p)(1-e-λC6(p)(t-s))𝔼[Vp(θ~)]+C7(p)C6(p), (35)

where C6(p), C7(p) and M¯(p) are defined in Lemma 3.6. Taking supremum over t and 𝐱 on both sides yield (34).

Corollary 3.8.

Let Assumptions 2.1, 2.2 and 2.5 hold. For any integer p2,

sup𝐱(m)supt+𝔼[Vp(Y~tλ(𝐱))]𝔼[Vp(θ0)]+3vp(M¯(p)), (36)

where M¯(p) is defined in (31).


By noting that Y~tλ(𝐱)=ζ~λ(t,0;𝐱,θ0), one immediately recovers the desired result from Lemma 3.7. ∎

Corollary 3.9.

Let Assumptions 2.1, 2.2 and 2.5 hold. For any integer p2,

supt+𝔼[Vp(Y~tλ(𝐗))]𝔼[Vp(θ0)]+3vp(M¯(p)), (37)

where the constant M¯(p) is given explicitly in the proof of Lemma 3.6.


Due to the fact that the dissipativity condition 2.5 is uniform in x, all estimates are independent of x and therefore the result follows immediately from Corollary 3.8. ∎

Lemma 3.10.

Let Assumptions 2.1, 2.2 and 2.5 hold. For any λ<λmax (see (7)), nN and t(n,n+1] and any sequence x(Rm)N,

𝔼[|Ytλ(𝐱)|2p]𝔼[|θ0|2p]+λaM(p,d)|xn|2p+λaM(p,d)j=0n-1(1-aλ)j|xn-j|2p+M^(p,d), (38)

where the constants M(p,d) and M^(p,d) are given by

M(p,d)=(2λmax+4/a)p-1[1/a+dM~2(p)]c0pM^(p,d)=(2λmax+4/a)p-1[1/a+dM~2(p)]c2p+M~2(p)(λmax+2/a)p-1(d+(1/β)p-1(2dp(2p-1))p) (39)


M~(p):=2pp(2p-1)/(aβ). (40)

In particular,

𝔼|Ytλ(𝐱)|2𝔼|θ0|2+λc0|xn|2+λc0j=0n-1(1-aλ)j|xn-j|2+c1, (41)

where c0 and c1 are defined by

c0=8K22λmax,c1=2a-1(b+4λmax(H*)2+β-1)𝑎𝑛𝑑c2=2b+8λmax(H*)2. (42)

For any n and t(n,n+1], define


Consider initially only the calculations around the square of the norm of Ytλ(𝐱), i.e.

𝔼[|Ytλ(𝐱)|2|Ynλ(𝐱)] =𝔼[|Δn(𝐱,t)|2+(2λ/β)|B~tλ-B~nλ|2+2Δn(𝐱,t),(2λ/β)(B~tλ-B~nλ)|Ynλ(𝐱)]

Using Assumptions 2.2 and 2.5, one obtains for all λλmax,

|Δn(𝐱,t)|2 =|Ynλ(𝐱)|2-2λ(t-n)Ynλ(𝐱),H(Ynλ(𝐱),xn)+λ2|H(Ynλ(𝐱),xn)(t-n)|2
(1-aλ(t-n))|Ynλ(𝐱)|2+λ(t-n)(c0|xn|2+2b+8λmax(H*)2). (43)

The desired result (41) follows from an easy induction. For higher moments, the calculation is somewhat more involved. To this end, one calculates

𝔼[|Ytλ(𝐱)|2p|Ynλ(𝐱)] =𝔼[(|Δn(𝐱,t)|2+2λβ|B~tλ-B~nλ|2+2Δn(𝐱,t),2λβ(B~tλ-B~nλ))p|Ynλ(𝐱)]

where the last inequality is due to Lemma 4.4. The following inequality is used in the subsequent analysis


where p2, r,s0 and ϵ>0. We continue as follows

𝔼[|Ytλ(𝐱)|2p|Ynλ(𝐱)] |Δn(𝐱,t)|2p+𝔼[(l=02(p-1)(2pl+2)|Δn(𝐱,t)|2(p-1)-l|2λβ(B~tλ-B~nλ)|l)

which yields

𝔼[|Ytλ(𝐱)|2p|Ynλ(𝐱)] |Δn(𝐱,t)|2p+λ(t-n)22p-2p(2p-1)dβ|Δn(𝐱,t)|2p-2
+23p-3(λ(t-n))p(p(2p-1))p+1(dβ)p (44)

where the moment estimates of stochastic integrals are given in Theorem 7.1 in Chapter 1 of [22]. Using the notations in (42) and the inequality (43), one calculates

|Δn(𝐱,t)|2p {(1-aλ(t-n))|Ynλ(𝐱)|2+λ(t-n)(c0|xn|2+c2)}p
(1-aλ(t-n)2)p-1(1-aλ(t-n))|Ynλ(𝐱)|2p+(λ(t-n)+2a)p-1λ(t-n)(c0|xn|2+c2)p. (45)

Substituting (3.4) into (3.4) yields

𝔼[|Ytλ(𝐱)|2p|Ynλ(𝐱)] (1-aλ(t-n)2)p-1(1-aλ(t-n))|Ynλ(𝐱)|2p+(λ(t-n)+2a)p-1λ(t-n)(c0|xn|2+c2)p
+23p-3(λ(t-n))p(p(2p-1))p+1(dβ)p. (46)

Define M~(p) as in (40) and observe that for |Ynλ(𝐱)|dM~(p)


Consequently, on {|Ynλ(𝐱)|dM~(p)} the inequality (3.4) yields

𝔼[|Ytλ(𝐱)|2p|Ynλ(𝐱)] (1-aλ(t-n)4)(1-aλ(t-n)2)p-2(1-aλ(t-n))|Ynλ(𝐱)|2p
(1-aλ(t-n))|Ynλ(𝐱)|2p+λ(t-n)a(M(p,d)|xn|2p+M^(p,d)), (47)

where the constants M(p,d) and M^(p,d) are defined in (39). Moreover, on {|Ynλ(𝐱)|<dM~(p)} the inequality (3.4) yields again

𝔼[|Ytλ(𝐱)|2p|Ynλ(𝐱)](1-aλ(t-n))|Ynλ(𝐱)|2p+λ(t-n)a(M(p,d)|xn|2p+M^(p,d)) (48)

Due to (48) and (3.4), one obtains by induction


The desired result (38) follows immediately. ∎

Remark 3.11.

One notes here that (𝔼[|Ytλ(𝐱)|2p])1/(2p) is of order d, where d denotes the dimension of the problem.

A crucial contraction property is formulated in the next theorem.

Proposition 3.12.

Let Lt, tR+ be the solution of (23) with initial condition L0=θ0 which is independent of F and satisfies |θ0|L2. Then,

w1,2((Lt),(Lt))C9e-C8tw1,2((θ0),(θ0)),t+, (49)

where the constants C8 and C9 are given explicitly in Lemma 3.26. Fix a positive integer m. Suppose, for any t>m, ζ~λ(t,m;x,θ~) and ζ~λ(t,m;x,θ~) are the solutions of (26) with initial conditions θ~,θ~L2, which are independent of F. Then,

w1,2((ζ~λ(t,m;𝐱,θ~),(ζ~λ(t,m;𝐱,θ~))C9e-C8λ(t-m)w1,2((θ~),(θ~)),for any t>m. (50)

We first treat Lt, Lt. Assumption 2.1 of [15] holds with κ constant (and equal to K1) due to Assumption 2.2. Assumption 2.5 of the same paper is valid due to (20) and, moreover, Assumption 2.2 of [15] holds with V=V2 due to Lemma 3.6 (note that in that paper the diffusion coefficient is assumed to be 1 while in our case it is 2/β but this does not affect the validity of the arguments, only the values of the constants). Thus, in view of Corollary 2.3 of [15],


where C8 is given in Lemma 3.26 and the functional 𝒲ρ2 comes from [15] with the choice V:=V2, i.e.

𝒲ρ2(μ,ν):=infζ𝒞(μ,ν)ddf(|θ-θ|)(1+ϵV2(θ)+ϵV2(θ))ζ(dθ,dθ), where μ,ν𝒫V2,

where f is given in Lemma 3.26. Note that f is a concave, bounded and non-decreasing continuous function and ϵ is a positive constant, for more details see Theorem 2.2, Section 5 of [15]. Consequently, by using the definition of 𝒲ρ2, one obtains


where C10,C11 can be found in Lemma 3.26. Statement (49) follows with C9=C11/C10.

The same approach is used for ζ~λ(t,m;𝐱,θ~) and ζ~λ(t,m;𝐱,θ~), with the only difference being that we derive first the contraction on an interval of length at most one, since the contribution from the data sequence, through xt, remains constant and thus, the drift coefficient remains autonomous for such an interval. More concretely, Assumption 2.1 of [15] holds in this case too with κ constant and equal to K1 due to Assumption 2.2. Assumption 2.2 of [15] is true with V=V2 due to Lemma 3.6. Note that the statements in these Assumptions are uniform in x (and thus identical for different values of xt). Finally, Assumption 2.5 of [15] is also true due to (20). Thus, the results of Corollary 2.3 of [15] apply in this case, too, and one concludes that

𝒲ρ2((ζ~λ(t,m;𝐱,θ~)),(ζ~λ(t,m;𝐱,θ~)))= 𝒲ρ2((ζ~λ(t,t;𝐱,ζ~λ(t,m;𝐱,θ~))),(ζ~λ(t,t;𝐱,ζ~λ(t,m;𝐱,θ~))))
e-C8λ(t-m)𝒲ρ2((θ~),(θ~)). (51)

Observing as above that 𝒲ρ2 is controlled from above and below by multiples of w1,2, (3.4) yields the result. ∎

Define a continuous-time filtration t:=𝒢t, t+ and the corresponding decreasing family of sigma-algebras t+:=𝒢t+, t+. Moreover, let T:=1/λ, which is used for the creation of a suitable set of grid points.

Lemma 3.13.

For each nN, there exists a measurable function


such that, for each tnT and θRd, ht,nT(θ)(ω) is a version of E[H(θ,Xt)|HnTλ] and, for almost every ωΩ, θht,nT(θ)(ω) is continuous.


As ht,nT, t[k,k+1) can be assumed constant for each k, it suffices to prove the existence of a measurable hk,nT:Ω×dd which is continuous in its second variable, for each fixed k. This follows from Lemma 8.5 of [2]. ∎

Lemma 3.14.

There exist random variables Ξn, nN such that, for all θRd,


and for each p1 there exist C18(p) such that


Notice that, for any integer knT,


since nT+ is independent of nT. This implies

|hk,nT(θ)-h(θ)| |𝔼[H(θ,Xk)|nT]-𝔼[H(θ,𝔼[Xk|nT+])|nT]|+|𝔼[H(θ,𝔼[Xk|nT+])]-𝔼[H(θ,Xk)]|

Hence, noting that h,nT(θ) is constant on each interval [k,k+1), k, knT,


Since X is conditionally L-mixing, supn𝔼[(Γ1nT(X))p]< for every p1. The statement follows. ∎

We recall that 𝐗 refers to the (m)-valued random variable that has coordinates 𝐗i=Xi, i. Let Zλ(t,s,ϑ), ts denote the solution of the SDE


with initial condition Zλ(s,s,ϑ):=ϑ for some sλ-measurable random variable ϑ. Note that Ltλ=Zλ(t,0,θ0).

Definition 3.15.

Fix nN and define

Z¯tλ,n=Zλ(t,nT,Y~nTλ(𝐗)),for any t[nT,).

It should be emphasized that for different n, the process Z¯λ,n is redefined accordingly and Z¯tλ,n is nT-measurable for all tnT.

Lemma 3.16.

Let Assumptions 2.1, 2.2 and 2.5 hold. For any integers p2 and n,

suptnT𝔼[Vp(Z¯tλ,n)]𝔼[Vp(θ0)]+6vp(M¯(p)), (52)

where the constant M¯(p) is given explicitly in the proof of Lemma 3.6.


By application of Ito’s formula, one obtains almost surely, for any t[nT,),


which implies, noting that the expectation of the stochastic integral vanishes by standard arguments,


Differentiating both sides and using Lemma 3.6, one obtains for any tnT,

ddt𝔼[Vp(Z¯tλ,n)] =𝔼[λΔVp(Z¯tλ,n)β-λh(Z¯tλ,n),Vp(Z¯tλ,n)]-λC6(p)𝔼[Vp(Z¯tλ,n)]+λC7(p),

which yields


and thus, in view of (37),


Finally, since C6(p)=ap/4, C7(p)=(3/4)apvp(M¯(p)) according to the proof of Lemma 3.6, the desired result (52) follows.

Corollary 3.17.

Assume 2.1, 2.2 and 2.5 hold. For any integer p2, and T=1/λ,

𝔼[supnTt(n+1)TVp(Z¯tλ,n)]3𝔼[V2p(θ0)]+3(1+ap)v2p(M¯(2p)) (53)

where the constant M¯(2p) is given explicitly in the proof of Lemma 3.6.


For any n, q2 and any bounded stopping time τnnT (a.s.), one obtains by application of Ito’s formula that, almost surely,


Due to Lemmata 3.6 and 3.16,

𝔼[Vq(Z¯τnλ,n)] 𝔼[Vq(Y~nTλ(𝐗))]+𝔼[nTτn(-λC6(p)Vq(Z¯sλ,n)+λC7(p))ds]

Then, according to Lenglart’s domination inequality [18], see also Proposition 4.7 of [24], for any k(0,1)


Consequently, for k=1/2 and q=2p and in view of Corollary 3.9, the desired result holds.

Lemma 3.18.

Let Xk,kN be conditionally L-mixing. Recall T=1/λ and choose n,NN. We define the filtrations Rj:=HnT+j/N, Rj+:=HnT+j/N+, jN and Gj:=XnT+j/N,jN. Then, it holds that

Mp0(G)=MpnT(X) and Γp0(G)2NΓpnT(X). (54)

Note that Mp0(G), Γp0(G) are calculated with respect to (j,j+)j while the quantities MpnT(X) and ΓpnT(G) are calculated with respect to (𝒢j,𝒢j+)j. The equality in (54) is trivial.

Notice that, if τj, mNτ<(m+1)N for some m then


using Lemma A.1 of [4], whence


for each mNτ<(m+1)N. The inequality in (54) follows. ∎

Lemma 3.19.

Assume 2.1, 2.2 and 2.5 hold. There is C19 such that, for each 0<λλmax,


Fix t[nT,(n+1)T]. Let us estimate, using Assumption 2.2,

|Y~tλ(𝐗)-Z¯tλ,n| λ|nTt[H(Y~sλ(𝐗),Xs)-h(Z¯sλ,n)]ds|

where hs,nT is defined in Lemma 3.13. Now let us apply Grönwall’s lemma and take the square of both sides. Using the elementary (x+y)22(x2+y2), x,y0, we arrive at

|Y~tλ(𝐗)-Z¯tλ,n|22λ2e2K1λT[ supu[nT,(n+1)T]|nTu[H(Z¯sλ,n,Xs)-hs,nT(Z¯sλ,n)]ds|2
+(nT|hs,nT(Z¯sλ,n)-h(Z¯sλ,n)|ds)2]. (55)

As shs,nT(Z¯sλ,n) and sH(Z¯sλ,n,Xs) are a.s. piecewise continuous (see Lemma 3.13), Lemma 4.1 guarantees that

supu[nT,(n+1)T]|nTu[H(Z¯sλ,n,Xs)-hs,nT(Z¯sλ,n)]ds|=limN1Nmax1kTN|j=0k[H(Z¯nT+j/Nλ,n,XnT+j/N)-hnT+j/N,nT(Z¯nT+j/Nλ,n)]|, (56)

a.s., with N ranging over integers. Defining j and j+ as in Lemma 3.18, the process



Mp0(G)=MpnT(𝐗) and Γp0(G)2NΓpnT(𝐗), (57)

for p1. Introduce the events Fi:={isups[nT,(n+1)T]|Z¯sλ,n|<i+1}, i, which are nT measurable, and fix i for the moment. Next, we apply Theorem 3.2 to the process


Clearly, 𝔼[Wj|0]=𝔼[Wj|nT]=0. From Lemma 3.4, we know that the estimates (15) (resp. (16)) hold for MpnT(H(θ,G),B(i)) (resp. ΓpnT(H(θ,G),B(i))). Noting that Z¯nT+j/Nλ,n is nT-measurable for j and H is Lipschitz-continuous, Lemma 3.5 implies that the process

W¯j:=H(Z¯nT+j/Nλ,n,XnT+j/N)𝟙Fi,forj (58)


Mp0(W¯)C*[MpnT(𝐗)+i+1], and  Γp0(W¯)4K2NΓpnT(𝐗).

where C*:=max{K1,K2,H*}. Then, by Remark 6.4 of [4],


Apply Theorem 3.2 with r:=3 at k=0. We obtain


whence, by the conditional Fatou lemma and (56),


Fix p11, to be chosen later. We can then estimate, using Cauchy’s inequality (twice) and Lemma 3.17,

C201λ (59)



Hence we set p1:=5 (any p1>4 would do) and obtain from (3.4) finally, for any t[nT,(n+1)T], due to (3.4) and Lemma 3.14,

W~2((Y~tλ(𝐗)),(Z¯tλ,n)) 𝔼1/2|Y~tλ(𝐗)-Z¯tλ,n|2 (60)



We now turn to estimating W1((Z¯tλ,n),(Ltλ)). Recall that n and t[nT,(n+1)T) are fixed. Now we may write, using the triangle inequality, the definition of Z¯λ,n, (22), Lemma 3.16, (60), and Proposition 3.12

= k=1nW1((Zλ(t,kT,Y~kTλ(𝐗))),(Zλ(t,(k-1)T,Y~(k-1)Tλ(𝐗)))
= C9k=1nexp(-C8(n-k))w1,2((Y~kTλ(𝐗)),(Z¯kTλ,k-1))

where the penultimate inequality is due to Lemma 3.28 and

C22=C91-exp(-C82)2eK1 C20+C18(2)

by Corollaries 3.8, 3.16. Now, putting together our estimations, we arrive at

W1((Y~tλ(𝐗)),(Ltλ)) W1((Y~tλ(𝐗)),(Z¯tλ,n))+W1((Z¯tλ,n),(Ltλ))

where C19=C21+C22, which finishes the proof. ∎

Remark 3.20.

Our assumptions can be somewhat weakened. Indeed, the above arguments go through if we assume only that the sequences M3n(X),Γ3n(X),n are bounded in L4 and |θ0|L10. The former propriety is called “conditional L-mixing of order (3,4)”, see [4].

Corollary 3.21.

Let Assumptions 2.1, 2.2 and 2.5 hold. For each 0<λλmax and 0st let ζ~λ(t,s;x,θ~) be the solution of (26). with an initial condition θ~. Then for each k1,

2(𝔼[|θ0|4]+aλM(2,d)j=0(k-1)T-1(1-aλ)j|x(k-1)T-j|4+M^(2,d))+3v4(M¯(4))+2, (62)


𝔼[V4(Y(k-1)Tλ)]2+2(𝔼[|θ0|4]+aλM(2,d)j=0(k-1)T-1(1-aλ)j|x(k-1)T-j|4+M^(2,d)), (63)

where the constants M(2,d) and M^(2,d) are given by (39) and (40) with p=2.


A direct consequence of Lemma 3.7 and (38). ∎

Lemma 3.22.

Let Assumptions 2.1, 2.2 and 2.5 hold. For each 0<λλmax and nN we have, for all t(nT,(n+1)T],


where C23(x,n) is given in (69).


Recall (26) and observes that Y~tλ(𝐱)=ζ~λ(t,0;𝐱,θ0). Then, one calculates for t(nT,(n+1)T],

W1((Y~tλ(𝐱)),(Ytλ(𝐱)))= W1((ζ~λ(t,0;𝐱,θ0)),(Ytλ(𝐱)))

and thus, due to the domination of W1 by w1,2, see (22), and Proposition 3.12, see (50), one obtains

W1((Y~tλ(𝐱)),(Ytλ(𝐱))) k=1nw1,2((ζ~λ(t,kT;𝐱,YkTλ(𝐱))),(ζ~λ(t,kT;𝐱,ζ~λ(kT,(k-1)T;𝐱,Y(k-1)Tλ(𝐱))))
+w1,2((ζ~λ(t,nT;𝐱,YnTλ(𝐱))),(Ytλ(𝐱))). (64)

At this point, one notes that due to Lemma 4.3, for any two probability measures μ and ν on d,


where KL(μ,ν) denotes the Kullback-Leibler distance of the two measures. Thus

w1,2((YkTλ(𝐱)),(ζ~λ(kT,(k-1)T;𝐱,Y(k-1)Tλ(𝐱))) (65)