A neural network based policy iteration algorithm with global $H^2$-superlinear convergence for stochastic games on domains

  • 2020-02-13 17:21:48
  • Kazufumi Ito, Christoph Reisinger, Yufei Zhang
  • 0

Abstract

In this work, we propose a class of numerical schemes for solving semilinearHamilton-Jacobi-Bellman-Isaacs (HJBI) boundary value problems which arisenaturally from exit time problems of diffusion processes with controlled drift.We exploit policy iteration to reduce the semilinear problem into a sequence oflinear Dirichlet problems, which are subsequently approximated by a multilayerfeedforward neural network ansatz. We establish that the numerical solutionsconverge globally in the $H^2$-norm, and further demonstrate that thisconvergence is superlinear, by interpreting the algorithm as an inexact Newtoniteration for the HJBI equation. Moreover, we construct the optimal feedbackcontrols from the numerical value functions and deduce convergence. Thenumerical schemes and convergence results are then extended to HJBI boundaryvalue problems corresponding to controlled diffusion processes with obliqueboundary reflection. Numerical experiments on the stochastic Zermelo navigationproblem are presented to illustrate the theoretical results and to demonstratethe effectiveness of the method.

 

Quick Read (beta)

A neural network based policy iteration algorithm with global H2-superlinear convergence for stochastic games on domains

Kazufumi Ito Department of Mathematics, North Carolina State University, Raleigh, NC 27607, United States of America, [email protected]    Christoph Reisinger Mathematical Institute, University of Oxford, United Kingdom ([email protected], [email protected])    Yufei Zhang22 2 Strictly speaking, the squared residual (6.6) is not differentiable (with respect to the network parameters) at the samples where one of the first partial derivatives of the current iterate u is zero, due to the nonsmooth functions 1,2:2[0,) in the HJBI operator F (see (6)). In practice, PyTorch will assign 0 as partial derivatives of 1 and 2 functions at their nondifferentiable points, and use it in the backward propagation.
\usetikzlibrary

arrows,positioning,shapes.geometric

Abstract. In this work, we propose a class of numerical schemes for solving semilinear Hamilton-Jacobi-Bellman-Isaacs (HJBI) boundary value problems which arise naturally from exit time problems of diffusion processes with controlled drift. We exploit policy iteration to reduce the semilinear problem into a sequence of linear Dirichlet problems, which are subsequently approximated by a multilayer feedforward neural network ansatz. We establish that the numerical solutions converge globally in the H2-norm, and further demonstrate that this convergence is superlinear, by interpreting the algorithm as an inexact Newton iteration for the HJBI equation. Moreover, we construct the optimal feedback controls from the numerical value functions and deduce convergence. The numerical schemes and convergence results are then extended to oblique derivative boundary conditions. Numerical experiments on the stochastic Zermelo navigation problem are presented to illustrate the theoretical results and to demonstrate the effectiveness of the method.

Key words. Hamilton-Jacobi-Bellman-Isaacs equations, neural networks, policy iteration, inexact semismooth Newton method, global convergence, q-superlinear convergence.

AMS subject classifications. 82C32, 91A15, 65M12

1 Introduction

In this article, we propose a class of numerical schemes for solving Hamilton-Jacobi-Bellman-Isaacs (HJBI) boundary value problems of the following form:

-aij(x)iju+G(x,u,u)=0,in Ωn;Bu=g,on Ω, (1.1)

where Ω is an open bounded domain, G is the (nonconvex) Hamiltonian defined as

G(x,u,u)=maxα𝐀minβ𝐁(bi(x,α,β)iu(x)+c(x,α,β)u(x)-f(x,α,β)), (1.2)

with given nonempty compact sets 𝐀,𝐁, and B is a boundary operator, i.e., if B is the identity operator, (1.1) is an HJBI Dirichlet problem, while if Bu=γiiu+γ0u with some functions {γi}i=0n, (1.1) is an HJBI oblique derivative problem. Above and hereafter, when there is no ambiguity, we shall adopt the summation convention as in [20], i.e., repeated equal dummy indices indicate summation from 1 to n.

It is well-known that the value function of zero-sum stochastic differential games in domains satisfies the HJBI equation (1.1), and the optimal feedback controls can be constructed from the derivatives of the solutions (see e.g. [32] and references within; see also Section 6 for a concrete example). In particular, the HJBI Dirichlet problem corresponds to exit time problems of diffusion processes with controlled drift (see e.g. [32, 10, 37]), while the HJBI oblique derivative problem corresponds to state constraints (see e.g. [36, 34]). A nonconvex HJBI equation as above also arises from a penalty approximation of hybrid control problems involving continous controls, optimal stopping and impulse controls, where the HJB (quasi-)variational inequality can be reduced to an HJBI equation by penalizing the difference between the value function and the obstacles (see e.g. [27, 47, 39, 40]). As (1.1) in general cannot be solved analytically, it is important to construct effective numerical schemes to find the solution of (1.1) and its derivatives.

The standard approach to solving (1.1) is to first discretize the operators in (1.1) by finite difference or finite element methods, and then solve the resulting nonlinear discretized equations by using policy iteration, also known as Howard’s algorithm, or generally (finite-dimensional) semismooth Newton methods (see e.g. [18, 8, 45, 39]). However, this approach has the following drawbacks, as do most mesh-based methods: (1) it can be difficult to generate meshes and to construct consistent numerical schemes for problems in domains with complicated geometries; (2) the number of unknowns in general grows exponentially with the dimension n, i.e., it suffers from Bellman’s curse of dimensionality, and hence this approach is infeasible for solving high-dimensional control problems. Moreover, since policy iteration is applied to a fixed finite-dimensional equation resulting from a particular discretization, it is difficult to infer whether the same convergence rate of policy iteration remains valid as the mesh size tends to zero ([42, 8]). We further remark that, for a given discrete HJBI equation, it can be difficult to determine a good initialization of policy iteration to ensure fast convergence of the algorithm; see [2] and references therein on possible accelerated methods.

Recently, numerical methods based on deep neural networks have been designed to solve high-dimensional partial differential equations (PDEs) (see e.g. [35, 15, 6, 16, 26, 44]). Most of these methods reformulate (1.1) into a nonlinear least-squares problem:

infu-aijiju+G(,u,u)L2(Ω)2+Bu-gL2(Ω)2, (1.3)

where is a collection of neural networks with a smooth activation function. Based on collocation points chosen randomly from the domain, (1.3) is then reduced into an empirical risk minimization problem, which is subsequently solved by using stochastic optimization algorithms, in particular the Stochastic Gradient Descent (SGD) algorithm or its variants. Since these methods avoid mesh generation, they can be adapted to solve PDEs in high-dimensional domains with complex geometries. Moreover, the choice of smooth activation functions leads to smooth numerical solutions, whose values can be evaluated everywhere without interpolations. In the following, we shall refer to these methods as the Direct Method, due to the fact that there is no policy iteration involved.

We observe, however, that the Direct Method also has several serious drawbacks, especially for solving nonlinear nonsmooth equations including (1.1). Firstly, the nonconvexity of both the deep neural networks and the Hamiltonian G leads to a nonconvex empirical minimization problem, for which there is no theoretical guarantee on the convergence of SGD to a minimizer (see e.g. [43]). In practice, training a network with a desired accuracy could take hours or days (with hundreds of thousands of iterations) due to the slow convergence of SGD. Secondly, each SGD iteration requires the evaluation of G (with respect to u and u) on sample points, but G is not necessarily defined everywhere due to the nonsmoothness of G. Moreover, evaluating the function G (again on a large set of sample points) can be expensive, especially when the sets A and B are of high dimensions, as we do not require more regularity than continuity of the coefficients with respect to the controls, so that approximate optimization may only be achieved by exhaustive search over a discrete coverage of the compact control set. Finally, as we shall see in Remark 4.2, merely including an L2(Ω)-norm of the boundary data in the loss function (1.3) does not generally lead to convergence of the derivatives of numerical solutions or the corresponding feedback control laws.

In this work, we propose an efficient neural network based policy iteration algorithm for solving (1.1). At the (k+1)th iteration, k0, we shall update the control laws (αk,βk) by performing pointwise maximization/minimization of the Hamiltonian G based on the previous iterate uk, and obtain the next iterate uk+1 by solving a linear boundary value problem, whose coefficients involve the control laws (αk,βk). This reduces the (nonconvex) semilinear problem into a sequence of linear boundary value problems, which are subsequently approximated by a multilayer neural network ansatz. Note that compared to Algorithm Ho-3 in [8] for discrete HJBI equations, which requires to solve a nonlinear HJB subproblem (involving minimization over the set B) for each iteration, our algorithm only requires to solve a linear subproblem for each iteration, hence it is in general more efficient, especially when the dimension of B is high.

Policy iteration (or Successive Galerkin Approximation) was employed in [4, 5, 28, 30] to solve convex HJB equations on the whole space n. Specifically, [4, 5, 28] approximate the solution to each linear equation via a separable polynomial ansatz (without concluding any convergence rate), while [30] assumes each linear equation is solved sufficiently accurately (without specifying a numerical method), and deduces pointwise linear convergence. The continuous policy iteration in [28] has also been applied to solve HJBI equations on n in [29], which is a direct extension of Algorithm Ho-3 in [8] and still requires to solve a nonlinear HJB subproblem at each iteration. In this paper, we propose an easily implementable accuracy criterion for the numerical solutions of the linear PDEs which ensures the numerical solutions converge superlinearly in a suitable function space for nonconvex HJBI equations from an arbitrary initial guess.

Our algorithm enjoys the main advantage of the Direct Method, i.e., it is a mesh-free method and can be applied to solve high-dimensional stochastic games. Moreover, by utilizing the superlinear convergence of policy iteration, our algorithm effectively reduces the number of pointwise maximization/minimization over the sets A and B, and significantly reduces the computational cost of the Direct Method, especially for high dimensional control sets. The superlinear convergence of policy iteration also helps eliminate the oscillation caused by SGD, which leads to smoother and more rapidly decaying loss curves in both the training and validation processes (see Figure 7). Our algorithm further allows training of the feedback controls on a separate network architecture from that representing the value function, or adaptively adjusting the architecture of networks for each policy iteration.

A major theoretical contribution of this work is the proof of global superlinear convergence of the policy iteration algorithm for the HJBI equation (1.1) in H2(Ω), which is novel even for HJB equations (i.e., one of the sets A and B is singleton). Although the (local) superlinear convergence of policy iteration for discrete equations has been proved in various works (e.g. [38, 42, 18, 8, 47, 45, 39]), to the best of our knowledge, there is no published work on the superlinear convergence of policy iteration for HJB PDEs in a function space, nor on the global convergence of policy iteration for solving nonconvex HJBI equations.

Moreover, this is the first paper which demonstrates the convergence of neural network based methods for the solutions and their (first and second order) derivatives of nonlinear PDEs with merely measurable coefficients (cf. [22, 23, 26, 44]). We will also prove the pointwise convergence of the numerical solutions and their derivatives, which subsequently enables us to construct the optimal feedback controls from the numerical value functions and deduce convergence.

Let us briefly comment on the main difficulties encountered in studying the convergence of policy iteration for HJBI equations. Recall that at the (k+1)th iteration, we need to solve a linear boundary value problem, whose coefficients involve the control laws (αk,βk), obtained by performing pointwise maximization/minimization of the Hamiltonian G. The uncountability of the state space Ω and the nonconvexity of the Hamiltonian require us to exploit several technical measurable selection arguments to ensure the measurability of the controls (αk,βk), which is essential for the well-definedness of the linear boundary value problems and the algorithm.

Moreover, the nonconvexity of the Hamiltonian prevents us from following the arguments in [42, 18, 8] for discrete HJB equations to establish the global convergence of our inexact policy iteration algorithm for HJBI equations. In fact, a crucial step in the arguments for discrete HJB equations is to use the discrete maximum principle and show the iterates generated by policy iteration converge monotonically with an arbitrary initial guess, which subsequently implies the global convergence of the iterates. However, this monotone convergence is in general false for the iterates generated by the inexact policy iteration algorithm, due to the nonconvexity of the Hamiltonian and the fact that each linear equation is only solved approximately. We shall present a novel analysis technique for establishing the global convergence of our inexact policy iteration algorithm, by interpreting it as a fixed point iteration in H2(Ω).

Finally, we remark that the proof of superlinear convergence of our algorithm is significantly different from the arguments for discrete equations. Instead of working with the sup-norm for (finite-dimensional) discrete equations as in [42, 18, 8, 47, 39], we employ a two-norm framework to establish the generalized differentiability of HJBI operators, where the norm gap is essential as has already been pointed out in [24, 46, 45]. Moreover, by taking advantage of the fact that the Hamiltonian only involves low order terms, we further demonstrate that the inverse of the generalized derivative is uniformly bounded. Furthermore, we include a suitable fractional Sobolev norm of the boundary data in the loss functions used in the training process, which is crucial for the H2(Ω)-superlinear convergence of the neural network based policy iteration algorithm.

We organize this paper as follows. Section 2 states the main assumptions and recalls basic results for HJBI Dirichlet problems. In Section 3 we propose a policy iteration scheme for HJBI Dirichlet problems and establish its global superlinear convergence. Then in Section 4, we shall introduce the neural network based policy iteration algorithm, establish its various convergence properties, and construct convergent approximations to optimal feedback controls. We extend the algorithm and convergence results to HJBI oblique derivative problems in Section 5. Numerical examples for two-dimensional stochastic Zermelo navigation problems are presented in Section 6 to confirm the theoretical findings and to illustrate the effectiveness of our algorithms. The Appendix collects some basic results which are used in this article, and gives a proof for the main result on the HJBI oblique derivative problem.

2 HJBI Dirichlet problems

In this section, we introduce the HJBI Dirichlet boundary value problems of our interest, recall the appropriate notion of solutions, and state the main assumptions on its coefficients. We start with several important spaces used frequently throughout this work.

Let n and Ω be a bounded C1,1 domain in n, i.e., a bounded open connected subset of n with a C1,1 boundary. For each integer k0 and real p with 1p<, we denote by Wk,p(Ω) the standard Sobolev space of real functions with their weak derivatives of order up to k in the Lebesgue space Lp(Ω). When p=2, we use Hk(Ω) to denote Wk,2(Ω). We further denote by H1/2(Ω) and H3/2(Ω) the spaces of traces from H1(Ω) and H2(Ω), respectively (see [21, Proposition 1.1.17]), which can be equivalently defined by using the surface measure σ on the boundaries Ω as follows (see e.g. [19]):

gH1/2(Ω) =[Ω|g|2𝑑σ+Ω×Ω|g(x)-g(y)|2|x-y|n𝑑σ(x)𝑑σ(y)]1/2, (2.1)
gH3/2(Ω) =[Ω(|g|2+i=1n|ig|2)𝑑σ+i=1nΩ×Ω|ig(x)-ig(y)|2|x-y|n𝑑σ(x)𝑑σ(y)]1/2.

We shall consider the following HJBI equation with nonhomogeneous Dirichlet boundary data:

F(u) -aij(x)iju+G(x,u,u)=0,a.e. Ω, (2.2a)
τu =g,on Ω. (2.2b)

where the nonlinear Hamiltonian is given as in (1.1):

G(x,u,u)=maxα𝐀minβ𝐁(bi(x,α,β)iu(x)+c(x,α,β)u(x)-f(x,α,β)). (2.3)

Throughout this paper, we shall focus on the strong solution to (2.2), i.e., a twice weakly differentiable function uH2(Ω) satisfying the HJBI equation (2.2a) almost everywhere in Ω, and the boundary values on Ω will be interpreted as traces of the corresponding Sobolev space. For instance, τu=g on Ω in (2.2b) means that the trace of u is equal to g in H3/2(Ω), where τ(H2(Ω),H3/2(Ω)) denotes the trace operator (see [19, Proposition 1.1.17]). See Section 5 for boundary conditions involving the derivatives of solutions.

We now list the main assumptions on the coefficients of (2.2).

H. 1.

Let nN, ΩRn be a bounded C1,1 domain, A be a nonempty finite set, and B be a nonempty compact metric space. Let gH3/2(Ω), {aij}i,j=1nC(Ω¯) satisfy the following ellipticity condition with a constant λ>0:

i,j=1naij(x)ξiξjλi=1nξi2,for all ξn and xΩ,

and {bi}i=1n,c,fL(Ω×𝐀×𝐁) satisfy that c0 on Ω×𝐀×𝐁, and that ϕ(x,α,):𝐁R is continuous, for all ϕ=bi,c,f and (x,α)Ω×𝐀.

As we shall see in Theorem 3.3 and Corollary 3.5, the finiteness of the set A enables us to establish the semismoothness of the HJBI operator (2.2a), whose coefficients involve a general nonlinear dependence on the parameters α and β. If all coefficients of (2.2a) are in a separable form, i.e., it holds for all ϕ=bi,c,f that ϕ(x,α,β)=ϕ1(x,α)+ϕ2(x,β) for some functions ϕ1,ϕ2 (e.g. the penalized equation for variational inequalities with bilateral obstacles in [27]), then we can relax the finiteness of A to the same conditions on B.

Finally, in this work we focus on boundary value problems in a C1,1 domain to simplify the presentation, but the numerical schemes and their convergence analysis can be extended to problems in nonsmooth convex domains with sufficiently regular coefficients (see e.g. [21, 45]).

We end this section by proving the uniqueness of solutions to the Dirichlet problem (2.2) in H2(Ω). The existence of strong solutions shall be established constructively via policy iteration below (see Theorem 3.7).

Proposition 2.1.

Suppose (H.1) holds. Then the Dirichlet problem (2.2) admits at most one strong solution u*H2(Ω).

Proof.

Let u,vH2(Ω) be two strong solutions to (2.2), we consider the following linear homogeneous Dirichlet problem:

-aij(x)ijw+b~i(x)iw+c~(x)w=0,a.e. in Ω;τw=0,on Ω, (2.4)

where we define the following measurable functions: for each i=1,,n,

b~i(x) ={G(x,v,((jv)1j<i,iu,(ju)i<jn))-G(x,v,((jv)1j<i,iv,(ju)i<jn))(iu-iv)(x),on {xΩi(u-v)(x)0},0,otherwise,
c~(x) ={G(x,u,u)-G(x,v,u)(u-v)(x),on {xΩ(u-v)(x)0},0,otherwise,

with the Hamiltonian G defined as in (2.3). Note that the boundedness of coefficients implies that {b~i}i=1nL(Ω), and c~L(Ω). Moreover, one can directly verify that the following inequality holds for all parametrized functions (fα,β,gα,β)α𝐀,β𝐁: for all xn,

inf(α,β)𝐀×𝐁fα,β(x)-gα,β(x)infα𝐀supβ𝐁fα,β(x)-infα𝐀supβ𝐁gα,β(x)sup(α,β)𝐀×𝐁fα,β(x)-gα,β(x),

which together with (H.1) leads to the estimate that c~(x)inf(α,β)𝐀×𝐁c(x,α,β)0 on the set {xΩ(u-v)(x)0}, and hence we have c~0 a.e. Ω. Then, we can deduce from Theorem A.1 that the Dirichlet problem (2.4) admits a unique strong solution w*H2(Ω) and w*=0. Since w=u-vH2(Ω) satisfies (2.4) a.e. in Ω and τw=0, we see that w=u-v is a strong solution to (2.4) and hence u-v=w*=0, which subsequently implies the uniqueness of strong solutions to the Dirichlet problem (2.2). ∎

3 Policy iteration for HJBI Dirichlet problems

In this section, we propose a policy iteration algorithm for solving the Dirichlet problem (2.2). We shall also establish the global superlinear convergence of the algorithm, which subsequently gives a constructive proof for the existence of a strong solution to the Dirichlet problem (2.2).

We start by presenting the policy iteration scheme for the HJBI equations in Algorithm 3, which extends the policy iteration algorithm (or Howard’s algorithm) for discrete HJB equations (see e.g. [18, 8, 39]) to the continuous setting.

{algorithm}

[!h] Policy iteration algorithm for Dirichlet problems.

  1. 1.

    Choose an initial guess u0 in H2(Ω), and set k=0.

  2. 2.

    Given the iterate ukH2(Ω), update the following control laws: for all α𝐀,xΩ,

    αk(x) argmaxα𝐀[minβ𝐁(bi(x,α,β)iuk(x)+c(x,α,β)uk(x)-f(x,α,β))], (3.1)
    βk(x) argminβ𝐁(bi(x,αk(x),β)iuk(x)+c(x,αk(x),β)uk(x)-f(x,αk(x),β)). (3.2)
  3. 3.

    Solve the linear Dirichlet problem for uk+1H2(Ω):

    -aijiju+bkiiu+cku-fk=0,in Ω;τu=g,on Ω, (3.3)

    where ϕk(x)ϕ(x,αk(x),βk(x)) for ϕ=bi,c,f.

  4. 4.

    If uk+1-ukH2(Ω)=0, then terminate with outputs uk+1,αk and βk, otherwise increment k by one and go to step 2.

The remaining part of this section is devoted to the convergence analysis of Algorithm 3. For notational simplicity, we first introduce two auxiliary functions: for each (x,𝐮,α,β)Ω×n+1×𝐀×𝐁 with 𝐮=(z,p)×n, we shall define the following functions

(x,𝐮,α,β) bi(x,α,β)pi+c(x,α,β)z-f(x,α,β), (3.4)
h(x,𝐮,α) minβ𝐁(x,𝐮,α,β). (3.5)

Note that for all k0 and xΩ, by setting 𝐮k(x)=(uk(x),uk(x)), we can see from (3.1) and (3.2) that

(x,𝐮k(x),αk(x),βk(x))=minβ𝐁(x,𝐮k(x),αk(x),β)=maxα𝐀minβ𝐁(x,𝐮k(x),α,β). (3.6)

We then recall several important concepts, which play a pivotal role in our subsequent analysis. The first concept ensures the existence of measurable feedback controls and the well-posedness of Algorithm 3.

Definition 3.1.

Let (S,Σ) be a measurable space, and let X and Y be topological spaces. A function ψ:S×XY is a Carathéodory function if:

  1. 1.

    for each xX, the function ψx=ψ(,x):SY is (Σ,Y)-measurable, where Y is the Borel σ-algebra of the topological space Y; and

  2. 2.

    for each sS, the function ψs=ψ(s,):XY is continuous.

Remark 3.1.

It is well-known that if X,Y are two complete separable metric spaces, and ψ:S×XY is a Carathéodory function, then for any given measurable function f:SX, the composition function sψ(s,f(s)) is measurable (see e.g. [3, Lemma 8.2.3]). Since any compact metric space is complete and separable, it is clear that (H.1) implies that the coefficients bi,c,f are Carathéodory functions (with S=Ω and X=𝐀×𝐁). Moreover, one can easily check that both and h are Carathéodory functions, i.e., (resp. h) is continuous in (𝐮,α,β) (resp. (𝐮,α)) and measurable in x (see Theorem A.3 for the measurability of h in x).

We now recall a generalized differentiability concept for nonsmooth operators between Banach spaces, which is referred as semismoothness in [46] and slant differentiability in [12, 24]. It is well-known (see e.g. [8, 45, 39]) that the HJBI operator in (2.2a) is in general non-Fréchet-differentiable, and this generalized differentiability is essential for showing the superlinear convergence of policy iteration applied to HJBI equations.

Definition 3.2.

Let F:VYZ be defined on a open subset V of the Banach space Y with images in the Banach space Z. In addition, let *F:V(Y,Z) be a given a set-valued mapping with nonempty images, i.e., *F(y) for all yV. We say F is *F-semismooth in V if for any given yV, we have that F is continuous near y, and

supM*F(y+s)F(y+s)-F(y)-MsZ=o(sY),as sY0.

The set-valued mapping *F is called a generalized differential of F in V.

Remark 3.2.

As in [46], we always require that *F has a nonempty image, and hence the *F-semismooth of F in V shall automatically imply that the image of *F is nonempty on V.

Now we are ready to analyze Algorithm 3. We first prove the semismoothness of the Hamiltonian G defined as in (2.3), by viewing it as the composition of a pointwise maximum operator and a family of HJB operators parameterized by the control α. Moreover, we shall simultaneously establish that, for each iteration, one can select measurable control laws αk,βk to ensure the measurability of the controlled coefficients bki,ck,fk in the linear problem (3.3), which is essential for the well-posedness of strong solutions to (3.3), and the well-definedness of Algorithm 3.

The following proposition establishes the semismoothness of a parameterized family of first-order HJB operators, which extends the result for scalar-valued HJB operators in [45]. Moreover, by taking advantage of the fact that the operators involve only first-order derivatives, we are able to establish that they are semismooth from H2(Ω) to Lp(Ω) for some p>2 (cf. [45, Theorem 13]), which is essential for the superlinear convergence of Algorithm 3.

Proposition 3.1.

Suppose (H.1) holds. Let p be a given constant satisfying p1 if n2 and p[1,2nn-2) if n>2, and let F1:H2(Ω)(Lp(Ω))|𝐀| be the HJB operator defined by

F1(u)(minβ𝑩(bi(x,α,β)iu+c(x,α,β)u-f(x,α,β)))α𝑨,uH2(Ω).

Then F1 is Lipschitz continuous and *F1-semismooth in H2(Ω) with a generalized differential

*F1:H2(Ω) (H2(Ω),(Lp(Ω))|𝑨|)

defined as follows: for any uH2(Ω), we have

*F1(u)(bi(,α,βu(,α))i+c(,α,βu(,α)))α𝑨, (3.7)

where βu:Ω×𝐀𝐁 is any jointly measurable function such that for all α𝐀 and xΩ,

βu(x,α)argminβ𝑩(bi(x,α,β)iu(x)+c(x,α,β)u(x)-f(x,α,β)). (3.8)
Proof.

Since A is a finite set, we shall assume without loss of generality that, the Banach space (Lp(Ω))|𝐀| is endowed with the usual product norm p,𝐀, i.e., for all u(Lp(Ω))|𝐀|, up,𝐀=α𝐀u(,α)Lp(Ω). Note that the Sobolev embedding theorem shows that the following injections are continuous: H2(Ω)W1,q(Ω), for all q2,n2, and H2(Ω)W1,2n/(n-2)(Ω), for all n>2. Thus for any given p satisfying the conditions in Proposition 3.1, we can find r(p,) such that the injection H2(Ω)W1,r(Ω) is continuous. Then, the boundedness of bi,c,f implies that the mappings F1 and *F1 are well-defined, and F1:H2(Ω)(Lp(Ω))|𝐀| is Lipschitz continuous.

Now we show that the mapping *F1 has a nonempty image from W1,r(Ω) to (Lp(Ω))|𝐀|, where we choose r(p,) such that the injection H2(Ω)W1,r(Ω) is continuous, and naturally extend the operators F1 and *F1 from H2(Ω) to W1,r(Ω). For each uW1,r(Ω), we consider the Carathéodory function g:Ω×𝐀×𝐁 such that g(x,α,β)(x,(u,u)(x),α,β) for all (x,α,β)Ω×𝐀×𝐁, where is defined by (3.5). Theorem A.3 shows there exists a function βu:Ω×𝐀𝐁 satisfying (3.8), i.e.,

βu(x,α)argminβ𝐁(x,(u(x),u(x)),α,β),(x,α)Ω×𝐀,

and βu is jointly measurable with respect to the product σ-algebra on Ω×𝐀. Hence *F1(u) is nonempty for all uW1,r(Ω).

We proceed to show that the operator F1 is in fact *F1-semismooth from W1,r(Ω) to (Lp(Ω))|𝐀|, which implies the desired conclusion due to the continuous embedding H2(Ω)W1,r(Ω). For each α𝐀, we denote by F1,α:W1,r(Ω)Lp(Ω) the α-th component of F1, and by *F1,α the α-th component of *F1. Theorem A.4 and the continuity of in 𝐮 show that for each (x,α)Ω×𝐀, the set-valued mapping

𝐮n+1argminβ𝐁(x,𝐮,α,β)𝐁,

is upper hemicontinuous, from which, by following precisely the steps in the arguments for [45, Theorem 13], we can prove that F1,α:W1,r(Ω)Lp(Ω) is *F1,α-semismooth. Then, by using the fact that a direct product of semismooth operators is again semismooth with respect to the direct product of the generalized differentials of the components (see [46, Proposition 3.6]), we can deduce that F1:W1,r(Ω)(Lp(Ω))|𝐀| is semismooth with respect to the generalized differential *F1 and finishes the proof. ∎

We then establish the semismoothness of a general pointwise maximum operator, by extending the result in [24] for the max-function f:xmax(x,0).

Proposition 3.2.

Let p(2,) be a given constant, A be a finite set, and Ω be a bounded subset of Rn. Let F2:(Lp(Ω))|𝐀|L2(Ω) be the pointwise maximum operator such that for each u=(u(,α))α𝐀(Lp(Ω))|𝐀|,

F2(u)(x)maxα𝑨u(x,α),for a.e. xΩ. (3.9)

Then F2 is *F2-semismooth in (Lp(Ω))|𝐀| with a generalized differential

*F2:(Lp(Ω))|𝑨| ((Lp(Ω))|𝑨|,L2(Ω))

defined as follows: for any u=(u(,α))α𝐀,v=(v(,α))α𝐀(Lp(Ω))|𝐀|, we have

(*F2(u)v)(x)v(x,αu(x)),for xΩ,

where αu:Ω𝐀 is any measurable function such that

αu(x)argmaxα𝑨(u(x,α)),for xΩ. (3.10)

Moreover, *F2(u) is uniformly bounded (in the operator norm) for all u(Lp(Ω))|𝐀|.

Proof.

Let the Banach space (Lp(Ω))|𝐀| be endowed with the product norm p,𝐀 defined as in the proof of Proposition 3.1. We first show the mappings F2 and *F2 are well-defined, *F2 has nonempty images, and *F2(u) is uniformly bounded for u(Lp(Ω))|𝐀|.

The finiteness of A implies that any u(Lp(Ω))|𝐀| can also be viewed as a Carathéodory function u:Ω×𝐀. Hence for any given u(Lp(Ω))|𝐀|, we can deduce from Theorem A.3 the existence of a measurable function αu:Ω𝐀 satisfying (3.10). Moreover, for any given measurable function αu:Ω𝐀 and v(Lp(Ω))|𝐀|, the function *F2(u)v remains Lebesgue measurable (see Remark 3.1). Then, for any given u(Lp(Ω))|𝐀| with p>2, one can easily check that F2(u)L2(Ω), and *F2(u)((Lp(Ω))|𝐀|,L2(Ω)), which subsequently implies that F2 and *F2 are well-defined, and the image of *F2 is nonempty on (Lp(Ω))|𝐀|. Moreover, for any u,v(Lp(Ω))|𝐀|, Hölder’s inequality leads to the following estimate:

Ω|v(x,αu(x))|2𝑑xΩα𝐀|v(x,α)|2dxα𝐀|Ω|(p-2)/pv(,α)Lp(Ω)2,

which shows that *F2(u)((Lp(Ω))|𝐀|,L2(Ω))|Ω|(p-2)/(2p) for all u(Lp(Ω))|𝐀|.

Now we prove by contradiction that the operator F2 is *F2-semismooth. Suppose there exists a constant δ>0 and functions u,{vk}k=1(Lp(Ω))|𝐀| such that vkp,𝐀0 as k, and

F2(u+vk)-F2(u)-*F2(u+vk)vkL2(Ω)/vkp,𝐀δ>0,k, (3.11)

where for each k, *F2(u+vk) is defined with some measurable function αu+vk:Ω𝐀. Then, by passing to a subsequence, we may assume that for all α𝐀, the sequence {vk(,α)}k converges to zero pointwise a.e. in Ω, as k.

For notational simplicity, we define Σ(x,u)argmaxα𝐀(u(x,α)) for all u(Lp(Ω))|𝐀| and xΩ. Then for a.e. xΩ, we have limkvk(x,α)=0 for all α𝐀, αu+vk(x)Σ(x,u+vk) for all k. By using the finiteness of A and the convergence of {vk(,α)}k, it is straightforward to prove by contradiction that for all such xΩ, αu+vk(x)Σ(x,u) for all large enough k.

We now derive an upper bound of the left-hand side of (3.11). For a.e. xΩ, we have

F2(u +vk)(x)-F2(u)(x)-(*F2(u+vk)vk)(x)
(u+vk)(x,αu+vk(x))-u(x,αu+vk(x))-vk(x,αu+vk(x))=0,
F2(u +vk)(x)-F2(u)(x)-(*F2(u+vk)vk)(x)
(u+vk)(x,αu(x))-u(x,αu(x))-vk(x,αu+vk(x))=vk(x,αu(x))-vk(x,αu+vk(x)),

from any αu(x)Σ(x,u). Thus, for each k, we have for a.e. xΩ that,

|F2(u+vk)(x)-F2(u)(x)-(*F2(u+vk)vk)(x)|ϕk(x)infαuΣ(x,u)|vk(x,αu)-vk(x,αu+vk(x))|,

where, by applying Theorem A.3 twice, we can see that both the set-valued mapping xΣ(x,u) and the function ϕk are measurable.

We then introduce the set Ωk={xΩαu+vk(x)Σ(x,u)} for each k. The measurability of the set-valued mapping xΣ(x,u) implies the associated distance function ρ(x,α)dist(α,Σ(x,u)) is a Carathéodory function (see [1, Theorem 18.5]), which subsequently leads to the measurability of Ωk for all k. Hence we can deduce that

F2(u+vk)-F2(u)-*F2(u+vk)vkL2(Ω)2ΩkinfαuΣ(x,u)|vk(x,αu)-vk(x,αu+vk(x))|2dx
2Ωkα𝐀|vk(x,α)|2dx2α𝐀|Ωk|(p-2)/pvk(,α)Lp(Ω)2,

which leads to the following estimate:

F2(u+vk)-F2(u)-*F2(u+vk)vkL2(Ω)/vkp,𝐀2|Ωk|(p-2)/(2p)0,as k,

where we have used the bounded convergence theorem and the fact that for a.e. xΩ, 1Ωk(x)=0 for all large enough k. This contradicts to the hypothesis (3.11), and hence finishes our proof. ∎

Now we are ready to conclude the semismoothness of the HJBI operator. Note that the argument in [45] does not apply directly to the HJBI operator, due to the nonconvexity of the Hamiltonian G defined as in (2.3).

Theorem 3.3.

Suppose (H.1) holds, and let F:H2(Ω)L2(Ω) be the HJBI operator defined as in (2.2a). Then F is semismooth in H2(Ω), with a generalized differential *F:H2(Ω)L(H2(Ω),L2(Ω)) defined as follows: for any uH2(Ω),

*F(u)-aij()ij+bi(,α(),βu(,α()))i+c(,α(),βu(,α())), (3.12)

where βu:Ω×𝐀𝐁 is any jointly measurable function satisfying (3.8), and α:Ω𝐀 is any measurable function such that

α(x)argmaxα𝐀[minβ𝑩(bi(x,α,β)iu(x)+c(x,α,β)u(x)-f(x,α,β))],for a.e. xΩ. (3.13)
Proof.

Note that we can decompose the HJBI operator F:H2(Ω)L2(Ω) into F=F0+F2F1, where F0:H2(Ω)L2(Ω) is the linear operator u-aijiju, F1:H2(Ω)(Lp(Ω))|𝐀| is the HJB operator defined in Proposition 3.1, F2:(Lp(Ω))|𝐀|L2(Ω) is the pointwise maximum operator defined in Proposition 3.2, and p is a constant satisfying p>2 if n2, and p(2,2n/(n-2)) if n>2.

Proposition 3.1 shows that F1 is Lipschitz continuous and semismooth with respect to the generalized differential *F1 defined by (3.7), while Proposition 3.2 shows that F2 is semismooth with respect to the uniformly bounded generalized differential *F2 defined by (3.9). Hence, we know the composed operator F2F1 is semismooth with respect to the composition of the generalized differentials (see [46, Proposition 3.8]), i.e., *(F2F1)(u)=*F2(F1(u))*F1(u) for all uH2(Ω). Consequently, by using the fact that F0 is Fréchet differentiable with the derivative -aijij(H2(Ω),L2(Ω)), we can conclude from Propositions 3.1 and 3.2 that F:H2(Ω)L2(Ω) is semismooth on H2(Ω), and that (3.12) is a desired generalized differential of F at u. ∎

Note that the above characterization of the generalized differential of the HJBI operator involves a jointly measurable function βu:Ω×𝐀𝐁, satisfying (3.8) for all (x,α)Ω×𝐀. We now present a technical lemma, which allows us to view the control law βk in (3.2) as such a feedback control on xΩ and α𝐀.

Lemma 3.4.

Suppose (H.1) holds. Let h,{hi}i=1n:ΩR, αh:Ω𝐀 be given measurable functions, and βh:Ω𝐁 be a measurable function such that for all xΩ,

βh(x)argminβ𝑩(bi(x,αh(x),β)hi(x)+c(x,αh(x),β)h(x)-f(x,αh(x),β)). (3.14)

Then there exists a jointly measurable function β~h:Ω×𝐀𝐁 such that βh(x)=β~h(x,αh(x)) for all xΩ, and it holds for all xΩ and α𝐀 that

β~h(x,α)argminβ𝑩(bi(x,α,β)hi(x)+c(x,α,β)h(x)-f(x,α,β)). (3.15)
Proof.

Let βh:Ω𝐁 be a given measurable function satisfying (3.14) for all xΩ (see Remark 3.1 and Theorem A.3 for the existence of such a measurable function). As shown in the proof of Proposition 3.1, there exists a jointly measurable function β¯:Ω×𝐀𝐁 satisfying the property (3.15) for all (x,α)Ω×𝐀. Now suppose that 𝐀={αi}i=1|𝐀| with |𝐀|< (see (H.1)), we shall define the function β~h(x,α):Ω×𝐀𝐁, such that for all (x,α)Ω×𝐀,

β~h(x,α)={βh(x),(x,α)𝒞i=1|𝐀|({xΩαh(x)=αi}×{αi}),β¯(x,α),otherwise.

The measurability of αh and the finiteness of A imply that the set 𝒞 is measurable in the product σ-algebra on Ω×𝐀, which along with the joint measurability of β¯ leads to the joint measurability of the function β~h.

For any given xΩ, we have (x,αh(x)){yΩαh(y)=αh(x)}×{αh(x)}, from which we can deduce from the definition of β~h that β~h(x,αh(x))=βh(x) for all xΩ. Finally, for any given αi𝐀, we shall verify (3.15) for all xΩ and α=αi. Let xΩ be fixed. If αh(x)=αi, then the fact that (x,αi)𝒞 and the definition of β~h imply that β~h(x,αi)=βh(x), which along with (3.14) and αh(x)=αi shows that (3.15) holds for the point (x,αi). On the other hand, if αh(x)αi, then (x,αi)𝒞 and β~h(x,αi)=β¯(x,αi) satisfies the condition (3.15) due to the selection of β¯. ∎

As a direct consequence of the above extension result, we now present an equivalent characterization of the generalized differential of the HJBI operator.

Corollary 3.5.

Suppose (H.1) holds, and let F:H2(Ω)L2(Ω) be the HJBI operator defined as in (2.2a). Then F is semismooth in H2(Ω), with a generalized differential *F:H2(Ω)L(H2(Ω),L2(Ω)) defined as follows: for any uH2(Ω),

*F(u)-aij()ij+bi(,αu(),βu())i+c(,αu(),βu()), (3.16)

where αu:Ω𝐀 and βu:Ω𝐁 are any measurable functions satisfying for all xΩ that

αu(x)argmaxα𝐀[minβ𝑩(bi(x,α,β)iu(x)+c(x,α,β)u(x)-f(x,α,β))],βu(x)argminβ𝑩(bi(x,αu(x),β)iu(x)+c(x,αu(x),β)u(x)-f(x,αu(x),β)). (3.17)
Proof.

Let uH2(Ω), and let αu and βu be given measurable functions satisfying (3.17) (see Remark 3.1 and Theorem A.3 for the existence of such measurable functions). Then by using Lemma 3.4, we know there exists a jointly measurable function β~u:Ω×𝐀𝐁 such that β~u satisfies (3.8) for all (x,α)Ω×𝐀, and β~u(x,αu(x))=βu(x) for all xΩ. Hence we see the linear operator defined in (3.16) is equal to the following operator

-aij()ij+bi(,αu(),β~u(,αu()))i+c(,αu(),β~u(,αu()))(H2(Ω),L2(Ω)),

which is a generalized differential of the HJBI operator F at u due to Theorem 3.3. ∎

The above characterization of the generalized differential of the HJBI operator enables us to demonstrate the superlinear convergence of Algorithm 3 by reformulating it as a semismooth Newton method for an operator equation.

Theorem 3.6.

Suppose (H.1) holds and let u*H2(Ω) be a strong solution to the Dirichlet problem (2.2). Then there exists a neighborhood N of u*, such that for all u0N, Algorithm 3 either terminates with uk=u* for some kN, or generates a sequence {uk}kN that converges q-superlinearly to u* in H2(Ω), i.e., limkuk+1-u*H2(Ω)/uk-u*H2(Ω)=0.

Proof.

Note that the Dirichlet problem (2.2) can be written as an operator equation F~(u)=0 with the following operator

F~:uH2(Ω)(F(u),τu-g)L2(Ω)×H3/2(Ω),

where F is the HJBI operator defined as in (2.2a), and τ:H2(Ω)H3/2(Ω) is the trace operator. Moreover, one can directly check that given an iterate ukH2(Ω), k0, the next iterate uk+1 solves the following Dirichlet problem:

Lk(uk+1-uk)=-F(uk),in Ω;τ(uk+1-uk)=-(τuk-g),on Ω.

with the differential operator Lk*F(uk) defined as in (3.16). Since F:H2(Ω)L2(Ω) is *F-semismooth (see Corollary 3.5) and τ(H2(Ω),H3/2(Ω)), we can conclude that Algorithm 3 is in fact a semismooth Newton method for solving the operator equation F~(u)=0.

Note that the boundedness of coefficients and the classical theory of elliptic regularity (see Theorem A.1) imply that under condition (H.1), there exists a constant C>0, such that for any uH2(Ω) and any L*F(u), the inverse operator (L,τ)-1:L2(Ω)×H3/2(Ω)H2(Ω) is well-defined, and the operator norm (L,τ)-1 is bounded by C, uniformly in uH2(Ω). Hence one can conclude from [46, Theorem 3.13] (see also Theorem A.5) that the iterates {uk}k converges superlinearly to u* in a neighborhood 𝒩 of u*.

The next theorem strengthens Theorem 3.6, and establishes a novel global convergence result of Algorithm 3 applied to the Dirichlet problem (2.2), which subsequently provides a constructive proof for the existence of solutions to (2.2). The following additional condition is essential for our proof of the global convergence of Algorithm 3:

H. 2.

Let the function c in (H.1) be given as: c(x,α,β)=c¯(x,α,β)+c¯0, for all (x,α,β)Ω×𝐀×𝐁, where c¯0 is a sufficiently large constant, depending on Ω, {aij}i,j=1n, {bi}i=1n and c¯L(Ω×𝐀×𝐁).

In practice, (H.2) can be satisfied if (2.2) arises from an infinite-horizon stochastic game with a large discount factor (see e.g. [10]), or if (2.2) stems from an implicit (time-)discretization of parabolic HJBI equations with a small time stepsize.

Theorem 3.7.

Suppose (H.1) and (H.2) hold, then the Dirichlet problem (2.2) admits a unique strong solution u*H2(Ω). Moreover, for any initial guess u0H2(Ω), Algorithm 3 either terminates with uk=u* for some kN, or generates a sequence {uk}kN that converges q-superlinearly to u* in H2(Ω), i.e., limkuk+1-u*H2(Ω)/uk-u*H2(Ω)=0.

Proof.

If Algorithm 3 terminates in iteration k, we have F(uk)=Lkuk-fk=0 and τuk=g, from which, we obtain from the uniqueness of strong solutions to (2.2) (Proposition 2.1) that uk=u* is the strong solution to the Dirichlet problem (2.2). Hence we shall assume without loss of generality that Algorithm 3 runs infinitely.

We now establish the global convergence of Algorithm 3 by first showing the iterates {uk}k form a Cauchy sequence in H2(Ω). For each k0, we deduce from (3.6) and (H.2) that τuk+1=g on Ω and

-aijijuk+1+bkiiuk+1+ckuk+1-fk=-aijijuk+1+bkiiuk+1+(c¯k+c¯0)uk+1-fk
=-aijijuk+1+c¯0uk+1+bkii(uk+1-uk)+c¯k(uk+1-uk)+G¯(,uk,uk)=0, (3.18)

for a.e. xΩ, where the function c¯k(x)c¯(x,αk(x),βk(x)) for all xΩ, and the modified Hamiltonian is defined as:

G¯(x,u,u)=maxα𝐀minβ𝐁(bi(x,α,β)iu(x)+c¯(x,α,β)u(x)-f(x,α,β)). (3.19)

Hence, by taking the difference of equations corresponding to the indices k-1 and k, one can obtain that

- aijij(uk+1-uk)+c¯0(uk+1-uk)=-bkii(uk+1-uk)-c¯k(uk+1-uk)
+bk-1ii(uk-uk-1)+c¯k-1(uk-uk-1)-[G¯(,uk,uk)-G¯(,uk-1,uk-1)], (3.20)

for xΩ, and τ(uk+1-uk)=0 on Ω.

It has been proved in Theorem 9.14 of [20] that, there exist positive constants C and γ0, depending only on {aij}i,j=1n and Ω, such that it holds for all uH2(Ω) with τu=0, and for all γγ0 that

uH2(Ω)C-aijiju+γuL2(Ω),

which, together with the identity that γu=(-aijiju+γu)+aijiju and the boundedness of {aij}ij, implies that the same estimate also holds for uH2(Ω)+γuL2(Ω):

uH2(Ω)+γuL2(Ω)C-aijiju+γuL2(Ω).

Thus, by assuming c¯0γ0 and using the boundedness of the coefficients, we can deduce from (3.20) that

uk+1-uk H2(Ω)+c¯0uk+1-ukL2(Ω)C(-bkii(uk+1-uk)-c¯k(uk+1-uk)
+bk-1ii(uk-uk-1)+c¯k-1(uk-uk-1)-[G¯(,uk,uk)-G¯(,uk-1,uk-1)]L2(Ω))
C(uk+1-ukH1(Ω)+uk-uk-1H1(Ω)), (3.21)

for some constant C independent of c¯0 and the index k.

Now we apply the following interpolation inequality (see [20, Theorem 7.28]): there exists a constant C, such that for all uH2(Ω) and ε>0, we have uH1(Ω)εuH2(Ω)+Cε-1uL2(Ω). Hence, for any given ε1(0,1),ε2>0, we have

(1-ε1)uk+1-ukH2(Ω)+c¯0uk+1-ukL2(Ω) ε2uk-uk-1H2(Ω)+Cε1-1uk+1-ukL2(Ω)
+Cε2-1uk-uk-1L2(Ω).

Then, by taking ε1(0,1), ε2<1-ε1, and assuming that c¯0 satisfies (c¯0-C/ε1)/(1-ε1)C/ε22, we can obtain for c=C/ε22 that

uk+1-ukH2(Ω)+cuk+1-ukL2(Ω)ε21-ε1(uk-uk-1H2(Ω)+cuk-uk-1L2(Ω)),

which implies that {uk}k is a Cauchy sequence with the norm cH2(Ω)+cL2(Ω).

Since c is equivalent to H2(Ω) on H2(Ω), we can deduce that {uk}k converges to some u¯ in H2(Ω). By passing k in (3) and using Proposition 2.1, we can deduce that u¯=u* is the unique strong solution of (2.2). Finally, for a sufficiently large K0, we can conclude the superlinear convergence of {uk}kK0 from Theorem 3.6. ∎

We end this section with an important remark that, if one of the sets A and B is a singleton, and aijC0,1(Ω¯) for all i,j, then Algorithm 3 applied to the Dirichlet problem (2.2) is in fact monotonically convergent with an arbitrary initial guess. Suppose, for instance, that A is a singleton, then for each k{0}, we have that

0=Lkuk+1-fkF(uk+1)=-Lk+1(uk+2-uk+1),for a.e. xΩ.

Hence we can deduce that wk+1uk+1-uk+2 is a weak subsolution to Lk+1w=0, i.e.,

Ω[aijjwk+1iϕ+((iaij+bk+1i)iwk+1+ck+1wk+1)ϕ]𝑑x0,ϕ0,ϕC01(Ω).

Thus, the weak maximal principle (see [19, Theorem 1.3.7]) and the fact that uk+1-uk+2=0 a.e. xΩ (with respect to the surface measure), leads to the estimate esssupΩuk+1-uk+20, which consequently implies that ukuk+1 for all k and a.e. xΩ.

4 Inexact policy iteration for HJBI Dirichlet problems

Note that at each policy iteration, Algorithm 3 requires us to obtain an exact solution to a linear Dirichlet boundary value problem, which is generally infeasible. Moreover, an accurate computation of numerical solutions to linear Dirichlet boundary value problems could be expensive, especially in a high-dimensional setting. In this section, we shall propose an inexact policy iteration algorithm for (2.2), where we compute an approximate solution to (3.3) by solving an optimization problem over a family of trial functions, while maintaining the superlinear convergence of policy iteration.

We shall make the following assumption on the trial functions of the optimization problem.

H. 3.

The collections of trial functions {FM}MN satisfies the following properties: FMFM+1 for all MN, and F={FM}MN is dense in H2(Ω).

It is clear that (H.3) is satisfied by any reasonable H2-conforming finite element spaces (see e.g. [9]) and high-order polynomial spaces or kernel-function spaces used in global spectral methods (see e.g. [4, 5, 13, 28, 29]). We now demonstrate that (H.3) can also be easily satisfied by the sets of multi-layer feedforward neural networks, which provides effective trial functions for high-dimensional problems. Let us first recall the definition of a feedforward neural network.

Definition 4.1 (Artificial neural networks).

Let L,N0,N1,,NL be given constants, and ϱ: be a given function. For each l=1,,L, let Tl:Nl-1Nl be an affine function given as Tl(x)=Wlx+bl for some WlNl×Nl-1 and blNl. A function F:N0NL defined as

F(x)=TL(ϱTL-1)(ϱT1),xN0,

is called a feedforward neural network. Here the activation function ϱ is applied componentwise. We shall refer the quantity L as the depth of F, N1,NL-1 as the dimensions of the hidden layers, and N0,NL as the dimensions of the input and output layers, respectively. We also refer to the number of entries of {Wl,bl}l=1N as the complexity of F.

Let {L(M)}M, {N1(M)}M,,{NL(M)-1(M)}M be some nondecreasing sequences of natural numbers, we define for each M the set M of all neural networks with depth L(M), input dimension being equal to n, output dimension being equal to 1, and dimensions of hidden layers being equal to {N1(M),,NL(M)-1(M)}M. It is clear that if L(M)L for all M, then we have MM+1. The following proposition is proved in [25, Corollary 3.8], which shows neural networks with one hidden layer are dense in H2(Ω).

Proposition 4.1.

Let ΩRn be an open bounded starshaped domain, and ϱC2(R) satisfying 0<|Dlϱ|L1(Ω)< for all l=1,2. Then the family of all neural networks with depth L=2 is dense in H2(Ω).

Now we discuss how to approximate the strong solutions of Dirichlet problems by reformulating the equations into optimization problems over trial functions. The idea is similar to least squares finite-element methods (see e.g. [7]), and has been employed previously to develop numerical methods for PDEs based on neural networks (see e.g. [35, 6, 44]). However, compared to [6, 35], we do not impose additional constraints on the trial functions by requiring that the networks exactly agree with the boundary conditions, due to the lack of theoretical support that the constrained neural networks are still dense in the solution space. Moreover, to ensure the convergence of solutions in the H2(Ω)-norm, we include the H3/2(Ω)-norm of the boundary data in the cost function, instead of the L2(Ω)-norm used in [44] (see Remark 4.2 for more details).

For each k{0}, let uk+1H2(Ω) be the unique solution to the Dirichlet problem (3.3):

Lku-fk=0,in Ω;τu=g,on Ω,

where Lk and fk denote the linear elliptic operator and the source term in (3.3), respectively. For each M, we shall consider the following optimization problems:

Jk,MinfuMJk(u),with Jk(u)=Lku-fkL2(Ω)2+τu-gH3/2(Ω)2. (4.1)

The following result shows that the cost function Jk provides a computable indicator of the error.

Proposition 4.2.

Suppose (H.1) and (H.3) hold. For each kN{0} and MN, let uk+1H2(Ω) be the unique solution to (3.3), and Jk,Jk,M be defined as in (4.1). Then there exist positive constants C1 and C2, such that we have for each uH2(Ω) and kN{0} that

C1Jk(u)u-uk+1H2(Ω)2C2Jk(u).

Consequently, it holds for each kN{0} that limMJk,M=0.

Proof.

Let k{0} and uH2(Ω). The definition of Jk(u) implies that Lku-fk=feL2(Ω), τu-g=geH3/2(Ω) and J(u)=feL2(Ω)2+geH3/2(Ω)2. Then, by using the assumption that uk+1 solves (3.3), we deduce that the residual term satisfies the following Dirichlet problem:

Lk(u-uk+1)=fe,in Ω;τ(u-uk+1)=ge,on Ω.

Hence the boundedness of coefficients and the regularity theory of elliptic operators (see Theorem A.1) lead to the estimate that

C1(feL2(Ω)2+geH3/2(Ω)2)u-uk+1H2(Ω)2C2(feL2(Ω)2+geH3/2(Ω)2),

where the constants C1,C2>0 depend only on the L(Ω)-norms of aij,bki,ck,fk, which are independent of k. The above estimate, together with the facts that {M}M is dense in H2(Ω) and MM+1, leads to the desired result that limMJk,M=0. ∎

We now present the inexact policy iteration algorithm for the HJBI problem (2.2), where at each policy iteration, we solve the linear Dirichlet problem within a given accuracy.

{algorithm}

[H] Inexact policy iteration algorithm for Dirichlet problems.

  1. 1.

    Choose a family of trial functions ={M}MH2(Ω), an initial guess u0 in , a sequence {ηk}k{0} of positive scalars, and set k=0.

  2. 2.

    Given the iterate uk, update the control laws αk and βk by (3.1) and (3.2), respectively.

  3. 3.

    Find uk+1 such that11 1 With a slight abuse of notation, we denote by uk+1 the inexact solution to the Dirichlet problem (3.3).

    Jk(uk+1)=Lkuk+1-fkL2(Ω)2+τuk+1-gH3/2(Ω)2ηk+1min(uk+1-ukH2(Ω)2,η0), (4.2)

    where Lk and fk denote the linear operator and the source term in (3.3), respectively.

  4. 4.

    If uk+1-ukH2(Ω)=0, then terminate with outputs uk+1,αk and βk, otherwise increment k by one and go to step 2.

Remark 4.1.

In practice, the evaluation of the squared residuals Jk in (4.2) depends on the choice of trial functions. For trial functions with linear architecture, e.g. if {M}M are finite element spaces, high-order polynomial spaces, and kernel-function spaces (see [9, 13, 28, 29]), one may evaluate the norms by applying high-order quadrature rules to the basis functions involved.

For trial functions with nonlinear architecture, such as feedforward neural networks, we can replace the integrations in Jk by the empirical mean over suitable collocation points in Ω and on Ω, such as pseudorandom points or quasi-Monte Carlo points (see Section 6; see also [35, 6, 44]). In particular, due to the existence of local coordinate charts of the boundaries, we can evaluate the double integral in the definition of the H3/2(Ω)-norm (see (2.1)) by first generating points in 2(n-1) and then mapping the samples onto Ω×Ω. The resulting empirical least-squares problem for the k+1-th policy iteration step (cf. (4.1)) can then be solved by stochastic gradient descent (SGD) algorithms; see Section 6. We remark that, instead of pre-generating all the collocation points in advance, one can perform gradient descent based on a sequence of mini-batches of points generated at each SGD iteration. This is particularly useful in higher dimensions, where many collocation points may be needed to cover the boundary, and using mini-batches avoids having to evaluate functions at all collocation points in each iteration.

It is well-known (see e.g. [46, 14]) that the residual term uk+1-ukH2(Ω) is crucial for the superlinear convergence of inexact Newton methods. This next theorem establishes the global superlinear convergence of Algorithm 4.2.

Theorem 4.3.

Suppose (H.1), (H.2) and (H.3) hold, and limkηk=0 in Algorithm 4.2. Let u*H2(Ω) be the solution to the Dirichlet problem (2.2). Then for any initial guess u0F, Algorithm 4.2 either terminates with uk=u* for some kN, or generates a sequence {uk}kN that converges q-superlinearly to u* in H2(Ω), i.e., limkuk+1-u*H2(Ω)/uk-u*H2(Ω)=0. Consequently, we have limk(uk,iuk,ijuk)(x)=(u*,iu*,iju*)(x) for a.e. xΩ, and for all i,j=1,,n.

Proof.

Let u0 be an arbitrary initial guess. We first show that Algorithm 4.2 is always well-defined. For each k{0}, if uk is the strong solution to (3.3), then we can choose uk+1=uk, which satisfies (4.2) and terminates the algorithm. If uk does not solve (3.3), the fact that is dense in H2(Ω) enables us to find uk+1 satisfying the criterion (4.2).

Moreover, one can clearly see from (4.2) that if Algorithm 4.2 terminates at iteration k, then uk is the exact solution to the Dirichlet problem (2.2). Hence in the sequel we shall assume without loss of generality that Algorithm 4.2 runs infinitely, i.e., uk+1-ukH2(Ω)>0 and uku* for all k{0}.

We next show the iterates converge to u* in H2(Ω) by following similar arguments as those for Theorem 3.7. For each k0, we can deduce from (4.2) that there exists fkeL2(Ω) and gkeH3/2(Ω) such that

Lkuk+1-fk=fke,in Ω;τuk+1-g=gke,on Ω, (4.3)

and Jk(uk+1)=fkeL2(Ω)2+gkeH3/2(Ω)2ηk+1(uk+1-ukH2(Ω)2) with limkηk=0. Then, by taking the difference between (4.3) and (2.2), we obtain that

-aijij(uk+1-u*)+c¯0(uk+1-u*) =-bkii(uk+1-uk)-c¯k(uk+1-uk)
-[G¯(,uk,uk)-G¯(,u*,u*)]+fke,in Ω,

and τ(uk+1-u*)=gke on Ω, where G¯ is the modified Hamiltonian defined as in (3.19). Then, by proceeding along the lines of Theorem 3.7, we can obtain a positive constant C, independent of c¯0 and the index k, such that

uk+1-u*H2(Ω)+c¯0uk+1-u*L2(Ω)
C(uk+1-u*H1(Ω)+uk+1-ukH1(Ω)+uk-u*H1(Ω))+o(uk+1-ukH2(Ω))
C(uk+1-u*H1(Ω)+uk-u*H1(Ω))+o(uk+1-u*H2(Ω)+uk-u*H2(Ω))

as k, where the additional high-order terms are due to the residuals fke and gke. Then, by using the interpolation inequality and assuming c¯0 is sufficiently large, we can deduce that {uk}k converge linearly to u* in H2(Ω).

We then reformulate Algorithm 4.2 into a quasi-Newton method for the operator equation F~(u)=0, with the operator F~:uH2(Ω)(F(u),τu-g)L2(Ω)×H3/2(Ω) defined in the proof of Theorem 3.6. Let H2(Ω)* denote the strong dual space of H2(Ω), and , denote the dual product on H2(Ω)*×H2(Ω). For each k{0}, by using the fact that uk+1-ukH2(Ω)>0, we can choose wkH2(Ω)* satisfying wk,uk+1-uk=-1, and introduce the following linear operators δLk(H2(Ω),L2(Ω)) and δτk(H2(Ω),H3/2(Ω)):

δLk:vH2(Ω)wk,vfkeL2(Ω)andδτk:vH2(Ω)wk,vgkeH3/2(Ω).

Then, we can apply the identity F(uk)=Lkuk-fk and rewrite (4.3) as:

(Lk+δLk)(uk+1-uk)=-F(uk),in Ω;(τ+δτk)(uk+1-uk)=-(τuk-g),on Ω,

with (Lk,τ)*F~(uk) as shown in Theorem 3.6. Hence one can clearly see that (4.3) is precisely a Newton step with a perturbed operator for the equation F~(u)=0.

Now we are ready to establish the superlinear convergence of {uk}k. For notational simplicity, in the subsequent analysis we shall denote by ZL2(Ω)×H3/2(Ω) the Banach space with the usual product norm zZz1L2(Ω)+z2H3/2(Ω) for each z=(z1,z2)Z. By using the semismoothess of F~:H2(Ω)Z (see Theorem 3.6) and the strong convergence of {uk}k in H2(Ω), we can directly infer from Theorem A.5 that it remains to show that there exists a open neighborhood V of u*, and a constant L>0, such that

v-u*H2(Ω)/LF~(v)-F~(u*)ZLv-u*H2(Ω),vV, (4.4)

and also

limk(δLksk,δτksk)Z/skH2(Ω)=0,with sk=uk+1-uk for all k. (4.5)

The criterion (4.2) and the definitions of δLk and δτk imply that (4.5) holds:

((δLksk,δτksk)ZskH2(Ω))2=(fkeL2(Ω)+gkeH3/2(Ω)skH2(Ω))22Jk(uk+1)skH2(Ω)22η0ηk+10,

as k. Moreover, the boundedness of the coefficients aij,bi,c,f shows that F~ is Lipschitz continuous. Finally, the characterization of the generalized differential of F~ in Theorem 3.6 and the regularity theory of elliptic operators (see Theorem A.1) show that for each vH2(Ω), we can choose an invertible operator Mv=(Lv,τ)*F~(v) such that Mv-1(Z,H2(Ω))C<, uniformly in v. Thus we can conclude from the semismoothness of F~ at u* that

F~(v)-F~(u*)Z =Mv(v-u*)+o(v-u*H2(Ω))ZMv(v-u*)Z-o(v-u*H2(Ω))
v-u*H2(Ω)/C-o(v-u*H2(Ω))v-u*H2(Ω)/(2C),

for all v in some neighborhood V of u*, which completes our proof for q-superlinear convergence of {uk}k.

Finally, we establish the pointwise convergence of {uk}k=1 and their derivatives. For any given γ(0,1), the superlinear convergence of {uk}k=1 implies that there exists a constant C>0, depending on γ, such that uk-u*H2(Ω)2Cγ2k for all k. Taking the summation over the index k, we have

Ωk=1(|uk-u*|2+i,j=1n[|iuk-iu*|2+|ijuk-iju*|2])dx=k=1uk-u*H2(Ω)2Cγ21-γ2<,

where we used the monotone convergence theorem in the first equality. Thus, we have

k=1(|uk-u*|2+i,j=1n[|iuk-iu*|2+|ijuk-iju*|2])(x)<,for a.e. xΩ,

which leads us to the pointwise convergence of uk and its partial derivatives with respect to k. ∎

Remark 4.2.

We reiterate that merely including the L2(Ω)-norm of the boundary data in the cost functional (4.2) in general cannot guarantee the convergence of the derivatives of the numerical solutions {uk}k=1, which can be seen from the following simple example. Let {gk}k=1H3/2(Ω) be a sequence such that gk0 in L2(Ω) but not in H1/2(Ω), and for each k, let hkH2(Ω) be the strong solution to -Δhk=0 in Ω and hk=gk on Ω.

The fact that gk↛0 in H1/2(Ω) implies that hk↛0 in H1(Ω) as k. We now show limkhk=0 in L2(Ω). Let wH2(Ω) be the solution to -Δw=hk in Ω and w=0 on Ω, we can deduce from the integration by parts and the a priori estimate wH2(Ω)ChkL2(Ω) that

hkL2(Ω)2 =Ω(-Δw)hk𝑑x=Ωw(-Δhk)𝑑x+Ωwnhkdσ-Ωhknwdσ
CgkL2(Ω)wH2(Ω)CgkL2(Ω)hkL2(Ω),

which shows that hkL2(Ω)CgkL2(Ω)0 as k. Now let be a given family of trial functions, which is dense in H2(Ω). One can find {uk}k=1 satisfying limkuk-hkH2(Ω)=0, and consequently uk↛0 in H1(Ω) as k. However, we have

-ΔukL2(Ω)2+ukL2(Ω)2=-Δ(uk-hk)L2(Ω)2+uk-hk+gkL2(Ω)20,as k.

Similarly, one can construct functions {uk}k=1 such that -ΔukL2(Ω)2+ukH1/2(Ω)20 as k, but {uk}k=1 does not converge to 0 in H2(Ω).

We end this section with a convergent approximation of the optimal control strategies based on the iterates {uk}k=1 generated by Algorithm 4.2. For any given uH2(Ω), we denote by 𝐀u(x) and 𝐁u(x,α) the set of optimal control strategies for all α𝐀 and for a.e. xΩ, such that

𝐁u(x,α) =argminβ𝐁(bi(x,α,β)iu(x)+c(x,α,β)u(x)-f(x,α,β)),
𝐀u(x) =argmaxα𝐀minβ𝐁(bi(x,α,β)iu(x)+c(x,α,β)u(x)-f(x,α,β)).

As an important consequence of the superlinear convergence of Algorithm 4.2, we now conclude that the feedback control strategies {αk}k=1 and {βk}k=1 generated by Algorithm 4.2 are convergent to the optimal control strategies.

Corollary 4.4.

Suppose the assumptions of Theorem 4.3 hold, and let u*H2(Ω) be the solution to the Dirichlet problem (2.2). Assume further that there exist functions α*:Ω𝐀 and β*:Ω𝐁 such that 𝐀u*(x)={α*(x)} and 𝐁u*(x,α*(x))={β*(x)} for a.e. xΩ. Then the measurable functions αk:Ω𝐀 and βk:Ω𝐁, kN, generated by Algorithm 4.2 converge to the optimal feedback control (α*,β*) pointwise almost everywhere.

Proof.

Let and h be the Carathéodory functions defined by (3.4) and (3.5), respectively, and we consider the following set-valued mappings:

Γ1:(x,𝐮)Ω×n+1Γ1(x,𝐮)argmaxα𝐀h(x,𝐮,α),Γ2:(x,𝐮,α)Ω×n+1×𝐀Γ2(x,𝐮,α)argminβ𝐁(x,𝐮,α,β). (4.6)

Theorem A.4 implies that the set-valued mappings Γ1(x,):n+1𝐀 and Γ2(x,,):n+1×𝐀𝐁 are upper hemicontinuous. Then the result follows directly from the pointwise convergence of (uk,uk)k=1 in Theorem 4.3, and the fact that 𝐀u*(x)={α*(x)} and 𝐁u*(x,α*(x))={β*(x)} are singleton for a.e. xΩ. ∎

Remark 4.3.

If we assume in addition that 𝐀XA and 𝐁YB for some Banach spaces XA and YB, then by using the compactness of A and B (see (H.1)), we can conclude from the dominated convergence theorem that αkα* in Lp(Ω;XA) and βkβ* in Lp(Ω;YB), for any p[1,).

5 Inexact policy iteration for HJBI oblique derivative problems

In this section, we extend the algorithms introduced in previous sections to more general boundary value problems. In particular, we shall propose a neural network based policy iteration algorithm with global H2-superlinear convergence for solving HJBI boundary value problems with oblique derivative boundary conditions. Similar arguments can be adapted to design superlinear convergent schemes for mixed boundary value problems with both Dirichlet and oblique derivative boundary conditions.

We consider the following HJBI oblique derivative problem:

F(u) -aij(x)iju+G(x,u,u)=0,a.e. xΩ, (5.1a)
Bu γiτ(iu)+γ0τu-g=0,on Ω. (5.1b)

where (5.1a) is the HJBI equation given in (2.2a), and (5.1b) is an oblique boundary condition. Note that the boundary condition Bu on Ω involves the traces of u and its first partial derivatives, which exist almost everywhere on Ω (with respect to the surface measure).

The following conditions are imposed on the coefficients of (5.1):

H. 4.

Assume Ω, A, B, (aij)i,j=1n, (bi)i=1n,c,f satisfy the same conditions as those in (H.1). Let gH1/2(Ω), {γi}i=0nC0,1(Ω), γ00 on Ω, and assume there exists a constant μ>0, such that cμ on Ω×𝐀×𝐁, and i=1nγiνiμ on Ω, where {νi}i=1n are the components of the unit outer normal vector field on Ω.

The next proposition establishes the well-posedness of the oblique derivative problem.

Proposition 5.1.

Suppose (H.4) holds. Then the oblique derivative problem (2.2) admits a unique strong solution u*H2(Ω).

Proof.

We shall establish the uniqueness of strong solutions to (5.1) in this proof, and then explicitly construct the solution in Theorem 5.2 with the help of policy iteration; see also Theorem 3.7. Suppose that u,vH2(Ω) are two strong solutions to (5.1), then we can see w=u-v is a strong solution to the following linear oblique derivative problem:

-aijijw+b~iiw+c~w=0,a.e. in Ω;γiτ(iw)+γ0τw=0,on Ω, (5.2)

where b~i is defined as in Proposition 2.1, and

c~(x) ={G(x,u,u)-G(x,v,u)(u-v)(x),on {xΩ(u-v)(x)0},μ,otherwise.

By following the same arguments as the proof of Proposition 2.1, we can show that b~i,c~L(Ω), and c~μ>0 a.e. in Ω, which, along with Theorem A.2, implies that w*=0 is the unique strong solution to (5.2), and consequently u=v in H2(Ω). ∎

Now we present the neural network based policy iteration algorithm for solving the oblique derivative problem and establish its rate of convergence.

{algorithm}

[H] Inexact policy iteration algorithm for oblique derivative problems.

  1. 1.

    Choose a family of trial functions ={M}MH2(Ω), an initial guess u0 in , a sequence {ηk}k{0} of positive scalars, and set k=0.

  2. 2.

    Given the iterate uk, update the control laws αk and βk by (3.1) and (3.2), respectively.

  3. 3.

    Find uk+1 such that

    Jk(uk+1)=Lkuk+1-fkL2(Ω)2+Buk+1H1/2(Ω)2ηk+1min(uk+1-ukH2(Ω)2,η0), (5.3)

    where Lk, fk, and B denote the linear operator in (3.3), the source term in (3.3) and the boundary operator in (5.1b), respectively.

  4. 4.

    If uk+1-ukH2(Ω)=0, then terminate with outputs uk+1,αk and βk, otherwise increment k by one and go to step 2.

Note that the H1/2(Ω)-norm of the boundary term is included in the cost function Jk, instead of the H3/2(Ω)-norm as in Algorithm 4.2. It is straightforward to see that Algorithm 5.1 is well-defined under (H.3) and (H.4). In fact, for each k{0}, given the iterate ukH2(Ω), Corollary 3.5 shows that one can select measurable control laws (αk,βk) such that the following linear oblique boundary value problem has measurable coefficients:

Lku-fk=0,in Ω;Bu=0,on Ω,

and hence admits a unique strong solution u¯k in H2(Ω) (see Theorem A.2). If uk=u¯k, then uk solve the HJBI oblique derivative problem (5.1), and we can select uk+1=uk and terminate the algorithm. Otherwise, the facts that Jk(uk+1)Cu¯k-uk+1H2(Ω)2 and is dense in H2(Ω) allows us to choose uk+1 sufficiently closed to u¯ such that the criterion (5.3) is satisfied, and proceed to the next iteration.

The following result is analogue to Theorem 4.3, and shows the global superlinear convergence of Algorithm 5.1 for solving the oblique derivative problem (5.1). The proof follows precisely the lines given in Theorem 4.3, hence we shall only present the main steps in Appendix B for the reader’s convenience. The convergence of feedback control laws can be concluded similarly to Corollary 4.4 and Remark 4.3.

Theorem 5.2.

Suppose (H.2), (H.3) and (H.4) hold, and limkηk=0 in Algorithm 5.1. Let u*H2(Ω) be the solution to the oblique derivative problem (5.1). Then for any initial guess u0F, Algorithm 5.1 either terminates with uk=u* for some kN, or generates a sequence {uk}kN that converges q-superlinearly to u* in H2(Ω), i.e., limkuk+1-u*H2(Ω)/uk-u*H2(Ω)=0. Consequently, we have limk(uk,iuk,ijuk)(x)=(u*,iu*,iju*)(x) for a.e. xΩ, and for all i,j=1,,n.

6 Numerical experiments: Zermelo’s Navigation Problem

In this section, we illustrate the theoretical findings and demonstrate the effectiveness of the schemes through numerical experiments. We present a two-dimensional convection-dominated HJBI Dirichlet boundary value problem in an annulus, which is related to stochastic minimum time problems.

In particular, we consider the stochastic Zermelo navigation problem (see e.g. [37]), which is a time-optimal control problem where the objective is to find the optimal trajectories of a ship/aircraft navigating a region of strong winds, modelled by a random vector field. Given a bounded open set Ωn and an adaptive control strategy {αt}t0 taking values in A, we assume the dynamics Xx,α of the ship is governed by the following controlled dynamics:

dXt=b(Xt,αt)dt+σdWt,t[0,);X0=xΩ,

where the drift coefficient b:Ω×𝐀n is the sum of the velocity of the wind and the relative velocity of the ship, the nondegenerate diffusion coefficient σ:Ωn×n describes a random perturbation of the velocity field of the wind, and W is an n-dimensional Brownian motion defined on a probability space (Ω~,{t}t0,).

The aim of the controller is to minimize the expected exit time of the region Ω, taking model ambiguity into account in the spirit of [41]. More generally, we consider the following value function:

u(x)infα𝒜[0τx,αf(Xtx,α)𝑑t+g(Xτx,αx,α)]=infα𝒜sup𝔼[0τx,αf(Xtx,α)𝑑t+g(Xτx,αx,α)] (6.1)

over all admissible choices of α𝒜, where τx,αinf{t0Xtx,αΩ} denotes the first exit time of the controlled dynamics Xx,α, the functions f and g denote the running cost and the exit cost, respectively, which indicate the desired destinations, and is a family of absolutely continuous probability measures with respect to with density Mt=exp(0tβt𝑑Wt-120tβt2𝑑t), where {βt}t0 is a predictable process satisfying βt=maxi|βt,i|κ for all t and a given parameter κ0. In other words, we would like to minimize a functional of the trajectory up to the exit time under the worst-case scenario, with uncertainty arising from the unknown law of the random perturbation.

By using the dual representation of [] and the dynamic programming principle (see e.g. [41, 10]), we can characterize the value function u as the unique viscosity solution to an HJBI Dirichlet boundary value problem of the form (2.2). Moreover, under suitable assumptions, one can further show that u is the strong (Sobolev) solution to this Dirichlet problem (see e.g. [33]).

For our numerical experiments, we assume that the domain Ω is an annulus, i.e., Ω={(x,y)2r2<x2+y2<R2}, the wind blows along the positive x-axis with a magnitude vc:

vc(x,y)=1-asin(πx2+y2-r2R2-r2),for some constant a[0,1),

which decreases in terms of the distance from the bank, and the random perturbation of the wind is given by the constant diffusion coefficient σ=diag(σx,σy). We also assume that the ship moves with a constant velocity vs, and the captain can control the boat’s direction instantaneously, which leads to the following dynamics of the boat in the region:

(dXtx,αdYtx,α)=(vc(Xtx,α,Ytx,α)+vscos(αt)vssin(αt))dt+(σx00σy)dWt,t0;(X0x,αY0x,α)=x,

where αt𝐀=[0,2π] represents the angle (measured counter-clockwise) between the positive x-axis and the direction of the boat. Finally, we assume the exit cost g0 on Br(0) and g1 on BR(0), which represents that the controller prefers to exit the domain through the inner boundary instead of the outer one (see Figure 6). Then the corresponding Dirichlet problem for the value function u in (6.1) is given by: u0 on Br(0), u1 on BR(0), and

F(u) =-12(σx2uxx+σy2uyy)-vcux-vsinfα𝐀[(cos(α),sin(α))Tu]-supβκ[βT(σu)]-f
=-12(σx2uxx+σy2uyy)-vcux+vsu2-κσu1-f=0,in Ω, (6.2)

where 1 and 2 denote the 1-norm and 2-norm on 2, respectively. The optimal feedback control laws can be further computed as

α*=π+θ,β*=κ(sgn(σxux),sgn(σyuy))T,a.e. in Ω, (6.3)

where uH2(Ω) is the strong solution to (6), and θ(-π,π] is the angle between u and the positive x direction. Note that the equation (6) is neither convex nor concave in u.


Figure 1: Zermelo navigation problem in an annulus.

6.1 Implementation details

In this section, we discuss the implementation details of Algorithm 4.2 for solving (6) with multi-layer neural networks (see Definition 4.1) as the trial functions. We shall now introduce the architecture of the neural networks, the involved hyper-parameters, and various computational aspects of the training process.

For simplicity, we shall adopt a fixed set of trial functions M for all policy iterations, which contains fully-connected networks with the activation function ϱ(y)=tanh(y), the depth L, and the dimension of each hidden layer H. The hyper-parameters L and H will be chosen depending on the complexity of the problem, which ensures that M admits sufficient flexibility to approximate the solutions within the desired accuracy. More complicated architectures of neural networks with shortcut connections can be adopted to further improve the performance of the algorithm (see e.g. [16, 44]).

We then proceed to discuss the computation of the cost functional Jk in (4.2) for each policy iteration. It is well-known that Sobolev norms of functions on sufficiently smooth boundaries can be explicitly computed via local coordinate charts of the boundaries (see e.g. [21]). In particular, due to the annulus shaped domain and the constant boundary conditions used in our experiment, we can express the cost functional Jk as follows: for all k and uM,

Jk(u)=Lku-fkL2(Ω)2+l=r,R[u-gL2(Bl(0))2+γ(-ππ|Dθ(uΦl)|2dθ+(-π,π)2|Dθ(uΦl)(θ1)-Dθ(uΦl)(θ2)|2|θ1-θ2|2dθ1dθ2)], (6.4)

where we define the map Φl:θ(-π,π)(lcos(θ),lsin(θ))Bl(0) for l=r,R. Note that we introduce an extra weighting parameter γ>0 in (6.4), which helps achieve the optimal balance between the residual of the PDE and the residuals of the boundary data. We set the parameter γ=0.1 for all the computations.

The cost functional (6.4) is further approximated by an empirical cost via the collocation method (see [35, 6]), where we discretize Ω and Θ=(-π,π)2 by sets of collocation points Ωd={xiΩ1iNd} and Θd={θ=(θ1,i,θ2,i)Θ1iNb}, respectively, and write the discrete form of (6.4) as follows: for all k and uM,

Jk,d(u)= |Ω|NdxiΩd|Lku(xi)-fk(xi)|2+l=r,R[|Bl(0)|NbθΘd|(u-g)Φl(θ1,i)|2 (6.5)
+γ(2πNbθΘd|Dθ(uΦl)|2(θ1,i)+(2π)2NbθΘd|Dθ(uΦl)(θ1,i)-Dθ(uΦl)(θ2,i)|2|θ1,i-θ2,i|2)],

where |Ω|=π(R2-r2), and |Bl(0)|=2πl for l=r,R are, respectively, the Lebesgue measures of the domain and boundaries. Note that the choice of the smooth activation function ϱ(y)=tanh(y) implies that every trial function uM is smooth, hence all its derivatives are well-defined at any given point. For simplicity, we take the same number of collocation points in the domain and on the boundaries, i.e., Nd=Nb=N.

It is clear that the choice of collocation points is crucial for the accuracy and efficiency of the algorithm. Since the total number of points in a regular grid grows exponentially with respect to the dimension, such a construction is infeasible for high-dimensional problems. Moreover, it is well-known that uniformly distributed pseudorandom points in high dimensions tend to cluster on hyperplanes and lead to a suboptimal distribution by relevant measures of uniformity (see e.g. [11, 6]). Therefore, we shall generate collocation points by a quasi-Monte Carlo (QMC) method based on low-discrepancy sequences. In particular, we first define points in [0,1]2 from the generalized Halton sequence (see [17]), and then map those points into the annulus via the polar map (x,y)(lcos(ψ),lsin(ψ)), where l=(R2-r2)x+r2 and ψ=2πy for all (x,y)[0,1]2. The above transformation preserves fractiona