Learning to Discretize: Solving 1D Scalar Conservation Laws via Deep Reinforcement Learning

  • 2019-06-11 15:06:32
  • Yufei Wang, Ziju Shen, Zichao Long, Bin Dong
  • 0

Abstract

Conservation laws are considered to be fundamental laws of nature. It hasbroad application in many fields including physics, chemistry, biology,geology, and engineering. Solving the differential equations associated withconservation laws is a major branch in computational mathematics. Recentsuccess of machine learning, especially deep learning, in areas such ascomputer vision and natural language processing, has attracted a lot ofattention from the community of computational mathematics and inspired manyintriguing works in combining machine learning with traditional methods. Inthis paper, we are the first to explore the possibility and benefit of solvingnonlinear conservation laws using deep reinforcement learning. As a proof ofconcept, we focus on 1-dimensional scalar conservation laws. We deploy themachinery of deep reinforcement learning to train a policy network that candecide on how the numerical solutions should be approximated in a sequentialand spatial-temporal adaptive manner. We will show that the problem of solvingconservation laws can be naturally viewed as a sequential decision makingprocess and the numerical schemes learned in such a way can easily enforcelong-term accuracy. Furthermore, the learned policy network can determine agood local discrete approximation based on the current state of the solution,which essentially makes the proposed method a meta-learning approach. In otherwords, the proposed method is capable of learning how to discretize for a givensituation mimicking human experts. Finally, we will provide details on how thepolicy network is trained, how well it performs compared with somestate-of-the-art numerical solvers such as WENO schemes, and how well itgeneralizes.

 

Quick Read (beta)

Learning to Discretize: Solving 1D Scalar Conservation Laws via Deep Reinforcement Learning

Yufei Wang*
Yuanpei College
Peking University
[email protected]
\ANDZiju Shen
Academy for Advanced Interdisciplinary Studies
Peking University
[email protected]
\ANDZichao Long
School of Mathematical Sciences
Peking University
[email protected]
&Bin Dong
Beijing International Center for Mathematical Research, Peking University
Center for Data Science, Peking University
Beijing Institute of Big Data Research
[email protected]
Equal Contribution
Abstract

Conservation laws are considered to be fundamental laws of nature. It has broad application in many fields including physics, chemistry, biology, geology, and engineering. Solving the differential equations associated with conservation laws is a major branch in computational mathematics. Recent success of machine learning, especially deep learning, in areas such as computer vision and natural language processing, has attracted a lot of attention from the community of computational mathematics and inspired many intriguing works in combining machine learning with traditional methods. In this paper, we are the first to explore the possibility and benefit of solving nonlinear conservation laws using deep reinforcement learning. As a proof of concept, we focus on 1-dimensional scalar conservation laws. We deploy the machinery of deep reinforcement learning to train a policy network that can decide on how the numerical solutions should be approximated in a sequential and spatial-temporal adaptive manner. We will show that the problem of solving conservation laws can be naturally viewed as a sequential decision making process and the numerical schemes learned in such a way can easily enforce long-term accuracy. Furthermore, the learned policy network can determine a good local discrete approximation based on the current state of the solution, which essentially makes the proposed method a meta-learning approach. In other words, the proposed method is capable of learning how to discretize for a given situation mimicking human experts. Finally, we will provide details on how the policy network is trained, how well it performs compared with some state-of-the-art numerical solvers such as WENO schemes, and how well it generalizes.

 

Learning to Discretize: Solving 1D Scalar Conservation Laws via Deep Reinforcement Learning


  Yufei Wang* Yuanpei College Peking University [email protected] Ziju Shenthanks: Equal Contribution Academy for Advanced Interdisciplinary Studies Peking University [email protected] Zichao Long School of Mathematical Sciences Peking University [email protected] Bin Dong Beijing International Center for Mathematical Research, Peking University Center for Data Science, Peking University Beijing Institute of Big Data Research [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Conservation laws are considered to be one of the fundamental laws of nature, and has broad applications in multiple fields such as physics, chemistry, biology, geology, and engineering. For example, Burger’s equation, a very classic partial differential equation (PDE) in conservation laws, has important applications in fluid mechanics, nonlinear acoustics, gas dynamics, and traffic flow.

Solving the differential equations associated with conservation laws has been a major branch of computational mathematics LeVeque (1992, 2002), and a lot of effective methods have been proposed, from classic methods such as the upwind scheme, the Lax-Friedrichs scheme, to the advanced ones such as the ENO/WENO schemes Liu et al. (1994); Shu (1998), the flux-limiter methods Jerez Galiano and Uh Zapata (2010), and etc. In the past few decades, these traditional methods have been proven successful in solving conservation laws. Nonetheless, the design of some of the high-end methods heavily relies on expert knowledge and the coding of these methods can be a laborious process. To ease the usage and potentially improve these traditional algorithms, machine learning, especially deep learning, has been recently incorporated into this field. For example, the ENO scheme requires lots of ‘if/else’ logical judgments when used to solve complicated system of equations or high-dimensional equations. This very much resembles the old-fashioned expert systems. The recent trend in artificial intelligence (AI) is to replace the expert systems by the so-called ‘connectionism’, e.g., deep neural networks, which leads to the recent bloom of AI. Therefore, it is natural and potentially beneficial to introduce deep learning in traditional numerical solvers of conservation laws.

1.1 Related works

In the last few years, neural networks (NNs) have been applied to solving ODEs/PDEs or the associated inverse problems. These works can be roughly classified into three categories according to the way that the NN is used.

The first type of works propose to harness the representation power of NNs, and are irrelevant to the numerical discretization based methods. For example, Raissi et al. (2017a, b) treated the NNs as new ansatz to approximate solutions of PDEs. More recent works along this line include Magiera et al. (2019); Michoski et al. (2019); Both et al. (2019). Besides, several works have focused on using NNs to establish direct mappings between the parameters of the PDEs (e.g. the coefficient field or the ground state energy) and their associated solutions Khoo et al. (2017); Khoo and Ying (2018); Li et al. (2019); Fan et al. (2018b). Furthermore, Han et al. (2018); Beck et al. (2017) proposed a method to solve very high-dimensional PDEs by converting the PDE to a stochastic control problem and use NNs to approximate the gradient of the solution.

The second type of works focus on the connection between deep neural networks (DNNs) and dynamic systems Weinan (2017); Chang et al. (2017); Lu et al. (2018); Long et al. (2018b); Chen et al. (2018). These works observed that there are connections between DNNs and dynamic systems (e.g. differential equations or unrolled optimization algorithms) so that we can combine deep learning with traditional tools from applied and computational mathematics to handle challenging tasks in inverse problems Long et al. (2018b, a); Qin et al. (2018).The main focus of these works, however, is to solve inverse problems, instead of learning numerical discretizations of differential equations. Nonetheless, these methods are closely related to numerical differential equations since learning a proper discretization is often an important auxiliary task for these methods to accurately recover the form of the differential equations.

The third type of works, which target at using NNs to learn new numerical schemes, are closely related to our work. However, we note that these works mainly fall in the setting of supervised learning (SL). For example, Discacciati et al. (2019) proposed to integrate NNs into high-order numerical solvers to predict artificial viscosity; Ray and Hesthaven (2018) trained a multilayer perceptron to replace traditional indicators for identifying troubled-cells in high-resolution schemes for conservation laws. These works greatly advanced the development in machine learning based design of numerical schemes for conservation laws. Note that in Discacciati et al. (2019), the authors only utilized the one-step error to train the artificial viscosity networks without taking into account the long-term accuracy of the learned numerical scheme. Ray and Hesthaven (2018) first constructed several functions with known regularities and then used them to train a neural network to predict the location of discontinuity, which was later used to choose a proper slope limiter. Therefore, the training of the NNs is separated from the numerical scheme. Then, a natural question is whether we can learn discretization of differential equations in an end-to-end fashion and the learned discrete scheme also takes long-term accuracy into account. This motivates us to employ reinforcement learning to learn good solvers for conservation laws.

1.2 Our Approach

The main objective of this paper is to design new numerical schemes in an autonomous way. We propose to use reinforcement learning (RL) to aid the process of solving the conservation laws. To our best knowledge, we are the first to propose using (deep) RL to solve evolution PDEs. We carefully design the proposed RL-based method so that the learned policy can generate high accuracy numerical schemes and can well generalize in varied situations. Details will be given in section 3.

Here, we first provide a brief discussion on the benefits of using RL instead of SL to solve conservation laws (the arguments apply to general evolution PDEs as well):

  • Most of the numerical solvers of conservation law can be interpreted naturally as a sequential decision making process (e.g., the approximated grid values at the current time instance definitely affects all the future approximations). Thus, it can be easily formulated as a Markov Decision Process (MDP) and solved by RL.

  • In almost all the RL algorithms, the policy π (which is the AI agent who decides on how the solution should be approximated locally) is optimized with regards to the values Qπ(s0,a0)=r(s0,a0)+t=1γtr(st,at), which by definition considers the long-term accumulated reward (or, error of the learned numerical scheme), thus could naturally guarantee the long-term accuracy of the learned schemes. Furthermore, it can gracefully handle the cases when the action space is discrete, which is in fact one of the major strength of RL.

  • Non-smooth norms such as the infinity norm of the error is often used to evaluate the performance of the learned numerical schemes. As the norm of the error serves as the loss function for the learning algorithms, computing the gradient of the infinity norm can be problematic for supervised learning, while RL does not have such problem since it does not explicitly take gradients of the loss function (i.e. the reward function for RL).

  • Learning the policy π within the RL framework makes the algorithm meta-learning-like Schmidhuber (1987); Bengio et al. (1992); Andrychowicz et al. (2016); Li and Malik (2016); Finn et al. (2017). The learned policy π can decide on which local numerical approximation to use by judging from the current state of the solution (e.g. local smoothness, oscillatory patterns, dissipation, etc). This is vastly different from regular (non-meta) learning where the algorithms directly make inference on the numerical schemes without the aid of an additional network such as π. As subtle the difference as it may seem, meta-learning-like methods have been proven effective in various applications such as in image restoration Jin et al. (2017); Fan et al. (2018a); Zhang et al. (2019).

Our paper is organized as follows. In section 2 we briefly review 1-dimensional conservation laws and the WENO schemes. In section 3, we discuss how to formulate the process of numerically solving conservation laws into a Markov Decision Process. Then, we present details on how to train a policy network to mimic human expert in choosing discrete schemes in a spatial-temporary adaptive manner by learning from WENO. In section 4, we conduct numerical experiments on 1-D conservation laws to demonstrate the performance of our trained policy network. Our experimental results show that the trained policy network indeed learned to adaptively choose good discrete schemes that offer better results than the state-of-the-art WENO scheme which is 5th order accurate in space and 4th order accurate in time. This serves as an evidence that the proposed RL framework has the potential to design high-performance numerical schemes for conservation laws in a data-driven fashion. Furthermore, the learned policy network generalizes well to other situations such as different initial conditions, mesh sizes, temporal discrete schemes, etc. The paper ends with a conclusion in section 5, where possible future research directions are also discussed.

2 Preliminaries

2.1 Notations

In this paper, we consider solving the following 1-D conservation laws:

ut(x,t)+fx(u(x,t))=0,axb,t[0,T],u(x,0)=u0(x). (1)

For example, f=u22 corresponds to the famous Burger’s Equation. We discretize the (x,t)-plane by choosing a mesh with spatial size Δx and temporal step size Δt, and define the discrete mesh points (xj,tn) by

xj=a+jΔx,tn=nΔtwith j=0,1,,J=b-aΔx,n=0,1,,N=TΔt.

We denote xj+12=xj+Δx/2=a+(j+12)Δx. The finite difference methods will produce approximations Ujn to the solution u(xj,tn) on the given discrete mesh points. We denote point-wise values of the true solution to be ujn=u(xj,tn), and the true point-wise flux values to be fjn=f(u(xj,tn)).

2.2 WENO – Weighted Essentially Non-Oscillatory Schemes

WENO (Weighted Essentially Non-Oscillatory) Liu et al. (1994) is a family of high order accurate finite difference schemes for solving hyperbolic conservation laws, and has been quite successful for a lot of practical problems. The key idea of WENO is a nonlinear adaptive procedure that automatically chooses the smoothest local stencil to reconstruct the numerical flux. Generally speaking, a finite difference method solves Eq.(1) by using a conservative approximation to the spatial derivative of the flux:

duj(t)dt=-1Δx(f^j+12-f^j-12), (2)

where uj(t) is the numerical approximation to the point value u(xj,t) and f^j+12 is the numerical flux generated by a numerical flux policy

f^j+12=πf(uj-r,,uj+s),

which is manually designed. Note that the term “numerical flux policy" is a new terminology that we introduce in this paper, which is exactly the policy we shall learn using RL. In WENO, πf works as follows. Using the physical flux values {fj-2,fj-1,fj}, we could obtain a 3th order accurate polynomial interpolation f^j+12-2, where the indices {j-2,j-1,j} is called a ‘stencil’. We could also use the stencil {j-1,j,j+1}, {j,j+1,j+2} or {j+1,j+2,j+3} to obtain another three interpolants f^j+12-1, f^j+120 and f^j+121. The key idea of WENO is to average (with properly designed weights) all these interpolants to obtain the final reconstruction:

f^j+12=r=-21wrf^j+1/2r,r=-21wr=1. (3)

The weight wi depends on the smoothness of the stencil. A general principal is: the smoother is the stencil, the more accurate is the interpolant and hence the larger is the weight. To ensure convergence, we need the numerical scheme to be consistent and stable LeVeque (1992). It is known that WENO schemes as described above are consistent. For stability, upwinding is required in constructing the flux. The most easy way is to use the sign of the Roe speed a¯j+12=fj+12-fj-12uj+12-uj-12 to determine the upwind direction: if a¯j+120, we only average among the three interpolants f^j+12-2, f^j+12-1 and f^j+120; if a¯j+12<0, we use f^j+12-1, f^j+120 and f^j+121.

Some further thoughts.

The key of the WENO method lies in how to compute the weight vector (w1,w2,w3,w4), which primarily depends on the smoothness of the solution at local stencils. In WENO, such smoothness is characterized by handcrafted formula, and was proven to be successful in many practical problems when coupled with high-order temporal discretization. However, it remains unknown whether there are better ways to combine the stencils so that higher accuracy can be achieved. Furthermore, estimating the upwind directions is another key component of WENO, which can get quite complicated in high-dimensional situations and requires lots of logical judgments (i.e. “if/else"). Can we ease the (some time painful) coding and improve the estimation at the aid of machine learning?

3 Methods

In this section we present how to employ reinforcement learning to solve the conservation laws given by Eq.(1). To better illustrate our idea, we first show in general how to formulate the process of numerically solving a conservation law into an MDP. We then discuss how to incorporate a policy network with the WENO scheme. Our policy network targets at the following two key aspects of WENO:

  • Can we learn to choose better weights to combine the constructed fluxes?

  • Can we learn to automatically judge the upwind direction, without complicated logical judgments?

3.1 MDP Formulation

\SetKwProgFnFunction: \SetKwDataInputInput \SetKwDataOutputOutput \DontPrintSemicolon\Input: initial values u00,u10,,uJ0, flux f(u), Δx, Δt, evolve time N, left shift r and right shift s.
\Output: {Ujn|j=0,,J,n=1,,N}
Uj0=uj0,j=0,,J
\Forn=1 \KwToN \Forj=0 \KwToJ Compute the numerical flux f^j-12n=πf(Uj-r-1n-1, Uj-rn-1, …, Uj+s-1n-1) and f^j+12n=πf(Uj-rn-1, Uj-r+1n-1, …, Uj+sn-1), e.g., using the WENO scheme
Compute duj(t)dt=-1Δx(f^j+12n-f^j-12n)
Compute Ujn = πt(Ujn-1,duj(t)dt), e.g., using the Euler scheme Ujn=Ujn-1+Δtduj(t)dt Return {Ujn|j=0,,J,n=1,,N}
\algorithmcfname 1 A Conservation Law Solving Procedure

As shown in Algorithm 1, the procedure of numerically solving a conservation law is naturally a sequential decision making problem. The key of the procedure is the numerical flux policy πf and the temporal scheme πt as shown in line 6 and 8 in Algorithm 1. Both policies could be learned using RL. However, in this paper, we mainly focus on using RL to learn the numerical flux policy πf, while leaving the temporal scheme πt with traditional numerical schemes such as the Euler scheme or the Runge–Kutta methods. A quick review of RL is given in the appendix.

Now, we demonstrate how to formulate the above procedure as an MDP and the construction of the state S, action A, reward r and transition dynamics P. Algorithm 2 shows in general how RL is incorporated into the procedure. In Algorithm 2, we use a single RL agent. Specifically, when computing Ujn:

  • The state for the RL agent is sjn=gs(Uj-r-1n-1,,Uj+sn-1), where gs is a pre-processing function.

  • In general, the action of the agent is used to determine how the numerical fluxes f^j+12n and f^j-12n is computed. In the next subsection, we detail how we incorporate ajn to be the linear weights of the fluxes computed using different stencils in the WENO scheme.

  • The reward should encourage the agent to generate a scheme that minimizes the error between its approximated value and the true value. Therefore, we define the reward generating function as rjn=gr(Uj-r-1n-uj-r-1n,,Uj+sn-uj+sn), where gr is a another pre-processing function, e.g., a simplest choice is gr=-||||2.

  • The transition dynamics P is fully deterministic, and depends on the choice of the temporal scheme at line 10 in Algorithm 2. Note that the next state can only be constructed when we have obtained all the point values in the next time step, i.e., sjn+1=gs(Uj-r-1n,,Uj+sn) does not only depends on action ajn, but also on actions aj-r-1n,,aj+sn (action ajn can only determine the value Ujn). This subtlety can be resolved by viewing the process under the framework of multi-agent RL, in which at each mesh point j we use a distinct agent AjRL, and the next state sjn+1=gs(Uj-r-1n,,Uj+sn) depends on these agents’ joint action 𝐚𝐣𝐧=(aj-r-1n,,aj+sn). However, it is impractical to train J different agents as J is usually very large, therefore we enforce the agents at different mesh point j to share the same weight, which reduces to case of using just a single agent. The single agent can be viewed as a counterpart of a human designer who decides on the choice of a local scheme based on the current state in traditional numerical methods.

\SetKwProgFnFunction: \SetKwDataInputInput \SetKwDataOutputOutput \DontPrintSemicolon\Input: initial values u00,,uJ0, flux f(u), Δx, Δt, evolve time N, left shift r, right shift s and RL policy πRL
\Output: {Ujn|j=0,,J,n=1,,N}
Uj0=uj0,j=0,,J
\ForMany iterations Construct initial states sj0=gs(Uj-r-10,,Uj+s0) for j=0,,J
\Forn=1 \KwToN \Forj=0 \KwToJ Compute the action ajn=πRL(sjn) that determines how f^j+12n and f^j-12n is computed
Compute duj(t)dt=-1Δx(f^j+12n-f^j-12n)
Compute Ujn = πt(Ujn-1,duj(t)dt), e.g., the Euler scheme Ujn=Ujn-1+Δtduj(t)dt
Compute the reward rjn=gr(Uj-r-1n-uj-r-1n,,Uj+sn-uj+sn). Construct the next states sjn+1=gs(uj-r-1n,,uj+sn) for j=0,,J
Use any RL algorithm to train the RL policy πRL with the transitions {(sjn,ajn,rjn,sjn+1)}j=0J.
Return the well-trained RL policy πRL.
\algorithmcfname 2 General RL Running Procedure

3.2 RL Empowered WENO

In this subsection, we present how to transfer the actions of the RL policy network to the weights for WENO fluxes. Instead of directly using πRL to generate the numerical flux, we use it to produce the weights of numerical fluxes computed using different stencils in WENO.

Specifically, at point xj (here we drop the time superscript n for simplicity), to compute the numerical flux f^j-12 and f^j+12, we first construct four fluxes {f^j-12i}i=-21 and {f^j+12i}i=-21 using four different stencils just as in WENO, and then use the RL policy πRL to generate the weights of these fluxes:

πRL(sj)=(wj-12-2,wj-12-1,wj-120,wj-121,wj+12-2,wj+12-1,wj+120,wj+121).

The numerical flux is then constructed by averaging these fluxes:

f^j-12=i=-21wj-12if^j-12i,f^j+12=i=-21wj+12if^j+12i.

Note that the determination of upwind direction is automatically embedded in the RL policy since it generates four weights at once. For instance, when the roe speed a¯j+120, we expect the 4th weight wj+1210 and when a¯j+12<0, we expect wj+12-20. Note that the upwind direction can be very complicated in a system of equations or in the high-dimensional situations, and using the policy network to automatically embed such a process could save lots of efforts in algorithm design and implementation. Our numerical experiments show that πRL can indeed automatically determine upwind directions for 1D scalar cases. Although this does not mean that it works for systems and/or in high-dimensions, it shows the potential of the proposed framework and value for further studies.

4 Experiments

4.1 Setup

In this subsection, we explain the general training setup. We train the RL policy network on the Burger’s equation, whose flux is computed as f(u)=12u2. In all the experiments, we set the left-shift r=2 and the right shift s=3. The state pre-processing function gs(sj)=gs(Uj-r-1,,Uj+s) will generate two vectors: sl=(fj-r-1,,fj+s-1,a¯j-12), and sr=(fj-r,,fj+s,a¯j+12) for computing f^j-12 and f^j+12 respectively. sl and sr will be passed into the same policy neural network πθRL to produce the desired actions, as described in section 3.2. The reward pre-processing function gr simply computes the infinity norm, i.e.,

gr(Uj-r-1-uj-r-1,,Uj+s-uj+s)=-||(Uj-r-1-uj-r-1,,Uj+s-uj+s)||.

The policy network πθRL is a feed-forward Multi-layer Perceptron with 6 hidden layers, each has 64 neurons and use Relu Goodfellow et al. (2016) as the activation function. We use the Deep Deterministic Policy Gradient Algorithm Lillicrap et al. (2015) to train the RL policy.

To guarantee the generalization power of the trained RL agent, we randomly sampled 20 initial conditions in the form u0(x)=a+bfunc(cπx), where |a|+|b|3.5, func{sin,cos} and c{2,4,6}. The goal of generating such kind of initial conditions is to ensure they have similar degree of smoothness and thus similar level of difficulty in learning. The computation domain is -1x1 and 0t0.8 with Δx=0.02, Δt=0.004, and evolve steps N=200 (which ensures the appearance of shocks). When training the RL agent, we use the Euler scheme for temporal discretization. The true solution needed for reward computing is generated using WENO on the same computation domain with Δx=0.001, Δt=0.0002 and the 4th order Runge-Kutta (RK4).

In the following, we denote the policy network that generates the weights of the WENO fluxes (as described in section 3.2) as RL-WENO. We randomly generated another different 10 initial conditions in the same form as training for testing.

\diagboxΔtΔx 0.02 0.04 0.05
RL-WENO WENO RL-WENO WENO RL-WENO WENO
0.002 5.66 (1.59) 5.89 (1.74) 8.76 (2.50) 9.09 (2.62) 9.71 (2.42) 10.24 (2.84)
0.003 5.64 (1.54) 5.86 (1.67) 8.73 (2.46) 9.06 (2.58) 9.75 (2.41) 10.28 (2.81)
0.004 5.63 (1.55) 5.81 (1.66) 8.72 (2.46) 9.05 (2.55) 9.61 (2.42) 10.13 (2.84)
0.005 5.08 (1.46) 5.19 (1.58) 8.29 (2.34) 8.58 (2.47) 9.30 (2.26) 9.78 (2.69)
0.006 - - 8.71 (2.49) 9.02 (2.61) 9.72 (2.38) 10.24 (2.80)
0.007 - - 8.56 (2.49) 8.84 (2.62) 9.59 (2.41) 10.12 (2.80)
0.008 - - 8.68 (2.55) 8.93 (2.66) 9.57 (2.49) 10.06 (2.92)
Table 1: Comparison of relative errors (×10-2) of RL-WENO and WENO with standard deviations of the errors among 10 trials in the parenthesis. Temporal discretization: RK4; flux function: 12u2.
\diagboxΔtΔx 0.02 0.04 0.05
RL-WENO WENO RL-WENO WENO RL-WENO WENO
0.002 4.85 (1.15) 5.17 (1.26) 7.77 (1.95) 8.05 (2.02) 8.16 (1.93) 8.56 (2.19)
0.003 - - 7.79 (1.96) 8.06 (2.03) 8.18 (1.92) 8.59 (2.18)
0.004 - - 7.72 (1.93) 7.98 (2.01) 8.15 (1.95) 8.54 (2.20)
0.005 - - - - 8.18 (1.94) 8.55 (2.15)
Table 2: Comparison of relative errors (×10-2) of RL-WENO and WENO with standard deviations of the errors among 10 trials in the parenthesis. Temporal discretization: RK4; flux function: 116u4.

4.2 Results

We compare the performance of RL-WENO and WENO. We also test whether the trained RL policy can generalize to different temporal discretization schemes, mesh sizes and flux functions that are not included in training. Table 1 and Table 2 present the comparison results, where the number shows the relative error (computed as ||U-u||2||u||2 with the 2-norm taking over all x) between the approximated solution U and the true solution u, averaged over 250 evolving steps (T=1.0) and 10 random initial values. Numbers in the bracket shows the standard deviation over the 10 initial conditions. Several entries in the table are marked as ‘-’ because the corresponding CFL number (dt/dx) is not small enough to guarantee convergence. Recall that training of the RL-WENO was conducted with Euler time discretization, (Δx,Δt)=(0.02,0.004), T=0.8 and f(u)=12u2.

Our experimental results show that, compared with the high order accurate WENO (5th order accurate in space and 4th order accurate in time), the linear weights learned by RL not only achieves smaller errors, but also generalizes well to: 1) longer evolving time (T=0.8 for training and T=1.0 for testing); 2) new time discretization schemes (trained on Euler, tested on RK4); 3) new mesh sizes (see Table 1 and Table 2 for results of varied Δx and Δt); and 4) a new flux function (trained on f(u)=12u2 shown in Table 1, tested on 116u4 Table 2).

Figure 1 shows some examples of the solutions. As one can see that, the solutions generated by RL-WENO have clear advantage over WENO near singularities which is particularly challenging for numerical PDE solvers and important in applications. Figure 2 shows that the learned numerical flux policy can indeed correctly determine upwind directions and generate local numerical schemes in an adaptive fashion. More interestingly, comparing to WENO, RL-WENO seems to be able to select stencils in a different way from WENO, and eventually leads to a more accurate solution. This shows that the proposed RL framework has the potential to surpass human experts in designing numerical schemes for conservation laws.


(a) f(u)=12u2
(b) f(u)=12u2 (c) f(u)=116u4 (d) f(u)=116u4


(e) f(u)=12u2
(f) f(u)=12u2 (g) f(u)=116u4 (h) f(u)=116u4


Figure 1: First row: solutions of RL-WENO (red), WENO (blue) and exact solutions (green). Second row: zoom-in views corresponding to the first row.

(a) Weights of f^j-12
(b) Weights of f^j+12

Figure 2: This figure compares the weights generated by the learned numerical flux policy πRL and those of WENO. The weights shown in (a) are {wj-12r}r=-21; while those in (b) are {wj+12r}r=-21. In each of the two plots, the 4 numbers in the upper bracket of each location are the weights of RL-WENO and those in the lower bracket are the weights of WENO. The relative errors of RL-WENO and WENO are 8.0×10-3 and 2.5×10-2 respectively.

5 Conclusion

In this paper, we proposed a general framework to learn how to solve 1-dimensional conservation laws via deep reinforcement learning. We first discussed how the procedure of numerically solving conservation laws can be naturally cast in the form of Markov Decision Process. We then elaborated how to relate notions in numerical schemes of PDEs with those of reinforcement learning. In particular, we introduced a numerical flux policy which was able to decide on how numerical flux should be designed locally based on the current state of the solution. In other words, the numerical flux policy mimics human experts in designing numerical schemes, which made the proposed method meta-learning-like. Our numerical experiments showed that the proposed RL based solver was able to outperform high order WENO and was well generalized in various cases.

As part of the future works, we would like to consider using the numerical flux policy to inference more complicated numerical fluxes with guaranteed consistency and stability. Furthermore, we can use the proposed framework to learn a policy that can generate adaptive grids and the associated numerical schemes. Lastly, we would like consider system of conservation laws in 2nd and 3rd dimensional space.

References

  • (1)
  • Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems. 3981–3989.
  • Beck et al. (2017) Christian Beck, E Weinan, and Arnulf Jentzen. 2017. Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. Journal of Nonlinear Science (2017), 1–57.
  • Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. 1992. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks. Univ. of Texas, 6–8.
  • Both et al. (2019) Gert-Jan Both, Subham Choudhury, Pierre Sens, and Remy Kusters. 2019. DeepMoD: Deep learning for Model Discovery in noisy data. arXiv preprint arXiv:1904.09406 (2019).
  • Chang et al. (2017) Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. 2017. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348 (2017).
  • Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations. In Advances in Neural Information Processing Systems. 6571–6583.
  • Discacciati et al. (2019) Niccolo’ Discacciati, Jan S Hesthaven, and Deep Ray. 2019. Controlling oscillations in high-order Discontinuous Galerkin schemes using artificial viscosity tuned by neural networks. Technical Report.
  • Fan et al. (2018a) Qingnan Fan, Dongdong Chen, Lu Yuan, Gang Hua, Nenghai Yu, and Baoquan Chen. 2018a. Decouple learning for parameterized image operators. In Proceedings of the European Conference on Computer Vision (ECCV). 442–458.
  • Fan et al. (2018b) Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. 2018b. A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883 (2018).
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126–1135.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
  • Han et al. (2018) Jiequn Han, Arnulf Jentzen, and E Weinan. 2018. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115, 34 (2018), 8505–8510.
  • Jerez Galiano and Uh Zapata (2010) Silvia Jerez Galiano and Miguel Uh Zapata. 2010. A new TVD flux-limiter method for solving nonlinear hyperbolic equations. J. Comput. Appl. Math. 234, 5 (2010), 1395–1403.
  • Jin et al. (2017) Meiguang Jin, Stefan Roth, and Paolo Favaro. 2017. Noise-blind image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3510–3518.
  • Khoo et al. (2017) Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. 2017. Solving parametric PDE problems with artificial neural networks. arXiv preprint arXiv:1707.03351 (2017).
  • Khoo and Ying (2018) Yuehaw Khoo and Lexing Ying. 2018. SwitchNet: a neural network model for forward and inverse scattering problems. arXiv preprint arXiv:1810.09675 (2018).
  • LeVeque (1992) Randall J LeVeque. 1992. Numerical methods for conservation laws. Vol. 132. Springer.
  • LeVeque (2002) Randall J LeVeque. 2002. Finite volume methods for hyperbolic problems. Vol. 31. Cambridge university press.
  • Li and Malik (2016) Ke Li and Jitendra Malik. 2016. Learning to optimize. arXiv preprint arXiv:1606.01885 (2016).
  • Li et al. (2019) Yingzhou Li, Jianfeng Lu, and Anqi Mao. 2019. Variational training of neural network approximations of solution maps for physical models. arXiv preprint arXiv:1905.02789 (2019).
  • Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
  • Liu et al. (1994) Xu-Dong Liu, Stanley Osher, and Tony Chan. 1994. Weighted essentially non-oscillatory schemes. Journal of computational physics 115, 1 (1994), 200–212.
  • Long et al. (2018a) Zichao Long, Yiping Lu, and Bin Dong. 2018a. PDE-Net 2.0: Learning PDEs from Data with A Numeric-Symbolic Hybrid Deep Network. arXiv preprint arXiv:1812.04426 (2018).
  • Long et al. (2018b) Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. 2018b. PDE-Net: Learning PDEs from Data. In ICML.
  • Lu et al. (2018) Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. 2018. Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations. In ICML.
  • Magiera et al. (2019) Jim Magiera, Deep Ray, Jan S Hesthaven, and Christian Rohde. 2019. Constraint-Aware Neural Networks for Riemann Problems. arXiv preprint arXiv:1904.12794 (2019).
  • Michoski et al. (2019) Craig Michoski, Milos Milosavljevic, Todd Oliver, and David Hatch. 2019. Solving Irregular and Data-enriched Differential Equations using Deep Neural Networks. arXiv preprint arXiv:1905.04351 (2019).
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
  • Qin et al. (2018) Tong Qin, Kailiang Wu, and Dongbin Xiu. 2018. Data driven governing equations approximation using deep neural networks. arXiv preprint arXiv:1811.05537 (2018).
  • Raissi et al. (2017a) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2017a. Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561 (2017).
  • Raissi et al. (2017b) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2017b. Physics informed deep learning (part ii): Data-driven discovery of nonlinear partial differential equations. arXiv preprint arXiv:1711.10566 (2017).
  • Ray and Hesthaven (2018) Deep Ray and Jan S Hesthaven. 2018. An artificial neural network as a troubled-cell indicator. J. Comput. Phys. 367 (2018), 166–191.
  • Schmidhuber (1987) Jurgen Schmidhuber. 1987. Evolutionary principles in self-referential learning. On learning how to learn: The meta-meta-… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich (1987).
  • Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1889–1897.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  • Shu (1998) Chi-Wang Shu. 1998. Essentially non-oscillatory and weighted essentially non-oscillatory schemes for hyperbolic conservation laws. In Advanced numerical approximation of nonlinear hyperbolic equations. Springer, 325–432.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
  • Weinan (2017) E Weinan. 2017. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5, 1 (2017), 1–11.
  • Zhang et al. (2019) Xiaoshuai Zhang, Yiping Lu, Jiaying Liu, and Bin Dong. 2019. Dynamically unfolding recurrent restorer: A moving endpoint control method for image restoration. ICLR (2019).

Appendix A A Review of Reinforcement Learning

A.1 Reinforcement Learning

Reinforcement Learning (RL) is a general framework for solving sequential decision making problems. Recently, combined with deep neural networks, RL has achieved great success in various tasks such as playing video games from raw screen inputs Mnih et al. (2015), playing Go Silver et al. (2016), and robotics control Schulman et al. (2017). The sequential decision making problem RL tackles is usually formulated as a Markov Decision Process (MDP), which comprises five elements: the state space S, the action space A, the reward r:S×A, the transition probability of the environment P:S×A×S[0,1], and the discounting factor γ. The interactions between an RL agent and the environment forms a trajectory τ=(s0,a0,r0,,sT,aT,rT,). The return of τ is the discounted sum of all its future rewards:

G(τ)=t=0γtrt

Similarly, the return of a state-action pair (st,at) is:

G(st,at)=l=tγl-trl

A policy π in RL is a probability distribution on the action A given a state S: π:S×A[0,1]. We say a trajectory τ is generated under policy π if all the actions along the trajectory is chosen following π, i.e., τπ means atπ(|st) and st+1P(|st,at). Given a policy π, the value of a state s is defined as the expected return of all the trajectories when the agent starts at s and then follows π:

Vπ(s)=Eτ[G(τ)|τ(s0)=s,τπ]

Similarly, the value of a state-action pair is defined as the expected return of all trajectories when the agent starts at s, takes action a, and then follows π:

Qπ(s,a)=Eτ[G(τ)|τ(s0)=s,τ(a0)=a,τπ]

As aforementioned in introduction, in most RL algorithms the policy π is optimized with regards to the values Qπ(s,a), thus naturally guarantees the long-term accumulated rewards (in our setting, the long-term accuracy of the learned schemes). Bellman Equation, one of the most important equations in RL, connects the value of a state and the value of its successor state:

Qπ(s,a)=r(s,a)+γEsP(|s,a),aπ(|s)[Qπ(s,a)]Vπ(s)=Eaπ(|s),sP(|s,a)[r(s,a)+γVπ(s)]

The goal of RL is to find a policy π to maximize the expected discounted sum of rewards starting from the initial state s0, J(π)=Es0ρ[Vπ(s0)], where ρ is the initial state distribution. If we parameterize π using θ, then we can optimize it using the famous policy gradient theorem:

dJ(πθ)dθ=Esρπθ,aπθ[θlogπθ(a|s)Qπθ(s,a)]

where ρπθ is the state distribution deduced by the policy πθ. In this paper we focus on the case where the action space A is continuous, and a lot of mature algorithms has been proposed for such a case, e.g., the Deep Deterministic Policy Gradient (DDPG) Lillicrap et al. (2015), the Trust Region Policy Optimization algorithm Schulman et al. (2015), and etc.