Abstract
Conservation laws are considered to be fundamental laws of nature. It hasbroad application in many fields including physics, chemistry, biology,geology, and engineering. Solving the differential equations associated withconservation laws is a major branch in computational mathematics. Recentsuccess of machine learning, especially deep learning, in areas such ascomputer vision and natural language processing, has attracted a lot ofattention from the community of computational mathematics and inspired manyintriguing works in combining machine learning with traditional methods. Inthis paper, we are the first to explore the possibility and benefit of solvingnonlinear conservation laws using deep reinforcement learning. As a proof ofconcept, we focus on 1dimensional scalar conservation laws. We deploy themachinery of deep reinforcement learning to train a policy network that candecide on how the numerical solutions should be approximated in a sequentialand spatialtemporal adaptive manner. We will show that the problem of solvingconservation laws can be naturally viewed as a sequential decision makingprocess and the numerical schemes learned in such a way can easily enforcelongterm accuracy. Furthermore, the learned policy network can determine agood local discrete approximation based on the current state of the solution,which essentially makes the proposed method a metalearning approach. In otherwords, the proposed method is capable of learning how to discretize for a givensituation mimicking human experts. Finally, we will provide details on how thepolicy network is trained, how well it performs compared with somestateoftheart numerical solvers such as WENO schemes, and how well itgeneralizes.
Quick Read (beta)
Learning to Discretize: Solving 1D Scalar Conservation Laws via Deep Reinforcement Learning
Abstract
Conservation laws are considered to be fundamental laws of nature. It has broad application in many fields including physics, chemistry, biology, geology, and engineering. Solving the differential equations associated with conservation laws is a major branch in computational mathematics. Recent success of machine learning, especially deep learning, in areas such as computer vision and natural language processing, has attracted a lot of attention from the community of computational mathematics and inspired many intriguing works in combining machine learning with traditional methods. In this paper, we are the first to explore the possibility and benefit of solving nonlinear conservation laws using deep reinforcement learning. As a proof of concept, we focus on 1dimensional scalar conservation laws. We deploy the machinery of deep reinforcement learning to train a policy network that can decide on how the numerical solutions should be approximated in a sequential and spatialtemporal adaptive manner. We will show that the problem of solving conservation laws can be naturally viewed as a sequential decision making process and the numerical schemes learned in such a way can easily enforce longterm accuracy. Furthermore, the learned policy network can determine a good local discrete approximation based on the current state of the solution, which essentially makes the proposed method a metalearning approach. In other words, the proposed method is capable of learning how to discretize for a given situation mimicking human experts. Finally, we will provide details on how the policy network is trained, how well it performs compared with some stateoftheart numerical solvers such as WENO schemes, and how well it generalizes.
Learning to Discretize: Solving 1D Scalar Conservation Laws via Deep Reinforcement Learning
Yufei Wang${}^{\mathrm{*}}$ Yuanpei College Peking University [email protected] Ziju Shen^{†}^{†}thanks: Equal Contribution Academy for Advanced Interdisciplinary Studies Peking University [email protected] Zichao Long School of Mathematical Sciences Peking University [email protected] Bin Dong Beijing International Center for Mathematical Research, Peking University Center for Data Science, Peking University Beijing Institute of Big Data Research [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Conservation laws are considered to be one of the fundamental laws of nature, and has broad applications in multiple fields such as physics, chemistry, biology, geology, and engineering. For example, Burger’s equation, a very classic partial differential equation (PDE) in conservation laws, has important applications in fluid mechanics, nonlinear acoustics, gas dynamics, and traffic flow.
Solving the differential equations associated with conservation laws has been a major branch of computational mathematics LeVeque (1992, 2002), and a lot of effective methods have been proposed, from classic methods such as the upwind scheme, the LaxFriedrichs scheme, to the advanced ones such as the ENO/WENO schemes Liu et al. (1994); Shu (1998), the fluxlimiter methods Jerez Galiano and Uh Zapata (2010), and etc. In the past few decades, these traditional methods have been proven successful in solving conservation laws. Nonetheless, the design of some of the highend methods heavily relies on expert knowledge and the coding of these methods can be a laborious process. To ease the usage and potentially improve these traditional algorithms, machine learning, especially deep learning, has been recently incorporated into this field. For example, the ENO scheme requires lots of ‘if/else’ logical judgments when used to solve complicated system of equations or highdimensional equations. This very much resembles the oldfashioned expert systems. The recent trend in artificial intelligence (AI) is to replace the expert systems by the socalled ‘connectionism’, e.g., deep neural networks, which leads to the recent bloom of AI. Therefore, it is natural and potentially beneficial to introduce deep learning in traditional numerical solvers of conservation laws.
1.1 Related works
In the last few years, neural networks (NNs) have been applied to solving ODEs/PDEs or the associated inverse problems. These works can be roughly classified into three categories according to the way that the NN is used.
The first type of works propose to harness the representation power of NNs, and are irrelevant to the numerical discretization based methods. For example, Raissi et al. (2017a, b) treated the NNs as new ansatz to approximate solutions of PDEs. More recent works along this line include Magiera et al. (2019); Michoski et al. (2019); Both et al. (2019). Besides, several works have focused on using NNs to establish direct mappings between the parameters of the PDEs (e.g. the coefficient field or the ground state energy) and their associated solutions Khoo et al. (2017); Khoo and Ying (2018); Li et al. (2019); Fan et al. (2018b). Furthermore, Han et al. (2018); Beck et al. (2017) proposed a method to solve very highdimensional PDEs by converting the PDE to a stochastic control problem and use NNs to approximate the gradient of the solution.
The second type of works focus on the connection between deep neural networks (DNNs) and dynamic systems Weinan (2017); Chang et al. (2017); Lu et al. (2018); Long et al. (2018b); Chen et al. (2018). These works observed that there are connections between DNNs and dynamic systems (e.g. differential equations or unrolled optimization algorithms) so that we can combine deep learning with traditional tools from applied and computational mathematics to handle challenging tasks in inverse problems Long et al. (2018b, a); Qin et al. (2018).The main focus of these works, however, is to solve inverse problems, instead of learning numerical discretizations of differential equations. Nonetheless, these methods are closely related to numerical differential equations since learning a proper discretization is often an important auxiliary task for these methods to accurately recover the form of the differential equations.
The third type of works, which target at using NNs to learn new numerical schemes, are closely related to our work. However, we note that these works mainly fall in the setting of supervised learning (SL). For example, Discacciati et al. (2019) proposed to integrate NNs into highorder numerical solvers to predict artificial viscosity; Ray and Hesthaven (2018) trained a multilayer perceptron to replace traditional indicators for identifying troubledcells in highresolution schemes for conservation laws. These works greatly advanced the development in machine learning based design of numerical schemes for conservation laws. Note that in Discacciati et al. (2019), the authors only utilized the onestep error to train the artificial viscosity networks without taking into account the longterm accuracy of the learned numerical scheme. Ray and Hesthaven (2018) first constructed several functions with known regularities and then used them to train a neural network to predict the location of discontinuity, which was later used to choose a proper slope limiter. Therefore, the training of the NNs is separated from the numerical scheme. Then, a natural question is whether we can learn discretization of differential equations in an endtoend fashion and the learned discrete scheme also takes longterm accuracy into account. This motivates us to employ reinforcement learning to learn good solvers for conservation laws.
1.2 Our Approach
The main objective of this paper is to design new numerical schemes in an autonomous way. We propose to use reinforcement learning (RL) to aid the process of solving the conservation laws. To our best knowledge, we are the first to propose using (deep) RL to solve evolution PDEs. We carefully design the proposed RLbased method so that the learned policy can generate high accuracy numerical schemes and can well generalize in varied situations. Details will be given in section 3.
Here, we first provide a brief discussion on the benefits of using RL instead of SL to solve conservation laws (the arguments apply to general evolution PDEs as well):

•
Most of the numerical solvers of conservation law can be interpreted naturally as a sequential decision making process (e.g., the approximated grid values at the current time instance definitely affects all the future approximations). Thus, it can be easily formulated as a Markov Decision Process (MDP) and solved by RL.

•
In almost all the RL algorithms, the policy $\pi $ (which is the AI agent who decides on how the solution should be approximated locally) is optimized with regards to the values ${Q}^{\pi}({s}_{0},{a}_{0})=r({s}_{0},{a}_{0})+{\sum}_{t=1}^{\mathrm{\infty}}{\gamma}^{t}r({s}_{t},{a}_{t})$, which by definition considers the longterm accumulated reward (or, error of the learned numerical scheme), thus could naturally guarantee the longterm accuracy of the learned schemes. Furthermore, it can gracefully handle the cases when the action space is discrete, which is in fact one of the major strength of RL.

•
Nonsmooth norms such as the infinity norm of the error is often used to evaluate the performance of the learned numerical schemes. As the norm of the error serves as the loss function for the learning algorithms, computing the gradient of the infinity norm can be problematic for supervised learning, while RL does not have such problem since it does not explicitly take gradients of the loss function (i.e. the reward function for RL).

•
Learning the policy $\pi $ within the RL framework makes the algorithm metalearninglike Schmidhuber (1987); Bengio et al. (1992); Andrychowicz et al. (2016); Li and Malik (2016); Finn et al. (2017). The learned policy $\pi $ can decide on which local numerical approximation to use by judging from the current state of the solution (e.g. local smoothness, oscillatory patterns, dissipation, etc). This is vastly different from regular (nonmeta) learning where the algorithms directly make inference on the numerical schemes without the aid of an additional network such as $\pi $. As subtle the difference as it may seem, metalearninglike methods have been proven effective in various applications such as in image restoration Jin et al. (2017); Fan et al. (2018a); Zhang et al. (2019).
Our paper is organized as follows. In section 2 we briefly review 1dimensional conservation laws and the WENO schemes. In section 3, we discuss how to formulate the process of numerically solving conservation laws into a Markov Decision Process. Then, we present details on how to train a policy network to mimic human expert in choosing discrete schemes in a spatialtemporary adaptive manner by learning from WENO. In section 4, we conduct numerical experiments on 1D conservation laws to demonstrate the performance of our trained policy network. Our experimental results show that the trained policy network indeed learned to adaptively choose good discrete schemes that offer better results than the stateoftheart WENO scheme which is 5th order accurate in space and 4th order accurate in time. This serves as an evidence that the proposed RL framework has the potential to design highperformance numerical schemes for conservation laws in a datadriven fashion. Furthermore, the learned policy network generalizes well to other situations such as different initial conditions, mesh sizes, temporal discrete schemes, etc. The paper ends with a conclusion in section 5, where possible future research directions are also discussed.
2 Preliminaries
2.1 Notations
In this paper, we consider solving the following 1D conservation laws:
$${u}_{t}(x,t)+{f}_{x}(u(x,t))=0,a\le x\le b,t\in [0,T],u(x,0)={u}_{0}(x).$$  (1) 
For example, $f=\frac{{u}^{2}}{2}$ corresponds to the famous Burger’s Equation. We discretize the $(x,t)$plane by choosing a mesh with spatial size $\mathrm{\Delta}x$ and temporal step size $\mathrm{\Delta}t$, and define the discrete mesh points $({x}_{j},{t}_{n})$ by
$${x}_{j}=a+j\mathrm{\Delta}x,{t}_{n}=n\mathrm{\Delta}t\mathit{\hspace{1em}}\text{with}j=0,1,\mathrm{\dots},J=\frac{ba}{\mathrm{\Delta}x},n=0,1,\mathrm{\dots},N=\frac{T}{\mathrm{\Delta}t}.$$ 
We denote ${x}_{j+\frac{1}{2}}={x}_{j}+\mathrm{\Delta}x/2=a+(j+\frac{1}{2})\mathrm{\Delta}x.$ The finite difference methods will produce approximations ${U}_{j}^{n}$ to the solution $u({x}_{j},{t}_{n})$ on the given discrete mesh points. We denote pointwise values of the true solution to be ${u}_{j}^{n}=u({x}_{j},{t}_{n})$, and the true pointwise flux values to be ${f}_{j}^{n}=f(u({x}_{j},{t}_{n}))$.
2.2 WENO – Weighted Essentially NonOscillatory Schemes
WENO (Weighted Essentially NonOscillatory) Liu et al. (1994) is a family of high order accurate finite difference schemes for solving hyperbolic conservation laws, and has been quite successful for a lot of practical problems. The key idea of WENO is a nonlinear adaptive procedure that automatically chooses the smoothest local stencil to reconstruct the numerical flux. Generally speaking, a finite difference method solves Eq.(1) by using a conservative approximation to the spatial derivative of the flux:
$$\frac{d{u}_{j}(t)}{dt}=\frac{1}{\mathrm{\Delta}x}\left({\widehat{f}}_{j+\frac{1}{2}}{\widehat{f}}_{j\frac{1}{2}}\right),$$  (2) 
where ${u}_{j}(t)$ is the numerical approximation to the point value $u({x}_{j},t)$ and ${\widehat{f}}_{j+\frac{1}{2}}$ is the numerical flux generated by a numerical flux policy
$${\widehat{f}}_{j+\frac{1}{2}}={\pi}^{f}({u}_{jr},\mathrm{\dots},{u}_{j+s}),$$ 
which is manually designed. Note that the term “numerical flux policy" is a new terminology that we introduce in this paper, which is exactly the policy we shall learn using RL. In WENO, ${\pi}^{f}$ works as follows. Using the physical flux values $\{{f}_{j2},{f}_{j1},{f}_{j}\}$, we could obtain a ${3}^{th}$ order accurate polynomial interpolation ${\widehat{f}}_{j+\frac{1}{2}}^{2}$, where the indices $\{j2,j1,j\}$ is called a ‘stencil’. We could also use the stencil $\{j1,j,j+1\}$, $\{j,j+1,j+2\}$ or $\{j+1,j+2,j+3\}$ to obtain another three interpolants ${\widehat{f}}_{j+\frac{1}{2}}^{1}$, ${\widehat{f}}_{j+\frac{1}{2}}^{0}$ and ${\widehat{f}}_{j+\frac{1}{2}}^{1}$. The key idea of WENO is to average (with properly designed weights) all these interpolants to obtain the final reconstruction:
$${\widehat{f}}_{j+\frac{1}{2}}=\sum _{r=2}^{1}{w}_{r}{\widehat{f}}_{j+1/2}^{r},\sum _{r=2}^{1}{w}_{r}=1.$$  (3) 
The weight ${w}_{i}$ depends on the smoothness of the stencil. A general principal is: the smoother is the stencil, the more accurate is the interpolant and hence the larger is the weight. To ensure convergence, we need the numerical scheme to be consistent and stable LeVeque (1992). It is known that WENO schemes as described above are consistent. For stability, upwinding is required in constructing the flux. The most easy way is to use the sign of the Roe speed ${\overline{a}}_{j+\frac{1}{2}}=\frac{{f}_{j+\frac{1}{2}}{f}_{j\frac{1}{2}}}{{u}_{j+\frac{1}{2}}{u}_{j\frac{1}{2}}}$ to determine the upwind direction: if ${\overline{a}}_{j+\frac{1}{2}}\ge 0$, we only average among the three interpolants ${\widehat{f}}_{j+\frac{1}{2}}^{2}$, ${\widehat{f}}_{j+\frac{1}{2}}^{1}$ and ${\widehat{f}}_{j+\frac{1}{2}}^{0}$; if $$, we use ${\widehat{f}}_{j+\frac{1}{2}}^{1}$, ${\widehat{f}}_{j+\frac{1}{2}}^{0}$ and ${\widehat{f}}_{j+\frac{1}{2}}^{1}$.
Some further thoughts.
The key of the WENO method lies in how to compute the weight vector $({w}_{1},{w}_{2},{w}_{3},{w}_{4})$, which primarily depends on the smoothness of the solution at local stencils. In WENO, such smoothness is characterized by handcrafted formula, and was proven to be successful in many practical problems when coupled with highorder temporal discretization. However, it remains unknown whether there are better ways to combine the stencils so that higher accuracy can be achieved. Furthermore, estimating the upwind directions is another key component of WENO, which can get quite complicated in highdimensional situations and requires lots of logical judgments (i.e. “if/else"). Can we ease the (some time painful) coding and improve the estimation at the aid of machine learning?
3 Methods
In this section we present how to employ reinforcement learning to solve the conservation laws given by Eq.(1). To better illustrate our idea, we first show in general how to formulate the process of numerically solving a conservation law into an MDP. We then discuss how to incorporate a policy network with the WENO scheme. Our policy network targets at the following two key aspects of WENO:

•
Can we learn to choose better weights to combine the constructed fluxes?

•
Can we learn to automatically judge the upwind direction, without complicated logical judgments?
3.1 MDP Formulation
As shown in Algorithm 1, the procedure of numerically solving a conservation law is naturally a sequential decision making problem. The key of the procedure is the numerical flux policy ${\pi}^{f}$ and the temporal scheme ${\pi}^{t}$ as shown in line 6 and 8 in Algorithm 1. Both policies could be learned using RL. However, in this paper, we mainly focus on using RL to learn the numerical flux policy ${\pi}^{f}$, while leaving the temporal scheme ${\pi}^{t}$ with traditional numerical schemes such as the Euler scheme or the Runge–Kutta methods. A quick review of RL is given in the appendix.
Now, we demonstrate how to formulate the above procedure as an MDP and the construction of the state $S$, action $A$, reward $r$ and transition dynamics $P$. Algorithm 2 shows in general how RL is incorporated into the procedure. In Algorithm 2, we use a single RL agent. Specifically, when computing ${U}_{j}^{n}$:

•
The state for the RL agent is ${s}_{j}^{n}={g}_{s}({U}_{jr1}^{n1},\mathrm{\dots},{U}_{j+s}^{n1})$, where ${g}_{s}$ is a preprocessing function.

•
In general, the action of the agent is used to determine how the numerical fluxes ${\widehat{f}}_{j+\frac{1}{2}}^{n}$ and ${\widehat{f}}_{j\frac{1}{2}}^{n}$ is computed. In the next subsection, we detail how we incorporate ${a}_{j}^{n}$ to be the linear weights of the fluxes computed using different stencils in the WENO scheme.

•
The reward should encourage the agent to generate a scheme that minimizes the error between its approximated value and the true value. Therefore, we define the reward generating function as ${r}_{j}^{n}={g}_{r}({U}_{jr1}^{n}{u}_{jr1}^{n},\mathrm{\cdots},{U}_{j+s}^{n}{u}_{j+s}^{n})$, where ${g}_{r}$ is a another preprocessing function, e.g., a simplest choice is ${g}_{r}=\cdot {}_{2}$.

•
The transition dynamics $P$ is fully deterministic, and depends on the choice of the temporal scheme at line 10 in Algorithm 2. Note that the next state can only be constructed when we have obtained all the point values in the next time step, i.e., ${s}_{j}^{n+1}={g}_{s}({U}_{jr1}^{n},\mathrm{\dots},{U}_{j+s}^{n})$ does not only depends on action ${a}_{j}^{n}$, but also on actions ${a}_{jr1}^{n},\mathrm{\dots},{a}_{j+s}^{n}$ (action ${a}_{j}^{n}$ can only determine the value ${U}_{j}^{n}$). This subtlety can be resolved by viewing the process under the framework of multiagent RL, in which at each mesh point $j$ we use a distinct agent ${A}_{j}^{RL}$, and the next state ${s}_{j}^{n+1}={g}_{s}({U}_{jr1}^{n},\mathrm{\dots},{U}_{j+s}^{n})$ depends on these agents’ joint action ${\mathbf{a}}_{\mathbf{j}}^{\mathbf{n}}=({a}_{jr1}^{n},\mathrm{\dots},{a}_{j+s}^{n})$. However, it is impractical to train $J$ different agents as $J$ is usually very large, therefore we enforce the agents at different mesh point $j$ to share the same weight, which reduces to case of using just a single agent. The single agent can be viewed as a counterpart of a human designer who decides on the choice of a local scheme based on the current state in traditional numerical methods.
3.2 RL Empowered WENO
In this subsection, we present how to transfer the actions of the RL policy network to the weights for WENO fluxes. Instead of directly using ${\pi}^{RL}$ to generate the numerical flux, we use it to produce the weights of numerical fluxes computed using different stencils in WENO.
Specifically, at point ${x}_{j}$ (here we drop the time superscript $n$ for simplicity), to compute the numerical flux ${\widehat{f}}_{j\frac{1}{2}}$ and ${\widehat{f}}_{j+\frac{1}{2}}$, we first construct four fluxes ${\{{\widehat{f}}_{j\frac{1}{2}}^{i}\}}_{i=2}^{1}$ and ${\{{\widehat{f}}_{j+\frac{1}{2}}^{i}\}}_{i=2}^{1}$ using four different stencils just as in WENO, and then use the RL policy ${\pi}^{RL}$ to generate the weights of these fluxes:
$${\pi}^{RL}({s}_{j})=({w}_{j\frac{1}{2}}^{2},{w}_{j\frac{1}{2}}^{1},{w}_{j\frac{1}{2}}^{0},{w}_{j\frac{1}{2}}^{1},{w}_{j+\frac{1}{2}}^{2},{w}_{j+\frac{1}{2}}^{1},{w}_{j+\frac{1}{2}}^{0},{w}_{j+\frac{1}{2}}^{1}).$$ 
The numerical flux is then constructed by averaging these fluxes:
$${\widehat{f}}_{j\frac{1}{2}}=\sum _{i=2}^{1}{w}_{j\frac{1}{2}}^{i}{\widehat{f}}_{j\frac{1}{2}}^{i},{\widehat{f}}_{j+\frac{1}{2}}=\sum _{i=2}^{1}{w}_{j+\frac{1}{2}}^{i}{\widehat{f}}_{j+\frac{1}{2}}^{i}.$$ 
Note that the determination of upwind direction is automatically embedded in the RL policy since it generates four weights at once. For instance, when the roe speed ${\overline{a}}_{j+\frac{1}{2}}\ge 0$, we expect the ${4}^{th}$ weight ${w}_{j+\frac{1}{2}}^{1}\approx 0$ and when $$, we expect ${w}_{j+\frac{1}{2}}^{2}\approx 0$. Note that the upwind direction can be very complicated in a system of equations or in the highdimensional situations, and using the policy network to automatically embed such a process could save lots of efforts in algorithm design and implementation. Our numerical experiments show that ${\pi}^{RL}$ can indeed automatically determine upwind directions for 1D scalar cases. Although this does not mean that it works for systems and/or in highdimensions, it shows the potential of the proposed framework and value for further studies.
4 Experiments
4.1 Setup
In this subsection, we explain the general training setup. We train the RL policy network on the Burger’s equation, whose flux is computed as $f(u)=\frac{1}{2}{u}^{2}$. In all the experiments, we set the leftshift $r=2$ and the right shift $s=3$. The state preprocessing function ${g}_{s}({s}_{j})={g}_{s}({U}_{jr1},\mathrm{\dots},{U}_{j+s})$ will generate two vectors: ${s}^{l}=({f}_{jr1},\mathrm{\dots},{f}_{j+s1},{\overline{a}}_{j\frac{1}{2}})$, and ${s}^{r}=({f}_{jr},\mathrm{\dots},{f}_{j+s},{\overline{a}}_{j+\frac{1}{2}})$ for computing ${\widehat{f}}_{j\frac{1}{2}}$ and ${\widehat{f}}_{j+\frac{1}{2}}$ respectively. ${s}_{l}$ and ${s}_{r}$ will be passed into the same policy neural network ${\pi}_{\theta}^{RL}$ to produce the desired actions, as described in section 3.2. The reward preprocessing function ${g}_{r}$ simply computes the infinity norm, i.e.,
$${g}_{r}({U}_{jr1}{u}_{jr1},\mathrm{\dots},{U}_{j+s}{u}_{j+s})={({U}_{jr1}{u}_{jr1},\mathrm{\dots},{U}_{j+s}{u}_{j+s})}_{\mathrm{\infty}}.$$ 
The policy network ${\pi}_{\theta}^{RL}$ is a feedforward Multilayer Perceptron with $6$ hidden layers, each has 64 neurons and use Relu Goodfellow et al. (2016) as the activation function. We use the Deep Deterministic Policy Gradient Algorithm Lillicrap et al. (2015) to train the RL policy.
To guarantee the generalization power of the trained RL agent, we randomly sampled 20 initial conditions in the form ${u}_{0}(x)=a+b\cdot \text{func}(c\pi x)$, where $a+b\le 3.5$, $\text{func}\in \{sin,cos\}$ and $c\in \{2,4,6\}$. The goal of generating such kind of initial conditions is to ensure they have similar degree of smoothness and thus similar level of difficulty in learning. The computation domain is $1\le x\le 1$ and $0\le t\le 0.8$ with $\mathrm{\Delta}x=0.02$, $\mathrm{\Delta}t=0.004$, and evolve steps $N=200$ (which ensures the appearance of shocks). When training the RL agent, we use the Euler scheme for temporal discretization. The true solution needed for reward computing is generated using WENO on the same computation domain with $\mathrm{\Delta}x=0.001$, $\mathrm{\Delta}t=0.0002$ and the 4th order RungeKutta (RK4).
In the following, we denote the policy network that generates the weights of the WENO fluxes (as described in section 3.2) as RLWENO. We randomly generated another different 10 initial conditions in the same form as training for testing.
\diagbox$\mathrm{\Delta}t$$\mathrm{\Delta}x$  0.02  0.04  0.05  
RLWENO  WENO  RLWENO  WENO  RLWENO  WENO  
0.002  5.66 (1.59)  5.89 (1.74)  8.76 (2.50)  9.09 (2.62)  9.71 (2.42)  10.24 (2.84) 
0.003  5.64 (1.54)  5.86 (1.67)  8.73 (2.46)  9.06 (2.58)  9.75 (2.41)  10.28 (2.81) 
0.004  5.63 (1.55)  5.81 (1.66)  8.72 (2.46)  9.05 (2.55)  9.61 (2.42)  10.13 (2.84) 
0.005  5.08 (1.46)  5.19 (1.58)  8.29 (2.34)  8.58 (2.47)  9.30 (2.26)  9.78 (2.69) 
0.006      8.71 (2.49)  9.02 (2.61)  9.72 (2.38)  10.24 (2.80) 
0.007      8.56 (2.49)  8.84 (2.62)  9.59 (2.41)  10.12 (2.80) 
0.008      8.68 (2.55)  8.93 (2.66)  9.57 (2.49)  10.06 (2.92) 
\diagbox$\mathrm{\Delta}t$$\mathrm{\Delta}x$  0.02  0.04  0.05  
RLWENO  WENO  RLWENO  WENO  RLWENO  WENO  
0.002  4.85 (1.15)  5.17 (1.26)  7.77 (1.95)  8.05 (2.02)  8.16 (1.93)  8.56 (2.19) 
0.003      7.79 (1.96)  8.06 (2.03)  8.18 (1.92)  8.59 (2.18) 
0.004      7.72 (1.93)  7.98 (2.01)  8.15 (1.95)  8.54 (2.20) 
0.005          8.18 (1.94)  8.55 (2.15) 
4.2 Results
We compare the performance of RLWENO and WENO. We also test whether the trained RL policy can generalize to different temporal discretization schemes, mesh sizes and flux functions that are not included in training. Table 1 and Table 2 present the comparison results, where the number shows the relative error (computed as $\frac{{Uu}_{2}}{{u}_{2}}$ with the 2norm taking over all $x$) between the approximated solution $U$ and the true solution $u$, averaged over 250 evolving steps ($T=1.0$) and 10 random initial values. Numbers in the bracket shows the standard deviation over the 10 initial conditions. Several entries in the table are marked as ‘’ because the corresponding CFL number ($dt/dx$) is not small enough to guarantee convergence. Recall that training of the RLWENO was conducted with Euler time discretization, $(\mathrm{\Delta}x,\mathrm{\Delta}t)=(0.02,0.004)$, $T=0.8$ and $f(u)=\frac{1}{2}{u}^{2}$.
Our experimental results show that, compared with the high order accurate WENO (5th order accurate in space and 4th order accurate in time), the linear weights learned by RL not only achieves smaller errors, but also generalizes well to: 1) longer evolving time ($T=0.8$ for training and $T=1.0$ for testing); 2) new time discretization schemes (trained on Euler, tested on RK4); 3) new mesh sizes (see Table 1 and Table 2 for results of varied $\mathrm{\Delta}x$ and $\mathrm{\Delta}t$); and 4) a new flux function (trained on $f(u)=\frac{1}{2}{u}^{2}$ shown in Table 1, tested on $\frac{1}{16}{u}^{4}$ Table 2).
Figure 1 shows some examples of the solutions. As one can see that, the solutions generated by RLWENO have clear advantage over WENO near singularities which is particularly challenging for numerical PDE solvers and important in applications. Figure 2 shows that the learned numerical flux policy can indeed correctly determine upwind directions and generate local numerical schemes in an adaptive fashion. More interestingly, comparing to WENO, RLWENO seems to be able to select stencils in a different way from WENO, and eventually leads to a more accurate solution. This shows that the proposed RL framework has the potential to surpass human experts in designing numerical schemes for conservation laws.
(a) $f(u)=\frac{1}{2}{u}^{2}$ 
(b) $f(u)=\frac{1}{2}{u}^{2}$  (c) $f(u)=\frac{1}{16}{u}^{4}$  (d) $f(u)=\frac{1}{16}{u}^{4}$ 


(e) $f(u)=\frac{1}{2}{u}^{2}$ 
(f) $f(u)=\frac{1}{2}{u}^{2}$  (g) $f(u)=\frac{1}{16}{u}^{4}$  (h) $f(u)=\frac{1}{16}{u}^{4}$ 

(a) Weights of ${\widehat{f}}_{j\frac{1}{2}}$ 
(b) Weights of ${\widehat{f}}_{j+\frac{1}{2}}$ 

5 Conclusion
In this paper, we proposed a general framework to learn how to solve 1dimensional conservation laws via deep reinforcement learning. We first discussed how the procedure of numerically solving conservation laws can be naturally cast in the form of Markov Decision Process. We then elaborated how to relate notions in numerical schemes of PDEs with those of reinforcement learning. In particular, we introduced a numerical flux policy which was able to decide on how numerical flux should be designed locally based on the current state of the solution. In other words, the numerical flux policy mimics human experts in designing numerical schemes, which made the proposed method metalearninglike. Our numerical experiments showed that the proposed RL based solver was able to outperform high order WENO and was well generalized in various cases.
As part of the future works, we would like to consider using the numerical flux policy to inference more complicated numerical fluxes with guaranteed consistency and stability. Furthermore, we can use the proposed framework to learn a policy that can generate adaptive grids and the associated numerical schemes. Lastly, we would like consider system of conservation laws in 2nd and 3rd dimensional space.
References
 (1)
 Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems. 3981–3989.
 Beck et al. (2017) Christian Beck, E Weinan, and Arnulf Jentzen. 2017. Machine learning approximation algorithms for highdimensional fully nonlinear partial differential equations and secondorder backward stochastic differential equations. Journal of Nonlinear Science (2017), 1–57.
 Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. 1992. On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks. Univ. of Texas, 6–8.
 Both et al. (2019) GertJan Both, Subham Choudhury, Pierre Sens, and Remy Kusters. 2019. DeepMoD: Deep learning for Model Discovery in noisy data. arXiv preprint arXiv:1904.09406 (2019).
 Chang et al. (2017) Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. 2017. Multilevel residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348 (2017).
 Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. Neural ordinary differential equations. In Advances in Neural Information Processing Systems. 6571–6583.
 Discacciati et al. (2019) Niccolo’ Discacciati, Jan S Hesthaven, and Deep Ray. 2019. Controlling oscillations in highorder Discontinuous Galerkin schemes using artificial viscosity tuned by neural networks. Technical Report.
 Fan et al. (2018a) Qingnan Fan, Dongdong Chen, Lu Yuan, Gang Hua, Nenghai Yu, and Baoquan Chen. 2018a. Decouple learning for parameterized image operators. In Proceedings of the European Conference on Computer Vision (ECCV). 442–458.
 Fan et al. (2018b) Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo ZepedaNúnez. 2018b. A multiscale neural network based on hierarchical matrices. arXiv preprint arXiv:1807.01883 (2018).
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 1126–1135.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
 Han et al. (2018) Jiequn Han, Arnulf Jentzen, and E Weinan. 2018. Solving highdimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115, 34 (2018), 8505–8510.
 Jerez Galiano and Uh Zapata (2010) Silvia Jerez Galiano and Miguel Uh Zapata. 2010. A new TVD fluxlimiter method for solving nonlinear hyperbolic equations. J. Comput. Appl. Math. 234, 5 (2010), 1395–1403.
 Jin et al. (2017) Meiguang Jin, Stefan Roth, and Paolo Favaro. 2017. Noiseblind image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3510–3518.
 Khoo et al. (2017) Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. 2017. Solving parametric PDE problems with artificial neural networks. arXiv preprint arXiv:1707.03351 (2017).
 Khoo and Ying (2018) Yuehaw Khoo and Lexing Ying. 2018. SwitchNet: a neural network model for forward and inverse scattering problems. arXiv preprint arXiv:1810.09675 (2018).
 LeVeque (1992) Randall J LeVeque. 1992. Numerical methods for conservation laws. Vol. 132. Springer.
 LeVeque (2002) Randall J LeVeque. 2002. Finite volume methods for hyperbolic problems. Vol. 31. Cambridge university press.
 Li and Malik (2016) Ke Li and Jitendra Malik. 2016. Learning to optimize. arXiv preprint arXiv:1606.01885 (2016).
 Li et al. (2019) Yingzhou Li, Jianfeng Lu, and Anqi Mao. 2019. Variational training of neural network approximations of solution maps for physical models. arXiv preprint arXiv:1905.02789 (2019).
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
 Liu et al. (1994) XuDong Liu, Stanley Osher, and Tony Chan. 1994. Weighted essentially nonoscillatory schemes. Journal of computational physics 115, 1 (1994), 200–212.
 Long et al. (2018a) Zichao Long, Yiping Lu, and Bin Dong. 2018a. PDENet 2.0: Learning PDEs from Data with A NumericSymbolic Hybrid Deep Network. arXiv preprint arXiv:1812.04426 (2018).
 Long et al. (2018b) Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. 2018b. PDENet: Learning PDEs from Data. In ICML.
 Lu et al. (2018) Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. 2018. Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations. In ICML.
 Magiera et al. (2019) Jim Magiera, Deep Ray, Jan S Hesthaven, and Christian Rohde. 2019. ConstraintAware Neural Networks for Riemann Problems. arXiv preprint arXiv:1904.12794 (2019).
 Michoski et al. (2019) Craig Michoski, Milos Milosavljevic, Todd Oliver, and David Hatch. 2019. Solving Irregular and Dataenriched Differential Equations using Deep Neural Networks. arXiv preprint arXiv:1905.04351 (2019).
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
 Qin et al. (2018) Tong Qin, Kailiang Wu, and Dongbin Xiu. 2018. Data driven governing equations approximation using deep neural networks. arXiv preprint arXiv:1811.05537 (2018).
 Raissi et al. (2017a) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2017a. Physics informed deep learning (part i): Datadriven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561 (2017).
 Raissi et al. (2017b) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2017b. Physics informed deep learning (part ii): Datadriven discovery of nonlinear partial differential equations. arXiv preprint arXiv:1711.10566 (2017).
 Ray and Hesthaven (2018) Deep Ray and Jan S Hesthaven. 2018. An artificial neural network as a troubledcell indicator. J. Comput. Phys. 367 (2018), 166–191.
 Schmidhuber (1987) Jurgen Schmidhuber. 1987. Evolutionary principles in selfreferential learning. On learning how to learn: The metameta… hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich (1987).
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15). 1889–1897.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
 Shu (1998) ChiWang Shu. 1998. Essentially nonoscillatory and weighted essentially nonoscillatory schemes for hyperbolic conservation laws. In Advanced numerical approximation of nonlinear hyperbolic equations. Springer, 325–432.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
 Weinan (2017) E Weinan. 2017. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5, 1 (2017), 1–11.
 Zhang et al. (2019) Xiaoshuai Zhang, Yiping Lu, Jiaying Liu, and Bin Dong. 2019. Dynamically unfolding recurrent restorer: A moving endpoint control method for image restoration. ICLR (2019).
Appendix A A Review of Reinforcement Learning
A.1 Reinforcement Learning
Reinforcement Learning (RL) is a general framework for solving sequential decision making problems. Recently, combined with deep neural networks, RL has achieved great success in various tasks such as playing video games from raw screen inputs Mnih et al. (2015), playing Go Silver et al. (2016), and robotics control Schulman et al. (2017). The sequential decision making problem RL tackles is usually formulated as a Markov Decision Process (MDP), which comprises five elements: the state space $S$, the action space $A$, the reward $r:S\times A\to \mathcal{R}$, the transition probability of the environment $P:S\times A\times S\to [0,1]$, and the discounting factor $\gamma $. The interactions between an RL agent and the environment forms a trajectory $\tau =({s}_{0},{a}_{0},{r}_{0},\mathrm{\dots},{s}_{T},{a}_{T},{r}_{T},\mathrm{\dots})$. The return of $\tau $ is the discounted sum of all its future rewards:
$$G(\tau )=\sum _{t=0}^{\mathrm{\infty}}{\gamma}^{t}{r}_{t}$$ 
Similarly, the return of a stateaction pair $({s}_{t},{a}_{t})$ is:
$$G({s}_{t},{a}_{t})=\sum _{l=t}^{\mathrm{\infty}}{\gamma}^{lt}{r}_{l}$$ 
A policy $\pi $ in RL is a probability distribution on the action $A$ given a state $S$: $\pi :S\times A\to [0,1]$. We say a trajectory $\tau $ is generated under policy $\pi $ if all the actions along the trajectory is chosen following $\pi $, i.e., $\tau \sim \pi $ means ${a}_{t}\sim \pi (\cdot {s}_{t})$ and ${s}_{t+1}\sim P(\cdot {s}_{t},{a}_{t})$. Given a policy $\pi $, the value of a state $s$ is defined as the expected return of all the trajectories when the agent starts at $s$ and then follows $\pi $:
$${V}^{\pi}(s)={E}_{\tau}[G(\tau )\tau ({s}_{0})=s,\tau \sim \pi ]$$ 
Similarly, the value of a stateaction pair is defined as the expected return of all trajectories when the agent starts at $s$, takes action $a$, and then follows $\pi $:
$${Q}^{\pi}(s,a)={E}_{\tau}[G(\tau )\tau ({s}_{0})=s,\tau ({a}_{0})=a,\tau \sim \pi ]$$ 
As aforementioned in introduction, in most RL algorithms the policy $\pi $ is optimized with regards to the values ${Q}^{\pi}(s,a)$, thus naturally guarantees the longterm accumulated rewards (in our setting, the longterm accuracy of the learned schemes). Bellman Equation, one of the most important equations in RL, connects the value of a state and the value of its successor state:
$$\begin{array}{cc}\hfill {Q}^{\pi}(s,a)& =r(s,a)+\gamma {E}_{{s}^{\prime}\sim P(\cdot s,a),{a}^{\prime}\sim \pi (\cdot {s}^{\prime})}[{Q}^{\pi}({s}^{\prime},{a}^{\prime})]\hfill \\ \hfill {V}^{\pi}(s)& ={E}_{a\sim \pi (\cdot s),{s}^{\prime}\sim P(\cdot {s}^{\prime},a)}[r(s,a)+\gamma {V}^{\pi}({s}^{\prime})]\hfill \end{array}$$ 
The goal of RL is to find a policy $\pi $ to maximize the expected discounted sum of rewards starting from the initial state ${s}_{0}$, $J(\pi )={E}_{{s}_{0}\sim \rho}[{V}^{\pi}({s}_{0})]$, where $\rho $ is the initial state distribution. If we parameterize $\pi $ using $\theta $, then we can optimize it using the famous policy gradient theorem:
$$\frac{dJ({\pi}_{\theta})}{d\theta}={E}_{s\sim {\rho}^{{\pi}_{\theta}},a\sim {\pi}_{\theta}}[{\nabla}_{\theta}\text{log}{\pi}_{\theta}(as){Q}^{{\pi}_{\theta}}(s,a)]$$ 
where ${\rho}^{{\pi}_{\theta}}$ is the state distribution deduced by the policy ${\pi}_{\theta}$. In this paper we focus on the case where the action space $A$ is continuous, and a lot of mature algorithms has been proposed for such a case, e.g., the Deep Deterministic Policy Gradient (DDPG) Lillicrap et al. (2015), the Trust Region Policy Optimization algorithm Schulman et al. (2015), and etc.