Abstract
We present a reinforcement learning framework, called ProgrammaticallyInterpretable Reinforcement Learning (PIRL), that is designed to generateinterpretable and verifiable agent policies. Unlike the popular DeepReinforcement Learning (DRL) paradigm, which represents policies by neuralnetworks, PIRL represents policies using a highlevel, domainspecificprogramming language. Such programmatic policies have the benefits of beingmore easily interpreted than neural networks, and being amenable toverification by symbolic methods. We propose a new method, called NeurallyDirected Program Search (NDPS), for solving the challenging nonsmoothoptimization problem of finding a programmatic policy with maximal reward. NDPSworks by first learning a neural policy network using DRL, and then performinga local search over programmatic policies that seeks to minimize a distancefrom this neural "oracle". We evaluate NDPS on the task of learning to drive asimulated car in the TORCS carracing environment. We demonstrate that NDPS isable to discover humanreadable policies that pass some significant performancebars. We also show that PIRL policies can have smoother trajectories, and canbe more easily transferred to environments not encountered during training,than corresponding policies discovered by DRL.
Quick Read (beta)
Programmatically Interpretable Reinforcement Learning
Appendix: Programmatically Interpretable Reinforcement Learning
Abstract
We present a reinforcement learning framework, called Programmatically Interpretable Reinforcement Learning (Pirl), that is designed to generate interpretable and verifiable agent policies. Unlike the popular Deep Reinforcement Learning (Drl) paradigm, which represents policies by neural networks, Pirl represents policies using a highlevel, domainspecific programming language. Such programmatic policies have the benefits of being more easily interpreted than neural networks, and being amenable to verification by symbolic methods. We propose a new method, called Neurally Directed Program Search (Ndps), for solving the challenging nonsmooth optimization problem of finding a programmatic policy with maximal reward. Ndps works by first learning a neural policy network using Drl, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural “oracle”. We evaluate Ndps on the task of learning to drive a simulated car in the Torcs carracing environment. We demonstrate that Ndps is able to discover humanreadable policies that pass some significant performance bars. We also show that Pirl policies can have smoother trajectories, and can be more easily transferred to environments not encountered during training, than corresponding policies discovered by Drl.
1 Introduction
Deep reinforcement learning (Drl) has had a massive impact on the field of machine learning and has led to remarkable successes in the solution of many challenging tasks (Mnih et al., 2015; Silver et al., 2016, 2017). While neural networks have been shown to be very effective in learning good policies, the expressivity of these models makes them difficult to interpret or to be checked for consistency for some desired properties, and casts a cloud over the use of such representations in safetycritical applications.
Motivated to overcome this problem, we propose a learning framework, called Programmatically Interpretable Reinforcement Learning (Pirl)^{1}^{1} 1 Pirl is pronounced PiRL (as in $\pi $Rl), that is based on the idea of learning policies that are represented in a humanreadable language. The Pirl framework is parameterized on a highlevel programming language for policies. A problem instance in Pirl is similar to a one in traditional Rl, but also includes a (policy) sketch that syntactically defines a set of programmatic policies in this language. The objective is to find a program in this set with maximal longterm reward.
Intuitively, the policy programming language and the sketch characterize what we consider “interpretable”. In addition to interpretability, the syntactic restriction on policies has three key benefits. First, the language can be used to implicitly encode the learner’s inductive bias that will be used for generalization. Second, the language can allow effective pruning of undesired policies to make the search for a good policy more efficient. Finally, it allows us to use symbolic program verification techniques to formally reason about the learned policies and check consistency with correctness properties. At the same time, policies in Pirl can have rich semantics, for example allowing actions to depend on events far back in history.
A key technical challenge in Pirl is that the space of policies permitted in an instance can be vast and nonsmooth, making optimization extremely challenging. To address this, we propose a new algorithm called Neurally Directed Program Synthesis (Ndps). The algorithm first uses Drl to compute a neural policy network that has high performance, but may not be expressible in the policy language. This network is then used to direct a local search over programmatic policies. In each iteration of this search, we maintain a set of “interesting” inputs, and update the program so as to minimize the distance between its outputs and the outputs of the neural policy (an “oracle”) on these inputs. The set of interesting inputs is updated as the search progresses. This strategy, inspired by imitation learning (Ross et al., 2011; Schaal, 1999), allows us to perform direct policy search in a highly nonsmooth policy space.
We evaluate our approach in the task of learning to drive a simulated car in the Torcs carracing environment (Wymann et al., 2014), as well as three classic control games (we discuss the former in the main paper, and the latter in the Appendix). Experiments demonstrate that Ndps is able to find interpretable policies that, while not as performant as the policies computed by Drl, pass some significant performance bars. Specifically, in Torcs, our policy sketch allows an unbounded set of programs with branches guarded by unknown conditions, each branch representing a ProportionalIntegralDerivative (PID) controller (Åström & Hägglund, 1995) with unknown parameters. The policy we obtain can successfully complete a lap of the race, and the use of the neural oracle is key to doing so. Our results also suggest that a welldesigned sketch can serve as a regularizer. Due to constraints imposed by the sketch, the policies for Torcs that Ndps learns lead to smoother trajectories than the corresponding neural policies, and can tolerate greater noise. The policies are also more easily transferred to new domains, in particular race tracks not seen during training. Finally, we show, using several properties, that the programmatic policies that we discover are amenable to verification using offtheshelf symbolic techniques.
2 Programmatically Interpretable Reinforcement Learning
In this section, we formalize the problem of programmatically interpretable reinforcement learning (Pirl).
We model a reinforcement learning setting as a Partially Observable Markov Decision Process (Pomdp) $M=(\mathcal{S},\mathcal{A},\mathcal{O},T(\cdot s,a),Z(\cdot s),r,\mathrm{\mathit{I}\mathit{n}\mathit{i}\mathit{t}},\gamma )$. Here, $\mathcal{S}$ is the set of (environment) states. $\mathcal{A}$ is the set of actions that the learning agent can perform, and $\mathcal{O}$ is the set of observations about the current state that the agent can make. An agent action $a$ at the state $s$ causes the environment state to change probabilistically, and the destination state follows the distribution $T(\cdot s,a)$. The probability that the agent makes an observation $o$ at state $s$ is $Z(os)$. The reward that the agent receives on performing action $a$ in state $s$ is given by $r(s,a)$. $\mathrm{\mathit{I}\mathit{n}\mathit{i}\mathit{t}}$ is the initial distribution over environment states. Finally, $$ is a real constant that is used to define the agent’s aggregate reward over time.
A history of $M$ is a sequence $h={o}_{0},{a}_{0},\mathrm{\dots},{a}_{k1},{o}_{k}$, where ${o}_{i}$ and ${a}_{i}$ are, respectively, the agent’s observation and action at the $i$th time step. Let ${\mathscr{H}}_{M}$ be the set of histories in $M$. A policy is a function $\pi :{\mathscr{H}}_{M}\to \mathcal{A}$ that maps each history as above to an action ${a}_{k}$. For each policy, we can define a set of histories that are possible when the agent follows $\pi $. We assume a mechanism to simulate the POMDP and sample histories that are possible under a policy. The policy also induces a distribution over possible rewards ${R}_{i}$ that the agent receives at the $i$th time step. The agent’s expected aggregate reward under $\pi $ is given by $R(\pi )=\mathbf{E}[{\sum}_{i=0}^{\mathrm{\infty}}{\gamma}^{i}{R}_{i}]$. The goal in reinforcement learning is to discover a policy ${\pi}^{*}$ that maximizes $R(\pi )$.
A Programming Language for Policies. The distinctive feature of Pirl is that policies here are expressed in a highlevel, domainspecific programming language. Such a language can be defined in many ways. However, to facilitate search through the space of programs expressible in the language, it is desirable for the language to express computations as compactly and canonically as possible. Because of this, we propose to express parameterized policies using a functional language based on a small number of sideeffectfree combinators. It is known from prior work on program synthesis (Feser et al., 2015) that such languages offer natural advantages in program synthesis.
We collectively refer to observations and actions, as well as auxiliary integers and reals generated during computation, as atoms. Our language considers two kinds of data: atoms and sequences of atoms (including histories). We assume a finite set of basic operators over atoms that is rich enough to capture all common operations on observations and actions.
Figure 1 shows the syntax of this language. The nonterminals $E$ and $x$ represent expressions that evaluate to atoms and histories, respectively. We sketch the semantics of the various language constructs below.

•
$c$ ranges over a universe of numerical constants, and $\oplus $ is a basic operator;

•
$\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}(x,i)$ returns the observation from the $i$th time step from a history $x$, and $\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}(x,1)$ is used as shorthand for the most recent observation;

•
$\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}$ is a standard higherorder combinators over sequences with the semantics:
$\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(f,[{e}_{1},\mathrm{\dots},{e}_{k}],e)=f({e}_{k},f({e}_{k1},\mathrm{\dots}f({e}_{1},e)))$ 
•
$x,{x}_{1},{x}_{2}$ are variables. As usual, unbound variables are assumed to be inputs.
The language comes with a type system that distinguishes between different types of atoms, and ensures that language constructs are used consistently. The type system can catch common errors, such as applying $\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}(x,k+1)$ to a history of size $k$. This type system identifies a set of expressions whose inputs are histories and outputs are actions. These expressions are known as programmatic policies, or simply programs.
The combinator over sequences, $\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}$, will be operating over histories in our programs. Since histories can be of variable lengths, we restrict these combinators to operate on only the last $n$ elements of a sequence, for some fixed $n$. This restriction provides us the ability to use these combinators in a consistent manner.
$E$  $::=$  $c\mid x\mid \oplus ({E}_{1},\mathrm{\dots},{E}_{k})\mid \mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}(x,i)\mid $  
$\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}((\lambda {x}_{1},{x}_{2}.{E}_{1}),\alpha )$ 
Sketches.
Discovering an optimal programmatic policy from the vast space of legitimate programs is typically impractical without some prior on the shape of target policies. Pirl allows the specification of such priors through instancespecific syntactic models called sketches.
We define a sketch as a grammar of expressions over atoms and sequences of atoms. The sketch places restrictions on the kinds of basic operators that will be considered during policy search. Formally, the sketch is obtained by restricting the grammar in Figure 1. The set of programs permitted by a sketch $\mathcal{S}$ is denoted by $[[\mathcal{S}]]$.
Pirl.
The Pirl problem can now be stated as follows. Suppose we are given a Pomdp $M$ and a sketch $\mathcal{S}$. Our goal is to find a program ${e}^{*}\in [[\mathcal{S}]]$ with optimal reward:
${e}^{*}$  $=$  $\underset{e\in [[\mathcal{S}]]}{\mathrm{arg}\mathrm{max}}R(e).$  (1) 
Example.
Now we consider a concrete example of Pirl, considered in more detail in Section 5.
Suppose our goal is to make a (simulated) car complete laps on a track. We want to do so by learning policies for tasks like steering and acceleration. Suppose we know that we could get wellbehaved policies by using PID control — specifically, by switching back and forth between a set of PID controllers. However, we do not know the parameters of these controllers, and neither do we know the conditions under which we should switch from one controller to another. We can express this knowledge using the following sketch:
$P$  $::=$  $\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}((\u03f5{h}_{i}),1)$  
$I$  $::=$  $\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,\u03f5{h}_{i})$  
$D$  $::=$  $\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{i},2)\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{i},1)$  
$C$  $::=$  ${c}_{1}*P+{c}_{2}*I+{c}_{3}*D$  
$B$  $::=$  ${c}_{0}+{c}_{1}*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{1},1)+\mathrm{\dots}$  
$\mathrm{\dots}+{c}_{k}*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{m},1)>0\mid $  
$\mathrm{\hspace{1em}\hspace{1em}}{B}_{1}\mathrm{\mathbf{o}\mathbf{r}}{B}_{2}\mid {B}_{1}\mathrm{\mathbf{a}\mathbf{n}\mathbf{d}}{B}_{2}$  
$E$  $::=$  $C\mid \mathrm{\mathbf{i}\mathbf{f}}B\mathrm{\mathbf{t}\mathbf{h}\mathbf{e}\mathbf{n}}{E}_{1}\mathrm{\mathbf{e}\mathbf{l}\mathbf{s}\mathbf{e}}{E}_{2}.$ 
Here, $E$ represents programs permitted by the sketch. The program’s input is a history $h$. We assume that this sequence is split into a set of sequences $\{{h}_{1},\mathrm{\dots},{h}_{m}\}$, where ${h}_{i}$ is the sequence of observations from the $i$th of $m$ sensors. The sensor’s most recent reading is given by $\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{k},1)$, and its second most recent reading is $\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{k},2)$. The operators $+$, $$, $*$, $>$, and ifthenelse are as usual. The program (optionally) evaluates a set of boolean conditions ($B$) over the current sensor readings, then chooses among a set of discretized PID controllers, represented by $C$. In the definition of $C$, $P$ is the proportional term, $I$ is the discretized integral term (calculated via a $\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}$), and $D$ is a finitedifference approximation of the derivative term, $\u03f5$ is a known constant and represents a fixed target for the controller, $(\u03f5{h}_{i})$ performs an elementwise operation on the sequence ${h}_{i}$. The symbols ${c}_{i}$ are realvalued parameters. Recall that the $\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}$ acts over a fixedsized window on the history, and hence can be used as a discrete approximation of the integral term in a PID controller.
The program in Figure 2 shows the body of a policy for acceleration that the Ndps algorithm finds given this sketch in the Torcs car racing environment. The program’s input consists of histories for 29 sensors; however, only two of them, $\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}$ and $\mathrm{\U0001d681\U0001d67f\U0001d67c}$, are actually used in the program. While the sensor $\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}$ (for the position of the car relative to the track axis) is used to decide which controller to use, only the $\mathrm{\U0001d681\U0001d67f\U0001d67c}$ sensor is needed to calculate the acceleration.
$$\begin{array}{c}\mathrm{\mathbf{i}\mathbf{f}}(0.001\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},1)>0)\mathrm{\mathbf{a}\mathbf{n}\mathbf{d}}(0.001+\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},1)>0)\hfill \\ \mathrm{\mathbf{t}\mathbf{h}\mathbf{e}\mathbf{n}}3.97*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}((0.44{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}),1)+0.01*\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,(0.44{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}))+48.79*(\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},2)\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},1))\hfill \\ \mathrm{\mathbf{e}\mathbf{l}\mathbf{s}\mathbf{e}}\mathrm{\hspace{0.33em}\hspace{0.17em}3.97}*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}((0.40{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}),1)+0.01*\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,(0.40{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}))+48.79*(\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},2)\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},1))\hfill \end{array}$$ 
3 Neurally Directed Program Search
Imitating a Neural Policy Oracle.
The Ndps algorithm is a direct policy search that is guided by a neural “oracle”. Searching over policies is a standard approach in reinforcement learning. However, the nonsmoothness of the space of programmatic policies poses a fundamental challenge to the use of such an approach in Pirl. For example, a conceivable way of solving the search problem would be to define a neighborhood relation over programs and perform local search. However, in practice, the objective $R(e)$ of such a search can vary irregularly, leading to poor performance (see Section 5 for experimental results on this).
In contrast, Ndps starts by using Drl to compute a neural policy oracle ${e}_{\mathcal{N}}$ for the given environment. This policy is an approximation of the programmatic policy that we seek to find. To a first approximation, Ndps is a local search over programmatic policies that seeks to find a program ${e}^{*}$ that closely imitates the behavior of ${e}_{\mathcal{N}}$. The main intuition here is that distance from ${e}_{\mathcal{N}}$ is a simpler objective than the reward function $R(e)$, which aggregates rewards over a lengthy time horizon. This approach can be seen to be a form of imitation learning (Schaal, 1999).
The distance between ${e}_{\mathcal{N}}$ and the estimate $e$ of ${e}^{*}$ in a search iteration is defined as $d({e}_{\mathcal{N}},e)={\sum}_{h\in \mathscr{H}}\parallel e(h){e}_{\mathcal{N}}(h)\parallel ,$ where $\mathscr{H}$ is a set of “interesting” inputs (histories) and $\parallel \cdot \parallel $ is a suitable norm. During the iteration, we search the neighborhood of $e$ for a program ${e}^{\prime}$ that minimizes this distance. At the end of the iteration, ${e}^{\prime}$ becomes the new estimate for ${e}^{*}$.
Input Augmentation. One challenge in the algorithm is that under the policy $e$, the agent may encounter histories that are not possible under ${e}_{\mathcal{N}}$, or any of the programs encountered in previous iterations of the search. For example, while searching for a steering controller, we may arrive at a program that, under certain conditions, steers the car into a wall, an illegal behavior that the neural policy does not exhibit. Such histories would be irrelevant to the distance between ${e}_{\mathcal{N}}$ and $e$ if the set $\mathscr{H}$ were constructed ahead of time by simulating ${e}_{\mathcal{N}}$, and never updated. This would be unfortunate as these are precisely the inputs on which the programmatic policy needs guidance.
Our solution to this problem is input augmentation, or periodic updates to the set $\mathscr{H}$. More precisely, after a certain number of search steps for a fixed set $\mathscr{H}$, and after choosing the best available synthesized program for this set, we sample a set of additional histories by simulating the current programmatic policy, and add these samples to $\mathscr{H}$.
3.1 Algorithm Details
{algorithm}[h] \[email protected]@algorithmic \STATEInput: POMDP $M$, neural policy ${e}_{\mathcal{N}}$, sketch $\mathcal{S}$ \STATE$\mathscr{H}\leftarrow \mathrm{\U0001d68c\U0001d69b\U0001d68e\U0001d68a\U0001d69d\U0001d68e}\mathrm{\_}\mathrm{\U0001d691\U0001d692\U0001d69c\U0001d69d\U0001d698\U0001d69b\U0001d692\U0001d68e\U0001d69c}({e}_{\mathcal{N}},M)$ \STATE$e\leftarrow \mathrm{\U0001d692\U0001d697\U0001d692\U0001d69d\U0001d692\U0001d68a\U0001d695\U0001d692\U0001d6a3\U0001d68e}({e}_{\mathcal{N}},\mathscr{H},M,\mathcal{S})$ \STATE$R\leftarrow \mathrm{\U0001d68c\U0001d698\U0001d695\U0001d695\U0001d68e\U0001d68c\U0001d69d}\mathrm{\_}\mathrm{\U0001d69b\U0001d68e\U0001d6a0\U0001d68a\U0001d69b\U0001d68d}(e,M)$ \REPEAT\STATE$({e}^{\prime},{R}^{\prime})\leftarrow (e,R)$ \STATE$\mathscr{H}\leftarrow \mathrm{\U0001d69e\U0001d699\U0001d68d\U0001d68a\U0001d69d\U0001d68e}\mathrm{\_}\mathrm{\U0001d691\U0001d692\U0001d69c\U0001d69d\U0001d698\U0001d69b\U0001d692\U0001d68e\U0001d69c}(e,{e}_{\mathcal{N}},M,\mathscr{H})$ \STATE$\mathcal{E}\leftarrow \mathrm{\U0001d697\U0001d68e\U0001d692\U0001d690\U0001d691\U0001d68b\U0001d698\U0001d69b\U0001d691\U0001d698\U0001d698\U0001d68d}\mathrm{\_}\mathrm{\U0001d699\U0001d698\U0001d698\U0001d695}(e)$ \STATE$e\leftarrow {\mathrm{arg}\mathrm{min}}_{{e}^{\prime}\in \mathcal{E}}{\sum}_{h\in \mathscr{H}}\parallel {e}^{\prime}(h){e}_{\mathcal{N}}(h)\parallel $ \STATE$R\leftarrow \mathrm{\U0001d68c\U0001d698\U0001d695\U0001d695\U0001d68e\U0001d68c\U0001d69d}\mathrm{\_}\mathrm{\U0001d69b\U0001d68e\U0001d6a0\U0001d68a\U0001d69b\U0001d68d}(e,M)$ \UNTIL${R}^{\prime}\ge R$ \STATEOutput: ${e}^{\prime}$
We show pseudocode for Ndps in Algorithm 3.1. The inputs to the algorithm are a Pomdp $M$, a neural policy ${e}_{\mathcal{N}}$ for $M$ that serves as an oracle, and a sketch $\mathcal{S}$. The algorithm first samples a set of histories of ${e}_{\mathcal{N}}$ using the procedure $\mathrm{\U0001d68c\U0001d69b\U0001d68e\U0001d68a\U0001d69d\U0001d68e}\mathrm{\_}\mathrm{\U0001d691\U0001d692\U0001d69c\U0001d69d\U0001d698\U0001d69b\U0001d692\U0001d68e\U0001d69c}$. Next it uses the routine $\mathrm{\U0001d692\U0001d697\U0001d692\U0001d69d\U0001d692\U0001d68a\U0001d695\U0001d692\U0001d6a3\U0001d68e}$ to generate the program that is the starting point of the policy search. Then the procedure $\mathrm{\U0001d68c\U0001d698\U0001d695\U0001d695\U0001d68e\U0001d68c\U0001d69d}\mathrm{\_}\mathrm{\U0001d69b\U0001d68e\U0001d6a0\U0001d68a\U0001d69b\U0001d68d}$ calculates the expected aggregate reward $R(e)$ (described in Section 2), by simulating the program in the Pomdp.
From this point on, Ndps iteratively updates its estimate $e$ of the target program, as well as its estimate $\mathscr{H}$ of the set of interesting inputs used for distance computation. To do the former, Ndps uses the procedure $\mathrm{\U0001d697\U0001d68e\U0001d692\U0001d690\U0001d691\U0001d68b\U0001d698\U0001d69b\U0001d691\U0001d698\U0001d698\U0001d68d}\mathrm{\_}\mathrm{\U0001d699\U0001d698\U0001d698\U0001d695}$ to generate a space of programs that are structurally similar to $e$, then finds the program in this space that minimizes distance from ${e}_{\mathcal{N}}$. The latter task is done by the routine $\mathrm{\U0001d69e\U0001d699\U0001d68d\U0001d68a\U0001d69d\U0001d68e}\mathrm{\_}\mathrm{\U0001d691\U0001d692\U0001d69c\U0001d69d\U0001d698\U0001d69b\U0001d692\U0001d68e\U0001d69c}$, which heuristically picks interesting inputs in the trajectory of the learned program and then obtains the corresponding actions from the oracle for those inputs. This process goes on until the iterative search fails to improve the estimated reward $R$ of $e$.
The subroutines used in the above description can be implemented in many ways. Now we elaborate on our implementation of the important subroutines of Ndps.
The optimization step.
The search for a program ${e}^{\prime}$ at minimal distance from the neural oracle can be implemented in many ways. The approach we use has two steps. First, we enumerate a set of program templates — numerically parameterized programs — that are structurally similar to $e$ and are permitted by the sketch $\mathcal{S}$, giving priority to shorter templates. Next, we find optimal parameters for the enumerated templates. Our primary tool for the second step is Bayesian optimization (Snoek et al., 2012), though we also explored a symbolic optimization technique based on Satisfiability Modulo Theories (SMT) solving (Appendix B).
The initialization step.
The performance of Ndps turns out to be quite sensitive to the choice of the program that is the starting point of the search. Our initialization routine $\mathrm{\U0001d692\U0001d697\U0001d692\U0001d69d\U0001d692\U0001d68a\U0001d695\U0001d692\U0001d6a3\U0001d68e}$ is broadly similar to the optimization step, in that it attempts to find programs that closely imitate the oracle through a combination of template enumeration and parameter optimization. However, rather than settling on a single program, $\mathrm{\U0001d692\U0001d697\U0001d692\U0001d69d\U0001d692\U0001d68a\U0001d695\U0001d692\U0001d6a3\U0001d68e}$ generates a pool of programs that are close in behavior to the oracle. After this, it simulates the programs in the POMDP and returns the program that achieves the highest reward.
4 Environments for Experiments
In this section, we describe the environments (modeled by POMDPs) on which we evaluated the Ndps algorithm.
Torcs.
We use Ndps to generate controllers for cars in The Open Racing Car Simulator (Torcs) (Wymann et al., 2014). Torcs has been used extensively in AI research, for example in (Salem et al., 2017), (Koutník et al., 2013), and (Loiacono et al., 2010) among others. (Lillicrap et al., 2015a) has shown that a Deep Deterministic Policy Gradient (DDPG) network can be used in Rl environments with continuous action spaces. The Drl agents for Torcs in this paper implement this approach.
In its full generality Torcs provides a rich environment with input from up to $89$ sensors, and optionally the 3D graphic from a chosen camera angle in the race. The controllers have to decide the values of 5 parameters during game play, which correspond to the acceleration, brake, clutch, gear and steering of the car. Apart from the immediate challenge of driving the car on the track, controllers also have to make racelevel strategy decisions, like making pitstops for fuel. A lower level of complexity is provided in the Practice Mode setting of TORCS. In this mode all racelevel strategies are removed. Currently, so far as we know, stateoftheart Drl models are capable of racing only in Practice Mode, and this is also the environment that we use. Here we consider the input from $29$ sensors, and decide values for the acceleration and steering actions.
The sketches used in our experiments are as in the example in Section 2, and provide the basic structure of a proportionalintegralderivative (PID) program, with appropriate holes for parameter and observation values. To obtain a practical implementation, we constrain the fold calculation to the five latest observations of the history. This constraint corresponds to the standard strategy of automatic (integral) error reset in discretized PID controllers (Astrom & Hagglund, 1984).
Each track in Torcs can we viewed as a distinct Pomdp. In our implementation of Ndps for Torcs we choose one track and synthesize a program for it. Whenever the algorithm needs to interact with the Pomdp, we use the program or Drl agent to race on the track. For example, in the procedure $\mathrm{\U0001d68c\U0001d698\U0001d695\U0001d695\U0001d68e\U0001d68c\U0001d69d}\mathrm{\_}\mathrm{\U0001d69b\U0001d68e\U0001d6a0\U0001d68a\U0001d69b\U0001d68d}$ we use the synthesized program to race one lap, and the reward is a function of the speed, angle and position of the car at each time step.
For the $\mathrm{\U0001d68c\U0001d69b\U0001d68e\U0001d68a\U0001d69d\U0001d68e}\mathrm{\_}\mathrm{\U0001d691\U0001d692\U0001d69c\U0001d69d\U0001d698\U0001d69b\U0001d692\U0001d68e\U0001d69c}$ procedure we use the Drl agent to complete one lap of the track (an episode), recording the sensor values and environment state at each time step. The $\mathrm{\U0001d69e\U0001d699\U0001d68d\U0001d68a\U0001d69d\U0001d68e}\mathrm{\_}\mathrm{\U0001d691\U0001d692\U0001d69c\U0001d69d\U0001d698\U0001d69b\U0001d692\U0001d68e\U0001d69c}$ procedure uses a two step process. First, the synthesized program is used to race one lap and we store the sequence of observations (given by sensor values) ${o}_{1},{o}_{2},\mathrm{\dots}$ provided by Torcs during this lap. Then, we use the Drl agent to generate the corresponding action ${a}_{i}$ for each observation ${o}_{i}$. Each tuple $({o}_{i},{a}_{i})$ is then added to the set of histories.
Classic Control Games.
In addition to Torcs, we evaluated our approach in three classic control games, Acrobot, CartPole, and MountainCar. These games provide simpler Rl environments, with fewer input sensors than Torcs and only a single discrete action at each time step, compared to two continuous actions in Torcs. These results appear in Appendix A.
5 Experimental Analysis
Now we present an empirical evaluation of the effectiveness of our algorithm in solving the Pirl problem. We synthesize programs for two Torcs tracks, CGSpeedway1 and Aalborg. These tracks provide varying levels of difficulty, with Aalborg being the more difficult track of the two.
5.1 Evaluating Performance
A controller’s performance is measured according to two metrics, lap time and reward. To calculate the lap time, the programs are allowed to complete a three lap race, and we report the average time taken to complete a lap during this race. The reward function is calculated using the car’s velocity, angle with the track axis, and distance from the track axis. The same function is used to train the DRL agent initially. In the experiments we compare the average reward per time step, obtained by the various programs.
We compare among the following Rl agents:

1.
DRL. An agent which uses Drl to find a policy represented as a deep neural network. The specific Drl algorithm we use is Deep Deterministic Policy Gradients (Lillicrap et al., 2015b), which has previously been used on Torcs.

2.
Naive. Program synthesized without access to a policy oracle.

3.
NoAug. Program synthesized without input augmentation.

4.
NoSketch. Program synthesized in our policy language without sketch guidance.

5.
NoIF. Programs permitted by a restriction of our sketch that does not permit conditional branching.

6.
Ndps. The Program generated by the Ndps algorithm.
In Table 1 we present the performance results of the above list. The lap times in that table are given in minutes and seconds. The Timeout entries indicate that the synthesis process did not return a program that could complete the race, within the specified timeout of twelve hours.
These results justify the various choices that we made in our Ndps algorithm architecture, as discussed in Section 3. In many cases those choices were necessary to be able to synthesize a program that could successfully complete a race. As a consequence of these results, we only consider the DRL agent and the Ndps program for subsequent comparisons.
The NoAug and NoSketch agents are unable to generate programs that complete a single lap on either track. In the case of NoSketch this is because the syntax of the policy language (Figure 1), defines a very large program space. If we randomly sample from this space without any constraints (like those provided by the sketch), then the probability of getting a good program is extremely low and hence we are unable to reliably generate a program that can complete a lap. The NoAug agent performs poorly because without input augmentation, the synthesizer has no guidance from the oracle regarding the “correct” behavior once the program deviates even slightly from the oracle’s trajectory.
The Ndps algorithm is biased towards generating simpler programs to aid in interpretability. In the Ndps algorithm experiments we allow the synthesizer to produce policies with up to five nested $\mathrm{\mathbf{i}\mathbf{f}}$ statements. However, if two policies have Lap Times within one second of each other, then the algorithm chooses the one with fewer $\mathrm{\mathbf{i}\mathbf{f}}$ statements as the output. This is a reasonable choice because a difference of less than one second in Lap Times can be the result of different starting positions in the Torcs simulator, and hence the performance of such policies is essentially equivalent.
Model  CGSpeedway1  Aalborg  

Lap Time  Reward  Lap Time  Reward  
Drl  54.27  118.39  1:49.66  71.23 
Naive  2:07.09  58.72  Timeout  $$ 
NoAug  Timeout  $$  Timeout  $$ 
NoSketch  Timeout  $$  Timeout  $$ 
NoIF  1:01.60  115.25  2:45.13  52.81 
Ndps  1:01.56  115.32  2:38.87  54.91 
5.2 Qualitative Analysis of the Programmatic Policy
We provide qualitative analysis of the inferred programmatic policy through the lens of interpretability, and its behavior in acting in the environment.
Interpretability.
Interpretability is a qualitative metric, and cannot be easily demonstrated via experiments. The Drl policies are considered uninterpretable because their policies are encoded in black box neural networks. In contrast, the Pirl policies are compact and humanreadable by construction, as exemplified by the acceleration policy in Figure 2. More examples of our synthesized policies are given in Appendix C.
Behavior of Policy.
Our experimental validation showed that the programmatic policy was less aggressive in terms of its use of actions and resulting in smoother steering actions. Numerically, we measure smoothness in Table 2 by comparing the population standard deviation of the set of steering actions taken by the program during the entire race. In Figure 3 we present a scatter plot of the steering actions taken by the DRL agent and the Ndps program during a slice of the CGSpeedway1 race. As we can see, the Ndps program takes much more conservative actions.
Model  CGSpeedway1  Aalborg 

Drl  0.5981  0.9008 
Ndps  0.1312  0.2483 
5.3 Robustness to Missing/Noisy Features
To evaluate the robustness of the agents with respect to defective sensors we introduce a Partial Observability variant of Torcs. In this variant, a random sample of $j$ sensors are declared defective. During the race, one or more of these defective sensors are blocked with some fixed probability. Hence, during gameplay, the sensor either returns the correct reading or a null reading. For sufficiently high block probabilities, both agents will fail to complete the race. In Table 3 we show the distances raced for two values of the block probability, and in Figure 4 we plot the distance raced as we increase the block probability on the Aalborg track. In both these experiments, the set of defective sensors was taken to be $\{\mathrm{\U0001d681\U0001d67f\U0001d67c},\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}\}$ because we know that the synthesized programs crucially depend on these sensors.
Model  CGSpeedway1  Aalborg  

50%  90%  50%  90%  
Drl  21  17  71  20 
Ndps  1976  200  1477  287 
5.4 Evaluating Generalization to New Instances
To compare the ability of the agents to perform on unseen tracks, we executed the learned policies on tracks of comparable difficulty. For agents trained on the CGSpeedway1 track, we chose CG track 2 and ERoad as the transfer tracks, and for Aalborg trained tracks we chose Alpine 2 and Ruudskogen. As can be seen in Tables 5 and 5, the Ndps programmatically synthesized program far outperforms the DRL agent on unseen tracks. The DRL agent is unable to complete the race on any of these transfer tracks. This demonstrates the transferability of the policies Ndps finds.
Model  CG track 2  ERoad  

Lap Time  Reward  Lap Time  Reward  
DRL  Cr 1608m  $$  Cr 1902m  $$ 
Ndps  1:40.57  110.18  1:51.59  98.21 
Model  Alpine 2  Ruudskogen  

Lap Time  Reward  Lap Time  Reward  
DRL  Cr 1688m  $$  Cr 3232m  $$ 
Ndps  3:16.68  67.49  3:19.77  57.69 
5.5 Verifiability of Policies
Now we use established symbolic verification techniques to automatically prove two properties of policies generated by Ndps. So far as we know, the current state of the art neural network verifiers cannot verify the DRL network we are using in a reasonable amount of time, due to the size and complexity of the network used to implement the DDPG algorithm. For example, the Reluplex (Katz et al., 2017) algorithm was tested on networks at most 300 nodes wide, whereas our network has three layers with 600 nodes each, and other smaller layers.
Smoothness Property
For the program given in Figure 2 we proved, we have $$. Intuitively, the above logical implication means that if the sum of the consecutive differences of the last six $\mathrm{\U0001d681\U0001d67f\U0001d67c}$ sensor values is less than $0.006$, then the acceleration actions calculated at the last and penultimate step will not differ by more than $0.49$. Similarly, for a policy given in Appendix C, we prove $$. This proof gives us a guarantee of the type of smooth steering behavior that we empirically examined earlier in this section.
Universal Bounds
We can prove that the program in Figure 2 satisfies the property $$. Intuitively, this means that we have proved global bounds for the action values in this environment, assuming reasonable bounds on some of the input values. In the Torcs environment these bounds are not very useful, since the simulator clips these actions to certain prespecified ranges. However, this experiment demonstrates that our framework allows us to prove universal bounds on the actions, and this could be a critical property for other environments.
6 Related Work
SyntaxGuided Synthesis.
The original formulation of inductive program synthesis is to search for a program in a hypothesis space (programming language) that is consistent with a specification (such as IO examples). However, this search is often intractable because of the large (potentially infinite) hypothesis space. One of the key ideas to make this search tractable is to provide the synthesizer a sketch of the desired program in addition to the examples, for example in (SolarLezama, 2009) and (Feser et al., 2015). The program sketch in addition to providing structure to the search space also allows users to provide additional insights. This approach has been generalized in a framework called SyntaxGuided Synthesis (Sygus) (Alur et al., 2015). Our Pirl approach is inspired by Sygus in the sense that we also use a highlevel grammar to constrain the shape of the possible learnt policies in a policy language grammar. However, unlike Sygus and previous sketchbased synthesis approaches that use logical constraints as specification, Pirl searches for policies with quantitative objectives.
Imitation Learning.
Imitation learning (Schaal, 1999) has been a successful paradigm for reducing the sample complexity of reinforcement learning algorithms by allowing the agent to leverage the additional supervision provided in terms of expert demonstrations for the desired behaviors. The DAgger (Dataset Aggregation) algorithm (Ross et al., 2011) is an iterative algorithm for imitation learning that learns stationary deterministic policies, where in each iteration $i$ it uses the current learnt policy ${\pi}_{i}$ to collect new trajectories and adds them to the dataset $D$ of all previously found trajectories. The policy for the next iteration ${\pi}_{i+1}$ is a policy that best mimics the expert policy ${\pi}^{*}$ on the whole dataset $D$. Our Neurally Directed Program Search (Ndps) is inspired by the DAgger algorithm, where we use the trained DeepRL agent as the expert (oracle), and iteratively perform IO augmentation for unseen input states explored by our synthesized policy with the current best reward. However, one key difference is that Ndps uses the expert trajectories to only guide the local program search in our policy language grammar to find a policy with highest rewards, unlike the imitation learning setting where the goal is to match the expert demonstrations perfectly.
Neural Program Synthesis and Induction.
Many recent efforts use neural networks for learning programs. These efforts have two flavors. In neural program induction, the goal is to learn a network that encodes the program semantics using internal weights. These architectures typically augment neural networks with differentiable computational substrates such as memory (Neural Turing Machines (Graves et al., 2014)), modules (Neural RAM (Kurach et al., 2015)) or datastructures such as stacks (Joulin & Mikolov, 2015), and formulate the program learning problem in an endtoend differentiable manner. In neural program synthesis, the architectures generate programs directly as outputs using multitask transfer learning (e.g. RobustFill (Devlin et al., 2017), DeepCoder (Balog et al., 2016), Bayou (Murali et al., 2018)), where the network weights are used to guide the program search in a DSL. There have also been some recent approaches to use Rl for learning to search programs in DSLs (Bunel et al., 2018; Abolafia et al., 2018). Our approach falls in the category of program synthesis approaches where we synthesize policies in a policy language. However, we learn richer policy programs with continuous parameters using the Ndps algorithm.
Interpretable Machine Learning.
Many recent efforts in deep learning aim to make deep networks more interpretable (Montavon et al., 2017; Lipton, 2016; Garnelo et al., 2016; Zahavy et al., 2016; Shanahan, 2005; Lake et al., 2016). There are three key approaches explored for interpreting DNNs: i) generate input prototypes in the input domain that are representatives of the learned concept in the abstract domain of the toplevel of a DNN, ii) explaining DNN decisions by relevance propagation and computing corresponding representative concepts in the input domain, and iii) Using symbolic techniques to explain and interpret a DNN. Our work differs from these approaches in that we are replacing the DRL model with human readable source code, that is programmatically synthesized to mimic the policy found by the neural network. Working at this level of abstraction provides a method to apply existing synthesis techniques to the problem of making DRL models interpretable.
Verification of Deep Neural Networks.
Reluplex (Katz et al., 2017) is an SMT solver that supports linear real arithmetic with ReLU constraints, and has been used to verify several properties of DNNbased airborne collision avoidance systems, such as not producing erroneous alerts and uniformity of alert regions. Unlike Reluplex, our framework generates interpretable program source code as output, where we can use traditional symbolic program verification techniques (King, 1976) to prove program properties.
7 Conclusion
We have introduced a framework for interpretable reinforcement learning, called Pirl. Here, policies are represented in a highlevel language. The goal is to find a policy that fits a syntactic “sketch” and also has optimal longterm reward. We have given an algorithm inspired by imitation learning, called Ndps, to achieve this goal. Our results show that the method is able to generate interpretable policies that clear reasonable performance goals, are amenable to symbolic verification, and, assuming a welldesigned sketch, are robust and easily transferred to unseen environments.
The experiments in this paper only considered environments with symbolic inputs. Handling perceptual inputs may raise additional algorithmic challenges, and is a natural next step. Also, in this paper, we only considered deterministic (if memoryful) policies. Extending our framework to stochastic policies is a goal for future work. Finally, while we explored policies in the context of reinforcement learning, one could define similar frameworks for other learning settings.
Acknowledgements
This research was partially supported by NSF Award CCF1162076 and DARPA MUSE Award #FA87501420270.
References
 Abolafia et al. (2018) Abolafia, D. A., Norouzi, M., and Le, Q. V. Neural program synthesis with priority queue training. CoRR, abs/1801.03526, 2018.
 Alur et al. (2015) Alur, R., Bodík, R., Dallal, E., Fisman, D., Garg, P., Juniwal, G., KressGazit, H., Madhusudan, P., Martin, M. M. K., Raghothaman, M., Saha, S., Seshia, S. A., Singh, R., SolarLezama, A., Torlak, E., and Udupa, A. Syntaxguided synthesis. In Dependable Software Systems Engineering, pp. 1–25. 2015. .
 Astrom & Hagglund (1984) Astrom, K. and Hagglund, T. Automatic tuning of simple regulators with specifications on phase and amplitude margins. Automatica, 20(5):645 – 651, 1984. ISSN 00051098. .
 Åström & Hägglund (1995) Åström, K. J. and Hägglund, T. PID controllers: Theory, Design, and Tuning. Instrument society of America, 1995.
 Balog et al. (2016) Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. CoRR, abs/1611.01989, 2016.
 Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC13(5):834–846, Sept 1983. ISSN 00189472. .
 Bunel et al. (2018) Bunel, R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcement learning for neural program synthesis. In ICLR, 2018.
 Cadar & Sen (2013) Cadar, C. and Sen, K. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82–90, 2013.
 Devlin et al. (2017) Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A., and Kohli, P. Robustfill: Neural program learning under noisy I/O. CoRR, abs/1703.07469, 2017.
 Feser et al. (2015) Feser, J. K., Chaudhuri, S., and Dillig, I. Synthesizing data structure transformations from inputoutput examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 1517, 2015, pp. 229–239, 2015. .
 Garnelo et al. (2016) Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards deep symbolic reinforcement learning. CoRR, abs/1609.05518, 2016.
 Geramifard et al. (2015) Geramifard, A., Dann, C., Klein, R. H., Dabney, W., and How, J. P. Rlpy: A valuefunctionbased reinforcement learning framework for education and research. Journal of Machine Learning Research, 16:1573–1578, 2015.
 Graves et al. (2014) Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. CoRR, abs/1410.5401, 2014.
 Joulin & Mikolov (2015) Joulin, A. and Mikolov, T. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pp. 190–198, 2015.
 Katz et al. (2017) Katz, G., Barrett, C. W., Dill, D. L., Julian, K., and Kochenderfer, M. J. Reluplex: An efficient smt solver for verifying deep neural networks. In Computer Aided Verification, pp. 97–117, 2017.
 King (1976) King, J. C. Symbolic execution and program testing. Communications of the ACM, 19(7):385–394, 1976.
 Koutník et al. (2013) Koutník, J., Cuccu, G., Schmidhuber, J., and Gomez, F. J. Evolving largescale neural networks for visionbased TORCS. In Proceedings of the 8th International Conference on the Foundations of Digital Games, FDG 2013, Chania, Crete, Greece, May 1417, 2013., pp. 206–212, 2013.
 Kurach et al. (2015) Kurach, K., Andrychowicz, M., and Sutskever, I. Neural randomaccess machines. CoRR, abs/1511.06392, 2015.
 Lake et al. (2016) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. CoRR, abs/1604.00289, 2016.
 Lillicrap et al. (2015a) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015a.
 Lillicrap et al. (2015b) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015b.
 Lipton (2016) Lipton, Z. C. The mythos of model interpretability. CoRR, abs/1606.03490, 2016.
 Loiacono et al. (2010) Loiacono, D., Prete, A., Lanzi, P. L., and Cardamone, L. Learning to overtake in TORCS using simple reinforcement learning. In Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2010, Barcelona, Spain, 1823 July 2010, pp. 1–8, 2010. .
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Montavon et al. (2017) Montavon, G., Samek, W., and Müller, K. Methods for interpreting and understanding deep neural networks. CoRR, abs/1706.07979, 2017.
 Moore (1991) Moore, A. Variable resolution dynamic programming: Efficiently learning action maps in multivariate realvalued statespaces. In Proceedings of the Eighth International Conference on Machine Learning. Morgan Kaufmann, January 1991.
 Murali et al. (2018) Murali, V., Chaudhuri, S., and Jermaine, C. Neural sketch learning for conditional program generation. In ICLR, 2018.
 Ross et al. (2011) Ross, S., Gordon, G. J., and Bagnell, D. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 1113, 2011, pp. 627–635, 2011.
 Salem et al. (2017) Salem, M., Mora, A. M., Merelo, J. J., and GarcíaSánchez, P. Driving in TORCS using modular fuzzy controllers. In Applications of Evolutionary Computation  20th European Conference, EvoApplications 2017, Amsterdam, The Netherlands, April 1921, 2017, Proceedings, Part I, pp. 361–376, 2017.
 Schaal (1999) Schaal, S. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
 Shanahan (2005) Shanahan, M. Perception as abduction: Turning sensor data into meaningful representation. Cognitive Science, 29(1):103–134, 2005. ISSN 15516709. .
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silver et al. (2017) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and Shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2, NIPS’12, pp. 2951–2959, USA, 2012. Curran Associates Inc.
 SolarLezama (2009) SolarLezama, A. The sketching approach to program synthesis. In Programming Languages and Systems, 7th Asian Symposium, APLAS 2009, Seoul, Korea, December 1416, 2009. Proceedings, pp. 4–13, 2009. .
 Wang et al. (2015) Wang, Z., de Freitas, N., and Lanctot, M. Dueling network architectures for deep reinforcement learning. CoRR, abs/1511.06581, 2015. URL http://arxiv.org/abs/1511.06581.
 Wymann et al. (2014) Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., and Sumner, A. TORCS, The Open Racing Car Simulator. http://www.torcs.org, 2014.
 Zahavy et al. (2016) Zahavy, T., BenZrihem, N., and Mannor, S. Graying the black box: Understanding dqns. CoRR, abs/1602.02658, 2016.
Appendix A Evaluation on Classic Control Games
In this section, we provide results of additional experimental evaluation on some classic control games. We use the OpenAI Gym environment implementation of these games. A brief description of these games is given below.
We used the DuelDDQN algorithm (Wang et al., 2015) to obtain our neural policy oracle for these games, rather than Ddpg, as an implementation of DuelDDQN already appears on the OpenAI Gym leaderboard.
Acrobot  CartPole  MountainCar  

Solved  $$  195  110 
Drl  63.17  197.53  84.73 
NdpsSMT  84.16  183.15  108.06 
NdpsBOPT  127.21  143.21  143.86 
Minimum  200  8  200 
Acrobot.
This environment consists of a two link, two joint robot. The joint between the links is actuated. At the start of the episode, the links are hanging downwards. At every timestep the agent chooses an action that correspond to applying a force to move the actuated link to the right, to the left, or to not applying a force. The episode is over once the end of the lower link swings above a certain height. The goal is to end the episode in the fewest possible timesteps.
We use the OpenAI Gym ‘Acrobotv1’ environment. This implementation is based on the system presented in (Geramifard et al., 2015). Each observation is a set consisting of readings from six sensors, corresponding to the rotational joint angles and velocities of joints and links. The action space is discrete with three elements, and at each timestep the environment returns the observation and a reward of $1$. An episode is terminated after 200 time steps irrespective of the state of the robot. This is an unsolved environment, which means it does not have a specified reward threshold at which it’s considered solved.
CartPole.
This environment consists of a pole attached by an unactuated joint to a cart that moves along a frictionless track. At the beginning, the pole is balanced vertically on the cart. The episode ends when the pole is more than ${15}^{\circ}$ from vertical, or the cart moves more than 2.4 units from the center. At every timestep the agent chooses to apply a force to move the cart to the right or to the left, and the goal is to prevent an episode from ending for the maximum possible timesteps.
We use the OpenAI Gym ‘CartPolev0’ environment, based on the system presented in (Barto et al., 1983). The sensor values correspond to the cart position, cart velocity, pole angle and pole velocity. The action space is discrete with two elements, and at each timestep the environment returns the observation and a reward of $+1$. An episode is terminated after 200 time steps irrespective of the state of the cart. CartPolev0 defines “solving” as getting an average reward of at least 195.0 over 100 consecutive trials.
MountainCar.
This environment consists of an underpowered car on a onedimensional track. At the beginning, the car is placed between two ‘hills’. The episode ends when the car reaches the top of the hill in front of it. Since the car is underpowered, the agent needs to drive it back and forth to build momentum. At every timestep the agent chooses to apply a force to move the car to the right, to the left, or to not apply a force. The goal is to end the episode in the fewest possible timesteps.
We use the OpenAI Gym ‘MountainCarv0’ environment. This implementation is based on the system presented in (Moore, 1991). The sensors provide the position and velocity of the car. The action space is discrete with three elements, and at each timestep the environment returns the observation and a reward of $1$. An episode is terminated after 200 time steps irrespective of the state of the robot. MountainCarv0 is considered “solved” if the average reward over 100 consecutive trials is not less than 110.0.
$$\begin{array}{c}0.97*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}((0.0{h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}}),1)+0.05*\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,(0.0{h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}}))+49.98*(\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},2)\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},1))\hfill \end{array}$$ 
$$\begin{array}{c}\mathrm{\mathbf{i}\mathbf{f}}(0.0001\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},1)>0)\mathrm{\mathbf{a}\mathbf{n}\mathbf{d}}(0.0001+\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},1)>0)\hfill \\ \mathrm{\mathbf{t}\mathbf{h}\mathbf{e}\mathbf{n}}0.95*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}((0.64{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}),1)+5.02*\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,(0.64{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}))+43.89*(\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},2)\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},1))\hfill \\ \mathrm{\mathbf{e}\mathbf{l}\mathbf{s}\mathbf{e}}\mathrm{\hspace{0.33em}\hspace{0.17em}0.95}*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}((0.60{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}),1)+5.02*\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,(0.60{h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}}))+43.89*(\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},2)\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d681\U0001d67f\U0001d67c}},1))\hfill \end{array}$$ 
$$\begin{array}{c}0.86*\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}((0.0{h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}}),1)+0.09*\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,(0.0{h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}}))+46.51*(\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},2)\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}},1))\hfill \end{array}$$ 
Results.
Table 6 shows rewards obtained by optimal policies found using various methods in these environments. The first row gives numbers for the Drl method. The rows NdpsSMT and NdpsBOPT for versions of the Ndps algorithm that respectively use SMTbased optimization and Bayesian optimization to find template parameters (more on this below).
Appendix B Additional Details on Algorithm
Now we elaborate on the optimization techniques we used in the distance computation step ${\mathrm{arg}\mathrm{min}}_{{e}^{\prime}}{\sum}_{h\in \mathscr{H}}\parallel {e}^{\prime}(h){e}_{\mathcal{N}}(h)\parallel $, to find a program similar to a given program $e$, in Algorithm 1.
As mentioned in the main paper, we start by enumerating a list of program templates, or programs with numericalvalued parameters $\theta $. This is done by first replacing the numerical constants in $e$ by parameters, eliding some subexpressions from the resulting parameterized program, and then regenerating the subexpressions using the rules of $\mathcal{S}$ (without instantiating the parameters), giving priority to shorter expressions. The resulting program template ${e}_{\theta}$ follows the sketch $\mathcal{S}$ and is also structurally close to $e$. Now we search for values for parameters $\theta $ that optimally imitate the neural oracle.
Bayesian optimization.
We use Bayesian optimization as our primary tool when searching for such optimal parameter values. This method applies to problems in which actions (program outputs) can be represented as vectors of real numbers. All problems considered in our experiments fall in this category. The distance of individual pairs of outputs of the synthesized program and the policy oracle is then simply the Euclidean distance between them. The sum of these distances is used to define the aggregate cost across all inputs in $\mathscr{H}$. We then use Bayesian optimization to find parameters that minimize this cost.
$$ 
$$\begin{array}{c}\mathrm{\mathbf{i}\mathbf{f}}(\mathrm{\mathbf{f}\mathbf{o}\mathbf{l}\mathbf{d}}(+,{h}_{0})\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}({h}_{3},1))>0\hfill \\ \mathrm{\mathbf{t}\mathbf{h}\mathbf{e}\mathbf{n}}0\hfill \\ \mathrm{\mathbf{e}\mathbf{l}\mathbf{s}\mathbf{e}}1\hfill \end{array}$$ 
$$ 
SMTbased Optimization.
We also use a second parameter search technique based on SMT (Satisfiability Modulo Theories) solving. Here, we generate a constraint that stipulates that for each $h\in \mathscr{H}$, the output ${e}_{\theta}(h)$ must match ${e}_{\mathcal{N}}(h)$ up to a constant error. Here, ${e}_{\mathcal{N}}(h)$ is a constant value obtained by executing ${e}_{\mathcal{N}}$. The output ${e}_{\theta}(h)$ depends on unknown parameters $\theta $; however, constraints over ${e}_{\theta}(h)$ can be represented as constraints over $\theta $ using techniques for symbolic execution of programs (Cadar & Sen, 2013). Because the oracle is only an approximation to the optimal policy in our setting, we do not insist that the generated constraint is satisfied entirely. Instead, we set up a MaxSat problem which assigns a weight to the constraint for each input $h$, and then solve this problem with a MaxSat solver.
Unfortunately, SMTbased optimization does not scale well in environments with continuous actions. Consequently, we exclusively use Bayesian optimization for all Torcs based experiments. SMTbased optimization can be used in the classic control games, however, and Table 6 shows results generated using this technique (in row NdpsSMT).
The results in Table 6 show that for the classic control games, SMTbased optimization gives better results. This is because the small number of legal actions in these games, limited to at most three values $\{0,1,2\}$, are well suited for the SMT setting. The SMT solver is able to efficiently perform parameter optimization, with a small set of histories. Whereas, the limited variability in actions forces the Bayesian optimization method to use a larger set of histories, and makes it harder for the method to avoid getting trapped in local minimas.
Appendix C Policy Examples
In this section we present more examples of the policies found by the Ndps algorithm.
The program in Figure 7 shows the body of a policy for steering, which together with the acceleration policy given in the paper (Figure 2), was found by the Ndps algorithm by training on the Aalborg track. Figures 7 & 7 likewise show the policies for acceleration and steering respectively, when trained on the CGSpeedway1 track. Similarly, Figures 10, 10 & 10 show policies found for Acrobot, CartPole, and MountainCar respectively. Here ${h}_{i}$ is the sequence of observations from the $i$th of $k$ sensors, for example ${h}_{0}$ is the $0$th sensor. The sensor order is determined by the OpenAI simulator.
Appendix D Torcs Video
We provide a video at the following link, which depicts clips of the Drl agent and the Ndps algorithm synthesized program, on the training track and one of the transfer (unseen) tracks, in that order:
https://goo.gl/Z2X5x6
On the training track, we can see that the steering actions taken by the Drl agent are very irregular, especially when compared to the smooth steering actions of the Ndps agent in the following clip. For the transfer track, we show the agents driving on the ERoad track. We can see that the Drl agent crashes before completing a full lap, while the Ndps agent does not crash. We have provided only small clips of the car during a race, to keep the video length and size small, but the behavior is representative of the agent for the entire race.