Programmatically Interpretable Reinforcement Learning

  • 2019-04-10 09:09:46
  • Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, Swarat Chaudhuri
  • 0


We present a reinforcement learning framework, called ProgrammaticallyInterpretable Reinforcement Learning (PIRL), that is designed to generateinterpretable and verifiable agent policies. Unlike the popular DeepReinforcement Learning (DRL) paradigm, which represents policies by neuralnetworks, PIRL represents policies using a high-level, domain-specificprogramming language. Such programmatic policies have the benefits of beingmore easily interpreted than neural networks, and being amenable toverification by symbolic methods. We propose a new method, called NeurallyDirected Program Search (NDPS), for solving the challenging nonsmoothoptimization problem of finding a programmatic policy with maximal reward. NDPSworks by first learning a neural policy network using DRL, and then performinga local search over programmatic policies that seeks to minimize a distancefrom this neural "oracle". We evaluate NDPS on the task of learning to drive asimulated car in the TORCS car-racing environment. We demonstrate that NDPS isable to discover human-readable policies that pass some significant performancebars. We also show that PIRL policies can have smoother trajectories, and canbe more easily transferred to environments not encountered during training,than corresponding policies discovered by DRL.


Quick Read (beta)

Programmatically Interpretable Reinforcement Learning

Abhinav Verma    Vijayaraghavan Murali    Rishabh Singh    Pushmeet Kohli    Swarat Chaudhuri

Appendix: Programmatically Interpretable Reinforcement Learning

Abhinav Verma    Vijayaraghavan Murali    Rishabh Singh    Pushmeet Kohli    Swarat Chaudhuri

We present a reinforcement learning framework, called Programmatically Interpretable Reinforcement Learning (Pirl), that is designed to generate interpretable and verifiable agent policies. Unlike the popular Deep Reinforcement Learning (Drl) paradigm, which represents policies by neural networks, Pirl represents policies using a high-level, domain-specific programming language. Such programmatic policies have the benefits of being more easily interpreted than neural networks, and being amenable to verification by symbolic methods. We propose a new method, called Neurally Directed Program Search (Ndps), for solving the challenging nonsmooth optimization problem of finding a programmatic policy with maximal reward. Ndps works by first learning a neural policy network using Drl, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural “oracle”. We evaluate Ndps on the task of learning to drive a simulated car in the Torcs car-racing environment. We demonstrate that Ndps is able to discover human-readable policies that pass some significant performance bars. We also show that Pirl policies can have smoother trajectories, and can be more easily transferred to environments not encountered during training, than corresponding policies discovered by Drl.

Reinforcement Learning, Program Synthesis

1 Introduction

Deep reinforcement learning (Drl) has had a massive impact on the field of machine learning and has led to remarkable successes in the solution of many challenging tasks (Mnih et al., 2015; Silver et al., 2016, 2017). While neural networks have been shown to be very effective in learning good policies, the expressivity of these models makes them difficult to interpret or to be checked for consistency for some desired properties, and casts a cloud over the use of such representations in safety-critical applications.

Motivated to overcome this problem, we propose a learning framework, called Programmatically Interpretable Reinforcement Learning (Pirl)11 1 Pirl is pronounced Pi-R-L (as in π-Rl), that is based on the idea of learning policies that are represented in a human-readable language. The Pirl framework is parameterized on a high-level programming language for policies. A problem instance in Pirl is similar to a one in traditional Rl, but also includes a (policy) sketch that syntactically defines a set of programmatic policies in this language. The objective is to find a program in this set with maximal long-term reward.

Intuitively, the policy programming language and the sketch characterize what we consider “interpretable”. In addition to interpretability, the syntactic restriction on policies has three key benefits. First, the language can be used to implicitly encode the learner’s inductive bias that will be used for generalization. Second, the language can allow effective pruning of undesired policies to make the search for a good policy more efficient. Finally, it allows us to use symbolic program verification techniques to formally reason about the learned policies and check consistency with correctness properties. At the same time, policies in Pirl can have rich semantics, for example allowing actions to depend on events far back in history.

A key technical challenge in Pirl is that the space of policies permitted in an instance can be vast and nonsmooth, making optimization extremely challenging. To address this, we propose a new algorithm called Neurally Directed Program Synthesis (Ndps). The algorithm first uses Drl to compute a neural policy network that has high performance, but may not be expressible in the policy language. This network is then used to direct a local search over programmatic policies. In each iteration of this search, we maintain a set of “interesting” inputs, and update the program so as to minimize the distance between its outputs and the outputs of the neural policy (an “oracle”) on these inputs. The set of interesting inputs is updated as the search progresses. This strategy, inspired by imitation learning (Ross et al., 2011; Schaal, 1999), allows us to perform direct policy search in a highly nonsmooth policy space.

We evaluate our approach in the task of learning to drive a simulated car in the Torcs car-racing environment (Wymann et al., 2014), as well as three classic control games (we discuss the former in the main paper, and the latter in the Appendix). Experiments demonstrate that Ndps is able to find interpretable policies that, while not as performant as the policies computed by Drl, pass some significant performance bars. Specifically, in Torcs, our policy sketch allows an unbounded set of programs with branches guarded by unknown conditions, each branch representing a Proportional-Integral-Derivative (PID) controller (Åström & Hägglund, 1995) with unknown parameters. The policy we obtain can successfully complete a lap of the race, and the use of the neural oracle is key to doing so. Our results also suggest that a well-designed sketch can serve as a regularizer. Due to constraints imposed by the sketch, the policies for Torcs that Ndps learns lead to smoother trajectories than the corresponding neural policies, and can tolerate greater noise. The policies are also more easily transferred to new domains, in particular race tracks not seen during training. Finally, we show, using several properties, that the programmatic policies that we discover are amenable to verification using off-the-shelf symbolic techniques.

2 Programmatically Interpretable Reinforcement Learning

In this section, we formalize the problem of programmatically interpretable reinforcement learning (Pirl).

We model a reinforcement learning setting as a Partially Observable Markov Decision Process (Pomdp) M=(𝒮,𝒜,𝒪,T(|s,a),Z(|s),r,𝐼𝑛𝑖𝑡,γ). Here, 𝒮 is the set of (environment) states. 𝒜 is the set of actions that the learning agent can perform, and 𝒪 is the set of observations about the current state that the agent can make. An agent action a at the state s causes the environment state to change probabilistically, and the destination state follows the distribution T(|s,a). The probability that the agent makes an observation o at state s is Z(o|s). The reward that the agent receives on performing action a in state s is given by r(s,a). 𝐼𝑛𝑖𝑡 is the initial distribution over environment states. Finally, 0<γ<1 is a real constant that is used to define the agent’s aggregate reward over time.

A history of M is a sequence h=o0,a0,,ak-1,ok, where oi and ai are, respectively, the agent’s observation and action at the i-th time step. Let M be the set of histories in M. A policy is a function π:M𝒜 that maps each history as above to an action ak. For each policy, we can define a set of histories that are possible when the agent follows π. We assume a mechanism to simulate the POMDP and sample histories that are possible under a policy. The policy also induces a distribution over possible rewards Ri that the agent receives at the i-th time step. The agent’s expected aggregate reward under π is given by R(π)=𝐄[i=0γiRi]. The goal in reinforcement learning is to discover a policy π* that maximizes R(π).

A Programming Language for Policies. The distinctive feature of Pirl is that policies here are expressed in a high-level, domain-specific programming language. Such a language can be defined in many ways. However, to facilitate search through the space of programs expressible in the language, it is desirable for the language to express computations as compactly and canonically as possible. Because of this, we propose to express parameterized policies using a functional language based on a small number of side-effect-free combinators. It is known from prior work on program synthesis (Feser et al., 2015) that such languages offer natural advantages in program synthesis.

We collectively refer to observations and actions, as well as auxiliary integers and reals generated during computation, as atoms. Our language considers two kinds of data: atoms and sequences of atoms (including histories). We assume a finite set of basic operators over atoms that is rich enough to capture all common operations on observations and actions.

Figure 1 shows the syntax of this language. The nonterminals E and x represent expressions that evaluate to atoms and histories, respectively. We sketch the semantics of the various language constructs below.

  • c ranges over a universe of numerical constants, and is a basic operator;

  • 𝐩𝐞𝐞𝐤(x,i) returns the observation from the i-th time step from a history x, and 𝐩𝐞𝐞𝐤(x,-1) is used as shorthand for the most recent observation;

  • 𝐟𝐨𝐥𝐝 is a standard higher-order combinators over sequences with the semantics:

  • x,x1,x2 are variables. As usual, unbound variables are assumed to be inputs.

The language comes with a type system that distinguishes between different types of atoms, and ensures that language constructs are used consistently. The type system can catch common errors, such as applying 𝐩𝐞𝐞𝐤(x,k+1) to a history of size k. This type system identifies a set of expressions whose inputs are histories and outputs are actions. These expressions are known as programmatic policies, or simply programs.

The combinator over sequences, 𝐟𝐨𝐥𝐝, will be operating over histories in our programs. Since histories can be of variable lengths, we restrict these combinators to operate on only the last n elements of a sequence, for some fixed n. This restriction provides us the ability to use these combinators in a consistent manner.

E ::= cx(E1,,Ek)𝐩𝐞𝐞𝐤(x,i)
Figure 1: Syntax of the policy language.


Discovering an optimal programmatic policy from the vast space of legitimate programs is typically impractical without some prior on the shape of target policies. Pirl allows the specification of such priors through instance-specific syntactic models called sketches.

We define a sketch as a grammar of expressions over atoms and sequences of atoms. The sketch places restrictions on the kinds of basic operators that will be considered during policy search. Formally, the sketch is obtained by restricting the grammar in Figure 1. The set of programs permitted by a sketch 𝒮 is denoted by [[𝒮]].


The Pirl problem can now be stated as follows. Suppose we are given a Pomdp M and a sketch 𝒮. Our goal is to find a program e*[[𝒮]] with optimal reward:

e* = argmaxe[[𝒮]]R(e). (1)


Now we consider a concrete example of Pirl, considered in more detail in Section 5.

Suppose our goal is to make a (simulated) car complete laps on a track. We want to do so by learning policies for tasks like steering and acceleration. Suppose we know that we could get well-behaved policies by using PID control — specifically, by switching back and forth between a set of PID controllers. However, we do not know the parameters of these controllers, and neither do we know the conditions under which we should switch from one controller to another. We can express this knowledge using the following sketch:

P ::= 𝐩𝐞𝐞𝐤((ϵ-hi),-1)
I ::= 𝐟𝐨𝐥𝐝(+,ϵ-hi)
D ::= 𝐩𝐞𝐞𝐤(hi,-2)-𝐩𝐞𝐞𝐤(hi,-1)
C ::= c1*P+c2*I+c3*D
B ::= c0+c1*𝐩𝐞𝐞𝐤(h1,-1)+
E ::= C𝐢𝐟B𝐭𝐡𝐞𝐧E1𝐞𝐥𝐬𝐞E2.

Here, E represents programs permitted by the sketch. The program’s input is a history h. We assume that this sequence is split into a set of sequences {h1,,hm}, where hi is the sequence of observations from the i-th of m sensors. The sensor’s most recent reading is given by 𝐩𝐞𝐞𝐤(hk,-1), and its second most recent reading is 𝐩𝐞𝐞𝐤(hk,-2). The operators +, -, *, >, and if-then-else are as usual. The program (optionally) evaluates a set of boolean conditions (B) over the current sensor readings, then chooses among a set of discretized PID controllers, represented by C. In the definition of C, P is the proportional term, I is the discretized integral term (calculated via a 𝐟𝐨𝐥𝐝), and D is a finite-difference approximation of the derivative term, ϵ is a known constant and represents a fixed target for the controller, (ϵ-hi) performs an element-wise operation on the sequence hi. The symbols ci are real-valued parameters. Recall that the 𝐟𝐨𝐥𝐝 acts over a fixed-sized window on the history, and hence can be used as a discrete approximation of the integral term in a PID controller.

The program in Figure 2 shows the body of a policy for acceleration that the Ndps algorithm finds given this sketch in the Torcs car racing environment. The program’s input consists of histories for 29 sensors; however, only two of them, 𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜 and 𝚁𝙿𝙼, are actually used in the program. While the sensor 𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜 (for the position of the car relative to the track axis) is used to decide which controller to use, only the 𝚁𝙿𝙼 sensor is needed to calculate the acceleration.

𝐢𝐟(0.001-𝐩𝐞𝐞𝐤(h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜,-1)>0)𝐚𝐧𝐝(0.001+𝐩𝐞𝐞𝐤(h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜,-1)>0)𝐭𝐡𝐞𝐧3.97*𝐩𝐞𝐞𝐤((0.44-h𝚁𝙿𝙼),-1)+0.01*𝐟𝐨𝐥𝐝(+,(0.44-h𝚁𝙿𝙼))+48.79*(𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-2)-𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-1))𝐞𝐥𝐬𝐞  3.97*𝐩𝐞𝐞𝐤((0.40-h𝚁𝙿𝙼),-1)+0.01*𝐟𝐨𝐥𝐝(+,(0.40-h𝚁𝙿𝙼))+48.79*(𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-2)-𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-1))
Figure 2: A programmatic policy for acceleration, automatically discovered by the Ndps algorithm. h𝚁𝙿𝙼 and h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜 represent histories for the 𝚁𝙿𝙼 and 𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜 sensors, respectively.

3 Neurally Directed Program Search

Imitating a Neural Policy Oracle.

The Ndps algorithm is a direct policy search that is guided by a neural “oracle”. Searching over policies is a standard approach in reinforcement learning. However, the nonsmoothness of the space of programmatic policies poses a fundamental challenge to the use of such an approach in Pirl. For example, a conceivable way of solving the search problem would be to define a neighborhood relation over programs and perform local search. However, in practice, the objective R(e) of such a search can vary irregularly, leading to poor performance (see Section 5 for experimental results on this).

In contrast, Ndps starts by using Drl to compute a neural policy oracle e𝒩 for the given environment. This policy is an approximation of the programmatic policy that we seek to find. To a first approximation, Ndps is a local search over programmatic policies that seeks to find a program e* that closely imitates the behavior of e𝒩. The main intuition here is that distance from e𝒩 is a simpler objective than the reward function R(e), which aggregates rewards over a lengthy time horizon. This approach can be seen to be a form of imitation learning (Schaal, 1999).

The distance between e𝒩 and the estimate e of e* in a search iteration is defined as d(e𝒩,e)=he(h)-e𝒩(h), where is a set of “interesting” inputs (histories) and is a suitable norm. During the iteration, we search the neighborhood of e for a program e that minimizes this distance. At the end of the iteration, e becomes the new estimate for e*.

Input Augmentation. One challenge in the algorithm is that under the policy e, the agent may encounter histories that are not possible under e𝒩, or any of the programs encountered in previous iterations of the search. For example, while searching for a steering controller, we may arrive at a program that, under certain conditions, steers the car into a wall, an illegal behavior that the neural policy does not exhibit. Such histories would be irrelevant to the distance between e𝒩 and e if the set were constructed ahead of time by simulating e𝒩, and never updated. This would be unfortunate as these are precisely the inputs on which the programmatic policy needs guidance.

Our solution to this problem is input augmentation, or periodic updates to the set . More precisely, after a certain number of search steps for a fixed set , and after choosing the best available synthesized program for this set, we sample a set of additional histories by simulating the current programmatic policy, and add these samples to .

3.1 Algorithm Details


[h] Neurally Directed Program Search \[email protected]@algorithmic \STATEInput: POMDP M, neural policy e𝒩, sketch 𝒮 \STATE𝚌𝚛𝚎𝚊𝚝𝚎_𝚑𝚒𝚜𝚝𝚘𝚛𝚒𝚎𝚜(e𝒩,M) \STATEe𝚒𝚗𝚒𝚝𝚒𝚊𝚕𝚒𝚣𝚎(e𝒩,,M,𝒮) \STATER𝚌𝚘𝚕𝚕𝚎𝚌𝚝_𝚛𝚎𝚠𝚊𝚛𝚍(e,M) \REPEAT\STATE(e,R)(e,R) \STATE𝚞𝚙𝚍𝚊𝚝𝚎_𝚑𝚒𝚜𝚝𝚘𝚛𝚒𝚎𝚜(e,e𝒩,M,) \STATE𝚗𝚎𝚒𝚐𝚑𝚋𝚘𝚛𝚑𝚘𝚘𝚍_𝚙𝚘𝚘𝚕(e) \STATEeargminehe(h)-e𝒩(h) \STATER𝚌𝚘𝚕𝚕𝚎𝚌𝚝_𝚛𝚎𝚠𝚊𝚛𝚍(e,M) \UNTILRR \STATEOutput: e

We show pseudocode for Ndps in Algorithm 3.1. The inputs to the algorithm are a Pomdp M, a neural policy e𝒩 for M that serves as an oracle, and a sketch 𝒮. The algorithm first samples a set of histories of e𝒩 using the procedure 𝚌𝚛𝚎𝚊𝚝𝚎_𝚑𝚒𝚜𝚝𝚘𝚛𝚒𝚎𝚜. Next it uses the routine 𝚒𝚗𝚒𝚝𝚒𝚊𝚕𝚒𝚣𝚎 to generate the program that is the starting point of the policy search. Then the procedure 𝚌𝚘𝚕𝚕𝚎𝚌𝚝_𝚛𝚎𝚠𝚊𝚛𝚍 calculates the expected aggregate reward R(e) (described in Section 2), by simulating the program in the Pomdp.

From this point on, Ndps iteratively updates its estimate e of the target program, as well as its estimate of the set of interesting inputs used for distance computation. To do the former, Ndps uses the procedure 𝚗𝚎𝚒𝚐𝚑𝚋𝚘𝚛𝚑𝚘𝚘𝚍_𝚙𝚘𝚘𝚕 to generate a space of programs that are structurally similar to e, then finds the program in this space that minimizes distance from e𝒩. The latter task is done by the routine 𝚞𝚙𝚍𝚊𝚝𝚎_𝚑𝚒𝚜𝚝𝚘𝚛𝚒𝚎𝚜, which heuristically picks interesting inputs in the trajectory of the learned program and then obtains the corresponding actions from the oracle for those inputs. This process goes on until the iterative search fails to improve the estimated reward R of e.

The subroutines used in the above description can be implemented in many ways. Now we elaborate on our implementation of the important subroutines of Ndps.

The optimization step.

The search for a program e at minimal distance from the neural oracle can be implemented in many ways. The approach we use has two steps. First, we enumerate a set of program templates — numerically parameterized programs — that are structurally similar to e and are permitted by the sketch 𝒮, giving priority to shorter templates. Next, we find optimal parameters for the enumerated templates. Our primary tool for the second step is Bayesian optimization (Snoek et al., 2012), though we also explored a symbolic optimization technique based on Satisfiability Modulo Theories (SMT) solving (Appendix B).

The initialization step.

The performance of Ndps turns out to be quite sensitive to the choice of the program that is the starting point of the search. Our initialization routine 𝚒𝚗𝚒𝚝𝚒𝚊𝚕𝚒𝚣𝚎 is broadly similar to the optimization step, in that it attempts to find programs that closely imitate the oracle through a combination of template enumeration and parameter optimization. However, rather than settling on a single program, 𝚒𝚗𝚒𝚝𝚒𝚊𝚕𝚒𝚣𝚎 generates a pool of programs that are close in behavior to the oracle. After this, it simulates the programs in the POMDP and returns the program that achieves the highest reward.

4 Environments for Experiments

In this section, we describe the environments (modeled by POMDPs) on which we evaluated the Ndps algorithm.


We use Ndps to generate controllers for cars in The Open Racing Car Simulator (Torcs) (Wymann et al., 2014). Torcs has been used extensively in AI research, for example in (Salem et al., 2017), (Koutník et al., 2013), and (Loiacono et al., 2010) among others. (Lillicrap et al., 2015a) has shown that a Deep Deterministic Policy Gradient (DDPG) network can be used in Rl environments with continuous action spaces. The Drl agents for Torcs in this paper implement this approach.

In its full generality Torcs provides a rich environment with input from up to 89 sensors, and optionally the 3D graphic from a chosen camera angle in the race. The controllers have to decide the values of 5 parameters during game play, which correspond to the acceleration, brake, clutch, gear and steering of the car. Apart from the immediate challenge of driving the car on the track, controllers also have to make race-level strategy decisions, like making pit-stops for fuel. A lower level of complexity is provided in the Practice Mode setting of TORCS. In this mode all race-level strategies are removed. Currently, so far as we know, state-of-the-art Drl models are capable of racing only in Practice Mode, and this is also the environment that we use. Here we consider the input from 29 sensors, and decide values for the acceleration and steering actions.

The sketches used in our experiments are as in the example in Section 2, and provide the basic structure of a proportional-integral-derivative (PID) program, with appropriate holes for parameter and observation values. To obtain a practical implementation, we constrain the fold calculation to the five latest observations of the history. This constraint corresponds to the standard strategy of automatic (integral) error reset in discretized PID controllers (Astrom & Hagglund, 1984).

Each track in Torcs can we viewed as a distinct Pomdp. In our implementation of Ndps for Torcs we choose one track and synthesize a program for it. Whenever the algorithm needs to interact with the Pomdp, we use the program or Drl agent to race on the track. For example, in the procedure 𝚌𝚘𝚕𝚕𝚎𝚌𝚝_𝚛𝚎𝚠𝚊𝚛𝚍 we use the synthesized program to race one lap, and the reward is a function of the speed, angle and position of the car at each time step.

For the 𝚌𝚛𝚎𝚊𝚝𝚎_𝚑𝚒𝚜𝚝𝚘𝚛𝚒𝚎𝚜 procedure we use the Drl agent to complete one lap of the track (an episode), recording the sensor values and environment state at each time step. The 𝚞𝚙𝚍𝚊𝚝𝚎_𝚑𝚒𝚜𝚝𝚘𝚛𝚒𝚎𝚜 procedure uses a two step process. First, the synthesized program is used to race one lap and we store the sequence of observations (given by sensor values) o1,o2, provided by Torcs during this lap. Then, we use the Drl agent to generate the corresponding action ai for each observation oi. Each tuple (oi,ai) is then added to the set of histories.

Classic Control Games.

In addition to Torcs, we evaluated our approach in three classic control games, Acrobot, CartPole, and MountainCar. These games provide simpler Rl environments, with fewer input sensors than Torcs and only a single discrete action at each time step, compared to two continuous actions in Torcs. These results appear in Appendix A.

5 Experimental Analysis

Now we present an empirical evaluation of the effectiveness of our algorithm in solving the Pirl problem. We synthesize programs for two Torcs tracks, CG-Speedway-1 and Aalborg. These tracks provide varying levels of difficulty, with Aalborg being the more difficult track of the two.

5.1 Evaluating Performance

A controller’s performance is measured according to two metrics, lap time and reward. To calculate the lap time, the programs are allowed to complete a three lap race, and we report the average time taken to complete a lap during this race. The reward function is calculated using the car’s velocity, angle with the track axis, and distance from the track axis. The same function is used to train the DRL agent initially. In the experiments we compare the average reward per time step, obtained by the various programs.

We compare among the following Rl agents:

  1. 1.

    DRL. An agent which uses Drl to find a policy represented as a deep neural network. The specific Drl algorithm we use is Deep Deterministic Policy Gradients (Lillicrap et al., 2015b), which has previously been used on Torcs.

  2. 2.

    Naive. Program synthesized without access to a policy oracle.

  3. 3.

    NoAug. Program synthesized without input augmentation.

  4. 4.

    NoSketch. Program synthesized in our policy language without sketch guidance.

  5. 5.

    NoIF. Programs permitted by a restriction of our sketch that does not permit conditional branching.

  6. 6.

    Ndps. The Program generated by the Ndps algorithm.

In Table 1 we present the performance results of the above list. The lap times in that table are given in minutes and seconds. The Timeout entries indicate that the synthesis process did not return a program that could complete the race, within the specified timeout of twelve hours.

These results justify the various choices that we made in our Ndps algorithm architecture, as discussed in Section 3. In many cases those choices were necessary to be able to synthesize a program that could successfully complete a race. As a consequence of these results, we only consider the DRL agent and the Ndps program for subsequent comparisons.

The NoAug and NoSketch agents are unable to generate programs that complete a single lap on either track. In the case of NoSketch this is because the syntax of the policy language (Figure 1), defines a very large program space. If we randomly sample from this space without any constraints (like those provided by the sketch), then the probability of getting a good program is extremely low and hence we are unable to reliably generate a program that can complete a lap. The NoAug agent performs poorly because without input augmentation, the synthesizer has no guidance from the oracle regarding the “correct” behavior once the program deviates even slightly from the oracle’s trajectory.

The Ndps algorithm is biased towards generating simpler programs to aid in interpretability. In the Ndps algorithm experiments we allow the synthesizer to produce policies with up to five nested 𝐢𝐟 statements. However, if two policies have Lap Times within one second of each other, then the algorithm chooses the one with fewer 𝐢𝐟 statements as the output. This is a reasonable choice because a difference of less than one second in Lap Times can be the result of different starting positions in the Torcs simulator, and hence the performance of such policies is essentially equivalent.

Table 1: Performance results in Torcs. Lap time is given in Minutes:Seconds. Timeout indicates that the synthesizer did not return a program that completed the race within the specified timeout.
Model CG-Speedway-1 Aalborg
Lap Time Reward Lap Time Reward
Drl 54.27 118.39 1:49.66 71.23
Naive 2:07.09 58.72 Timeout -
NoAug Timeout - Timeout -
NoSketch Timeout - Timeout -
NoIF 1:01.60 115.25 2:45.13 52.81
Ndps 1:01.56 115.32 2:38.87 54.91

5.2 Qualitative Analysis of the Programmatic Policy

We provide qualitative analysis of the inferred programmatic policy through the lens of interpretability, and its behavior in acting in the environment.


Interpretability is a qualitative metric, and cannot be easily demonstrated via experiments. The Drl policies are considered uninterpretable because their policies are encoded in black box neural networks. In contrast, the Pirl policies are compact and human-readable by construction, as exemplified by the acceleration policy in Figure 2. More examples of our synthesized policies are given in Appendix C.

Behavior of Policy.

Our experimental validation showed that the programmatic policy was less aggressive in terms of its use of actions and resulting in smoother steering actions. Numerically, we measure smoothness in Table 2 by comparing the population standard deviation of the set of steering actions taken by the program during the entire race. In Figure 3 we present a scatter plot of the steering actions taken by the DRL agent and the Ndps program during a slice of the CG-Speedway-1 race. As we can see, the Ndps program takes much more conservative actions.

Figure 3: Slice of steering actions taken by the DRL and Ndps agents, during the CG-Speedway-1 race. This figure demonstrates that the Ndps agent drives more smoothly.
Table 2: Smoothness measure of agents in Torcs, given by the standard deviation of the steering actions during a complete race. Lower values indicate smoother steering.
Model CG-Speedway-1 Aalborg
Drl 0.5981 0.9008
Ndps 0.1312 0.2483

5.3 Robustness to Missing/Noisy Features

To evaluate the robustness of the agents with respect to defective sensors we introduce a Partial Observability variant of Torcs. In this variant, a random sample of j sensors are declared defective. During the race, one or more of these defective sensors are blocked with some fixed probability. Hence, during game-play, the sensor either returns the correct reading or a null reading. For sufficiently high block probabilities, both agents will fail to complete the race. In Table 3 we show the distances raced for two values of the block probability, and in Figure 4 we plot the distance raced as we increase the block probability on the Aalborg track. In both these experiments, the set of defective sensors was taken to be {𝚁𝙿𝙼,𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜} because we know that the synthesized programs crucially depend on these sensors.

Table 3: Partial observability results in Torcs blocking sensors {𝚁𝙿𝙼,𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜} . For each track and block probability we give the distance, in meters, raced by the program before crashing.
Model CG-Speedway-1 Aalborg
50% 90% 50% 90%
Drl 21 17 71 20
Ndps 1976 200 1477 287

Figure 4: Distance raced by the agents as the block probability increases for a particular sensor(s) on Aalborg. The Ndps agent is more robust to blocked sensors.

5.4 Evaluating Generalization to New Instances

To compare the ability of the agents to perform on unseen tracks, we executed the learned policies on tracks of comparable difficulty. For agents trained on the CG-Speedway-1 track, we chose CG track 2 and E-Road as the transfer tracks, and for Aalborg trained tracks we chose Alpine 2 and Ruudskogen. As can be seen in Tables 5 and 5, the Ndps programmatically synthesized program far outperforms the DRL agent on unseen tracks. The DRL agent is unable to complete the race on any of these transfer tracks. This demonstrates the transferability of the policies Ndps finds.

Table 4: Transfer results with training on CG-Speedway-1. ‘Cr’ indicates that the agent crashed after racing the specified distance.
Model CG track 2 E-Road
Lap Time Reward Lap Time Reward
DRL Cr 1608m - Cr 1902m -
Ndps 1:40.57 110.18 1:51.59 98.21
Table 5: Transfer results with training on Aalborg. ‘Cr’ denotes the agent crashed, after racing the specified distance.
Model Alpine 2 Ruudskogen
Lap Time Reward Lap Time Reward
DRL Cr 1688m - Cr 3232m -
Ndps 3:16.68 67.49 3:19.77 57.69

5.5 Verifiability of Policies

Now we use established symbolic verification techniques to automatically prove two properties of policies generated by Ndps. So far as we know, the current state of the art neural network verifiers cannot verify the DRL network we are using in a reasonable amount of time, due to the size and complexity of the network used to implement the DDPG algorithm. For example, the Reluplex (Katz et al., 2017) algorithm was tested on networks at most 300 nodes wide, whereas our network has three layers with 600 nodes each, and other smaller layers.

Smoothness Property

For the program given in Figure 2 we proved, we have k,i=kk+5𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,i+1)-𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,i)<0.006𝐩𝐞𝐞𝐤(h𝙰𝚌𝚌𝚎𝚕,k+1)-𝐩𝐞𝐞𝐤(h𝙰𝚌𝚌𝚎𝚕,k)<0.49. Intuitively, the above logical implication means that if the sum of the consecutive differences of the last six 𝚁𝙿𝙼 sensor values is less than 0.006, then the acceleration actions calculated at the last and penultimate step will not differ by more than 0.49. Similarly, for a policy given in Appendix C, we prove k,i=kk+5𝐩𝐞𝐞𝐤(h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜,i+1)-𝐩𝐞𝐞𝐤(h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜,i)<0.006𝐩𝐞𝐞𝐤(h𝚂𝚝𝚎𝚎𝚛,k+1)-𝐩𝐞𝐞𝐤(h𝚂𝚝𝚎𝚎𝚛,k)<0.11. This proof gives us a guarantee of the type of smooth steering behavior that we empirically examined earlier in this section.

Universal Bounds

We can prove that the program in Figure 2 satisfies the property i(0𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,i)1-1𝐩𝐞𝐞𝐤(h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜,i)1)(𝐩𝐞𝐞𝐤(h𝚂𝚝𝚎𝚎𝚛,i)<101.08-54.53<𝐩𝐞𝐞𝐤(h𝙰𝚌𝚌𝚎𝚕,i)<53.03). Intuitively, this means that we have proved global bounds for the action values in this environment, assuming reasonable bounds on some of the input values. In the Torcs environment these bounds are not very useful, since the simulator clips these actions to certain pre-specified ranges. However, this experiment demonstrates that our framework allows us to prove universal bounds on the actions, and this could be a critical property for other environments.

6 Related Work

Syntax-Guided Synthesis.

The original formulation of inductive program synthesis is to search for a program in a hypothesis space (programming language) that is consistent with a specification (such as IO examples). However, this search is often intractable because of the large (potentially infinite) hypothesis space. One of the key ideas to make this search tractable is to provide the synthesizer a sketch of the desired program in addition to the examples, for example in (Solar-Lezama, 2009) and (Feser et al., 2015). The program sketch in addition to providing structure to the search space also allows users to provide additional insights. This approach has been generalized in a framework called Syntax-Guided Synthesis (Sygus(Alur et al., 2015). Our Pirl approach is inspired by Sygus in the sense that we also use a high-level grammar to constrain the shape of the possible learnt policies in a policy language grammar. However, unlike Sygus and previous sketch-based synthesis approaches that use logical constraints as specification, Pirl searches for policies with quantitative objectives.

Imitation Learning.

Imitation learning (Schaal, 1999) has been a successful paradigm for reducing the sample complexity of reinforcement learning algorithms by allowing the agent to leverage the additional supervision provided in terms of expert demonstrations for the desired behaviors. The DAgger (Dataset Aggregation) algorithm (Ross et al., 2011) is an iterative algorithm for imitation learning that learns stationary deterministic policies, where in each iteration i it uses the current learnt policy πi to collect new trajectories and adds them to the dataset D of all previously found trajectories. The policy for the next iteration πi+1 is a policy that best mimics the expert policy π* on the whole dataset D. Our Neurally Directed Program Search (Ndps) is inspired by the DAgger algorithm, where we use the trained DeepRL agent as the expert (oracle), and iteratively perform IO augmentation for unseen input states explored by our synthesized policy with the current best reward. However, one key difference is that Ndps uses the expert trajectories to only guide the local program search in our policy language grammar to find a policy with highest rewards, unlike the imitation learning setting where the goal is to match the expert demonstrations perfectly.

Neural Program Synthesis and Induction.

Many recent efforts use neural networks for learning programs. These efforts have two flavors. In neural program induction, the goal is to learn a network that encodes the program semantics using internal weights. These architectures typically augment neural networks with differentiable computational substrates such as memory (Neural Turing Machines (Graves et al., 2014)), modules (Neural RAM (Kurach et al., 2015)) or data-structures such as stacks (Joulin & Mikolov, 2015), and formulate the program learning problem in an end-to-end differentiable manner. In neural program synthesis, the architectures generate programs directly as outputs using multi-task transfer learning (e.g. RobustFill (Devlin et al., 2017), DeepCoder (Balog et al., 2016), Bayou (Murali et al., 2018)), where the network weights are used to guide the program search in a DSL. There have also been some recent approaches to use Rl for learning to search programs in DSLs (Bunel et al., 2018; Abolafia et al., 2018). Our approach falls in the category of program synthesis approaches where we synthesize policies in a policy language. However, we learn richer policy programs with continuous parameters using the Ndps algorithm.

Interpretable Machine Learning.

Many recent efforts in deep learning aim to make deep networks more interpretable (Montavon et al., 2017; Lipton, 2016; Garnelo et al., 2016; Zahavy et al., 2016; Shanahan, 2005; Lake et al., 2016). There are three key approaches explored for interpreting DNNs: i) generate input prototypes in the input domain that are representatives of the learned concept in the abstract domain of the top-level of a DNN, ii) explaining DNN decisions by relevance propagation and computing corresponding representative concepts in the input domain, and iii) Using symbolic techniques to explain and interpret a DNN. Our work differs from these approaches in that we are replacing the DRL model with human readable source code, that is programmatically synthesized to mimic the policy found by the neural network. Working at this level of abstraction provides a method to apply existing synthesis techniques to the problem of making DRL models interpretable.

Verification of Deep Neural Networks.

Reluplex (Katz et al., 2017) is an SMT solver that supports linear real arithmetic with ReLU constraints, and has been used to verify several properties of DNN-based airborne collision avoidance systems, such as not producing erroneous alerts and uniformity of alert regions. Unlike Reluplex, our framework generates interpretable program source code as output, where we can use traditional symbolic program verification techniques (King, 1976) to prove program properties.

7 Conclusion

We have introduced a framework for interpretable reinforcement learning, called Pirl. Here, policies are represented in a high-level language. The goal is to find a policy that fits a syntactic “sketch” and also has optimal long-term reward. We have given an algorithm inspired by imitation learning, called Ndps, to achieve this goal. Our results show that the method is able to generate interpretable policies that clear reasonable performance goals, are amenable to symbolic verification, and, assuming a well-designed sketch, are robust and easily transferred to unseen environments.

The experiments in this paper only considered environments with symbolic inputs. Handling perceptual inputs may raise additional algorithmic challenges, and is a natural next step. Also, in this paper, we only considered deterministic (if memoryful) policies. Extending our framework to stochastic policies is a goal for future work. Finally, while we explored policies in the context of reinforcement learning, one could define similar frameworks for other learning settings.


This research was partially supported by NSF Award CCF-1162076 and DARPA MUSE Award #FA8750-14-2-0270.


  • Abolafia et al. (2018) Abolafia, D. A., Norouzi, M., and Le, Q. V. Neural program synthesis with priority queue training. CoRR, abs/1801.03526, 2018.
  • Alur et al. (2015) Alur, R., Bodík, R., Dallal, E., Fisman, D., Garg, P., Juniwal, G., Kress-Gazit, H., Madhusudan, P., Martin, M. M. K., Raghothaman, M., Saha, S., Seshia, S. A., Singh, R., Solar-Lezama, A., Torlak, E., and Udupa, A. Syntax-guided synthesis. In Dependable Software Systems Engineering, pp. 1–25. 2015. .
  • Astrom & Hagglund (1984) Astrom, K. and Hagglund, T. Automatic tuning of simple regulators with specifications on phase and amplitude margins. Automatica, 20(5):645 – 651, 1984. ISSN 0005-1098. .
  • Åström & Hägglund (1995) Åström, K. J. and Hägglund, T. PID controllers: Theory, Design, and Tuning. Instrument society of America, 1995.
  • Balog et al. (2016) Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. CoRR, abs/1611.01989, 2016.
  • Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834–846, Sept 1983. ISSN 0018-9472. .
  • Bunel et al. (2018) Bunel, R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcement learning for neural program synthesis. In ICLR, 2018.
  • Cadar & Sen (2013) Cadar, C. and Sen, K. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82–90, 2013.
  • Devlin et al. (2017) Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A., and Kohli, P. Robustfill: Neural program learning under noisy I/O. CoRR, abs/1703.07469, 2017.
  • Feser et al. (2015) Feser, J. K., Chaudhuri, S., and Dillig, I. Synthesizing data structure transformations from input-output examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, pp. 229–239, 2015. .
  • Garnelo et al. (2016) Garnelo, M., Arulkumaran, K., and Shanahan, M. Towards deep symbolic reinforcement learning. CoRR, abs/1609.05518, 2016.
  • Geramifard et al. (2015) Geramifard, A., Dann, C., Klein, R. H., Dabney, W., and How, J. P. Rlpy: A value-function-based reinforcement learning framework for education and research. Journal of Machine Learning Research, 16:1573–1578, 2015.
  • Graves et al. (2014) Graves, A., Wayne, G., and Danihelka, I. Neural turing machines. CoRR, abs/1410.5401, 2014.
  • Joulin & Mikolov (2015) Joulin, A. and Mikolov, T. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 190–198, 2015.
  • Katz et al. (2017) Katz, G., Barrett, C. W., Dill, D. L., Julian, K., and Kochenderfer, M. J. Reluplex: An efficient smt solver for verifying deep neural networks. In Computer Aided Verification, pp. 97–117, 2017.
  • King (1976) King, J. C. Symbolic execution and program testing. Communications of the ACM, 19(7):385–394, 1976.
  • Koutník et al. (2013) Koutník, J., Cuccu, G., Schmidhuber, J., and Gomez, F. J. Evolving large-scale neural networks for vision-based TORCS. In Proceedings of the 8th International Conference on the Foundations of Digital Games, FDG 2013, Chania, Crete, Greece, May 14-17, 2013., pp. 206–212, 2013.
  • Kurach et al. (2015) Kurach, K., Andrychowicz, M., and Sutskever, I. Neural random-access machines. CoRR, abs/1511.06392, 2015.
  • Lake et al. (2016) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. CoRR, abs/1604.00289, 2016.
  • Lillicrap et al. (2015a) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015a.
  • Lillicrap et al. (2015b) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015b.
  • Lipton (2016) Lipton, Z. C. The mythos of model interpretability. CoRR, abs/1606.03490, 2016.
  • Loiacono et al. (2010) Loiacono, D., Prete, A., Lanzi, P. L., and Cardamone, L. Learning to overtake in TORCS using simple reinforcement learning. In Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2010, Barcelona, Spain, 18-23 July 2010, pp. 1–8, 2010. .
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Montavon et al. (2017) Montavon, G., Samek, W., and Müller, K. Methods for interpreting and understanding deep neural networks. CoRR, abs/1706.07979, 2017.
  • Moore (1991) Moore, A. Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces. In Proceedings of the Eighth International Conference on Machine Learning. Morgan Kaufmann, January 1991.
  • Murali et al. (2018) Murali, V., Chaudhuri, S., and Jermaine, C. Neural sketch learning for conditional program generation. In ICLR, 2018.
  • Ross et al. (2011) Ross, S., Gordon, G. J., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pp. 627–635, 2011.
  • Salem et al. (2017) Salem, M., Mora, A. M., Merelo, J. J., and García-Sánchez, P. Driving in TORCS using modular fuzzy controllers. In Applications of Evolutionary Computation - 20th European Conference, EvoApplications 2017, Amsterdam, The Netherlands, April 19-21, 2017, Proceedings, Part I, pp. 361–376, 2017.
  • Schaal (1999) Schaal, S. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
  • Shanahan (2005) Shanahan, M. Perception as abduction: Turning sensor data into meaningful representation. Cognitive Science, 29(1):103–134, 2005. ISSN 1551-6709. .
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Silver et al. (2017) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and Shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  • Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, NIPS’12, pp. 2951–2959, USA, 2012. Curran Associates Inc.
  • Solar-Lezama (2009) Solar-Lezama, A. The sketching approach to program synthesis. In Programming Languages and Systems, 7th Asian Symposium, APLAS 2009, Seoul, Korea, December 14-16, 2009. Proceedings, pp. 4–13, 2009. .
  • Wang et al. (2015) Wang, Z., de Freitas, N., and Lanctot, M. Dueling network architectures for deep reinforcement learning. CoRR, abs/1511.06581, 2015. URL
  • Wymann et al. (2014) Wymann, B., Espié, E., Guionneau, C., Dimitrakakis, C., Coulom, R., and Sumner, A. TORCS, The Open Racing Car Simulator., 2014.
  • Zahavy et al. (2016) Zahavy, T., Ben-Zrihem, N., and Mannor, S. Graying the black box: Understanding dqns. CoRR, abs/1602.02658, 2016.

Appendix A Evaluation on Classic Control Games

In this section, we provide results of additional experimental evaluation on some classic control games. We use the OpenAI Gym environment implementation of these games. A brief description of these games is given below.

We used the Duel-DDQN algorithm (Wang et al., 2015) to obtain our neural policy oracle for these games, rather than Ddpg, as an implementation of Duel-DDQN already appears on the OpenAI Gym leader-board.

Table 6: Rewards achieved in Classic Control Games. Acrobot does not have threshold at which it is considered solved.
Acrobot CartPole MountainCar
Solved - 195 -110
Drl -63.17 197.53 -84.73
Ndps-SMT -84.16 183.15 -108.06
Ndps-BOPT -127.21 143.21 -143.86
Minimum -200 8 -200


This environment consists of a two link, two joint robot. The joint between the links is actuated. At the start of the episode, the links are hanging downwards. At every timestep the agent chooses an action that correspond to applying a force to move the actuated link to the right, to the left, or to not applying a force. The episode is over once the end of the lower link swings above a certain height. The goal is to end the episode in the fewest possible timesteps.

We use the OpenAI Gym ‘Acrobot-v1’ environment. This implementation is based on the system presented in (Geramifard et al., 2015). Each observation is a set consisting of readings from six sensors, corresponding to the rotational joint angles and velocities of joints and links. The action space is discrete with three elements, and at each timestep the environment returns the observation and a reward of -1. An episode is terminated after 200 time steps irrespective of the state of the robot. This is an unsolved environment, which means it does not have a specified reward threshold at which it’s considered solved.


This environment consists of a pole attached by an un-actuated joint to a cart that moves along a frictionless track. At the beginning, the pole is balanced vertically on the cart. The episode ends when the pole is more than 15 from vertical, or the cart moves more than 2.4 units from the center. At every timestep the agent chooses to apply a force to move the cart to the right or to the left, and the goal is to prevent an episode from ending for the maximum possible timesteps.

We use the OpenAI Gym ‘CartPole-v0’ environment, based on the system presented in (Barto et al., 1983). The sensor values correspond to the cart position, cart velocity, pole angle and pole velocity. The action space is discrete with two elements, and at each timestep the environment returns the observation and a reward of +1. An episode is terminated after 200 time steps irrespective of the state of the cart. CartPole-v0 defines “solving” as getting an average reward of at least 195.0 over 100 consecutive trials.


This environment consists of an underpowered car on a one-dimensional track. At the beginning, the car is placed between two ‘hills’. The episode ends when the car reaches the top of the hill in front of it. Since the car is underpowered, the agent needs to drive it back and forth to build momentum. At every timestep the agent chooses to apply a force to move the car to the right, to the left, or to not apply a force. The goal is to end the episode in the fewest possible timesteps.

We use the OpenAI Gym ‘MountainCar-v0’ environment. This implementation is based on the system presented in (Moore, 1991). The sensors provide the position and velocity of the car. The action space is discrete with three elements, and at each timestep the environment returns the observation and a reward of -1. An episode is terminated after 200 time steps irrespective of the state of the robot. MountainCar-v0 is considered “solved” if the average reward over 100 consecutive trials is not less than -110.0.

Figure 5: A programmatic policy for steering, automatically discovered by the Ndps algorithm with training on Aalborg.
𝐢𝐟(0.0001-𝐩𝐞𝐞𝐤(h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜,-1)>0)𝐚𝐧𝐝(0.0001+𝐩𝐞𝐞𝐤(h𝚃𝚛𝚊𝚌𝚔𝙿𝚘𝚜,-1)>0)𝐭𝐡𝐞𝐧0.95*𝐩𝐞𝐞𝐤((0.64-h𝚁𝙿𝙼),-1)+5.02*𝐟𝐨𝐥𝐝(+,(0.64-h𝚁𝙿𝙼))+43.89*(𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-2)-𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-1))𝐞𝐥𝐬𝐞  0.95*𝐩𝐞𝐞𝐤((0.60-h𝚁𝙿𝙼),-1)+5.02*𝐟𝐨𝐥𝐝(+,(0.60-h𝚁𝙿𝙼))+43.89*(𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-2)-𝐩𝐞𝐞𝐤(h𝚁𝙿𝙼,-1))
Figure 6: A programmatic policy for acceleration, automatically discovered by the Ndps algorithm with training on CG-Speedway-1.
Figure 7: A programmatic policy for steering, automatically discovered by the Ndps algorithm with training on CG-Speedway-1.


Table 6 shows rewards obtained by optimal policies found using various methods in these environments. The first row gives numbers for the Drl method. The rows Ndps-SMT and Ndps-BOPT for versions of the Ndps algorithm that respectively use SMT-based optimization and Bayesian optimization to find template parameters (more on this below).

Appendix B Additional Details on Algorithm

Now we elaborate on the optimization techniques we used in the distance computation step argminehe(h)-e𝒩(h), to find a program similar to a given program e, in Algorithm 1.

As mentioned in the main paper, we start by enumerating a list of program templates, or programs with numerical-valued parameters θ. This is done by first replacing the numerical constants in e by parameters, eliding some subexpressions from the resulting parameterized program, and then regenerating the subexpressions using the rules of 𝒮 (without instantiating the parameters), giving priority to shorter expressions. The resulting program template eθ follows the sketch 𝒮 and is also structurally close to e. Now we search for values for parameters θ that optimally imitate the neural oracle.

Bayesian optimization.

We use Bayesian optimization as our primary tool when searching for such optimal parameter values. This method applies to problems in which actions (program outputs) can be represented as vectors of real numbers. All problems considered in our experiments fall in this category. The distance of individual pairs of outputs of the synthesized program and the policy oracle is then simply the Euclidean distance between them. The sum of these distances is used to define the aggregate cost across all inputs in . We then use Bayesian optimization to find parameters that minimize this cost.

Figure 8: A programmatic policy for Acrobot, automatically discovered by the Ndps algorithm.
Figure 9: A programmatic policy for CartPole, automatically discovered by the Ndps algorithm.
Figure 10: A programmatic policy for MountainCar, automatically discovered by the Ndps algorithm.

SMT-based Optimization.

We also use a second parameter search technique based on SMT (Satisfiability Modulo Theories) solving. Here, we generate a constraint that stipulates that for each h, the output eθ(h) must match e𝒩(h) up to a constant error. Here, e𝒩(h) is a constant value obtained by executing e𝒩. The output eθ(h) depends on unknown parameters θ; however, constraints over eθ(h) can be represented as constraints over θ using techniques for symbolic execution of programs (Cadar & Sen, 2013). Because the oracle is only an approximation to the optimal policy in our setting, we do not insist that the generated constraint is satisfied entirely. Instead, we set up a Max-Sat problem which assigns a weight to the constraint for each input h, and then solve this problem with a Max-Sat solver.

Unfortunately, SMT-based optimization does not scale well in environments with continuous actions. Consequently, we exclusively use Bayesian optimization for all Torcs based experiments. SMT-based optimization can be used in the classic control games, however, and Table 6 shows results generated using this technique (in row Ndps-SMT).

The results in Table 6 show that for the classic control games, SMT-based optimization gives better results. This is because the small number of legal actions in these games, limited to at most three values {0,1,2}, are well suited for the SMT setting. The SMT solver is able to efficiently perform parameter optimization, with a small set of histories. Whereas, the limited variability in actions forces the Bayesian optimization method to use a larger set of histories, and makes it harder for the method to avoid getting trapped in local minimas.

Appendix C Policy Examples

In this section we present more examples of the policies found by the Ndps algorithm.

The program in Figure 7 shows the body of a policy for steering, which together with the acceleration policy given in the paper (Figure 2), was found by the Ndps algorithm by training on the Aalborg track. Figures 7 & 7 likewise show the policies for acceleration and steering respectively, when trained on the CG-Speedway-1 track. Similarly, Figures 10, 10 & 10 show policies found for Acrobot, CartPole, and MountainCar respectively. Here hi is the sequence of observations from the i-th of k sensors, for example h0 is the 0-th sensor. The sensor order is determined by the OpenAI simulator.

Appendix D Torcs Video

We provide a video at the following link, which depicts clips of the Drl agent and the Ndps algorithm synthesized program, on the training track and one of the transfer (unseen) tracks, in that order:

On the training track, we can see that the steering actions taken by the Drl agent are very irregular, especially when compared to the smooth steering actions of the Ndps agent in the following clip. For the transfer track, we show the agents driving on the E-Road track. We can see that the Drl agent crashes before completing a full lap, while the Ndps agent does not crash. We have provided only small clips of the car during a race, to keep the video length and size small, but the behavior is representative of the agent for the entire race.