Abstract
We study the problem of programmatic reinforcement learning, in whichpolicies are represented as short programs in a symbolic language. Programmaticpolicies can be more interpretable, generalizable, and amenable to formalverification than neural policies; however, designing rigorous learningapproaches for such policies remains a challenge. Our approach to thischallenge  a metaalgorithm called PROPEL  is based on three insights.First, we view our learning task as optimization in policy space, modulo theconstraint that the desired policy has a programmatic representation, and solvethis optimization problem using a form of mirror descent that takes a gradientstep into the unconstrained policy space and then projects back onto theconstrained space. Second, we view the unconstrained policy space as mixingneural and programmatic representations, which enables employingstateoftheart deep policy gradient approaches. Third, we cast the projectionstep as program synthesis via imitation learning, and exploit contemporarycombinatorial methods for this task. We present theoretical convergence resultsfor PROPEL and empirically evaluate the approach in three continuous controldomains. The experiments show that PROPEL can significantly outperformstateoftheart approaches for learning programmatic policies.
Quick Read (beta)
ImitationProjected Programmatic Reinforcement Learning
Appendix: ImitationProjected Programmatic Reinforcement Learning
Abstract
We study the problem of programmatic reinforcement learning, in which policies are represented as short programs in a symbolic language. Programmatic policies can be more interpretable, generalizable, and amenable to formal verification than neural policies; however, designing rigorous learning approaches for such policies remains a challenge. Our approach to this challenge — a metaalgorithm called Propel— is based on three insights. First, we view our learning task as optimization in policy space, modulo the constraint that the desired policy has a programmatic representation, and solve this optimization problem using a form of mirror descent that takes a gradient step into the unconstrained policy space and then projects back onto the constrained space. Second, we view the unconstrained policy space as mixing neural and programmatic representations, which enables employing stateoftheart deep policy gradient approaches. Third, we cast the projection step as program synthesis via imitation learning, and exploit contemporary combinatorial methods for this task. We present theoretical convergence results for Propel and empirically evaluate the approach in three continuous control domains. The experiments show that Propel can significantly outperform stateoftheart approaches for learning programmatic policies.
ImitationProjected Programmatic Reinforcement Learning
Abhinav Verma^{†}^{†}thanks: Equal contribution Rice University [email protected] Hoang M. Le${}^{\mathrm{*}}$ Caltech [email protected] Yisong Yue Caltech [email protected] Swarat Chaudhuri Rice University [email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
A growing body of work [58, 8, 60] investigates reinforcement learning (RL) approaches that represent policies as programs in a symbolic language, e.g., a domainspecific language for composing control modules such as PID controllers [5]. Short programmatic policies offer many advantages over neural policies discovered through deep RL, including greater interpretability, better generalization to unseen environments, and greater amenability to formal verification. These benefits motivate developing effective approaches for learning such programmatic policies.
However, programmatic reinforcement learning (PRl) remains a challenging problem, owing to the highly structured nature of the policy space. Recent stateoftheart approaches employ program synthesis methods to imitate or distill a pretrained neural policy into short programs [58, 8]. However, such a distillation process can yield a highly suboptimal programmatic policy — i.e., a large distillation gap — and the issue of direct policy search for programmatic policies also remains open.
In this paper, we develop Propel (ImitationProjected Programmatic Reinforcement Learning), a new learning metaalgorithm for PRl, as a response to this challenge. The design of Propel is based on three insights that enables integrating and building upon stateoftheart approaches for policy gradients and program synthesis. First, we view programmatic policy learning as a constrained policy optimization problem, in which the desired policies are constrained to be those that have a programmatic representation. This insight motivates utilizing constrained mirror descent approaches, which take a gradient step into the unconstrained policy space and then project back onto the constrained space. Second, by allowing the unconstrained policy space to have a mix of neural and programmatic representations, we can employ welldeveloped deep policy gradient approaches [55, 36, 47, 48, 19] to compute the unconstrained gradient step. Third, we define the projection operator using program synthesis via imitation learning [58, 8], in order to recover a programmatic policy from the unconstrained policy space. Our contributions can be summarized as:

•
We present Propel, a novel metaalgorithm that is based on mirror descent, program synthesis, and imitation learning, for PRl.

•
On the theoretical side, we show how to cast Propel as a form of constrained mirror descent. We provide a thorough theoretical analysis characterizing the impact of approximate gradients and projections. Further, we prove results that provide expected regret bounds and finitesample guarantees under reasonable assumptions.

•
On the practical side, we provide a concrete instantiation of Propel and evaluate it in three continuous control domains, including the challenging carracing domain Torcs [59]. The experiments show significant improvements over stateoftheart approaches for learning programmatic policies.
2 Problem Statement
The problem of programmatic reinforcement learning (PRl) consists of a Markov Decision Process (Mdp) $\mathcal{M}$ and a programmatic policy class $\mathrm{\Pi}$. The definition of $\mathcal{M}=(\mathcal{S},\mathcal{A},P,c,{p}_{0},\gamma )$ is standard [54], with $\mathcal{S}$ being the state space, $\mathcal{A}$ the action space, $P({s}^{\prime}s,a)$ the probability density function of transitioning from a stateaction pair to a new state, $c(s,a)$ the stateaction cost function, ${p}_{0}(s)$ a distribution over starting states, and $\gamma \in (0,1)$ the discount factor. A policy $\pi :\mathcal{S}\to \mathcal{A}$ (stochastically) maps states to actions. We focus on continuous control problems, so $\mathcal{S}$ and $\mathcal{A}$ are assumed to be continuous spaces. The goal is to find a programmatic policy ${\pi}^{*}\in \mathrm{\Pi}$ such that:
${\pi}^{*}=\underset{\pi \in \mathrm{\Pi}}{argmin}J(\pi ),\text{where:}J(\pi )=\mathbf{E}\left[{\displaystyle \sum _{i=0}^{\mathrm{\infty}}}{\gamma}^{i}c({s}_{i},{a}_{i}\equiv \pi ({s}_{i}))\right],$  (1) 
with the expectation taken over the initial state distribution ${s}_{0}\sim {p}_{0}$, the policy decisions, and the transition dynamics $P$. One can also use rewards, in which case (1) becomes a maximization problem.
Programmatic Policy Class. A programmatic policy class $\mathrm{\Pi}$ consists of policies that can be represented parsimoniously by a (domainspecific) programming language. Recent work [58, 8, 60] indicates that such policies can be easier to interpret and formally verify than neural policies, and can also be more robust to changes in the environment.
In this paper, we consider two concrete classes of programmatic policies. The first, a simplification of the class considered in Verma et al. [58], is defined by the modular, highlevel language in Figure 1. This language assumes a library of parameterized functions ${\oplus}_{\theta}$ representing standard controllers, for instance ProportionalIntegralDerivative (PID) [6] or bangbang controllers [11]. Programs in the language take states $s$ as inputs and produce actions $a$ as output, and can invoke fully instantiated library controllers along with predefined arithmetic, boolean and relational operators. The second, “lowerlevel" class, from Bastani et al. [8], consists of decision trees that map states to actions.
$\pi (s)$  $::=$  $a\mid \mathrm{\mathit{O}\mathit{p}}({\pi}_{1}(s),\mathrm{\dots},{\pi}_{k}(s))\mid \mathrm{\mathbf{i}\mathbf{f}}b\mathrm{\mathbf{t}\mathbf{h}\mathbf{e}\mathbf{n}}{\pi}_{1}(s)\mathrm{\mathbf{e}\mathbf{l}\mathbf{s}\mathbf{e}}{\pi}_{2}(s)\mid {\oplus}_{\theta}({\pi}_{1}(s),\mathrm{\dots},{\pi}_{k}(s))$  
$b$  $::=$  $\varphi (s)\mid \mathrm{\mathit{B}\mathit{O}\mathit{p}}({b}_{1},\mathrm{\dots},{b}_{k})$ 
$$ 
Example. Consider the problem of learning a programmatic policy, in the language of Figure 1, that controls a car’s accelerator in the Torcs carracing environment [59]. Figure 2 shows a program in our language for this task. The program invokes PID controllers ${\mathrm{\U0001d67f\U0001d678\U0001d673}}_{\u27e8j,{\theta}_{P},{\theta}_{I},{\theta}_{D}\u27e9}$, where $j$ identifies the sensor (out of 29, in our experiments) that provides inputs to the controller, and ${\theta}_{P}$, ${\theta}_{I}$, and ${\theta}_{D}$ are respectively the realvalued coefficients of the proportional, integral, and derivative terms in the controller. We note that the program only uses the sensors $\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}$ and $\mathrm{\U0001d681\U0001d67f\U0001d67c}$. While $\mathrm{\U0001d683\U0001d69b\U0001d68a\U0001d68c\U0001d694\U0001d67f\U0001d698\U0001d69c}$ (for the position of the car relative to the track axis) is used to decide which controller to use, only the $\mathrm{\U0001d681\U0001d67f\U0001d67c}$ sensor is needed to calculate the acceleration.
Learning Challenges. Learning programmatic policies in the continuous RL setting is challenging, as the best performing methods utilize policy gradient approaches [55, 36, 47, 48, 19], but policy gradients are hard to compute in programmatic representations. In many cases, $\mathrm{\Pi}$ may not even be differentiable. For our approach, we only assume access to program synthesis methods that can select a programmatic policy $\pi \in \mathrm{\Pi}$ that minimizes imitation disagreement with demonstrations provided by a teaching oracle. Because imitation learning tends to be easier than general RL in longhorizon tasks [53], the task of imitating a neural policy with a program is, intuitively, significantly simpler than the full programmatic RL problem. This intuition is corroborated by past work on programmatic RL [58], which shows that direct search over programs often fails to meet basic performance objectives.
[t] \[email protected]@algorithmic[1] \STATEInput: Programmatic & Neural Policy Classes: $\mathrm{\Pi}$ & $\mathcal{F}$. \STATEInput: Either initial ${\pi}_{0}$ or initial ${f}_{0}$ \STATEDefine joint policy class: $\mathscr{H}\equiv \mathrm{\Pi}\oplus \mathcal{F}$ //${h}{\mathrm{\equiv}}{\pi}{\mathrm{+}}{f}$ defined as ${h}{\mathit{}}{\mathrm{(}}{s}{\mathrm{)}}{\mathrm{=}}{\pi}{\mathit{}}{\mathrm{(}}{s}{\mathrm{)}}{\mathrm{+}}{f}{\mathit{}}{\mathrm{(}}{s}{\mathrm{)}}$ \IFgiven initial ${f}_{0}$ \STATE${\pi}_{0}\leftarrow \mathrm{\text{Project}}({f}_{0})$ //program synthesis via imitation learning \ENDIF\FOR$t=1,\mathrm{\dots},T$ \STATE${h}_{t}\leftarrow {\mathrm{\text{Update}}}_{\mathcal{F}}({\pi}_{t1},\eta )$ //policy gradient in neural policy space with learning rate ${\eta}$ \STATE${\pi}_{t}\leftarrow {\mathrm{\text{Project}}}_{\mathrm{\Pi}}({h}_{t})$ //program synthesis via imitation learning \ENDFOR\STATEReturn: Policy ${\pi}_{T}$
3 Learning Algorithm
To develop our approach, we take the viewpoint of (1) being a constrained optimization problem, where $\mathrm{\Pi}\subset \mathscr{H}$ resides within a larger space of policies $\mathscr{H}$. In particular, we will represent $\mathscr{H}\equiv \mathrm{\Pi}\oplus \mathcal{F}$ using a mixing of programmatic policies $\mathrm{\Pi}$ and neural polices $\mathcal{F}$. Any mixed policy $h\equiv \pi +f$ can be invoked as $h(s)=\pi (s)+f(s)$. In general, we assume that $\mathcal{F}$ is a good approximation of $\mathrm{\Pi}$ (i.e., for each $\pi \in \mathrm{\Pi}$ there is some $f\in \mathcal{F}$ that approximates it well), which we formalize in Section 4.
We can now frame our constrained learning problem as minimizing (1) over $\mathrm{\Pi}\subset \mathscr{H}$, that alternate between taking a gradient step in the general space $\mathscr{H}$ and projecting back down onto $\mathrm{\Pi}$. This “liftandproject” perspective motivates viewing our problem via the lens of mirror descent [40]. In standard mirror descent, the unconstrained gradient step can be written as $h\leftarrow {h}_{prev}\eta {\nabla}_{\mathscr{H}}J({h}_{prev})$ for step size $\eta $, and the projection can be written as $\pi \leftarrow {argmin}_{{\pi}^{\prime}\in \mathrm{\Pi}}D({\pi}^{\prime},h)$ for divergence measure $D$.
Our approach, ImitationProjected Programmatic Reinforcement Learning (Propel), is outlined in Algorithm 2 (also see Figure 3). Propel is a metaalgorithm that requires instantiating two subroutines, Update and Project, which correspond to the standard update and projection steps, respectively. Propel can be viewed as a form of functional mirror descent with some notable deviations from vanilla mirror descent.
${\mathrm{\text{Update}}}_{\mathcal{F}}$. Since policy gradient methods are welldeveloped for neural policy classes $\mathcal{F}$ (e.g., [36, 47, 48, 30, 24, 19]) and nonexistent for programmatic policy classes $\mathrm{\Pi}$, Propel is designed to leverage policy gradients in $\mathcal{F}$ and avoid policy gradients in $\mathrm{\Pi}$. Algorithm 3 shows one instantiation of ${\mathrm{\text{Update}}}_{\mathcal{F}}$. Note that standard mirror descent takes unconstrained gradient steps in $\mathscr{H}$ rather than $\mathcal{F}$, and we discuss this discrepancy between ${\mathrm{\text{Update}}}_{\mathscr{H}}$ and ${\mathrm{\text{Update}}}_{\mathcal{F}}$ in Section 4.
${\mathrm{\text{Project}}}_{\mathrm{\Pi}}$. Projecting onto $\mathrm{\Pi}$ can be implemented using program synthesis via imitation learning, i.e., by synthesizing a $\pi \in \mathrm{\Pi}$ to best imitate demonstrations provided by a teaching oracle $h\in \mathscr{H}$. Recent work [58, 8, 60] has given practical heuristics for this task for various programmatic policy classes. Algorithm 3 shows one instantiation of ${\mathrm{\text{Project}}}_{\mathrm{\Pi}}$ (based on DAgger [46]). One complication that arises is that finitesample runs of such imitation learning approaches only return approximate solutions and so the projection is not exact. We characterize the impact of approximate projections in Section 4.
Practical Considerations. In practice, we often employ multiple gradient steps before taking a projection step (as also described in Algorithm 3), because the step size of individual (stochastic) gradient updates can be quite small. Another issue that arises in virtually all policy gradient approaches is that the gradient estimates can have very high variance [55, 33, 30]. We utilize lowvariance policy gradient updates by using the reference $\pi $ as a proximal regularizer in function space [19].
For the projection step (Algorithm 3), in practice we often retain all previous rollouts $\tau $ from all previous projection steps. It is straightforward to query the current oracle $h$ to provide demonstrations on the states $s\in \tau $ from previous rollouts, which can lead to substantial savings in sample complexity with regards to executing rollouts on the environment, while not harming convergence.
[t] \[email protected]@algorithmic[1] \STATEInput: Neural Policy Class $\mathcal{F}$. Input: Reference programmatic policy: $\pi $ \STATEInput: Step size: $\eta $. Input: Regularization parameter: $\lambda $ \STATEInitialize neural policy: ${f}_{0}$ //any standard randomized initialization \FOR$j=1,\mathrm{\dots},m$ \STATE${f}_{j}\leftarrow {f}_{j1}\eta \lambda {\nabla}_{\mathcal{F}}J(\pi +\lambda {f}_{j1})$ //using DDPG [36], TRPO [47], etc., holding ${\pi}$ fixed \ENDFOR\STATEReturn: $h\equiv \pi +\lambda {f}_{m}$
[t] \[email protected]@algorithmic[1] \STATEInput: Programmatic Policy Class: $\mathrm{\Pi}$. Input: Oracle policy: $h$ \STATERollout $h$ on environment, get trajectory: ${\tau}_{0}=({s}^{0},h({s}^{0}),{s}^{1},h({s}^{1}),\mathrm{\dots})$ \STATECreate supervised demonstration set: ${\mathrm{\Gamma}}_{0}=\{(s,h(s))\}$ from ${\tau}_{0}$ \STATEDerive ${\pi}_{0}$ from ${\mathrm{\Gamma}}_{0}$ via program synthesis //e.g., using methods in [58, 8] \FOR$k=1,\mathrm{\dots},M$ \STATERollout ${\pi}_{k1}$, creating trajectory: ${\tau}_{k}$ \STATECollect demonstration data: ${\mathrm{\Gamma}}^{\prime}=\{(s,h(s))s\in {\tau}_{k}\}$ \STATE${\mathrm{\Gamma}}_{k}\leftarrow {\mathrm{\Gamma}}^{\prime}\cup {\mathrm{\Gamma}}_{k1}$ //DAggerstyle imitation learning [46] \STATEDerive ${\pi}_{k}$ from ${\mathrm{\Gamma}}_{k}$ via program synthesis //e.g., using methods in [58, 8] \ENDFOR\STATEReturn: ${\pi}_{M}$
4 Theoretical Analysis
We start by viewing Propel through the lens of online learning in function space, independent of the specific parametric representation. This start point yields a convergence analysis of Alg. 2 in Section 4.1 under generic approximation errors. We then analyze the issues of policy class representation in Sections 4.2 and 4.3, and connect Algorithms 3 and 3 with the overall performance, under some simplifying conditions. In particular, Section 4.3 characterizes the update error in a possibly nondifferentiable setting; to our knowledge, this is the first such analysis of its kind for reinforcement learning.
Preliminaries. We consider $\mathrm{\Pi}$ and $\mathcal{F}$ to be subspaces of an ambient policy space $\mathcal{U}$, which is a vector space equipped with inner product $\u27e8\cdot ,\cdot \u27e9$, induced norm $\parallel u\parallel =\sqrt{\u27e8u,u\u27e9}$, dual norm ${\parallel v\parallel}_{*}=sup\{\u27e8v,u\u27e9\parallel u\parallel \le 1\}$, and standard scaling & addition: $(au+bv)(s)=au(s)+bv(s)$ for $a,b\in \mathbb{R}$ and $u,v\in \mathcal{U}$. The cost functional of a policy $u$ is $J(u)={\int}_{\mathcal{S}}c(s,u(s))\mathit{d}{\mu}^{u}(s)$, where ${\mu}^{u}$ is the distribution of states induced by $u$. The joint policy class is $\mathscr{H}=\mathrm{\Pi}\oplus \mathcal{F}$, by $\mathscr{H}=\{\pi +f\forall \pi \in \mathrm{\Pi},f\in \mathcal{F}\}$.^{1}^{1} 1 The operator $\oplus $ is not a direct sum, since $\mathrm{\Pi}$ and $\mathcal{F}$ are not orthogonal. Note that $\mathscr{H}$ is a subspace of $\mathcal{U}$, and inherits its vector space properties. Without affecting the analysis, we simply equate $\mathcal{U}\equiv \mathscr{H}$ for the remainder of the paper.
We assume that $J$ is convex in $\mathscr{H}$, which implies that subgradient $\partial J(h)$ exists (with respect to $\mathscr{H}$) [9]. Where $J$ is differentiable, we utilize the notion of a Fréchet gradient. Recall that a bounded linear operator $\nabla :\mathscr{H}\mapsto \mathscr{H}$ is called a Fréchet functional gradient of $J$ at $h\in \mathscr{H}$ if $\underset{\parallel g\parallel \to 0}{lim}\frac{J(h+g)J(h)\u27e8\nabla J(h),g\u27e9}{\parallel g\parallel}=0$. By default, $\nabla $ (or ${\nabla}_{\mathscr{H}}$ for emphasis) denotes the gradient with respect to $\mathscr{H}$, whereas ${\nabla}_{\mathcal{F}}$ defines the gradient in the restricted subspace $\mathcal{F}$.
4.1 Propel as (Approximate) Functional Mirror Descent
For our analysis, Propel can be viewed as approximating mirror descent in (infinitedimensional) function space over a convex set $\mathrm{\Pi}\subset \mathscr{H}$.^{2}^{2} 2 $\mathrm{\Pi}$ can be convexified by considering randomized policies, as stochastic combinations of $\pi \in \mathrm{\Pi}$ (cf. [35]). Similar to the finitedimensional setting [40], we choose a strongly convex and smooth functional regularizer $R$ to be the mirror map. From the approximate mirror descent perspective, for each iteration $t$:

1.
Obtain a noisy gradient estimate: ${\widehat{\nabla}}_{t1}\approx \nabla J({\pi}_{t1})$

2.
${\mathrm{\text{Update}}}_{\mathscr{H}}(\pi )$ in $\mathscr{H}$ space: $\nabla R({h}_{t})=\nabla R({\pi}_{t1})\eta {\widehat{\nabla}}_{t1}$ (Note ${\mathrm{\text{Update}}}_{\mathrm{H}}\mathrm{\ne}{\mathrm{\text{Update}}}_{\mathrm{F}}$)

3.
Obtain approximate projection: ${\pi}_{t}={\mathrm{\text{Project}}}_{\mathrm{\Pi}}^{R}({h}_{t})\approx {argmin}_{\pi \in \mathrm{\Pi}}{D}_{R}(\pi ,{h}_{t})$
${D}_{R}(u,v)=R(u)R(v)\u27e8\nabla R(u),uv\u27e9$ is a Bregman divergence. Taking $R(h)=\frac{1}{2}{\parallel h\parallel}^{2}$ will recover projected functional gradient descent in $L2$space. Here Update becomes ${h}_{t}={\pi}_{t1}\eta \widehat{\nabla}J({\pi}_{t1})$, and Project solves for ${argmin}_{\pi \in \mathrm{\Pi}}{\parallel \pi {h}_{t}\parallel}^{2}$. While we mainly focus on this choice of $R$ in our experiments, note that other selections of $R$ lead to different Update and Project operators (e.g., minimizing KL divergence if $R$ is negative entropy).
The functional mirror descent scheme above may encounter two additional sources of error compared to standard mirror descent [40]. First, in the stochastic setting (also called bandit feedback [28]), the gradient estimate ${\widehat{\nabla}}_{t}$ may be biased, in addition to having high variance. One potential source of bias is the gap between ${\mathrm{\text{Update}}}_{\mathscr{H}}$ and ${\mathrm{\text{Update}}}_{\mathcal{F}}$. Second, the Project step may be inexact. We start by analyzing the behavior of Propel under generic bias, variance, and projection errors, before discussing the implications of approximating ${\mathrm{\text{Update}}}_{\mathscr{H}}$ and ${\mathrm{\text{Project}}}_{\mathrm{\Pi}}$ by Algs. 3 & 3, respectively. Let the bias be bounded by $\beta $, i.e., $\parallel \mathbb{E}[{\widehat{\nabla}}_{t}{\pi}_{t}]\nabla J({\pi}_{t}){\parallel}_{*}\le \beta $ almost surely. Similarly let the variance of the gradient estimate be bounded by ${\sigma}^{2}$, and the projection error norm $\parallel {\pi}_{t}{\pi}_{t}^{*}\parallel \le \u03f5$. We state the expected regret bound below; more details and a proof appear in Appendix A.2.
Theorem 4.1 (Expected regret bound under gradient estimation and projection errors).
Let ${\pi}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\pi}_{T}$ be a sequence of programmatic policies returned by Algorithm 2, and ${\pi}^{\mathrm{*}}$ be the optimal programmatic policy. Choosing learning rate $\eta \mathrm{=}\sqrt{\frac{\mathrm{1}}{{\sigma}^{\mathrm{2}}}\mathrm{(}\frac{\mathrm{1}}{T}\mathrm{+}\u03f5}\mathrm{)}$, we have the expected regret over $T$ iterations:
$$\mathbb{E}\left[\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})\right]J({\pi}^{*})=O\left(\sigma \sqrt{\frac{1}{T}+\u03f5}+\beta \right).$$  (2) 
The result shows that error $\u03f5$ from Project and the bias $\beta $ do not accumulate and simply contribute an additive term on the expected regret.^{3}^{3} 3 Other mirror descentstyle analyses, such as in [52], lead to accumulation of errors over the rounds of learning $T$. One key difference is that we are leveraging the assumption of convexity of $J$ in the (infinitedimensional) function space representation. The effect of variance of gradient estimate decreases at a $\sqrt{1/T}$ rate. Note that this regret bound is agnostic to the specific Update and Project operations, and can be applied more generically beyond the specific algorithmic choices used in our paper.
4.2 FiniteSample Analysis under Vanilla Policy Gradient Update and DAgger Projection
Next, we show how certain instantiations of Update and Project affect the magnitude of errors and influence endtoend learning performance from finite samples, under some simplifying assumptions on the Update step. For this analysis, we simplify Alg. 3 into the case ${\mathrm{\text{Update}}}_{\mathcal{F}}\equiv {\mathrm{\text{Update}}}_{\mathscr{H}}$. In particular, we assume programmatic policies in $\mathrm{\Pi}$ to be parameterized by a vector $\theta \in {\mathbb{R}}^{k}$, and $\pi $ is differentiable in $\theta $ (e.g., we can view $\mathrm{\Pi}\subset \mathcal{F}$ where $\mathcal{F}$ is parameterized in ${\mathbb{R}}^{k}$). We further assume the trajectory rollout is performed in an exploratory manner, where action is taken uniformly random over finite set of $A$ actions, thus enabling the bound on the bias of gradient estimates via Bernstein’s inequality. The Project step is consistent with Alg. 3, i.e., using DAgger [45] under convex imitation loss, such as ${\mathrm{\ell}}_{2}$ loss. We have the following highprobability guarantee:
Theorem 4.2 (Finitesample guarantee).
At each iteration, we perform vanilla policy gradient estimate of $\pi $ (over $\mathrm{H}$) using $m$ trajectories and, use DAgger algorithm to collect $M$ rollouts for the imitation learning projection. Setting the learning rate $\eta \mathrm{=}\sqrt{\frac{\mathrm{1}}{{\sigma}^{\mathrm{2}}}\mathit{}\mathrm{\left(}\frac{\mathrm{1}}{T}\mathrm{+}\frac{H}{M}\mathrm{+}\sqrt{\frac{\mathrm{log}\mathit{}\mathrm{(}T\mathrm{/}\delta \mathrm{)}}{M}}\mathrm{\right)}}$, after $T$ rounds of the algorithm, we have that:
$$\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})J({\pi}^{*})\le O\left(\sigma \sqrt{\frac{1}{T}+\frac{H}{M}+\sqrt{\frac{\mathrm{log}(T/\delta )}{M}}}\right)+O\left(\sigma \sqrt{\frac{\mathrm{log}(Tk/\delta )}{m}}+\frac{AH\mathrm{log}(Tk/\delta )}{m}\right)$$ 
holds with probability at least $\mathrm{1}\mathrm{}\delta $, with $H$ being the task horizon, $A$ the cardinality of action space, ${\sigma}^{\mathrm{2}}$ the variance of policy gradient estimates, and $k$ the dimension $\mathrm{\Pi}$’s parameterization.
The expanded result and proof are included in Appendix A.3. The proof leverages previous analysis from DAgger [46] and the finite sample analysis of vanilla policy gradient algorithm [32]. The finitesample regret bound scales linearly with the standard deviation $\sigma $ of the gradient estimate, while the bias, which is the very last component of the RHS, scales linearly with the task horizon $H$. Note that the standard deviation $\sigma $ can be exponential in task horizon $H$ in the worst case [32], and so it is important to have practical implementation strategies to reduce the variance of the Update operation. While conducted in a stylized setting, this analysis provides insight in the relative tradeoffs of spending effort in obtaining more accurate projections versus more reliable gradient estimates.
4.3 Closing the gap between ${\mathrm{\text{Update}}}_{\mathscr{H}}$ and ${\mathrm{\text{Update}}}_{\mathcal{F}}$
Our functional mirror descent analysis rests on taking gradients in $\mathscr{H}$: ${\mathrm{\text{Update}}}_{\mathscr{H}}(\pi )$ involves estimating ${\nabla}_{\mathscr{H}}J(\pi )$ in the $\mathscr{H}$ space. On the other hand, Algorithm 3 performs ${\mathrm{\text{Update}}}_{\mathcal{F}}(\pi )$ only in the neural policy space $\mathcal{F}$. In either case, although $J(\pi )$ may be differentiable in the nonparametric ambient policy space, it may not be possible to obtain a differentiable parametric programmatic representation in $\mathrm{\Pi}$. In this section, we discuss theoretical motivations to addressing a practical issue: How do we define and approximate the gradient ${\mathrm{\nabla}}_{\mathrm{H}}\mathit{}J\mathit{}\mathrm{(}\pi \mathrm{)}$ under a parametric representation? To our knowledge, we are the first to consider such a theoretical question for reinforcement learning.
Defining a consistent approximation of ${\mathrm{\nabla}}_{\mathrm{H}}\mathbf{}J\mathbf{}\mathrm{(}\pi \mathrm{)}$. The idea in ${\mathrm{\text{Update}}}_{\mathcal{F}}(\pi )$ (Line 8 of Alg. 2) is to approximate ${\nabla}_{\mathscr{H}}J(\pi )$ by ${\nabla}_{\mathcal{F}}J(f)$, which has a differentiable representation, at some $f$ close to $\pi $ (under the norm). Under appropriate conditions on $\mathcal{F}$, we show that this approximation is valid.
Proposition 4.3.
Assume that (i) $J$ is Fréchet differentiable on $\mathrm{H}$, (ii) $J$ is also differentiable on the restricted subspace $\mathrm{F}$, and (iii) $\mathrm{F}$ is dense in $\mathrm{H}$ (i.e., the closure $\text{[emailprotected]}\mathit{}\mathrm{\Delta}\mathit{}\text{[emailprotected]}\mathit{}\text{[emailprotected]}\mathit{}\text{[emailprotected]@skewchar}\mathit{}\text{[emailprotected]@a}\mathit{}\mathrm{111}\mathit{}\mathrm{F}\mathrm{=}\mathrm{H}$). Then for any fixed policy $\pi \mathrm{\in}\mathrm{\Pi}$, define a sequence of policies ${f}_{k}\mathrm{\in}\mathrm{F}$, $k\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{,}\mathrm{\dots}$), that converges to $\pi $: ${\mathrm{lim}}_{k\mathrm{\to}\mathrm{\infty}}\mathit{}\mathrm{\parallel}{f}_{k}\mathrm{}\pi \mathrm{\parallel}\mathrm{=}\mathrm{0}$. We then have ${\mathrm{lim}}_{k\mathrm{\to}\mathrm{\infty}}\mathit{}{\mathrm{\parallel}{\mathrm{\nabla}}_{\mathrm{F}}\mathit{}J\mathit{}\mathrm{(}{f}_{k}\mathrm{)}\mathrm{}{\mathrm{\nabla}}_{\mathrm{H}}\mathit{}J\mathit{}\mathrm{(}\pi \mathrm{)}\mathrm{\parallel}}_{\mathrm{*}}\mathrm{=}\mathrm{0}$.
Since the Fréchet gradient is unique in the ambient space $\mathscr{H}$, $\forall k$ we have ${\nabla}_{\mathscr{H}}J({f}_{k})={\nabla}_{\mathcal{F}}J({f}_{k})\to {\nabla}_{\mathscr{H}}J(\pi )$ as $k\to \mathrm{\infty}$ (by Proposition 4.3). We thus have an asymptotically unbiased approximation of ${\nabla}_{\mathscr{H}}J(\pi )$ via differentiable space $\mathcal{F}$ as: ${\nabla}_{\mathcal{F}}J(\pi )\triangleq {\nabla}_{\mathscr{H}}J(\pi )\triangleq {lim}_{k\to \mathrm{\infty}}{\nabla}_{\mathcal{F}}J({f}_{k})$.^{4}^{4} 4 We do not assume $J(\pi )$ to be differentiable when restricting to the policy subspace $\mathrm{\Pi}$, i.e., ${\nabla}_{\mathrm{\Pi}}J(\pi )$ may not exist under policy parameterization of $\mathrm{\Pi}$. Connecting to the result from Theorem 4.1, let ${\sigma}^{2}$ be an upper bound on the policy gradient estimates in the neural policy class $\mathrm{F}$, under an asymptotically unbiased approximation of ${\nabla}_{\mathscr{H}}J(\pi )$, the expected regret bound becomes $\mathbb{E}\left[\frac{1}{T}{\sum}_{t=1}^{T}J({\pi}_{t})\right]J({\pi}^{*})=O\left(\sigma \sqrt{\frac{1}{T}+\u03f5}\right)$.
Biasvariance considerations of ${\mathrm{\text{Update}}}_{\mathrm{F}}\mathbf{}\mathrm{(}\pi \mathrm{)}$ To further theoretically motivate a practical strategy for ${\mathrm{\text{Update}}}_{\mathcal{F}}(\pi )$ in Algorithm 3, we utilize an equivalent proximal perspective of mirror descent [10], where ${\mathrm{\text{Update}}}_{\mathscr{H}}(\pi )$ is equivalent to solving for ${h}^{\prime}={argmin}_{h\in \mathscr{H}}\eta \u27e8{\nabla}_{\mathscr{H}}J(\pi ),h\u27e9+{D}_{R}(h,\pi )$.
Proposition 4.4 (Minimizing a relaxed objective).
For a fixed programmatic policy $\pi $, with sufficiently small constant $\lambda \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$, we have that
$$\underset{h\in \mathscr{H}}{\mathrm{min}}\eta \u27e8{\nabla}_{\mathscr{H}}J(\pi ),h)\u27e9+D{}_{R}(h,\pi )\le \mathrm{min}{}_{f\in \mathcal{F}}J(\pi +\lambda f)J(\pi )+\u27e8\nabla J(\pi ),\pi \u27e9$$  (3) 
Thus, a relaxed ${\mathrm{\text{Update}}}_{\mathscr{H}}$ step is obtained by minimizing the RHS of (3), i.e., minimizing $J(\pi +\lambda f)$ over $f\in \mathcal{F}$. Each gradient descent update step is now ${f}^{\prime}=f\eta \lambda {\nabla}_{\mathcal{F}}J({\pi}_{t}+\lambda f)$, corresponding to Line 5 of Algorithm 3. For fixed $\pi $ and small $\lambda $, this relaxed optimization problem becomes regularized policy optimization over $\mathcal{F}$, which is significantly easier. Functional regularization in policy space around a fixed prior controller $\pi $ has demonstrated significant reduction in the variance of gradient estimate [19], at the expense of some bias. The below expected regret bound summarizes the impact of this increased bias and reduced variance, with details included in Appendix A.5.
Proposition 4.5 (Biasvariance characterization of ${\mathrm{\text{Update}}}_{\mathcal{F}}$).
Assuming $J\mathit{}\mathrm{(}h\mathrm{)}$ is $L$strongly smooth over $\mathrm{H}$, i.e., ${\mathrm{\nabla}}_{\mathrm{H}}\mathit{}J\mathit{}\mathrm{(}h\mathrm{)}$ is $L$Lipschitz continuous, approximating ${\mathrm{\text{Update}}}_{\mathrm{H}}$ by ${\mathrm{\text{Update}}}_{F}$ per Alg. 3 leads to the expected regret bound: $\mathrm{E}\mathit{}\mathrm{\left[}\frac{\mathrm{1}}{T}\mathit{}{\mathrm{\sum}}_{t\mathrm{=}\mathrm{1}}^{T}J\mathit{}\mathrm{(}{\pi}_{t}\mathrm{)}\mathrm{\right]}\mathrm{}J\mathit{}\mathrm{(}{\pi}^{\mathrm{*}}\mathrm{)}\mathrm{=}O\mathit{}\mathrm{\left(}\lambda \mathit{}\sigma \mathit{}\sqrt{\frac{\mathrm{1}}{T}\mathrm{+}\u03f5}\mathrm{+}{\lambda}^{\mathrm{2}}\mathit{}{L}^{\mathrm{2}}\mathrm{\right)}$.
Compared to the idealized unbiased approximation in Proposition 4.3, the introduced bias here is related to the inherent smoothness property of cost functional $J(h)$ over the joint policy class $\mathscr{H}$, i.e., how close $J(\pi +\lambda f)$ is to its linear underapproximation $J(\pi )+\u27e8{\nabla}_{\mathscr{H}}J(\pi ),\lambda f\u27e9$ around $\pi $.
5 Experiments
We demonstrate the effectiveness of Propel in synthesizing programmatic controllers in three continuous control environments. For brevity and focus, this section primarily focuses on Torcs ^{5}^{5} 5 The code for the Torcs experiments can be found at: https://bitbucket.org/averma8053/propel, a challenging race car simulator environment [59]. Empirical results on two additional classic control tasks, MountainCar and Pendulum, are provided in Appendix B; those results follow similar trends as the ones described for Torcs below, and further validate the convergence analysis of Propel.
Experimental Setup. We evaluate over five distinct tracks in the Torcs simulator. The difficulty of a track can be characterized by three properties; track length, track width, and number of turns. Our suite of tracks provides environments with varying levels of difficulty for the learning algorithm. The performance of a policy in the Torcs simulator is measured by the lap time achieved on the track. To calculate the lap time, the policies are allowed to complete a threelap race, and we record the best lap time during this race. We perform the experiments with twentyfive random seeds and report the median lap time over these twentyfive trials. Some of the policies crash the car before completing a lap on certain tracks, even after training for $600$ episodes. Such crashes are recorded as a lap time of infinity while calculating the median. If the policy crashes for more than half the seeds, this is reported as Cr in Tables 1 & 2. We choose to report the median because taking the crash timing as infinity, or an arbitrarily large constant, heavily skews other common measures such as the mean.
Baselines. Among recent stateoftheart approaches to learning programmatic policies are Ndps [58] for highlevel language policies, and Viper [8] for learning treebased policies. Both Ndps and Viper rely on imitating a fixed (pretrained) neural policy oracle, and can be viewed as degenerate versions of Propel that only run Lines 46 in Algorithm 2. We present two Propel analogues to Ndps and Viper: (i) PropelProg: Propel using the highlevel language of Figure 1 as the class of programmatic policies, similar to Ndps. (ii) PropelTree: Propel using regression trees, similar to Viper. We also report results for Prior, which is a (suboptimal) PID controller that is also used as the initial policy in Propel. In addition, to study generalization ability as well as safety behavior during training, we also include Ddpg, a neural policy learned using the Deep Deterministic Policy Gradients [36] algorithm, with $600$ episodes of training for each track. In principle, Propel and its analysis can accommodate different policy gradient subroutines. However, in the Torcs domain, other policy gradient algorithms such as PPO and TRPO failed to learn policies that are able to complete the considered tracks. We thus focus on Ddpg as our main policy gradient component.
GTrack  ERoad  Aalborg  Ruudskogen  Alpine2  
Length  3186m  3260m  2588m  3274m  3774m 
Prior  312.92 / 0.0  322.59 / 0.0  244.19 / 0.0  340.29 / 0.0  402.89 / 0.0 
Ddpg  78.82 / 0.24  89.71 / 0.28  101.06 / 0.40  Cr / 0.68  Cr / 0.92 
Ndps  108.25 / 0.24  126.80 / 0.28  163.25 / 0.40  Cr / 0.68  Cr / 0.92 
Viper  83.60 / 0.24  87.53 / 0.28  110.57 / 0.40  Cr / 0.68  Cr / 0.92 
PropelProg  93.67 / 0.04  119.17 / 0.04  147.28 / 0.12  124.58 / 0.16  256.59 / 0.16 
PropelTree  78.33 / 0.04  79.39 / 0.04  109.83 / 0.16  118.80 / 0.24  236.01 / 0.36 

Evaluating Performance. Table 1 shows the performance on the considered Torcs tracks. We see that PropelProg and PropelTree consistently outperform the Ndps [58] and Viper [8] baselines, respectively. While Ddpg outperforms Propel on some tracks, its volatility causes it to be unable to learn in some environments, and hence to crash the majority of the time. Figure 4 shows the consistent improvements made over the prior by PropelProg, over the iterations of the Propel algorithm. Appendix B contains similar results achieved on the two classic control tasks, MountainCar and Pendulum. Figure 5 shows that, compared to Ddpg, our approach suffers far fewer crashes while training in Torcs.
Evaluating Generalization. To compare the ability of the controllers to perform on tracks not seen during training, we executed the learned policies on all the other tracks (Table 2). We observe that Ddpg crashes significantly more often than PropelProg. This demonstrates the generalizability of the policies returned by Propel. Generalization results for the PropelTree policy are given in the appendix. In general, PropelTree policies are more generalizable than Ddpg but less than PropelProg. On an absolute level, the generalization ability of Propel still leaves much room for improvement, which is an interesting direction for future work.
Verifiability of Policies. As shown in prior work [8, 58], parsimonious programmatic policies are more amenable to formal verification than neural policies. Unsurprisingly, the policies generated by PropelTree and PropelProg are easier to verify than Ddpg policies. As a concrete example, we verified a smoothness property of the PropelProg policy using the Z3 SMTsolver [21] (more details in Appendix B). The verification terminated in $0.49$ seconds.
Initialization. In principle, Propel can be initialized with a random program, or a random policy trained using Ddpg. In practice, the performance of Propel depends to a certain degree on the stability of the policy gradient procedure, which is Ddpg in our experiments. Unfortunately, Ddpg often exhibits high variance across trials and fares poorly in challenging RL domains. Specifically, in our Torcs experiments, Ddpg fails on a number of tracks (similar phenomena have been reported in previous work that experiments on similar continuous control domains [30, 19, 58]). Agents obtained by initializing Propel with neural policies obtained via Ddpg also fail on multiple tracks. Their performance over the five tracks is reported in Appendix B. In contrast, Propel can often finish the challenging tracks when initialized with a very simple handcrafted programmatic prior.
GTrack  ERoad  Aalborg  Ruudskogen  Alpine2  

GTrack    124 / Cr  Cr / Cr  Cr / Cr  Cr / Cr 
ERoad  102 / 92    Cr / Cr  Cr / Cr  Cr / Cr 
Aalborg  201 / 91  228 / Cr    217 / Cr  Cr / Cr 
Ruudskogen  131 / Cr  135 / Cr  Cr / Cr    Cr / Cr 
Alpine2  222 / Cr  231 / Cr  184 / Cr  Cr / Cr   
6 Related Work
Program Synthesis. Program synthesis is the problem of automatically searching for a program within a language that fits a given specification [29]. Recent approaches to the problem have leveraged symbolic knowledge about program structure [27], satisfiability solvers [50, 31], and metalearning techniques [39, 41, 22, 7] to generate interesting programs in many domains [3, 42, 4]. In most prior work, the specification is a logical constraint on the input/output behavior of the target program. However, there is also a growing body of work that considers program synthesis modulo optimality objectives [13, 15, 43], often motivated by machine learning tasks [39, 57, 26, 23, 58, 8, 60]. Synthesis of programs that imitates an oracle has been considered in both the logical [31] and the optimization [58, 8, 60] settings. The projection step in Propel builds on this prior work. While our current implementation of this step is entirely symbolic, in principle, the operation can also utilize contemporary techniques for learning policies that guide the synthesis process [39, 7, 49].
Constrained Policy Learning. Constrained policy learning has seen increased interest in recent years, largely due to the desire to impose side guarantees such as stability and safety on the policy’s behavior. Broadly, there are two approaches to imposing constraints: specifying constraints as an additional cost function [1, 35], and explicitly encoding constraints into the policy class [2, 34, 19, 20, 12]. In some cases, these two approaches can be viewed as duals of each other. For instance, recent work that uses controltheoretic policies as a functional regularizer [34, 19] can be viewed from the perspective of both regularization (additional cost) and an explicitly constrained policy class (a specific mix of neural and controltheoretic policies). We build upon this perspective to develop the gradient update step in our approach.
RL using Imitation Learning. There are two ways to utilize imitation learning subroutines within RL. First, one can leverage limitedaccess or suboptimal experts to speed up learning [44, 18, 14, 51]. Second, one can learn over two policy classes (or one policy and one model class) to achieve accelerated learning compared to using only one policy class [38, 17, 52, 16]. Our approach has some stylistic similarities to previous efforts [38, 52] that use a richer policy space to search for improvements before retraining the primary policy to imitate the richer policy. One key difference is that our primary policy is programmatic and potentially nondifferentiable. A second key difference is that our theoretical framework takes a functional gradient descent perspective — it would be interesting to carefully compare with previous analysis techniques to find a unifying framework.
RL with Mirror Descent. The mirror descent framework has previously used to analyze and design RL algorithms. For example, Thomas et al. [56] and Mahadevan and Liu [37] use composite objective mirror descent, or Comid [25], which allows incorporating adaptive regularizers into gradient updates, thus offering connections to either natural gradient RL [56] or sparsity inducing RL algorithms [37]. Unlike in our work, these prior approaches perform projection into the same native, differentiable representation. Also, the analyses in these papers do not consider errors introduced by hybrid representations and approximate projection operators. However, one can potentially extend our approach with versions of mirror descent, e.g., Comid, that were considered in these efforts.
7 Conclusion and Future Work
We have presented Propel, a metaalgorithm based on mirror descent, program synthesis, and imitation learning, for programmatic reinforcement learning (PRl). We have presented theoretical convergence results for Propel, developing novel analyses to characterize approximate projections and biased gradients within the mirror descent framework. We also validated Propel empirically, and show that it can discover interpretable, verifiable, generalizable, performant policies and significantly outperform the state of the art in PRl.
The central idea of Propel is the use of imitation learning and combinatorial methods in implementing a projection operation for mirror descent, with the goal of optimization in a functional space that lacks gradients. While we have developed Propel in an RL setting, this idea is not restricted to RL or even sequential decision making. Future work will seek to exploit this insight in other machine learning and program synthesis settings.
Acknowledgements. This work was supported in part by United States Air Force Contract # FA875019C0092, NSF Award # 1645832, NSF Award # CCF1704883, the Okawa Foundation, Raytheon, PIMCO, and Intel.
References
 [1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 22–31. JMLR. org, 2017.
 [2] Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [3] Rajeev Alur, Rastislav Bodík, Eric Dallal, Dana Fisman, Pranav Garg, Garvit Juniwal, Hadas KressGazit, P. Madhusudan, Milo M. K. Martin, Mukund Raghothaman, Shambwaditya Saha, Sanjit A. Seshia, Rishabh Singh, Armando SolarLezama, Emina Torlak, and Abhishek Udupa. Syntaxguided synthesis. In Dependable Software Systems Engineering, pages 1–25. 2015.
 [4] Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. Scaling enumerative program synthesis via divide and conquer. In Tools and Algorithms for the Construction and Analysis of Systems  23rd International Conference, TACAS 2017, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2017, Uppsala, Sweden, April 2229, 2017, Proceedings, Part I, pages 319–336, 2017.
 [5] Kiam Heong Ang, Gregory Chong, and Yun Li. Pid control system analysis, design, and technology. IEEE transactions on control systems technology, 13(4):559–576, 2005.
 [6] Karl Johan Åström and Tore Hägglund. Automatic tuning of simple regulators with specifications on phase and amplitude margins. Automatica, 20(5):645–651, 1984.
 [7] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, 2017.
 [8] Osbert Bastani, Yewen Pu, and Armando SolarLezama. Verifiable reinforcement learning via policy extraction. In Advances in Neural Information Processing Systems, pages 2494–2504, 2018.
 [9] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.
 [10] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 [11] Richard Bellman, Irving Glicksberg, and Oliver Gross. On the “bangbang” control problem. Quarterly of Applied Mathematics, 14(1):11–18, 1956.
 [12] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe modelbased reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918, 2017.
 [13] Roderick Bloem, Krishnendu Chatterjee, Thomas A. Henzinger, and Barbara Jobstmann. Better quality in synthesis through quantitative objectives. In Computer Aided Verification, 21st International Conference, CAV 2009, Grenoble, France, June 26  July 2, 2009. Proceedings, pages 140–156, 2009.
 [14] KaiWei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, and John Langford. Learning to search better than your teacher. In International Conference on Machine Learning (ICML), 2015.
 [15] Swarat Chaudhuri, Martin Clochard, and Armando SolarLezama. Bridging boolean and quantitative synthesis using smoothed proof search. In POPL, pages 207–220, 2014.
 [16] ChingAn Cheng, Xinyan Yan, Nathan Ratliff, and Byron Boots. Predictorcorrector policy optimization. In International Conference on Machine Learning (ICML), 2019.
 [17] ChingAn Cheng, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Accelerating imitation learning with predictive models. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
 [18] ChingAn Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning through imitation and reinforcement. In Uncertainty in artificial intelligence, 2019.
 [19] Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control regularization for reduced variance reinforcement learning. In International Conference on Machine Learning (ICML), 2019.
 [20] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
 [21] Leonardo Mendonça de Moura and Nikolaj Bjørner. Z3: An Efficient SMT Solver. In TACAS, pages 337–340, 2008.
 [22] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdelrahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 990–998. JMLR. org, 2017.
 [23] Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, Daniela Rus, Armando SolarLezama, and Wojciech Matusik. Inversecsg: automatic conversion of 3d models to CSG trees. ACM Trans. Graph., 37(6):213:1–213:16, 2018.
 [24] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
 [25] John C Duchi, Shai ShalevShwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In COLT, pages 14–26, 2010.
 [26] Kevin Ellis, Daniel Ritchie, Armando SolarLezama, and Josh Tenenbaum. Learning to infer graphics programs from handdrawn images. In Advances in Neural Information Processing Systems, pages 6059–6068, 2018.
 [27] John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from inputoutput examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 1517, 2015, pages 229–239, 2015.
 [28] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.
 [29] Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. Foundations and Trends in Programming Languages, 4(12):1–119, 2017.
 [30] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [31] Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. Oracleguided componentbased program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software EngineeringVolume 1, pages 215–224. ACM, 2010.
 [32] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London London, England, 2003.
 [33] Vijay R Konda and John N Tsitsiklis. Actorcritic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.
 [34] Hoang M. Le, Andrew Kang, Yisong Yue, and Peter Carr. Smooth imitation learning for online sequence prediction. In International Conference on Machine Learning (ICML), 2016.
 [35] Hoang M Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning (ICML), 2019.
 [36] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [37] Sridhar Mahadevan and Bo Liu. Sparse qlearning with mirror descent. In Proceedings of the TwentyEighth Conference on Uncertainty in Artificial Intelligence, pages 564–573. AUAI Press, 2012.
 [38] William H Montgomery and Sergey Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008–4016, 2016.
 [39] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Neural sketch learning for conditional program generation. In ICLR, 2018.
 [40] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
 [41] Emilio Parisotto, Abdelrahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neurosymbolic program synthesis. arXiv preprint arXiv:1611.01855, 2016.
 [42] Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on ObjectOriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 2530, 2015, pages 107–126, 2015.
 [43] Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20  22, 2016, pages 761–774, 2016.
 [44] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.
 [45] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
 [46] Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 1113, 2011, pages 627–635, 2011.
 [47] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [48] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [49] Xujie Si, Yuan Yang, Hanjun Dai, Mayur Naik, and Le Song. Learning a metasolver for syntaxguided program synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, 2019.
 [50] Armando SolarLezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. Combinatorial sketching for finite programs. In ASPLOS, pages 404–415, 2006.
 [51] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement learning & imitation learning. In International Conference on Learning Representations (ICLR), 2018.
 [52] Wen Sun, Geoffrey J Gordon, Byron Boots, and J Bagnell. Dual policy iteration. In Advances in Neural Information Processing Systems, pages 7059–7069, 2018.
 [53] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning (ICML), 2017.
 [54] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [55] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [56] Philip S Thomas, William C Dabney, Stephen Giguere, and Sridhar Mahadevan. Projected natural actorcritic. In Advances in neural information processing systems, pages 2337–2345, 2013.
 [57] Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, and Swarat Chaudhuri. Houdini: Lifelong learning as program synthesis. In Advances in Neural Information Processing Systems, pages 8687–8698, 2018.
 [58] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. In International Conference on Machine Learning, pages 5052–5061, 2018.
 [59] Bernhard Wymann, Eric Espié, Christophe Guionneau, Christos Dimitrakakis, Rémi Coulom, and Andrew Sumner. TORCS, The Open Racing Car Simulator. http://www.torcs.org, 2014.
 [60] He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework for verifiable reinforcement learning. In ACM Conference on Programming Language Design and Implementation (SIGPLAN), 2019.
Appendix A Theoretical Analysis
A.1 Preliminaries and Notations
We formally define an ambient control policy space $\mathcal{U}$ to be a vector space equipped with inner product $\u27e8\cdot ,\cdot \u27e9:\mathcal{U}\times \mathcal{U}\mapsto \mathbb{R}$, which induces a norm $\parallel u\parallel =\sqrt{\u27e8u,u\u27e9}$, and its dual norm defined as ${\parallel v\parallel}_{*}=sup\{\u27e8v,u\u27e9\parallel u\parallel \le 1\}$. While multiple ways to define the inner product exist, for concreteness we can think of the example of squareintegrable stationary policies with $\u27e8u,v\u27e9={\int}_{\mathcal{S}}u(s)v(s)\mathit{d}s$. The addition operator $+$ between two policies $u,v\in \mathcal{U}$ is defined as $(u+v)(s)=u(s)+v(s)$ for all state $s\in \mathcal{S}$. Scaling $\lambda u+\kappa v$ is defined similarly for scalar $\lambda ,\kappa $.
The cost functional of a control policy $u$ is defined as $J(u)={\int}_{0}^{\mathrm{\infty}}c(s(\tau ),u(\tau ))\mathit{d}\tau $, or $J(u)={\int}_{\mathcal{S}}c(s,u(s))\mathit{d}{\mu}^{u}(s)$, where ${\mu}^{u}$ is the distribution of states induced by policy $u$. This latter example is equivalent to the standard notion of value function in reinforcement learning.
Separate from the parametric representation issues, both programmatic policy class $\mathrm{\Pi}$ and neural policy class $\mathcal{F}$, and by extension  the joint policy class $\mathscr{H}$, are considered to live in the ambient vector space $\mathcal{U}$. We thus have a common and welldefined notion of distance between policies from different classes.
We make an important distinction between differentiability of $J(h)$ in the ambient policy space (nonparametric), versus differentiability in parameterization (parametric). For example, if $\mathrm{\Pi}$ is a class of decisiontree based policy, policies in $\mathrm{\Pi}$ may not be differentiable under representation. However, policies $\pi \in \mathrm{\Pi}$ might still be differentiable when considered as points in the ambient vector space $\mathcal{U}$.
We will use the following standard notion of gradient and differentiability from functional analysis:
Definition A.1 (Subgradients).
The subgradient of $J$ at $h$, denoted $\partial J(h)$, is the nonempty set $\{g\in \mathscr{H}\forall j\in \mathscr{H}:\u27e8jh,g\u27e9+J(h)\le J(j)\}$
Definition A.2 (Fréchet gradient).
A bounded linear operator $\nabla :\mathscr{H}\mapsto \mathscr{H}$ is called Fréchet functional gradient of $J$ at $h\in \mathscr{H}$ if $\underset{\parallel g\parallel \to 0}{lim}\frac{J(h+g)J(h)\u27e8\nabla J(h),g\u27e9}{\parallel g\parallel}=0$
The notions of convexity, smoothness and Bregman divergence are analogous to finitedimensional setting:
Definition A.3 (Strong convexity).
A differentiable function $R$ is $\alpha $strongly convex w.r.t norm $\parallel \cdot \parallel $ if $R(y)\ge R(x)+\u27e8\nabla R(x),yx\u27e9+\frac{\alpha}{2}{\parallel yx\parallel}^{2}$
Definition A.4 (Lipschitz continuous gradient smoothness).
A differentiable function $R$ is ${L}_{R}$strongly smooth w.r.t norm $\parallel \cdot \parallel $ if ${\parallel \nabla R(x)\nabla R(y)\parallel}_{*}\le {L}_{R}\parallel xy\parallel $
Definition A.5 (Bregman Divergence).
For a strongly convex regularizer $R$, ${D}_{R}(x,y)=R(x)R(y)\u27e8\nabla R(y),xy\u27e9$ is the Bregman divergence between $x$ and $y$ (not necessarily symmetric)
The following standard result for Bregman divergence will be useful:
Lemma A.1.
[10] For all $x\mathrm{,}y\mathrm{,}z$ we have the identity $\mathrm{\u27e8}\mathrm{\nabla}\mathit{}R\mathit{}\mathrm{(}x\mathrm{)}\mathrm{}\mathrm{\nabla}\mathit{}R\mathit{}\mathrm{(}y\mathrm{)}\mathrm{,}x\mathrm{}z\mathrm{\u27e9}\mathrm{=}{D}_{R}\mathit{}\mathrm{(}x\mathrm{,}y\mathrm{)}\mathrm{+}{D}_{R}\mathit{}\mathrm{(}z\mathrm{,}x\mathrm{)}\mathrm{}{D}_{R}\mathit{}\mathrm{(}z\mathrm{,}y\mathrm{)}$. Since Bregman divergence is nonnegative, a consequence of this identity is that ${D}_{R}\mathit{}\mathrm{(}z\mathrm{,}x\mathrm{)}\mathrm{}{D}_{R}\mathit{}\mathrm{(}z\mathrm{,}y\mathrm{)}\mathrm{\le}\mathrm{\u27e8}\mathrm{\nabla}\mathit{}R\mathit{}\mathrm{(}x\mathrm{)}\mathrm{}\mathrm{\nabla}\mathit{}R\mathit{}\mathrm{(}y\mathrm{)}\mathrm{,}z\mathrm{}x\mathrm{\u27e9}$
A.2 Expected Regret Bound under Noisy Policy Gradient Estimates and Projection Errors
In this section, we show regret bound for the performance of the sequence of returned programs ${\pi}_{1},\mathrm{\dots},{\pi}_{T}$ of the algorithm. The analysis here is agnostic to the particular implementation of algorithm 3 and algorithm 3.
Let $R$ be a $\alpha $strongly convex and ${L}_{R}$smooth functional with respect to norm $\parallel \cdot \parallel $ on $\mathscr{H}$. The steps from algorithm 2 can be described as follows.

•
Initialize ${\pi}_{0}\in \mathrm{\Pi}$. For each iteration $t$:

1.
Obtain a noisy estimate of the gradient $\widehat{\nabla}J({\pi}_{t1})\approx \nabla J({\pi}_{t1})$

2.
Update in the $\mathscr{H}$ space: $\nabla R({h}_{t})=\nabla R({\pi}_{t1})\eta \widehat{\nabla}J({\pi}_{t1})$

3.
Obtain approximate projection ${\pi}_{t}={\mathrm{\text{Project}}}_{\pi}^{R}({h}_{t})\approx {argmin}_{pi\in \mathrm{\Pi}}{D}_{R}(\pi ,{h}_{t})$

1.
This procedure is an approximate functional mirror descent scheme under bandit feedback. We will develop the following result, which is a more detailed version of 4.1 in the main paper.
In the statement below, $D$ is the diameter on $\mathrm{\Pi}$ with respect to defined norm $\parallel \cdot \parallel $ (i.e., $D=sup\parallel \pi {\pi}^{\prime}\parallel $). ${L}_{J}$ is the Lipschitz constant of the functional $J$ on $\mathscr{H}$. $\beta ,{\sigma}^{2}$ are the bound on the bias and variance of the gradient estimate at each iteration, respectively. $\alpha $ and ${\u0141}_{R}$ are the strongly convex and smooth coefficients of the functional regularizer $R$. Finally, $\u03f5$ is the bound on the projection error with respect to the same norm $\parallel \cdot \parallel $.
Theorem A.2 (Regret bound of returned policies).
Let ${\pi}_{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\pi}_{T}$ be a sequence of programmatic policies returned by algorithm 2 and ${\pi}^{\mathrm{*}}$ be the optimal programmatic policy. We have the expected regret bound:
$$\mathbb{E}\left[\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})\right]J({\pi}^{*})\le \frac{{L}_{R}{D}^{2}}{\eta T}+\frac{\u03f5{L}_{R}D}{\eta}+\frac{\eta ({\sigma}^{2}+{L}_{J}^{2})}{\alpha}+\beta D$$ 
In particular, choosing the learning rate $\eta \mathrm{=}\sqrt{\frac{\frac{\mathrm{1}}{T}\mathrm{+}\u03f5}{{\sigma}^{\mathrm{2}}}}$, the expected regret is simplified into:
(4) 
$$\mathbb{E}\left[\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})\right]J({\pi}^{*})=O\left(\sigma \sqrt{\frac{1}{T}+\u03f5}+\beta \right)$$ 
Proof.
At each round $t$, let $\text{[emailprotected]}\mathrm{\Delta}\text{[emailprotected]}\text{[emailprotected]}\text{[emailprotected]@skewchar}\text{[emailprotected]@a}111{\nabla}_{t}=\text{\mathbf{E}}[{\widehat{\nabla}}_{t}{\pi}_{t}]$ be the conditional expectation of the gradient estimate. We will use the shorthand notation ${\nabla}_{t}=\nabla J({\pi}_{t})$. Denote the upperbound on the bias of the estimate by ${\beta}_{t}$, i.e., ${\parallel \text{[emailprotected]}\mathrm{\Delta}\text{[emailprotected]}\text{[emailprotected]}\text{[emailprotected]@skewchar}\text{[emailprotected]@a}111{\nabla}_{t}{\nabla}_{t}\parallel}_{*}\le {\beta}_{t}$ almost surely. Denote the noise of the gradient estimate by ${\xi}_{t}=\text{[emailprotected]}\mathrm{\Delta}\text{[emailprotected]}\text{[emailprotected]}\text{[emailprotected]@skewchar}\text{[emailprotected]@a}111{\nabla}_{t}{\widehat{\nabla}}_{t}$, and ${\sigma}_{t}^{2}=\text{\mathbf{E}}\left[{\parallel {\widehat{\nabla}}_{t}\text{[emailprotected]}\mathrm{\Delta}\text{[emailprotected]}\text{[emailprotected]}\text{[emailprotected]@skewchar}\text{[emailprotected]@a}111{\nabla}_{t}\parallel}_{*}^{2}\right]$ is the variance of gradient estimate ${\widehat{\nabla}}_{t}$.
The projection operator is $\u03f5$approximate in the sense that $\parallel {\pi}_{t}{\mathrm{\text{Project}}}_{\mathrm{\Pi}}^{R}({f}_{t})\parallel =\parallel {\widehat{\mathrm{\text{Project}}}}_{\mathrm{\Pi}}^{R}({h}_{t}){\mathrm{\text{Project}}}_{\mathrm{\Pi}}^{R}({h}_{t})\parallel \le \u03f5$ with some constant $\u03f5$, which reflects the statistical error of the imitation learning procedure. This projection error in general is independent of the choice of function classes $\mathrm{\Pi}$ and $\mathcal{F}$.We will use the shorthand notation ${\pi}_{t}^{*}={\mathrm{\text{Project}}}_{\mathrm{\Pi}}^{R}({f}_{t})$ for the true Bregman projection of ${h}_{t}$ onto $\mathrm{\Pi}$.
Due to convexity of $J$ over the space $\mathscr{H}$ (which includes $\mathrm{\Pi}$), we have for all $\pi \in \mathrm{\Pi}$:
$$J({\pi}_{t})J(\pi )\le \u27e8{\nabla}_{t},{\pi}_{t}\pi \u27e9$$ 
We proceed to bound the RHS, starting with bounding the inner product where the actual gradient is replaced by the estimated gradient.
$\u27e8{\widehat{\nabla}}_{t},{\pi}_{t}\pi \u27e9={\displaystyle \frac{1}{{\eta}_{t}}}\u27e8\nabla R({\pi}_{t})\nabla R({h}_{t+1}),{\pi}_{t}\pi \u27e9$  (5)  
$={\displaystyle \frac{1}{{\eta}_{t}}}\left({D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{h}_{t+1})+{D}_{R}({\pi}_{t},{h}_{t+1})\right)$  (6)  
$\le {\displaystyle \frac{1}{{\eta}_{t}}}\left({D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{\pi}_{t+1}^{*}){D}_{R}({\pi}_{t+1}^{*},{h}_{t+1})+{D}_{R}({\pi}_{t},{h}_{t+1})\right)$  (7)  
$={\displaystyle \frac{1}{{\eta}_{t}}}\left(\underset{\text{telescoping}}{\underset{\u23df}{{D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{\pi}_{t+1})}}+\underset{\text{projection error}}{\underset{\u23df}{{D}_{R}(\pi ,{\pi}_{t+1}){D}_{R}(\pi ,{\pi}_{t+1}^{*})}}\underset{\text{relative improvement}}{\underset{\u23df}{{D}_{R}({\pi}_{t+1}^{*},{h}_{t+1})+{D}_{R}({\pi}_{t},{h}_{t+1})}}\right)$  (8) 
Equation (5) is due to the gradient update rule in $\mathcal{F}$ space. Equation (6) is derived from definition of Bregman divergence. Equation (7) is due to the generalized Pythagorean theorem of Bregman projection ${D}_{R}(x,y)\ge {D}_{R}(x,{\mathrm{\text{Project}}}_{\mathrm{\Pi}}^{R}(x))+{D}_{R}({\mathrm{\text{Project}}}_{\mathrm{\Pi}}^{R}(x),y)$. The RHS of equation (7) are decomposed into three components that will be bounded separately.
Bounding projection error. By lemma (A.1) we have
$${D}_{R}(\pi ,{\pi}_{t+1}){D}_{R}(\pi ,{\pi}_{t+1}^{*})\le \u27e8\nabla R({\pi}_{t+1})\nabla R({\pi}_{t+1}^{*}),\pi {\pi}_{t+1}\u27e9$$  (9) 
$$\le \parallel \nabla R({\pi}_{t+1})\nabla R({\pi}_{t+1}^{*})\parallel {\parallel \pi {\pi}_{t+1}\parallel}_{*}$$  (10) 
$$\le {L}_{R}\parallel {\pi}_{t+1}{\pi}_{t+1}^{*}\parallel D\le \u03f5{L}_{R}D$$  (11) 
Equation (10) is due to Cauchy–Schwarz. Equation (11) is due to Lipschitz smoothness of $\nabla R$ and definition of $\u03f5$approximate projection.
Bounding relative improvement. This follows standard argument from analysis of mirror descent algorithm.
${D}_{R}({\pi}_{t},{h}_{t+1}){D}_{R}({\pi}_{t+1}^{*},{h}_{t+1})=R({\pi}_{t})R({\pi}_{t+1}^{*})+\u27e8\nabla R({h}_{t+1}),{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  (12)  
$\le \u27e8\nabla R({\pi}_{t}),{\pi}_{t}{\pi}_{t+1}^{*}\u27e9{\displaystyle \frac{\alpha}{2}}{\parallel {\pi}_{t+1}^{*}{\pi}_{t}\parallel}_{*}^{2}+\u27e8\nabla R({h}_{t+1}),{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  (13)  
$={\eta}_{t}\u27e8{\widehat{\nabla}}_{t},{\pi}_{t+1}^{*}{\pi}_{t}\u27e9{\displaystyle \frac{\alpha}{2}}{\parallel {\pi}_{t+1}^{*}{\pi}_{t}\parallel}^{2}$  (14)  
$\le {\displaystyle \frac{{\eta}_{t}^{2}}{2\alpha}}{\parallel {\widehat{\nabla}}_{t}\parallel}_{*}^{2}\le {\displaystyle \frac{{\eta}_{t}^{2}}{\alpha}}({\sigma}_{t}^{2}+{L}_{J}^{2})$  (15) 
Equation (13) is from the $\alpha $strong convexity property of regularizer $R$. Equation (14) is by definition of the gradient update. Combining the bounds on the three components and taking expectation, we thus have
$$\mathbb{E}\left[\u27e8{\widehat{\nabla}}_{t},{\pi}_{t}\pi \u27e9\right]\le \frac{1}{{\eta}_{t}}\left({D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{\pi}_{t+1})+\u03f5{L}_{R}D+\frac{{\eta}_{t}^{2}}{\alpha}({\sigma}_{t}^{2}+{L}_{J}^{2})\right)$$  (16) 
Next, the difference between estimated gradient ${\widehat{\nabla}}_{t}$ and actual gradient ${\nabla}_{t}$ factors into the bound via CauchySchwarz:
$$\mathbb{E}\left[\u27e8{\nabla}_{t}{\widehat{\nabla}}_{t},{\pi}_{t}\pi \u27e9\right]\le {\parallel {\nabla}_{t}\mathbb{E}[{\widehat{\nabla}}_{t}]\parallel}_{*}\parallel {\pi}_{t}\pi \parallel \le {\beta}_{t}D$$  (17) 
Unbiased gradient estimates. For the case when the gradient estimate is unbiased, assume the variance of the noise of gradient estimates is bounded by ${\sigma}^{2}$, we have the expected regret bound for all $pi\in \mathrm{\Pi}$
$$\mathbb{E}\left[\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})\right]J(\pi )\le \frac{{L}_{R}{D}^{2}}{\eta T}+\frac{\u03f5{L}_{R}D}{\eta}+\frac{\eta ({\sigma}^{2}+{L}_{J}^{2})}{\alpha}$$  (18) 
here to clarify, ${L}_{R}$ is the smoothness coefficient of regularizer $R$ (i.e., the gradient of $R$ is ${L}_{R}$Lipschitz, ${L}_{J}$ is Lipschitz constant of $J$, $D$ is the diameter of $\mathrm{\Pi}$ under norm $\parallel \cdot \parallel $, ${\sigma}^{2}$ is the upperbound on the variance of gradient estimates, and $\u03f5$ is the error from the projection procedure (i.e., imitation learning loss).
We can set learning rate $\eta =\sqrt{\frac{\frac{1}{T}+\u03f5}{{\sigma}^{2}}}$ to observe that the expected regret is bounded by $O(\sigma \sqrt{\frac{1}{T}+\u03f5})$.
Biased gradient estimates. Assume that the bias of gradient estimate at each round is upperbounded by ${\beta}_{t}\le \beta $. Similar to before, combining inequalities from (16) and (17), we have
$$\mathbb{E}\left[\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})\right]J(\pi )\le \frac{{L}_{R}{D}^{2}}{\eta T}+\frac{\u03f5{L}_{R}D}{\eta}+\frac{\eta ({\sigma}^{2}+{L}_{J}^{2})}{\alpha}+\beta D$$  (19) 
Similar to before, we can set learning rate $\eta =\sqrt{\frac{\frac{1}{T}+\u03f5}{{\sigma}^{2}}}$ to observe that on the expected regret is bounded by $O(\sigma \sqrt{\frac{1}{T}+\u03f5}+\beta )$. Compared to the bound on (18), in the biased case, the extra regret incurred per bound is simply a constant, and does not depend on $T$. ∎
A.3 FiniteSample Analysis
In this section, we provide overall finitesample analysis for Propel under some simplifying assumptions. We first consider the case where exact gradient estimate is available, before extending the result to the general case of noisy policy gradient update. Combining the two steps will give us the proof for the following statement (theorem 4.2 in the main paper)
Theorem A.3 (Finitesample guarantee).
At each iteration, we perform vanilla policy gradient estimate of $\pi $ (over $\mathrm{H}$) using $m$ trajectories and use DAgger algorithm to collect $M$ rollouts. Setting the learning rate $\eta \mathrm{=}\sqrt{\frac{\mathrm{1}}{{\sigma}^{\mathrm{2}}}\mathit{}\mathrm{\left(}\frac{\mathrm{1}}{T}\mathrm{+}\frac{H}{M}\mathrm{+}\sqrt{\frac{\mathrm{log}\mathit{}\mathrm{(}T\mathrm{/}\delta \mathrm{)}}{M}}\mathrm{\right)}}$, after $T$ rounds of the algorithm, we have that
$$\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})J({\pi}^{*})\le O\left(\sigma \sqrt{\frac{1}{T}+\frac{H}{M}+\sqrt{\frac{\mathrm{log}(T/\delta )}{M}}}\right)+O\left(\sigma \sqrt{\frac{\mathrm{log}(Tk/\delta )}{m}}+\frac{AH\mathrm{log}(Tk/\delta )}{m}\right)$$ 
holds with probability at least $\mathrm{1}\mathrm{}\delta $, with $H$ the task horizon, $A$ the cardinality of action space, ${\sigma}^{\mathrm{2}}$ the variance of policy gradient estimates, and $k$ the dimension $\mathrm{\Pi}$’s parameterization.
Exact gradient estimate case. Assuming that the policy gradients can be calculated exactly, it is straightforward to provide highprobability guarantee for the effect of the projection error. We start with the following result, adapted from [45] for the case of projection error bound. In this version of DAgger, we assume that we only collect a single (state, expert action) pair from each trajectory rollout. Result is similar, with tighter bound, when multiple data points are collected along the trajectory.
Lemma A.4 (Projection error bound from imitation learning procedure).
Using DAgger as the imitation learning subroutine for our Project operator in algorithm 3, let $M$ be the number of trajectories rolledout for learning, and $H$ be the horizon of the task. With probability at least $\mathrm{1}\mathrm{}\delta $, we have
$${D}_{R}(\pi ,{\pi}^{*})\le \stackrel{~}{O}(1/M)+\frac{2{\mathrm{\ell}}_{max}(1+H)}{M}+\sqrt{\frac{2{\mathrm{\ell}}_{max}\mathrm{log}(1/\delta ))}{M}}$$ 
where $\pi $ is the result of Project, ${\pi}^{\mathrm{*}}$ is the true Bregman projection of $h$ onto $\mathrm{\Pi}$, and ${\mathrm{\ell}}_{m\mathit{}a\mathit{}x}$ is the maximum value of the imitation learning loss function ${D}_{R}\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{,}\mathrm{\cdot}\mathrm{)}$
The bound in lemma A.4 is simpler than previous imitation learning results with cost information ([44, 45]. The reason is that the goal of the Project operator is more modest. Since we only care about the distance between the empirical projection $\pi $ and the true projection ${\pi}^{*}$, the loss objective in imitation learning is simplified (i.e., this is only a regret bound), and we can disregard how well policies in $\mathrm{\Pi}$ can imitate the expert $h$, as well as the performance of $J(\pi )$ relative to the true cost from the environment $J(h)$.
A consequence of this lemma is that for the number of trajectories at each round of imitation learning $M=O(\frac{\mathrm{log}1/\delta}{{\u03f5}^{2}})+O(\frac{H}{\u03f5})$, we have ${D}_{R}({\pi}_{t},{\pi}_{t}^{*})\le \u03f5$ with probability at least $1\delta $. Applying union bound across $T$ rounds of learning, we obtain the following guarantee (under no gradient estimation error)
Proposition A.5 (Finitesample Projection Error Bound).
To simplify the presentation of the result, we consider ${L}_{R}\mathrm{,}D\mathrm{,}L\mathrm{,}\alpha $ to be known constants. Using DAgger algorithm to collect $M\mathrm{=}O\mathit{}\mathrm{(}\frac{\mathrm{log}\mathit{}T\mathrm{/}\delta}{{\u03f5}^{\mathrm{2}}}\mathrm{)}\mathrm{+}O\mathit{}\mathrm{(}\frac{H}{\u03f5}\mathrm{)}$ rollouts at each iteration, we have the following regret guarantee after $T$ rounds of our main algorithm:
$$\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})J({\pi}^{*})\le O\left(\frac{1}{\eta T}+\frac{\u03f5}{\eta}+\eta \right)$$ 
with probability at least $\mathrm{1}\mathrm{}\delta $. Consequently, setting $\eta \mathrm{=}\sqrt{\frac{\mathrm{1}}{T}\mathrm{+}\frac{H}{M}\mathrm{+}\sqrt{\frac{\mathrm{log}\mathit{}\mathrm{(}T\mathrm{/}\delta \mathrm{)}}{M}}}$, we have that
$$\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})J({\pi}^{*})\le O\left(\sqrt{\frac{1}{T}+\frac{H}{M}+\sqrt{\frac{\mathrm{log}(T/\delta )}{M}}}\right)$$ 
with probability at least $\mathrm{1}\mathrm{}\delta $
Note that the dependence on the time horizon of the task is sublinear. This is different from standard imitation learning regret bounds, which are often at least linear in the task horizon. The main reason is that our comparison benchmark ${\pi}^{*}$ does live in the space $\mathrm{\Pi}$, whereas for DAgger, the expert policy may not reside in the same space.
Noisy gradient estimate case. We now turn to the issue of estimating the gradient of $\nabla J(\pi )$. We make the following simplifying assumption about the gradient estimation:

•
The $\pi $ is parameterized by vector $\theta \in {\mathbb{R}}^{k}$ (such as a neural network). The parameterization is differentiable with respect to $\theta $ (Alternatively, we can view $\mathrm{\Pi}$ as a differentiable subspace of $\mathcal{F}$, in which case we have $\mathscr{H}=\mathcal{F}$)

•
At each Update loop, the policy is rolled out $m$ times to collect the data, each trajectory has horizon length $H$

•
For each visited state $s\sim {d}_{h}$, the policy takes a uniformly random action $a$. The action space is finite with cardinality $A$.

•
The gradient $\nabla {h}_{\theta}$ is bounded by $B$
The gradient estimate is performed consistent with a generic policy gradient scheme, i.e.,
$$\widehat{\nabla}J(\theta )=\frac{A}{m}\sum _{i=1}^{H}\sum _{j=1}^{m}\nabla {\pi}_{\theta}({a}_{i}^{j}{s}_{i}^{j},\theta ){\widehat{Q}}_{i}^{j}$$ 
where ${\widehat{Q}}_{i}^{j}$ is the estimated costtogo [55].
Taking uniform random exploratory actions ensures that the samples are i.i.d. We can thus apply Bernstein’s inequality to obtain the bound between estimated gradient and the true gradient. Indeed, with probability at least $1\delta $, we have that the following bound on the bias componentwise:
$${\parallel \widehat{\nabla}J(\theta )\nabla J(\theta )\parallel}_{\mathrm{\infty}}\le \beta \text{when}m\ge \frac{(2{\sigma}^{2}+2AHB\frac{\beta}{3})\mathrm{log}\frac{k}{\delta}}{{\beta}^{2}}$$ 
which leads to similar bound with respect to $\parallel \cdot {\parallel}_{*}$ (here we leverage the equivalence of norms in finite dimensional setting):
$${\parallel {\nabla}_{t}{\widehat{\nabla}}_{t}\parallel}_{*}\le \beta \text{when}m=O\left(\frac{({\sigma}^{2}+AHB\beta )\mathrm{log}\frac{k}{\delta}}{{\beta}^{2}}\right)$$ 
Applying union bound of this result over $T$ rounds of learning, and combining with the result from proposition (A.5), we have the following finitesample guarantee in the simplifying policy gradient update. This is also the more detailed statement of theorem 4.2 in the main paper.
Proposition A.6 (Finitesample Guarantee under Noisy Gradient Updates and Projection Error).
At each iteration, we perform policy gradient estimate using $m\mathrm{=}O\mathit{}\mathrm{(}\frac{\mathrm{(}{\sigma}^{\mathrm{2}}\mathrm{+}A\mathit{}H\mathit{}B\mathit{}\beta \mathrm{)}\mathit{}\mathrm{log}\mathit{}\frac{T\mathit{}k}{\delta}}{{\beta}^{\mathrm{2}}}\mathrm{)}$ trajectories and use DAgger algorithm to collect $M\mathrm{=}O\mathit{}\mathrm{(}\frac{\mathrm{log}\mathit{}T\mathrm{/}\delta}{{\u03f5}^{\mathrm{2}}}\mathrm{)}\mathrm{+}O\mathit{}\mathrm{(}\frac{H}{\u03f5}\mathrm{)}$ rollouts. Setting the learning rate $\eta \mathrm{=}\sqrt{\frac{\mathrm{1}}{{\sigma}^{\mathrm{2}}}\mathit{}\mathrm{\left(}\frac{\mathrm{1}}{T}\mathrm{+}\frac{H}{M}\mathrm{+}\sqrt{\frac{\mathrm{log}\mathit{}\mathrm{(}T\mathrm{/}\delta \mathrm{)}}{M}}\mathrm{\right)}}$, after $T$ rounds of the algorithm, we have that
$$\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})J({\pi}^{*})\le O\left(\sigma \sqrt{\frac{1}{T}+\frac{H}{M}+\sqrt{\frac{\mathrm{log}(T/\delta )}{M}}}\right)+\beta $$ 
with probability at least $\mathrm{1}\mathrm{}\delta $.
Consequently, we also have the following regret bound:
$$\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})J({\pi}^{*})\le O\left(\sigma \sqrt{\frac{1}{T}+\frac{H}{M}+\sqrt{\frac{\mathrm{log}(T/\delta )}{M}}}\right)+O\left(\sigma \sqrt{\frac{\mathrm{log}(Tk/\delta )}{m}}+\frac{AH\mathrm{log}(Tk/\delta )}{m}\right)$$ 
holds with probability at least $\mathrm{1}\mathrm{}\delta $, where again $H$ is the task horizon, $A$ is the cardinality of action space, and $k$ is the dimension of function class $\mathrm{\Pi}$’s parameterization.
Proof.
(For both proposition (A.6) and (A.5)). The results follow by taking the inequality from equation (19), and by solving for $\u03f5$ and $\beta $ explicitly in terms of relevant quantities. Based on the specification of $M$ and $m$, we obtain the necessary precision for each round of learning in terms of number of trajectories:
$\beta $  $=O(\sigma {\displaystyle \frac{\mathrm{log}(k/\delta )}{m}}+{\displaystyle \frac{AHB\mathrm{log}(k/\delta )}{m}})$  
$\u03f5$  $=O({\displaystyle \frac{H}{M}}+\sqrt{{\displaystyle \frac{\mathrm{log}(1/\delta )}{M}}})$ 
Setting the learning rate $\eta =\sqrt{\frac{1}{{\sigma}^{2}}\left(\frac{1}{T}+\u03f5\right)}$ and rearranging the inequalities lead to the desired bounds. ∎
The regret bound depends on the variance ${\sigma}^{2}$ of the policy gradient estimates. It is wellknown that vanilla policy gradient updates suffer from high variance. We instead use functional regularization technique, based on CORERL, in the practical implementation of our algorithm. The CORERL subroutine has been demonstrated to reduce the variance in policy gradient updates [19].
A.4 Defining a consistent approximation of ${\nabla}_{\mathscr{H}}J(\pi )$  Proof of Proposition 4.3
We are using the notion of Fréchet derivative to define gradient of differentiable functional. Note that while Gateaux derivative can also be utilized, Fréchet derivative ensures continuity of the gradient operator that would be useful for our analysis.
Definition A.6 (Fréchet gradient).
A bounded linear operator $\nabla :\mathscr{H}\mapsto \mathscr{H}$ is called Fréchet functional gradient of $J$ at $h\in \mathscr{H}$ if $\underset{\parallel g\parallel \to 0}{lim}\frac{J(h+g)J(h)\u27e8\nabla J(h),g\u27e9}{\parallel g\parallel}=0$
We make the following assumption about $\mathscr{H}$ and $\mathcal{F}$. One interpretation of this assumption is that the space of policies $\mathrm{\Pi}$ and $\mathcal{F}$ that we consider have the property that a programmatic policy $\pi \in \mathrm{\Pi}$ can be wellapproximated by a large space of neural policies $f\in \mathcal{F}$.
Assumption 1.
$J$ is Fréchet differentiable on $\mathrm{H}$. $J$ is also differentiable on the restricted subspace $\mathrm{F}$. And $\mathrm{F}$ is dense in $\mathrm{H}$ (i.e., the closure $\text{[emailprotected]}\mathit{}\mathrm{\Delta}\mathit{}\text{[emailprotected]}\mathit{}\text{[emailprotected]}\mathit{}\text{[emailprotected]@skewchar}\mathit{}\text{[emailprotected]@a}\mathit{}\mathrm{111}\mathit{}\mathrm{F}\mathrm{=}\mathrm{H}$)
It is then clear that $\forall $ $f\in \mathcal{F}$ the Fréchet gradient ${\nabla}_{\mathcal{F}}J(f)$, restricted to the subspace $\mathcal{F}$ is equal to the gradient of $f$ in the ambient space $\mathscr{H}$ (since Fréchet gradient is unique). In general, given $\pi \in \mathrm{\Pi}$ and $f\in \mathcal{F}$, $\pi +f$ is not necessarily in $\mathcal{F}$. However, the restricted gradient on subspace $\mathcal{F}$ of $J(\pi +f)$ can be defined asymptotically.
Proposition A.7.
Fixing a policy $\pi \mathrm{\in}\mathrm{\Pi}$, define a sequence of policies ${f}_{k}\mathrm{\in}\mathrm{F}$, $k\mathrm{=}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{,}\mathrm{\dots}$ that converges to $\pi $: ${\mathrm{lim}}_{k\mathrm{\to}\mathrm{\infty}}\mathit{}\mathrm{\parallel}{f}_{k}\mathrm{}g\mathrm{\parallel}\mathrm{=}\mathrm{0}$, we then have ${\mathrm{lim}}_{k\mathrm{\to}\mathrm{\infty}}\mathit{}{\mathrm{\parallel}{\mathrm{\nabla}}_{\mathrm{F}}\mathit{}J\mathit{}\mathrm{(}{f}_{k}\mathrm{)}\mathrm{}{\mathrm{\nabla}}_{\mathrm{H}}\mathit{}J\mathit{}\mathrm{(}\pi \mathrm{)}\mathrm{\parallel}}_{\mathrm{*}}\mathrm{=}\mathrm{0}$
Proof.
Since Fréchet derivative is a continuous linear operator, we have ${lim}_{k\to \mathrm{\infty}}{\parallel {\nabla}_{\mathscr{H}}J({f}_{k}){\nabla}_{\mathscr{H}}J(\pi )\parallel}_{*}=0$. By the reasoning above, for $f\in \mathcal{F}$, the gradient ${\nabla}_{\mathcal{F}}J(f)$ defined via restriction to the space $\mathcal{F}$ does not change compared to ${\nabla}_{\mathscr{H}}J(f)$, the gradient defined over the ambient space $\mathscr{H}$. Thus we also have ${lim}_{k\to \mathrm{\infty}}{\parallel {\nabla}_{\mathcal{F}}J({f}_{k}){\nabla}_{\mathscr{H}}J(\pi )\parallel}_{*}=0$. By the same argument, we also have that for any given $\pi \in \mathrm{\Pi}$ and $f\in \mathcal{F}$, even if $\pi +f\notin \mathcal{F}$, the gradient ${\nabla}_{\mathcal{F}}J(\pi +f)$ with respect to the $\mathcal{F}$ can be approximated similarly. ∎
Note that we are not assuming $J(\pi )$ to be differentiable when restricting to the policy subspace $\mathrm{\Pi}$.
A.5 Theoretical motivation for Algorithm 3  Proof of Proposition 4.4 and 4.5
We consider the case where $\mathrm{\Pi}$ is not differentiable by parameterization. Note that this does not preclude $J(\pi )$ for $\pi \in \mathrm{\Pi}$ to be differentiable in the nonparametric function space. Two complications arise compared to our previous approximate mirror descent procedure. First, for each $\pi \in \mathrm{\Pi}$, estimating the gradient $\nabla J(\pi )$ (which may not exist under certain parameterization, per section 4.3) can become much more difficult. Second, the update rule $\nabla R(\pi ){\nabla}_{\mathcal{F}}J(\pi )$ may not be in the dual space of $\mathcal{F}$, as in the simple case where $\mathrm{\Pi}\subset \mathcal{F}$, thus making direct gradient update in the $\mathcal{F}$ space inappropriate.
Assumption 2.
$J$ is convex in $\mathrm{H}$.
By convexity of $J$ in $\mathscr{H}$, subgradients $\partial J(h)$ exists for all $h\in \mathscr{H}$. In particular, $\partial J(\pi )$ exists for all $\pi \in \mathrm{\Pi}$. Note that $\partial J(\pi )$ reflects subgradient of $\pi $ with respect to the ambient policy space $\mathscr{H}$.
We will make use of the following equivalent perspective to mirror descent[10], which consists of twostep process for each iteration $t$

1.
Solve for ${h}_{t+1}={argmin}_{h\in \mathscr{H}}\eta \u27e8\partial J({\pi}_{t}),h\u27e9+{D}_{R}(h,{\pi}_{t})$

2.
Solve for ${\pi}_{t+1}={argmin}_{\pi \in \mathrm{\Pi}}{D}_{R}(\pi ,{h}_{t+1})$
We will show how this version of the algorithm motivates our main algorithm. Consider step 1 of the main loop of Propel, where given a fixed $\pi \in \mathrm{\Pi}$, the optimization problem within $\mathscr{H}$ is
$$(\text{OBJECTIVE\_}1)=\underset{h\in \mathscr{H}}{\mathrm{min}}\eta \u27e8\partial J(\pi ),h\u27e9+{D}_{R}(h,\pi )$$  (20) 
Due to convexity of $\mathscr{H}$ and the objective, problem $(\text{OBJECTIVE\_}1)$ is equivalent to:
$(\text{OBJECTIVE\_}1)=$  $\mathrm{min}\u27e8\partial J(\pi ),h\u27e9$  (21)  
$\text{s.t.}{D}_{R}(h,\pi )\le \tau $  (22) 
where $\tau $ depends on $\eta $. Since $\pi $ is fixed, this optimization problem can be relaxed by choosing $\lambda \in [0,1]$, and a set of candidate policies $h=\pi +\lambda f$, for all $f\in \mathcal{F}$, such that ${D}_{R}(h,\pi )\le \tau $ is satisfied (Selection of $\lambda $ is possible with bounded spaces). Since this constraint set is potentially a restricted set compared to the space of policies satisfying inequality (22), the optimization problem (20) is relaxed into:
$$(\text{OBJECTIVE\_}1)\le (\text{OBJECTIVE\_}2)=\underset{f\in \mathcal{F}}{\mathrm{min}}\u27e8\partial J(\pi ),\pi +\lambda f\u27e9$$  (23) 
Due to convexity property of $J$, we have
$$\u27e8\partial J(\pi ),\lambda f\u27e9=\u27e8\partial J(\pi ),\pi +\lambda f\pi )\u27e9\le J(\pi +\lambda f)J(\pi )$$  (24) 
The original problem $\text{OBJECTIVE\_}1$ is thus upper bounded by:
$$\underset{h\in \mathscr{H}}{\mathrm{min}}\eta \u27e8\partial J(\pi ),h)\u27e9+D{}_{R}(h,\pi )\le \mathrm{min}{}_{f\in \mathcal{F}}J(\pi +\lambda f)J(\pi )+\u27e8\partial J(\pi ),\pi \u27e9$$ 
Thus, a relaxed version of original optimization problem $\text{OBJECTIVE\_}1$ can be obtained by miniziming $J(\pi +\lambda f)$ over $f\in \mathcal{F}$ (note that $\pi $ is fixed). This naturally motivates using functional regularization technique, such as CORERL algorithm [19], to update the parameters of differentiable function $f$ via policy gradient descent update:
$${f}^{\prime}=f\eta \lambda {\nabla}_{\mathcal{F}}\lambda J(\pi +\lambda f)$$ 
where the gradient of $J$ is taken with respect to the parameters of $f$ (neural networks). This is exactly the update step in algorithm 3 (also similar to iterative updte of CORERL algorithm), where the neural network policy is regularized by a prior controller $\pi $.
Statement and Proof of Proposition 4.5
Proposition A.8 (Regret bound for the relaxed optimization objective).
Assuming $J\mathit{}\mathrm{(}h\mathrm{)}$ is $L$strongly smooth over $\mathrm{H}$, i.e., ${\mathrm{\nabla}}_{\mathrm{H}}\mathit{}J\mathit{}\mathrm{(}h\mathrm{)}$ is $L$Lipschitz continuous, approximating ${\mathrm{\text{Update}}}_{\mathrm{H}}$ by ${\mathrm{\text{Update}}}_{F}$ per Alg. 3 leads to the expected regret bound: $\mathrm{E}\mathit{}\mathrm{\left[}\frac{\mathrm{1}}{T}\mathit{}{\mathrm{\sum}}_{t\mathrm{=}\mathrm{1}}^{T}J\mathit{}\mathrm{(}{\pi}_{t}\mathrm{)}\mathrm{\right]}\mathrm{}J\mathit{}\mathrm{(}{\pi}^{\mathrm{*}}\mathrm{)}\mathrm{=}O\mathit{}\mathrm{\left(}\lambda \mathit{}\sigma \mathit{}\sqrt{\frac{\mathrm{1}}{T}\mathrm{+}\u03f5}\mathrm{+}{\lambda}^{\mathrm{2}}\mathit{}{L}^{\mathrm{2}}\mathrm{\right)}$
Proof.
Instead of focusing on the bias of the gradient estimate ${\nabla}_{\mathscr{H}}J(\pi )$, we will shift our focus on the alternative proximal formulation of mirror descent, under optimization and projection errors. In particular, at each iteration $t$, let ${h}_{t+1}^{*}={argmin}_{h\in \mathscr{H}}\eta \u27e8\nabla J({\pi}_{t}),h\u27e9+{D}_{R}(h,{\pi}_{t})$ and let the optimization error be defined as ${\beta}_{t}$ where $\nabla R({h}_{t+1})=\nabla R({h}_{t+1}^{*})+{\beta}_{t}$. Note here that this is different from (but related to) the notion of bias from gradient estimate of $\nabla J(\pi )$ used in theorem 4.1 and theorem A.2. The projection error from imitation learning procedure is defined similarly to theorem 4.1: ${\pi}_{t+1}^{*}={argmin}_{\pi \in \mathrm{\Pi}}{D}_{R}(\pi ,{h}_{t+1})$ is the true projection, and $\parallel {\pi}_{t+1}{\pi}_{t+1}^{*}\parallel \le \u03f5$.
We start with similar bounding steps to the proof of theorem 4.1:
$\u27e8\nabla J({\pi}_{t}),{\pi}_{t}\pi \u27e9$  $={\displaystyle \frac{1}{\eta}}\u27e8\nabla R({h}_{t+1}^{*})\nabla R({\pi}_{t}),{\pi}_{t}\pi \u27e9$  
$={\displaystyle \frac{1}{\eta}}\left(\u27e8\nabla R({h}_{t+1})\nabla R({\pi}_{t}),{\pi}_{t}\pi \u27e9\u27e8{\beta}_{t},{\pi}_{t}\pi \u27e9\right)$  
$=\underset{\text{component\_1}}{\underset{\u23df}{{\displaystyle \frac{1}{\eta}}\left({D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{h}_{t+1})+{D}_{R}({\pi}_{t},{h}_{t+1})\right)}}+\underset{\text{component\_2}}{\underset{\u23df}{{\displaystyle \frac{1}{\eta}}\u27e8{\beta}_{t},{\pi}_{t}\pi \u27e9}}$  (25) 
As seen from the proof of theorem A.2, component_1 can be upperbounded by: $\frac{1}{\eta}\left(\underset{\text{telescoping}}{\underset{\u23df}{{D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{\pi}_{t+1})}}+\underset{\text{projection error}}{\underset{\u23df}{{D}_{R}(\pi ,{\pi}_{t+1}){D}_{R}(\pi ,{\pi}_{t+1}^{*})}}\underset{\text{relative improvement}}{\underset{\u23df}{{D}_{R}({\pi}_{t+1}^{*},{h}_{t+1})+{D}_{R}({\pi}_{t},{h}_{t+1})}}\right)$ The bound on projection error is identical to theorem A.2:
$${D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{\pi}_{t+1}^{*})\le \u03f5{L}_{R}D$$  (26) 
The bound on relative improvement is slightly different:
${D}_{R}({\pi}_{t},{h}_{t+1}){D}_{R}({\pi}_{t+1}^{*},{h}_{t+1})=R({\pi}_{t})R({\pi}_{t+1}^{*})+\u27e8\nabla R({h}_{t+1}),{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  
$=R({\pi}_{t})R({\pi}_{t+1}^{*}+\u27e8\nabla R({h}_{t+1}^{*}),{\pi}_{t+1}^{*}{\pi}_{t}\u27e9)+\u27e8{\beta}_{t},{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  
$\le \u27e8\nabla R({\pi}_{t}),{\pi}_{t}{\pi}_{t+1}^{*}\u27e9{\displaystyle \frac{\alpha}{2}}{\parallel {\pi}_{t+1}^{*}{\pi}_{t}\parallel}^{2}+\u27e8\nabla R({h}_{t+1}^{*}),{\pi}_{t+1}^{*}{\pi}_{t}\u27e9+\u27e8{\beta}_{t},{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  
$=\eta \u27e8\nabla {J}_{\mathscr{H}}({\pi}_{t}),{\pi}_{t+1}^{*}{\pi}_{t}\u27e9{\displaystyle \frac{\alpha}{2}}{\parallel {\pi}_{t+1}^{*}{\pi}_{t}\parallel}^{2}+\u27e8{\beta}_{t},{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  (27)  
$\le {\displaystyle \frac{{\eta}^{2}}{2\alpha}}{\parallel {\nabla}_{\mathscr{H}}J({\pi}_{t})\parallel}_{*}^{2}+\u27e8{\beta}_{t},{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  
$\le {\displaystyle \frac{{\eta}^{2}}{2\alpha}}{L}_{J}^{2}+\u27e8{\beta}_{t},{\pi}_{t+1}^{*}{\pi}_{t}\u27e9$  (28) 
Note here that the gradient ${\nabla}_{\mathscr{H}}J({\pi}_{t})$ is not the result of estimation. Combining equations (25), (26), (27), (28), we have:
$$\u27e8\nabla J({\pi}_{t}),{\pi}_{t}\pi \u27e9\le \frac{1}{\eta}\left({D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{\pi}_{t+1})+\u03f5{L}_{R}D+\frac{{\eta}^{2}}{2\alpha}{L}_{J}^{2}+\u27e8{\beta}_{t},{\pi}_{t+1}^{*}\pi \u27e9\right)$$  (29) 
Next, we want to bound ${\beta}_{t}$. Choose regularizer $R$ to be $\frac{1}{2}\parallel \cdot {\parallel}^{2}$ (consistent with the pseudocode in algorithm 3). We have that:
$${h}_{t+1}^{*}=\underset{h\in \mathscr{H}}{argmin}\eta \u27e8\nabla J({\pi}_{t}),h\u27e9+\frac{1}{2}{\parallel h{\pi}_{t}\parallel}^{2}$$ 
which is equivalent to:
$${h}_{t+1}^{*}={\pi}_{t}+\underset{f\in \mathcal{F}}{argmin}\eta \u27e8\nabla J({\pi}_{t}),f\u27e9+\frac{1}{2}{\parallel f\parallel}^{2}$$ 
Let ${f}_{t+1}^{*}={argmin}_{f\in \mathcal{F}}\eta \u27e8\nabla J({\pi}_{t}),f\u27e9+\frac{1}{2}{\parallel f\parallel}^{2}$. Taking the gradient over $f$, we can see that ${f}_{t+1}^{*}=\eta \nabla J({\pi}_{t})$. Let ${f}_{t+1}$ be the minimizer of ${\mathrm{min}}_{f\in \mathcal{F}}J({\pi}_{t}+\lambda f)$. We then have ${h}_{t+1}^{*}={\pi}_{t}+{f}_{t+1}^{*}$ and ${h}_{t+1}=\pi +\lambda {f}_{t+1}$. Thus ${\beta}_{t}={h}_{t+1}{h}_{t+1}^{*}={f}_{t+1}{f}_{t+1}^{*}$.
On one hand, we have
$$J({\pi}_{t}+\lambda {f}_{t+1})\le J({\pi}_{t}+\omega {f}_{t+1}^{*})\le J({\pi}_{t})+\u27e8\nabla J({\pi}_{t}),\omega {f}_{t+1}^{*}\u27e9+\frac{L}{2}{\parallel \omega {f}_{t+1}^{*}\parallel}^{2}$$ 
due to optimality of ${f}_{t+1}$ and strong smoothness property of $J$. On the other hand, since $J$ is convex, we also have the firstorder condition:
$$J({\pi}_{t}+\lambda {f}_{t+1})\ge J({\pi}_{t})+\u27e8\nabla J({\pi}_{t}),\lambda {f}_{t+1}\u27e9$$ 
Combine with the inequality above, and subtract $J({\pi}_{t})$ from both sides, and using the relationship ${f}_{t+1}^{*}=\eta \nabla J({\pi}_{t})$, we have that:
$$\u27e8\frac{1}{\eta}{f}_{t+1}^{*},\lambda {f}_{t+1}\u27e9\le \u27e8\frac{1}{\eta}{f}_{t+1}^{*},\omega {f}_{t+1}^{*}\u27e9+\frac{L{\omega}^{2}}{2}{\parallel {f}_{t+1}^{*}\parallel}^{2}$$ 
Since this is true $\forall \omega $, rearrange and choose $\omega $ such that $\frac{\omega}{\eta}\frac{L{\omega}^{2}}{2}=\frac{\lambda}{2\eta}$, namely $\omega =\frac{1\sqrt{1\lambda \eta L}}{L\eta}$, and complete the square, we can establish the bound that:
$$\parallel {f}_{t+1}{f}_{t+1}^{*}\parallel \le \eta {(\lambda L)}^{2}B$$  (30) 
for $B$ the upperbound on $\parallel {f}_{t+1}\parallel $. We thus have $\parallel {\beta}_{t}\parallel =O(\eta {(\lambda L)}^{2})$. Plugging the result from equation 30 into RHS of equation 29, we have:
$$\u27e8\nabla J({\pi}_{t}),{\pi}_{t}\pi \u27e9\le \frac{1}{\eta}\left({D}_{R}(\pi ,{\pi}_{t}){D}_{R}(\pi ,{\pi}_{t+1})+\u03f5{L}_{R}D+\frac{{\eta}^{2}}{2\alpha}{L}_{J}^{2}\right)+\left(\eta {(\lambda L)}^{2}B\right)$$  (31) 
Since $J$ is convex in $\mathscr{H}$, we have $J({\pi}_{t})J(\pi )\le \u27e8\nabla J({\pi}_{t}),{\pi}_{t}\pi \u27e9$. Similar to theorem 4.1, setting $\eta =\sqrt{\frac{1}{{\lambda}^{2}{\sigma}^{2}}(\frac{1}{T}+\u03f5)}$ and taking expectation on both sides, we have:
$$\mathbb{E}\left[\frac{1}{T}\sum _{t=1}^{T}J({\pi}_{t})\right]J({\pi}^{*})=O\left(\lambda \sigma \sqrt{\frac{1}{T}+\u03f5}+{\lambda}^{2}{L}^{2}\right)$$  (32) 
Note that unlike regret bound from theorem 4.1 under general bias, variance of gradient estimate and projection error, ${\sigma}^{2}$ here explicitly refers to the bound on neuralnetwork based policy gradient variance. The variance reduction of $\lambda \sigma $, at the expense of some bias, was also similarly noted in a recent functional regularization technique for policy gradient [19]. ∎
Appendix B Additional Experimental Results and Details
B.1 TORCS
We generate controllers for cars in The Open Racing Car Simulator (Torcs) [59]. In its full generality Torcs provides a rich environment with input from up to $89$ sensors, and optionally the 3D graphic from a chosen camera angle in the race. The controllers have to decide the values of $5$ parameters during game play, which correspond to the acceleration, brake, clutch, gear and steering of the car.
Apart from the immediate challenge of driving the car on the track, controllers also have to make racelevel strategy decisions, like making pitstops for fuel. A lower level of complexity is provided in the Practice Mode setting of TORCS. In this mode all racelevel strategies are removed. Currently, so far as we know, stateoftheart Drl models are capable of racing only in Practice Mode, and this is also the environment that we use. Here we consider the input from $29$ sensors, and decide values for the acceleration, steering, and braking actions.
We chose a suite of tracks that provide varying levels of difficulty for the learning algorithms. In particular, for the tracks Ruudskogen and Alpine2, the Ddpg agent is unable to reliably learn a policy that would complete a lap. We perform the experiments with twentyfive random seeds and report the median lap time over these twentyfive trials. However we note that the Torcs simulator is not deterministic even for a fixed random seed. Since we model the environment as a Markov Decision Process, this nondeterminism is consistent with our problem statement.
For our Deep Reinforcement Learning agents we used standard open source implementations (with pretuned hyperparameters) for the relevant domain.
All experiments were conducted on standard workstation with a 2.5 GHz Intel Core i7 CPU and a GTX 1080 Ti GPU card.
The code for the Torcs experiments can be found at: https://bitbucket.org/averma8053/propel
In Table 3 we show the lap time performance and crash ratios of Propel agents initialized with neural policies obtained via Ddpg. As discussed in Section 5, Ddpg often exhibits high variance across trials and this adversely affects the performance of the Propel agents when they are initialized via Ddpg. In Table 4 we show generalization results for the PropelTree agent. As noted in Section 5, the generalization results for PropelTree are in between those of Ddpg and PropelProg.
Verified Smoothness Property. For the program given in Figure 2 we proved using symbolic verification techniques, that $$. Here the function $\mathrm{\mathbf{p}\mathbf{e}\mathbf{e}\mathbf{k}}(.,i)$ takes in a history/sequence of sensor or action values and returns the value at position $i$, . Intuitively, the above logical implication means that if the sum of the consecutive differences of the last six $\mathrm{\U0001d681\U0001d67f\U0001d67c}$ sensor values is less than $0.003$, then the acceleration actions calculated at the last and penultimate step will not differ by more than $0.63$.
GTrack  ERoad  Aalborg  Ruudskogen  Alpine2  

Length  3186m  3260m  2588m  3274m  3774m 
PropelProgDdpg  97.76/.12  108.06/.08  140.48/.48  Cr / 0.68  Cr / 0.92 
PropelTreeDdpg  78.47/0.16  85.46/.04  Cr / 0.56  Cr / 0.68  Cr / 0.92 
GTrack  ERoad  Aalborg  Ruudskogen  Alpine2  

GTrack    95  Cr  Cr  Cr 
ERoad  84    Cr  Cr  Cr 
Aalborg  111  Cr    Cr  Cr 
Ruudskogen  154  Cr  Cr    Cr 
Alpine2  Cr  276  Cr  Cr   
MountainCar  Pendulum  

Prior  $00.59\pm 0.00$  $875.53\pm 0.00$ 
Ddpg  $97.16\pm 3.21$  $132.70\pm 6.44$ 
Trpo  $93.03\pm 1.86$  $131.54\pm 4.56$ 
Ndps  $66.98\pm 3.11$  $435.71\pm 4.83$ 
Viper  $64.86\pm 3.28$  $394.11\pm 4.97$ 
PropelProg  $95.63\pm 1.02$  $187.71\pm 2.35$ 
PropelTree  $96.56\pm 2.81$  $139.09\pm 3.31$ 
B.2 Classic Control
We present results from two classic control problems, MountainCar (with continuous actions) and Pendulum, in Table 5. We use the OpenAI Gym implementations of these environments. More information about these environments can be found at the links: MountainCar and Pendulum.
In MountainCar the goal is to drive an underpowered car up the side of a mountain in as few timesteps as possible. In Pendulum, the goal is to swing a pendulum up so that it stays upright. In both the environments an episode terminates after a maximum of $200$ timesteps.
In Table 5 we report the mean and standard deviation, over twentyfive random seeds, of the average scores over $100$ episodes for the listed agents and environments. In Figure 7 and Figure 7 we show the improvements made over the prior by the PropelProg agent in MountainCar and Pendulum respectively, with each iteration of the Propel algorithm.