Abstract
Automatic differentiation plays a prominent role in scientific computing andin modern machine learning, often in the context of powerful programmingsystems. The relation of the various embodiments of automatic differentiationto the mathematical notion of derivative is not always entirelycleardiscrepancies can arise, sometimes inadvertently. In order to studyautomatic differentiation in such programming contexts, we define a small butexpressive programming language that includes a construct for reversemodedifferentiation. We give operational and denotational semantics for thislanguage. The operational semantics employs popular implementation techniques,while the denotational semantics employs notions of differentiation familiarfrom real analysis. We establish that these semantics coincide.
Quick Read (beta)
A Simple Differentiable Programming Language
Abstract.
Automatic differentiation plays a prominent role in scientific computing and in modern machine learning, often in the context of powerful programming systems. The relation of the various embodiments of automatic differentiation to the mathematical notion of derivative is not always entirely clear—discrepancies can arise, sometimes inadvertently. In order to study automatic differentiation in such programming contexts, we define a small but expressive programming language that includes a construct for reversemode differentiation. We give operational and denotational semantics for this language. The operational semantics employs popular implementation techniques, while the denotational semantics employs notions of differentiation familiar from real analysis. We establish that these semantics coincide.
1
1. Introduction
Automatic differentiation is a set of techniques for calculating the derivatives of functions described by computer programs (e.g., (Pearlmutter and Siskind, 2008; Hascoët and Pascual, 2013; Baydin et al., 2018; Griewank, 2000)). These techniques are not required to produce symbolic representations for derivatives as in classic symbolic differentiation; on the other hand, neither do they employ finitedifference approximation methods common in numerical differentiation. Instead, they rely on the chain rule from calculus to obtain the desired derivatives from those of the programs’s basic operations. Thus, automatic differentiation is at the intersection of calculus and programming. However, the programs of interest are more than chains of operations: they may include controlflow constructs, data structures, and computational effects (e.g., sideeffects or exceptions). Calculus does not provide an immediate justification for the treatment of such programminglanguage features.
In the present work we help bridge the gap between rules for automatic differentiation in expressive programming languages and their mathematical justification in terms of denotational semantics. Specifically, we consider automatic differentiation from a programminglanguage perspective by defining and studying a small but powerful language. The language is a functional firstorder language, with only rudimentary data structures, but with conditionals and recursively defined functions (from which loops can be constructed). Additionally, it contains a construct for reversemode differentiation, explained in detail below. Our language is thus inspired by modern systems for machine learning, which include standard programming constructs and support reversemode differentiation. Reversemode differentiation permits the computation of gradients, forwardmode derivatives, and more. Indeed as our differentiation construct is a language primitive, differentiations can be nested within differentiations, allowing the computation of higherorder derivatives.
In the setting of a language such as ours, we can consider some common approaches to implementing differentiation:

•
One approach relies on code transformation, whether on source code or intermediate representations. For example, for the derivative of a conditional expression $\mathrm{\U0001d692\U0001d68f}B\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}{M}_{1}\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}{M}_{2}$, it would output $\mathrm{\U0001d692\U0001d68f}B\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}{N}_{1}\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}{N}_{2}$, where ${N}_{1}$ and ${N}_{2}$ are the derivatives of ${M}_{1}$ and ${M}_{2}$ respectively. This approach is employed, for instance, in Theano (Bergstra et al., 2010), Tensorflow 1.0 (Abadi et al., 2016a; Yu et al., 2018), and Tangent (van Merrienboer et al., 2018).

•
Another approach relies on tracing, typically eliminating control structures to produce a simpler form of code, which we call an execution trace, that can more easily be differentiated. For example, to produce the derivative of $\mathrm{\U0001d692\U0001d68f}B\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}{M}_{1}\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}{M}_{2}$, tracing would evaluate the conditional and produce a trace of the branch taken. Execution traces correspond to graphs of basic operations, and can be taken to be sequences of elementary assignments or else functional programs in Anormal form. Their derivatives can be calculated by applying the chain rule to those basic operations, perhaps via a code transformation (but now of a much simpler kind). Tracing may also record some intermediate values in an evaluation trace, to reduce, or eliminate, the need for recomputation.^{1}^{1} 1 Terminology in the automatic differentiation literature varies. Here we follow (Griewank, 2000) for evaluation traces. Our execution traces are, perhaps in somewhat different manifestations, variously termed Wengert lists or tapes or evaluation traces (Baydin et al., 2018; Pearlmutter and Siskind, 2008). They can also be seen as combinations of the operation and index traces of (Griewank, 2000).
This approach thereby conveniently avoids the problem of defining code transformations for conditionals and many other language constructs. It can also be implemented efficiently, sometimes in part with JIT compilation. For these reasons, tracebased differentiation is of growing importance. It is employed, for instance, in Autograd (Maclaurin et al., 2015), TensorFlow Eager Mode (Shankar and Dobson, 2017), Chainer (Tokui, 2018), PyTorch (PyTorch, 2018), and JAX (Frostig et al., 2018).
We therefore focus on tracebased differentiation in this paper, and give an operational semantics for our language using the tracebased approach. To do so, we define a sublanguage of execution trace terms (called simply trace terms below). These terms have no conditionals, function definitions or calls, or reversemode differentiations. They do have local definitions, corresponding to fanout in the graphs, but are not necessarily in Anormal form. Tracing is modeled by a new kind of evaluation, called symbolic evaluation. This uses an environment for the free variables of a term to remove conditionals and function calls. Function derivatives at a given value are evaluated in three stages: first, the function is traced at that value; next, the resulting trace term is symbolically differentiated (largely just using the chain rule), resulting in another such trace term; and, finally, that term is evaluated.
We do not account for some of the optimizations used in practice. Doing so would have been a more complicated enterprise, possibly with more arbitrary choices tied to implementation details, and we wished to get a more straightforward formalization working first.
From a mathematical perspective, both approaches to implementing differentiation pose correctness problems. In particular, functions defined using conditionals need not be continuous, let alone differentiable. Consider, for example, the following definition
$$ 
of the popular ReLU function (Goodfellow et al., 2016). This function is not differentiable at $0$. Further, changing the function body to $$ yields a noncontinuous function. What is more, both approaches can produce wrong answers even for differentiable functions! Consider, for example, the following definition of the identity function on the reals:
$$g(x:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}):\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}=\mathrm{\U0001d692\U0001d68f}(x=0)\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}\mathrm{\hspace{0.33em}0}\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}x$$ 
The derivative of this function at $x=0$ is $1$. However, differentiation “by branches” (whether by code transformation or tracing) would produce the wrong answer, $0$.
In order to capture the mathematical perspective, in addition to its operational semantics we give our language a denotational semantics. This semantics is based on classical notions of differentiation from real analysis (see, for example, (Trench, 2003)). That theory concerns multivariate functions on the reals defined on open domains, i.e., partial such functions with open domains of definition. In our semantics, we make use of those that are smooth (that is, those that can be partially differentiated any number of times). A particularly pleasing aspect of this mathematical development is how well domain theory (needed to account for recursion) interacts with differentiation.
Partiality is necessary, as for any language with general recursion, but it also gives us useful flexibility in connection with differentiation. For example, let $$ be the approximation to $$ which is equal to it except on the diagonal (i.e., where both arguments are equal) where it is undefined. Then
$$ 
defines an approximation to ReLU which is undefined at $0$. The approximation to $$ is (unlike $$) continuous (i.e., the preimages of $\mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e}$ and $\mathrm{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}$ are open sets), and the approximation to ReLU is differentiable wherever it is defined. Therefore, we design the semantics of our language so that it forbids functions such as $f$ but allows related approximations such as $\dot{f}$. An interesting question is how satisfactory an idealization this is of programming practice (which in any case works with approximate reals). We return to this point in the final section.
Proceeding in this way, we obtain adequacy theorems (i.e., operational soundness and completeness theorems) connecting the operational semantics of our language with a denotational semantics based on the classical theory of differentiation of partially defined multivariate real functions. Our theorems apply not only to conditional expressions but to the full language.
In sum, the main contributions of this paper are: (1) a firstorder language with conditionals, recursive function definitions, and a reversemode differentiation construct; (2) an operational semantics that models one form of tracebased differentiation; (3) a denotational semantics based on standard mathematical notions from real analysis and domain theory; and (4) theorems that show that the two semantics coincide, i.e., the derivatives computed by the operational semantics are indeed the correct derivatives in a mathematical sense. Beyond the specifics of these results, this paper aims to give some evidence of the relevance of ideas and techniques from the programminglanguages literature for programming systems that include automatic differentiation, such as current systems for machine learning.
Additional context
While traditionally associated with scientific computing, automatic differentiation is now a central component of many modern machine learning systems, and those for deep learning in particular (Goodfellow et al., 2016). These systems often employ automatic differentiation to compute the gradients of “loss functions” with respect to parameters, such as neural network weights. Loss functions measure the error resulting from particular values for the parameters. For example, when a machinelearning model is trained with a dataset consisting of pairs $({x}_{0},{y}_{0}),\mathrm{\dots},({x}_{n1},{y}_{n1})$, aiming to learn a function that maps the ${x}_{i}$’s to the ${y}_{i}$’s, the loss function may be the distance between the ${y}_{i}$’s and the values the model predicts when presented with the ${x}_{i}$’s. By applying gradient descent to adjust the parameters, this error can be reduced, until convergence or (more commonly) until the error is tolerable enough. This simple approach has proven remarkably effective: it is at the core of many recent successes of machine learning in a variety of domains.
Whereas gradients are for functions of type ${\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}\to \mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}$, for $n\ge 0$, treating the more general functions of type ${\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}\to {\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{m}$, for $n,m\ge 0$, works better with function composition, and with the composite structures such as tensors of reals used in deep learning. The literature contains two basic “modes” for differentiating such functions. Forwardmode extends the computation of the function, step by step, with the computation of derivatives; it can be seen as evaluating the function on dual numbers of the form $v+\dot{v}\epsilon $ where $\epsilon $ is nilpotent. In contrast, reversemode propagates derivatives backwards from each output, typically after the computation of the function. Reversemode differentiation is often preferred because of its superior efficiency for functions of type ${\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}\to {\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{m}$ with $$. In particular, systems for machine learning, which often deal with loss functions for which $m=1$, generally rely on reversemode differentiation. We refer the reader to the useful recent survey (Baydin et al., 2018) for additional background on these two modes of differentiation; it also discusses the use of higherorder differentiation.
Applications to machine learning are our main motivation. Accordingly, our language is loosely inspired by systems for machine learning, and the implementation strategies that we consider are ones of current interest there. We also deemphasize some concerns (e.g., numerical stability) that, at present, seem to play a larger role in scientific computing than in machine learning. As noted in (Baydin et al., 2016), the machine learning community has developed a mindset and a body of techniques distinct from those traditional in automatic differentiation.
The literature on scientific computing has addressed the correctness problem for conditionals (Beck and Fischer, 1994; Fischer, 2001), although not in the context of a formally defined programming language. In (Mayero, 2002) a formal proof of correctness was given in Coq (Bertot and Castéran, 2013) for an algorithm for the automatic differentiation of straightline sequences of Fortran assignments. Closer to machine learning, (Selsam et al., 2017) consider a stochastic graphical formalism where the nodes are random variables, and use the Lean theorem prover (de Moura et al., 2015) to establish the correctness of stochastic backpropagation. However, overall, the literature does not seem to contain semantics and theorems for a language of the kind we consider here.
Our work is also related to important papers by by Ehrhard, Regnier, et al., and by Di Gianantonio and Edalat. Ehrhard and Regnier introduce the differential $\lambda $calculus (see (Ehrhard and Regnier, 2003)); this is a simplytyped higherorder $\lambda $calculus with a forwardmode differentiation construct which can be applied to functions of any type. As described in (Blute et al., 2010), the differential $\lambda $calculus can be modeled using the category of convenient vector spaces and smooth functions between them (see (Kriegl and Michor, 1997)). They do not give an operational semantics but they do give rules for symbolic differentiation and it should not be too difficult to use them to give an operational semantics. However their language with its convenient vector space semantics only supports total functions. It therefore cannot be extended to include recursive function definitions or conditionals (even with total predicates, as continuous functions from ${\mathbb{R}}^{n}$ to the booleans are constant). Di Gianantonio and Edalat (Di Gianantonio and Edalat, 2013) prove adequacy theorems for their language, as do we, but their work differs from ours in several respects. In particular, their language has firstorder forwardmode but no reversemode differentiation: our language effectively supports both, and at all orders. On the other hand, their language allows recursivelydefined higherorder functions and accommodates functions, such as the ReLU function, which are differentiable in only a weaker sense. As far as we know, no other work on differentiable programming languages (e.g., (Pearlmutter and Siskind, 2008; Elliott, 2018; Wang et al., 2018; Shaikhha et al., 2018; Manzyuk, 2012)) gives operational and denotational semantics and proves adequacy theorems. Further afield, there is a good deal of work in the categorical literature on categories equipped with differential structure, for example (Blute et al., 2009; Bucciarelli et al., 2010).
Outline
Section 2 defines our language. Section 3 gives it an operational semantics with rules for symbolically evaluating general terms to trace terms, and for symbolically differentiating these terms. Sections 4 and 5 cover the needed mathematical material and the denotational semantics. Sections 6 establishes the correspondence between operational and denotational semantics. Section 7 concludes with discussion and some suggestions for future work.
2. A simple language
The types $S,T,U,\mathrm{\dots}$ of our language are given by the grammar:
$$\begin{array}{ccc}T\hfill & \hfill ::=\hfill & \mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}\mid \mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}\mid T\times U\hfill \end{array}$$ 
We will make use of iterated products ${T}_{0}\times \mathrm{\dots}\times {T}_{n1}$, defined to be $\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}$ when $n=0$, ${T}_{0}$ when $n=1$, and, recursively, $({T}_{0}\times \mathrm{\dots}\times {T}_{n1})\times {T}_{n}$, when $n>1$; we write ${\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}$ for the $n$fold iterated product of $\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}$. Note that this type system includes the types of tensors (multidimensional arrays) of a given shape: the type of tensors of shape $({d}_{0},\mathrm{\dots},{d}_{n1})$ is the iterated product ${\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{{d}_{0}}\times \mathrm{\dots}\times {\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{{d}_{n1}}$. The terms $L,M,N,P,\mathrm{\dots}$ and boolean terms $B$ of the language are built from operation symbols $\mathrm{op}\in \mathrm{Op}$ and predicate symbols $\mathrm{pred}\in \mathrm{Pred}$. An example operation symbol could be ${\mathrm{\mathit{D}\mathit{P}\mathit{r}\mathit{o}\mathit{d}}}_{n}$ for dot product of vectors of dimension $n$ (for $n\in \mathbb{N}$); an example predicate symbol could be $$.
The terms are given by the following grammar, where $x$ and $f$ range over disjoint countably infinite sets of ordinary and function variables, respectively (so we have disjoint alphabets of ordinary and function variables):
$$\begin{array}{ccc}M\hfill & \hfill ::=\hfill & x\mid r(r\in \mathbb{R})\mid M+N\mid \mathrm{op}(M)\mid \hfill \\ & & \mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=M\mathrm{\U0001d692\U0001d697}N\mid \hfill \\ & & \ast \mid {\u27e8M,N\u27e9}_{T,U}\mid {\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}(M)\mid {\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}(M)\mid \hfill \\ & & \mathrm{\U0001d692\U0001d68f}B\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N\mid \hfill \\ & & \mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N\mid \hfill \\ & & f(M)\mid \hfill \\ & & M.{\mathrm{\U0001d69b\U0001d68d}}_{L}(x:T.N)\hfill \\ & & \\ B\hfill & \hfill ::=\hfill & \mathrm{pred}(M)\mid \mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e}\mid \mathrm{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}\hfill \end{array}$$ 
These constructs are all fairly standard, except for $\mathrm{\U0001d69b\U0001d68d}$, which is for reversemode differentiation, and which we explain below. We treat addition separately from the operations to underline the fact that the commutative monoid it forms, together with zero, is basic for differentiation. For example, the rules for symbolic differentiation given below make essential use of this structure, but do not need more any more of the available vector space structure.
Note the type subscripts on pairing and projection terms. Below, we rely on these subscripts for symbolic differentiation. In practice they could, if needed, be added when typechecking.
The sets $\mathrm{FV}(M)$ and $\mathrm{FFV}(M)$ of free ordinary variables and free function variables of a term $M$ are understood as usual (and similarly for boolean terms). As is also usual, we do not distinguish $\alpha $equivalent terms (or boolean terms).
The useful abbreviation
$$\mathrm{\U0001d695\U0001d68e\U0001d69d}\u27e8{x}_{0}:{T}_{0},\mathrm{\dots},{x}_{n1}:{T}_{n1}\u27e9=M\mathrm{\U0001d692\U0001d697}N$$ 
provides an elimination construct for iterated products. When $n=0$ this is
$$\mathrm{\U0001d695\U0001d68e\U0001d69d}x:\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}=M\mathrm{\U0001d692\U0001d697}N$$ 
where $x\notin \mathrm{FV}(N)$; when $n=1$ it is the above let construct; otherwise, it is defined recursively by:
$$\mathrm{\U0001d695\U0001d68e\U0001d69d}\u27e8{x}_{0}:{T}_{0},\mathrm{\dots},{x}_{n}:{T}_{n}\u27e9=M\mathrm{\U0001d692\U0001d697}N=\begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}z:({T}_{0}\times \mathrm{\dots}\times {T}_{n})=M\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}\u27e8{x}_{0}:{T}_{0},\mathrm{\dots},{x}_{n1}:{T}_{n1}\u27e9={\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{{T}_{0}\times \mathrm{\dots}\times {T}_{n1},{T}_{n}}(z)\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}{x}_{n}:{T}_{n}={\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{{T}_{0}\times \mathrm{\dots}\times {T}_{n1},{T}_{n}}(z)\mathrm{\U0001d692\U0001d697}N\hfill \end{array}$$ 
(where $z$ is chosen not free in $N$).
We have zero and addition only at type $\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}$. At other types we proceed inductively:
$${0}_{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}=0\mathit{\hspace{1em}\hspace{1em}}{0}_{\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}}=\ast \mathit{\hspace{1em}\hspace{1em}}{0}_{T\times U}=\u27e8{0}_{T},{0}_{U}\u27e9$$ 
and:
$$\begin{array}{ccc}M{+}_{\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}}N\hfill & \hfill =\hfill & \begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}x:\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}=M\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}y:\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}=N\mathrm{\U0001d692\U0001d697}\ast \hfill \end{array}\hfill \\ & & \\ M{+}_{T\times U}N\hfill & \hfill =\hfill & \begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}\u27e8{x}_{1}:T,{x}_{2}:U\u27e9=M\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}\u27e8{y}_{1}:T,{y}_{2}:U\u27e9=N\mathrm{\U0001d692\U0001d697}\hfill \\ \u27e8{x}_{1}{+}_{T}{y}_{1},{x}_{2}{+}_{U}{y}_{2}\u27e9\hfill \end{array}\hfill \end{array}$$ 
Skating over the difference between terms and their denotations, $M.{\mathrm{\U0001d69b\U0001d68d}}_{L}(x:T.N)$ is the reversemode derivative at $L:T$, evaluated at $M:U$, of the function $f$, where $f(x:T):U=N$. Reversemode differentiation includes gradients as a special case. When $T={\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}$ and $U=\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}$, the gradient of $f$ at $L$ is given by:
$${\mathrm{\U0001d690\U0001d69b\U0001d68a\U0001d68d}}_{L}(x:{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}.N)=\mathrm{\hspace{0.33em}1}.{\mathrm{\U0001d69b\U0001d68d}}_{L}(x:{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}.N)$$ 
For definitions of gradients, Jacobians, and derivatives see Section 4.1 below, particularly equations (1), (2), (3), and (4). More generally, for an introduction to real analysis including vectorvalued functions of several variables and their differentials and Jacobians, see, for example, (Trench, 2003).
As in (Christianson, 2012), and as validated by equation (6) below, forwardmode differentiation can be defined using nested reversemode differentiation. We can set:
$$M.{\mathrm{\U0001d68f\U0001d68d}}_{U,L}(x:T.N)=M.{\mathrm{\U0001d69b\U0001d68d}}_{{0}_{U}}(y:U.y.{\mathrm{\U0001d69b\U0001d68d}}_{L}(x:T.N))$$ 
So our language effectively also has forwardmode differentiation.
Function definitions can be recursive. Indeed a function can even be defined in terms of its own derivative: in a recursive function definition $\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N$, the language allows occurrences of $f$ within the term ${N}^{\prime}$ in a subterm ${M}^{\prime}.{\mathrm{\U0001d69b\U0001d68d}}_{{L}^{\prime}}(y:{T}^{\prime}.{N}^{\prime})$ of $M$. This generality may be useful—examples (Maclaurin, 2017, p22) have arisen in the context of Autograd (Maclaurin et al., 2015). Pleasantly, both our operational and denotational semantics accommodate it without special complications. When we define a function without recursion, we may abbreviate $\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}$ to $\mathrm{\U0001d695\U0001d68e\U0001d69d}$.
Turning to typing, operation and predicate symbols have assumed arities, written $\mathrm{op}:T\to U$ and $\mathrm{pred}:T$; we write ${\mathrm{Op}}_{T,U}$ for the set of operation symbols of arity $T\to U$. As examples, we would have ${\mathrm{\mathit{D}\mathit{P}\mathit{r}\mathit{o}\mathit{d}}}_{n}:{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}\times {\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}\to \mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}$ and $$. Figure 1 gives typing rules for sequents of the forms
$$\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T\mathit{\hspace{1em}\hspace{1em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2B$$ 
where (type) environments $\mathrm{\Gamma}$ have the form
$${x}_{0}:{T}_{0},\mathrm{\dots},{x}_{n1}:{T}_{n1}$$ 
(${x}_{i}$ all different) and where function (type) environments $\mathrm{\Phi}$ have the form
$${f}_{0}:{T}_{0}\to {U}_{0},\mathrm{\dots},{f}_{n1}:{T}_{n1}\to {U}_{n1}$$ 
(${f}_{i}$ all different). We adopt the usual overwriting notations $\mathrm{\Gamma}[{\mathrm{\Gamma}}^{\prime}]$ and $\mathrm{\Phi}[{\mathrm{\Phi}}^{\prime}]$ for type environments.
$$\begin{array}{c}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2x:T\hspace{1em}(x:T\in \mathrm{\Gamma})\hspace{1em}\hspace{1em}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2r:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}\hspace{1em}(r\in \mathbb{R})\hfill \\ \\ \frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}\mathit{\hspace{1em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2N:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M+N:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}\hfill \\ \\ \frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{op}(M):U}\mathit{\hspace{1em}\hspace{0.5em}}\left(\mathrm{op}:T\to U\right)\hfill \\ \\ \frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}[x:T]\u22a2N:U}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=M\mathrm{\U0001d692\U0001d697}N:U}\hfill \\ \\ \mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\ast :\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}\mathit{\hspace{1em}\hspace{1em}}\frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T\mathit{\hspace{1em}\hspace{1em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2N:U}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2{\u27e8M,N\u27e9}_{T,U}:T\times U}\hfill \\ \\ \frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T\times U}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2{\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}(M):T}\mathit{\hspace{1em}\hspace{1em}}\frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T\times U}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2{\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}(M):U}\hfill \\ \\ \frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2B\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2N:T}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{\U0001d692\U0001d68f}B\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N:T}\hfill \\ \\ \frac{\mathrm{\Phi}[f:T\to U]\mid \mathrm{\Gamma}[x:T]\u22a2M:U\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}[f:T\to U]\mid \mathrm{\Gamma}\u22a2N:S}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N:S}\mathit{\hspace{1em}}\left(\mathrm{FV}\left(M\right)\subseteq \left\{x\right\}\right)\hfill \\ \\ \frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2f(M):U}\mathit{\hspace{1em}}\left(f:T\to U\in \mathrm{\Phi}\right)\hfill \\ \\ \frac{\mathrm{\Phi}\mid \mathrm{\Gamma}[x:T]\u22a2N:U\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2L:T\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:U}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M.{\mathrm{\U0001d69b\U0001d68d}}_{L}(x:T.N):T}\hfill \\ \\ \mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e}\mathit{\hspace{1em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}\mathit{\hspace{1em}}\frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{pred}(M)}\left(\mathrm{pred}:T\right)\hfill \end{array}$$ 
The typing rule for function definitions forbids any global variable occurrences (i.e., free variables in function definitions). This restriction involves no loss in expressiveness: as in lambda lifting, one can just add any global variables to a function’s parameters, and then apply the function to the global variables wherever it is called. The restriction enabled us to prove part (2) of Theorem 6.2 (below), but we conjecture it is not needed.
Our various abbreviations have natural admissible typing rules:
$$\frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:{T}_{0}\times \mathrm{\dots},\times {T}_{n1}\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}[{x}_{0}:{T}_{0},\mathrm{\dots},{x}_{n1}:{T}_{n1}]\u22a2N:U}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2\mathrm{\U0001d695\U0001d68e\U0001d69d}\u27e8{x}_{0}:{T}_{0},\mathrm{\dots},{x}_{n1}:{T}_{n1}\u27e9=M\mathrm{\U0001d692\U0001d697}N:U}$$ 
$$\frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2N:{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}\mathit{\hspace{1em}\hspace{1em}}\mathrm{\Phi}\mid \mathrm{\Gamma}[x:{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}]\u22a2M:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2{\mathrm{\U0001d690\U0001d69b\U0001d68a\U0001d68d}}_{N}(x:{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}.M):{\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}}^{n}}$$ 
$$\frac{\mathrm{\Phi}\mid \mathrm{\Gamma}[x:T]\u22a2N:U\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2L:T\mathit{\hspace{1em}\hspace{0.5em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M.{\mathrm{\U0001d68f\U0001d68d}}_{U,L}(x:T.N):U}$$ 
$$\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2{0}_{T}:T\mathit{\hspace{1em}\hspace{1em}}\frac{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T\mathit{\hspace{1em}\hspace{1em}}\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2N:T}{\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M{+}_{T}N:T}$$ 
We may write $\mathrm{\Gamma}\u22a2M:T$ (or $\u22a2M:T$) instead of $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T$ if $M$ has no free ordinary (or function) variables (and similarly for boolean terms). Typing is unique: for any $\mathrm{\Gamma}$, $\mathrm{\Phi}$, and $M$ there is at most one type $T$ such that $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T$ holds.
As an example, we use our language to program a miniature version of an algorithm for training a machine learning model by gradient descent, loosely based on (Goodfellow et al., 2016, Algorithm 8.1). In such training, one often starts with an untrained model, which is a function from inputs (for example, images) and parameter values to output “predictions” (for example, image labels). Relying on a dataset of input/output pairs, one then picks values of the parameters by gradient descent, as indicated in the Introduction. In our miniature version, we treat inputs, parameter values, and outputs as reals, and we assume that the training data consists of only one fixed input/output pair $(a,b)$. We also assume that we have real constants ${w}_{0}$ (for the initial value for gradient descent), $\mathrm{\mathit{r}\mathit{a}\mathit{t}\mathit{e}}$ (for the learning rate, which is fixed) and $\mathrm{\mathit{m}\mathit{a}\mathit{x}\mathit{L}\mathit{o}\mathit{s}\mathit{s}}$ (for the desired maximum loss on the dataset), and the infix predicate symbol $$. We can then define the trained model from the untrained model and a loss function as follows:
$$ 
The example above is typical of what can be expressed in our language, and many variants of machine learning techniques that rely on gradient descent (e.g., of the kind presented in (Goodfellow et al., 2016), and commonly used in systems like TensorFlow) are in scope as well. For instance, there is no difficulty in expressing optimization with momentum, or differentially private stochastic gradient descent (e.g., (Song et al., 2013; Abadi et al., 2016b)). Probabilistic choice may be treated via random number generators, as is done in practice. Architectures that rely on convolutions or RNN cells can be expressed, even conveniently, with a suitable choice of primitives.
3. Operational semantics
We give a bigstep operational semantics, specified with Felleisen and Friedman’s method (Felleisen and Friedman, 1987) using evaluation contexts and redexes. Other styles of operational semantics accommodating differentiation are surely also possible.
Terms and boolean terms are (ordinarily) evaluated to closed values and (necessarily) closed boolean values. The most original aspect of our operational semantics concerns the evaluation of differential terms; this is based on the tracebased approach outlined in the Introduction, and uses a second mode of evaluation: symbolic evaluation.
The core idea is that to evaluate a differential term
$$M.{\mathrm{\U0001d69b\U0001d68d}}_{L}(x:T.N)$$ 
one first evaluates $L$ and $M$, and then performs differentiation before evaluating further. There are two differentiation stages. First, using the closed value $V$ of $L$ for the differentiation variable $x$, $N$ is symbolically evaluated to a trace term $C$, thereby removing all control constructs from $N$, but possibly keeping the variable $x$ free in $C$, as the derivative may well depend on it. For example, when $N$ is $$, the value $V$ allows us to evaluate the guard of the conditional, but we do not replace the occurrence of $x$ in the $\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}$ branch with $V$. Second, $C$ is symbolically differentiated at $V$ with respect to $x$.
However, this idea is not enough by itself as the differential term may occur inside yet another differential term. One therefore also needs to be able to symbolically evaluate the differential term. That is done much as in the case of ordinary evaluation, but now symbolically evaluating redexes in $L$ and $M$ until one is left with the problem of symbolically evaluating a term of the form
$$W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.N)$$ 
where $V$ and $W$ are values that may contain free variables. One then proceeds as above, symbolically evaluating $N$ (now using the closed value ${V}^{\prime}$ of $V$) and then performing symbolic differentiation. As there is some duplication between these two symbolic and ordinary evaluation processes, our rule for ordinarily evaluating a differential term is designed, when executed, to first symbolically evaluate the term, and then ordinarily evaluate the resulting trace term.
The need to keep track of differentiation variables and their values for symbolic evaluation leads us to use value environments for ordinary variables. It is convenient to also use them for ordinary evaluation and to use function environments for function variables for both modes of evaluation.
Values $V,W,X,\mathrm{\dots}$ are terms given by the grammar:
$$\begin{array}{ccc}V\hfill & \hfill ::=\hfill & x\mid r(r\in \mathbb{R})\mid \ast \mid {\u27e8V,W\u27e9}_{T,U}\hfill \end{array}$$ 
Note that, as indicated above, values may have free variables for the purposes of differentiation. Boolean values ${V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}$ are boolean terms given by:
$${V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}::=\mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e}\mid \mathrm{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}$$ 
Closed values have unique types $\u22a2V:{T}_{V}$; the set of closed values of type $T$ is ${\mathrm{Val}}_{T}$; and the set of boolean values is ${\mathrm{Val}}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}$. We assume available operation and predicate symbol evaluation functions
$$\mathrm{ev}:{\mathrm{Op}}_{T,U}\times {\mathrm{Val}}_{T}\rightharpoonup {\mathrm{Val}}_{U}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}}\mathrm{bev}:{\mathrm{Pred}}_{T}\times {\mathrm{Val}}_{T}\rightharpoonup {\mathrm{Val}}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}$$ 
We also assume that for every operator $\mathrm{op}$ of arity $T\to U$ there is an operator ${\mathrm{op}}_{r}$ of arity $T\times U\to T$. The idea is that ${\mathrm{op}}_{r}(\u27e8L,M\u27e9)$ is the reversemode derivative of $\mathrm{op}$ at $L$ evaluated at $M$. We write $M.{\mathrm{op}}_{r}(L)$ for ${\mathrm{op}}_{r}(\u27e8L,M\u27e9)$. For example, for ${\mathrm{\mathit{D}\mathit{P}\mathit{r}\mathit{o}\mathit{d}}}_{\mathit{2}}$ we would have:
$$\mathrm{ev}({\mathrm{\mathit{D}\mathit{P}\mathit{r}\mathit{o}\mathit{d}}}_{2},\u27e8\u27e8a,b\u27e9,\u27e8{a}^{\prime},{b}^{\prime}\u27e9\u27e9)=a{a}^{\prime}+b{b}^{\prime}$$ 
and
$$\mathrm{ev}({({\mathrm{\mathit{D}\mathit{P}\mathit{r}\mathit{o}\mathit{d}}}_{\mathit{2}})}_{r},\u27e8\u27e8\u27e8a,b\u27e9,\u27e8{a}^{\prime},{b}^{\prime}\u27e9\u27e9,c\u27e9)=\u27e8\u27e8c{a}^{\prime},c{b}^{\prime}\u27e9,\u27e8ca,cb\u27e9\u27e9$$ 
We next define (value) environments $\rho $, function environments $\phi $, and (recursive function) closures $\mathrm{Cl}$, the last two mutually recursively:


Value environments are finite functions
$$\rho =\{{x}_{0}\mapsto {V}_{0},\mathrm{\dots},{x}_{n1}\mapsto {V}_{n1}\}$$ from ordinary variables to closed values.


Every finite function
$$\phi =\{{f}_{0}\mapsto {\mathrm{Cl}}_{0},\mathrm{\dots},{f}_{n1}\mapsto {\mathrm{Cl}}_{n1}\}$$ from function variables to closures is a function environment.


If $\mathrm{FV}(M)\subseteq \{x\}$ and $\mathrm{FFV}(M)\backslash \{f\}\subseteq \mathrm{Dom}(\phi )$ then $\u27e8\phi ,f,x,T,U,M\u27e9$ is a closure, written ${\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{\phi}(f(x:T):U.M)$.
For any $V$ and $\rho $ with $\mathrm{FV}(V)\subseteq \mathrm{Dom}(\rho )$, $\rho (V)$ is the closed value obtained by substituting $\rho (x)$ for all free occurrences of $x$ in $V$.
Trace terms $C,D,\mathrm{\dots}$, are defined as follows:
$$\begin{array}{ccc}C\hfill & \hfill ::=\hfill & x\mid r(r\in \mathbb{R})\mid C+D\mid \mathrm{op}(C)\mid \hfill \\ & & \mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=C\mathrm{\U0001d692\U0001d697}D\mid \hfill \\ & & \ast \mid {\u27e8C,D\u27e9}_{T,U}\mid {\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}(C)\mid {\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}(C)\hfill \end{array}$$ 
They are the terms with no conditionals, function definitions or applications, or differentiations.
We will define two ordinary evaluation relations, and one symbolic one:

•
For all $\phi $ and $\rho $ we define evaluation relations between terms and closed values and between boolean terms and closed boolean values via rules establishing sequents of the forms:
$$\phi \mid \rho \u22a2M\Rightarrow V\mathit{\hspace{1em}\hspace{1em}\hspace{1em}}\phi \mid \rho \u22a2B\Rightarrow {V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}$$ 
•
For all $\phi $ and $\rho $ we define a symbolic evaluation relation between terms and trace terms via rules establishing sequents of the form:
$$\phi \mid \rho \u22a2M\rightsquigarrow C$$
Evaluation contexts (boolean evaluation contexts), ranged over by $E$ (resp. ${E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}$ ), are terms with a unique hole $[]$ :
$$\begin{array}{ccc}\hfill E\hfill & \hfill ::=\hfill & []\mid E+N\mid V+E\mid \mathrm{op}(E)\mid \hfill \\ & & \mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=E\mathrm{\U0001d692\U0001d697}N\mid \hfill \\ & & {\u27e8E,N\u27e9}_{T,U}\mid {\u27e8V,E\u27e9}_{T,U}\mid {\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}(E)\mid {\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}(E)\mid \hfill \\ & & \mathrm{\U0001d692\U0001d68f}{E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N\mid \hfill \\ & & f(E)\mid \hfill \\ & & M.{\mathrm{\U0001d69b\U0001d68d}}_{E}(x:T.N)\mid E.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.N)\hfill \\ & & \\ \hfill {E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\hfill & \hfill ::=\hfill & \mathrm{pred}(E)\hfill \end{array}$$ 
We write $E[M]$ for the term obtained by replacing the hole $[]$ in $E$ by the term $M$ and ${E}_{bool}[M]$ similarly; a context $E$ is trivial if it is $[]$; and $\mathrm{FV}$ and $\mathrm{FFV}$ are extended to contexts. We have $\mathrm{FV}(E[M])=\mathrm{FV}(E)\cup \mathrm{FV}(M)$ and $\mathrm{FFV}(E[M])=\mathrm{FFV}(E)\cup \mathrm{FFV}(M)$ and similarly for boolean contexts.
Redexes, ranged over by $R$, and boolean redexes, ranged over by ${R}_{bool}$, are given by:
$$\begin{array}{ccc}R\hfill & \hfill ::=\hfill & V+W\mid \mathrm{op}(V)\mid \hfill \\ & & \mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}N\mid \hfill \\ & & {\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}(V)\mid {\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}(V)\mid \hfill \\ & & \mathrm{\U0001d692\U0001d68f}{V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N\mid \hfill \\ & & \mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N\mid f(V)\mid \hfill \\ & & W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.N)\hfill \\ & & \\ {R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\hfill & \hfill ::=\hfill & \mathrm{pred}(V)\hfill \end{array}$$ 
Note that boolean expressions are useful here in that they enable separate conditional and predicate redexes, and so evaluating predicates and making choices are distinct in the operational semantics.
The next lemma is the basis of a division into cases that supports operational semantics using evaluation contexts in the style of Felleisen and Friedman.
Lemma 3.1 (Evaluation context analysis).

(1)
Every term $M$, other than a value, has exactly one of the following two forms:

•
$E[R]$ for a unique evaluation context and redex, or

•
$E[{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]$ for a unique, and nontrivial, evaluation context and boolean redex.

•

(2)
Every boolean term $B$, other than a boolean value, has exactly one of the following two forms:

•
${E}_{bool}[R]$ for a unique, and nontrivial, boolean evaluation context and redex, or

•
${E}_{bool}[{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]$ for a unique boolean evaluation context and boolean redex.

•
The next lemma is useful to track types when proving theorems about the operational semantics.
Lemma 3.2 (Evaluation context polymorphism).
Suppose that $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}E\mathit{}\mathrm{[}M\mathrm{]}\mathrm{:}T$. Then, for some type $U$ we have $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}M\mathrm{:}U$ and, whenever $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}N\mathrm{:}U$, we have $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}E\mathit{}\mathrm{[}N\mathrm{]}\mathrm{:}T$.
Analogous results hold for typings of any of the forms $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}E\mathit{}\mathrm{[}B\mathrm{]}\mathrm{:}T$ or $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}{E}_{\mathrm{bool}}\mathit{}\mathrm{[}M\mathrm{]}$ or $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}{E}_{\mathrm{bool}}\mathit{}\mathrm{[}B\mathrm{]}$.
By the uniqueness of types, the types whose existence is claimed in the above lemma are unique.
$$\begin{array}{c}\phi \mid \rho \u22a2V\Rightarrow \rho \left(V\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow r\mathit{\hspace{1em}}\phi \mid \rho \u22a2W\Rightarrow s}{\phi \mid \rho \u22a2V+W\Rightarrow t}\hfill \\ \left(\text{where}t=r+s\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {V}^{\prime}}{\phi \mid \rho \u22a2\mathrm{op}(V)\Rightarrow W}\hfill \\ \left(\text{where}\mathrm{ev}(\mathrm{op},{V}^{\prime})\simeq W\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {V}^{\prime}\mathit{\hspace{1em}}\phi \mid \rho [{V}^{\prime}/x]\u22a2N\Rightarrow W}{\phi \mid \rho \u22a2\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}N\Rightarrow W}\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {\u27e8{W}_{1},{W}_{2}\u27e9}_{T,U}}{\phi \mid \rho \u22a2{\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}(V)\Rightarrow {W}_{1}}\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {\u27e8{W}_{1},{W}_{2}\u27e9}_{T,U}}{\phi \mid \rho \u22a2{\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}(V)\Rightarrow {W}_{2}}\hfill \\ \end{array}$$ 
$$\begin{array}{c}\frac{\phi \mid \rho \u22a2M\Rightarrow V}{\phi \mid \rho \u22a2\mathrm{\U0001d692\U0001d68f}\mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e}\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N\Rightarrow V}\hfill \\ \\ \frac{\phi \mid \rho \u22a2N\Rightarrow V}{\phi \mid \rho \u22a2\mathrm{\U0001d692\U0001d68f}\mathrm{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N\Rightarrow V}\hfill \\ \\ \frac{\phi [{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{\phi}(f(x:T):U.M)/f]\mid \rho \u22a2N\Rightarrow V}{\phi \mid \rho \u22a2\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N\Rightarrow V}\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {V}^{\prime}\mathit{\hspace{1em}}{\phi}^{\prime}[\phi (f)/f]\mid \{x\mapsto {V}^{\prime}\}\u22a2M\Rightarrow W}{\phi \mid \rho \u22a2f(V)\Rightarrow W}\hfill \\ \left(\text{where}\phi \left(f\right)={\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{{\phi}^{\prime}}(f\left(x:T\right):U.M)\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.N)\rightsquigarrow C\mathit{\hspace{1em}}\phi \mid \rho \u22a2C\Rightarrow X}{\phi \mid \rho \u22a2W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.N)\Rightarrow X}\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {V}^{\prime}}{\phi \mid \rho \u22a2\mathrm{pred}(V)\Rightarrow {W}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}}\hfill \\ \left(\text{where}\mathrm{bev}(\mathrm{pred},{V}^{\prime})\simeq {W}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\right)\hfill \end{array}$$ 
$$\begin{array}{c}\frac{\phi \mid \rho \u22a2R\Rightarrow V\mathit{\hspace{1em}}\phi \mid \rho [V/x]\u22a2E[x]\Rightarrow W}{\phi \mid \rho \u22a2E[R]\Rightarrow W}\hspace{1em}\u2006\left(E\text{nontrivial,}x\notin \mathrm{Dom}\left(\rho \right)\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\Rightarrow {V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\mathit{\hspace{1em}}\phi \mid \rho \u22a2E[{V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]\Rightarrow W}{\phi \mid \rho \u22a2E[{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]\Rightarrow W}\hfill \\ \\ \frac{\phi \mid \rho \u22a2R\Rightarrow V\mathit{\hspace{1em}}\phi \mid \rho [V/x]\u22a2{E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}[x]\Rightarrow {W}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}}{\phi \mid \rho \u22a2{E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}[R]\Rightarrow {W}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}}\hspace{1em}\hspace{1em}\left(x\notin \mathrm{Dom}\left(\rho \right)\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\Rightarrow {V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\mathit{\hspace{1em}}\phi \mid \rho \u22a2{E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}[{V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]\Rightarrow {W}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}}{\phi \mid \rho \u22a2{E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}[{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]\Rightarrow {W}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}}\hfill \end{array}$$ 
$$\begin{array}{c}\phi \mid \rho \u22a2V\rightsquigarrow V\hfill \\ \\ \phi \mid \rho \u22a2V+W\rightsquigarrow V+W\hfill \\ \\ \phi \mid \rho \u22a2\mathrm{op}\left(V\right)\rightsquigarrow \mathrm{op}\left(V\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {V}^{\prime}\mathit{\hspace{1em}}\phi \mid \rho [{V}^{\prime}/x]\u22a2N\rightsquigarrow C}{\phi \mid \rho \u22a2\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}N\rightsquigarrow \mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}C}\hfill \\ \\ \phi \mid \rho \u22a2{\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}\left(V\right)\rightsquigarrow {\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}\left(V\right)\hfill \\ \\ \phi \mid \rho \u22a2{\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}\left(V\right)\rightsquigarrow {\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}\left(V\right)\hfill \\ \end{array}$$ 
$$\begin{array}{c}\frac{\phi \mid \rho \u22a2M\rightsquigarrow C}{\phi \mid \rho \u22a2\mathrm{\U0001d692\U0001d68f}\mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e}\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N\rightsquigarrow C}\hfill \\ \\ \frac{\phi \mid \rho \u22a2N\rightsquigarrow C}{\phi \mid \rho \u22a2\mathrm{\U0001d692\U0001d68f}\mathrm{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N\rightsquigarrow C}\hfill \\ \\ \frac{\phi [{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{\phi}(f(x:T):U.M)/f]\mid \rho \u22a2N\rightsquigarrow C}{\phi \mid \rho \u22a2\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N\rightsquigarrow C}\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {V}^{\prime}\mathit{\hspace{1em}}{\phi}^{\prime}[\phi (f)/f]\mid \{x\mapsto {V}^{\prime}\}\u22a2M\rightsquigarrow C}{\phi \mid \rho \u22a2f(V)\rightsquigarrow \mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}C}\hfill \\ (\text{where}\phi \left(f\right)={\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{{\phi}^{\prime}}(f(x:T):U.M))\hfill \\ \\ \frac{\phi \mid \rho \u22a2V\Rightarrow {V}^{\prime}\mathit{\hspace{1em}}\phi \mid \rho [{V}^{\prime}/x]\u22a2N\rightsquigarrow C}{\phi \mid \rho \u22a2W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.N)\rightsquigarrow W.{\mathcal{R}}_{V}(x:T.C)}\hfill \\ \end{array}$$ 
$$\begin{array}{c}\frac{\phi \mid \rho \u22a2R\rightsquigarrow C\mathit{\hspace{1em}}\phi \mid \rho \u22a2C\Rightarrow V\mathit{\hspace{1em}}\phi \mid \rho [V/x]\u22a2E[x]\rightsquigarrow D}{\phi \mid \rho \u22a2E[R]\rightsquigarrow \mathrm{\U0001d695\U0001d68e\U0001d69d}x:{T}_{V}=C\mathrm{\U0001d692\U0001d697}D}\hspace{1em}\left(E\text{nontrivial and}x\notin \mathrm{Dom}\left(\rho \right)\right)\hfill \\ \\ \frac{\phi \mid \rho \u22a2{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\Rightarrow {V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}\mathit{\hspace{1em}}\phi \mid \rho \u22a2E[{V}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]\rightsquigarrow C}{\phi \mid \rho \u22a2E[{R}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}]\rightsquigarrow C}\hfill \end{array}$$ 
$$\begin{array}{ccc}W.{\mathcal{R}}_{V}(x:T.y)\hfill & =\hfill & \{\begin{array}{cc}({0}_{{T}_{0}},\mathrm{\dots},{0}_{{T}_{i1}},W,{0}_{{T}_{i+1}},\mathrm{\dots},{0}_{{T}_{n1}})\hfill & (y={x}_{i})\hfill \\ {0}_{T}\hfill & (y\ne \text{any}{x}_{i})\hfill \end{array}\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.r)\hfill & =\hfill & {0}_{T}\hspace{1em}\hspace{1em}(r\in \mathbb{R})\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.D+E)\hfill & =\hfill & W.{\mathcal{R}}_{V}(x:T.D){+}_{T}W.{\mathcal{R}}_{V}(x:T.E)\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.\mathrm{op}(D))\hfill & =\hfill & \begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}y:S=W.{\mathrm{op}}_{r}(D)\mathrm{\U0001d692\U0001d697}y.{\mathcal{R}}_{V}(x:T.D)\hfill \end{array}\hfill \\ & & (y\notin \mathrm{FV}(V),\mathrm{op}:S\to U)\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.\mathrm{\U0001d695\U0001d68e\U0001d69d}y:S=D\mathrm{\U0001d692\U0001d697}E)\hfill & =\hfill & \begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}y:S=D\mathrm{\U0001d692\U0001d697}\hfill \\ W.{\mathcal{R}}_{V}(x:T.E){+}_{T}\hfill \\ (\mathrm{\U0001d695\U0001d68e\U0001d69d}\overline{y}:S=W.{\mathcal{R}}_{y}(y:S.E)\mathrm{\U0001d692\U0001d697}\overline{y}.{\mathcal{R}}_{V}(x:T.D))\hfill \end{array}\hfill \\ & & (y\notin \mathrm{FV}(W),y,\overline{y}\notin \mathrm{FV}(V,D))\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.\ast )\hfill & =\hfill & {0}_{T}\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.{\u27e8M,N\u27e9}_{U,S})\hfill & =\hfill & \begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}y:U,z:S=W\mathrm{\U0001d692\U0001d697}\hfill \\ y.{\mathcal{R}}_{V}(x:T.M){+}_{T}z.{\mathcal{R}}_{V}(x:T.N)\hfill \\ (y,z\notin \mathrm{FV}(V,M,N))\hfill \end{array}\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.{\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{U,S}(D))\hfill & =\hfill & \begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}y:U\times S=D\mathrm{\U0001d692\U0001d697}\u27e8W,{0}_{S}\u27e9.{\mathcal{R}}_{V}(x:T.D)\hfill \end{array}\hfill \\ & & (y\notin \mathrm{FV}(V,W,D))\hfill \\ & & \\ W.{\mathcal{R}}_{V}(x:T.{\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{U,S}(D))\hfill & =\hfill & \begin{array}{c}\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}\hfill \\ \mathrm{\U0001d695\U0001d68e\U0001d69d}y:U\times S=D\mathrm{\U0001d692\U0001d697}\u27e8{0}_{U},W\u27e9.{\mathcal{R}}_{V}(x:T.D)\hfill \end{array}\hfill \\ & & (y\notin \mathrm{FV}(V,W,D))\hfill \\ & & \end{array}$$ 
The rules for ordinary evaluation are given in Figures 2 and 3; those for symbolic evaluation are given in Figures 4 and 5. The definitions are mutually recursive. They make use of the symbolic differentiation of trace terms: given a trace term $C$, and values $V$ and $W$ (not necessarily closed), we define a trace term
$$W.{\mathcal{R}}_{V}(x:T.C)$$ 
intended to denote the reversemode derivative of the function $x:T\mapsto C$, at $V$, evaluated at $W$. A definition is given in Figure 6; in the definition we assume that $x\notin \mathrm{FV}(V,W)$, and, as is common, that all binding variables are different.
Proposition 3.3 ().
The following typing rule is admissible:
$$\frac{\mathrm{\Gamma}[x:T]\u22a2C:U\mathit{\hspace{1em}}\mathrm{\Gamma}\u22a2V:T\mathit{\hspace{1em}}\mathrm{\Gamma}\u22a2W:U}{\mathrm{\Gamma}\u22a2W.{\mathcal{R}}_{V}(x:T.C):T}$$ 
In large part because of the restrictions on trace terms, their symbolic differentiation is just a systematic, formal application of the chain rule. In our setting, this application requires a fair amount of attention to detail, for instance the use of the type decorations when giving derivatives of pairing and projection terms.
The reader may wish to try the following two evaluation examples with nested differentiation:
$$1.{\mathrm{\U0001d69b\U0001d68d}}_{1}(x:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}.x\times \mathrm{\hspace{0.17em}1}.{\mathrm{\U0001d69b\U0001d68d}}_{1}(y:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}.x+y))\Rightarrow 1$$ 
and
$$\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}):\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}=\mathrm{\hspace{0.33em}1}.{\mathrm{\U0001d69b\U0001d68d}}_{1}(y:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}.x+y)\mathrm{\U0001d692\U0001d697}\mathrm{\hspace{0.33em}\hspace{0.33em}1}.{\mathrm{\U0001d69b\U0001d68d}}_{1}(x:\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}.x+f(x))\Rightarrow 1$$ 
Examples of this kind can be used to illustrate perturbation confusion in forward differentiation, e.g., (Siskind and Pearlmutter, 2005, 2008).
We need some basic results on our evaluation relations. Two are standard: determinacy and type safety, and are used implicitly throughout the rest of the paper. The third connects symbolic and ordinary evaluation: one can interpolate symbolic evaluation within ordinary evaluation. It is principally helpful to reduce the completeness part of symbolic evaluation to the completeness of ordinary evaluation (see Theorem 6.7).
Proposition 3.4 (Determinacy of evaluation).
The following hold:

(1)
For any $\phi $, $\rho $, and $M$, there is at most one value $V$ s.t. $\phi \mid \rho \u22a2M\Rightarrow V$.

(2)
For any $\phi $, $\rho $, and $M$, there is at most one trace term $C$ s.t. $\phi \mid \rho \u22a2M\rightsquigarrow C$.
The following interpolation proposition establishes a certain consistency between the ordinary and symbolic evaluation relations.
Proposition 3.5 (Operational interpolation).
For all $\phi $, $\rho $, and closed values $V$, the following are equivalent:

(1)
$\phi \mid \rho \u22a2M\Rightarrow V$,

(2)
$\phi \mid \rho \u22a2M\rightsquigarrow C$ and $\phi \mid \rho \u22a2C\Rightarrow V$, for some $C$.
For a type safety theorem, we need typing judgments $\rho :\mathrm{\Gamma}$, $\phi :\mathrm{\Phi}$, and $\mathrm{Cl}:T\to U$ for environments, function environments, and closures (implicitly extending the notion of type). These are defined inductively by the following rules:
$$\frac{\u22a2{V}_{i}:{T}_{i}\mathit{\hspace{1em}}(i=0,n1)}{\u22a2\{\mathrm{\dots},{x}_{i}\mapsto {V}_{i},\mathrm{\dots}\}:\mathrm{\dots},{x}_{i}:{T}_{i},\mathrm{\dots}}$$ 
$$\frac{\u22a2{\mathrm{Cl}}_{i}:{T}_{i}\to {U}_{i}\mathit{\hspace{1em}}(i=0,n1)}{\u22a2\{\mathrm{\dots},{f}_{i}\mapsto {\mathrm{Cl}}_{i},\mathrm{\dots}\}:\mathrm{\dots},{f}_{i}:{T}_{i}\to {U}_{i},\mathrm{\dots}}$$ 
$$\frac{\u22a2\phi :\mathrm{\Phi}\mathit{\hspace{1em}}\mathrm{\Phi}[T\to U/f],x:T\u22a2M:U}{\u22a2{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{\phi}(f(x:T):U.M):T\to U}$$ 
Unique typing extends to environments, to function environments, and to closures (indeed, a closure ${\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{\phi}(f(x:T):U.M)$ can only have type $T\to U$).
Proposition 3.6 (Type safety).
Suppose $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}M\mathrm{:}T$, $\mathrm{\u22a2}\phi \mathrm{:}\mathrm{\Phi}$ and $\mathrm{\u22a2}\rho \mathrm{:}\mathrm{\Gamma}$. Then we have:
$$\phi \mid \rho \u22a2M\Rightarrow V\u27f9\u22a2V:T$$ 
and
$$\phi \mid \rho \u22a2M\rightsquigarrow C\u27f9\mathrm{\Gamma}\u22a2C:T$$ 
4. Mathematical preliminaries
We now turn to the mathematical facts needed for the denotational semantics of our language. These concern the two modes of differentiation and their interaction with domain theory. We follow (Abramsky and Jung, 1994) for domain theory, but write dcppo for pointed dcpo, and say a partial order is coherent iff every compatible subset has a lub. (A subset is compatible if any two of its elements have an upper bound.) Every coherent partial order is a dcppo.
The collection of partial functions $f:X\rightharpoonup Y$ with open domain between two topological spaces forms a partial order under graph inclusion:
$$f\le g\iff f\subseteq g$$ 
equivalently, using the Kleene order ^{2}^{2} 2 We write $e\u2aaf{e}^{\prime}$ for two mathematical expressions $e$ and ${e}^{\prime}$ to mean that if $e$ is defined so is ${e}^{\prime}$, and they are then equal.:
$$f\le g\iff \forall \mathbf{x}\in X.f\mathbf{x}\u2aafg\mathbf{x}$$ 
This partial order is a coherent dcppo with $\u27c2$ the everywhere undefined function and compatible sups given by unions. A partial function $f:X\rightharpoonup Y$ is continuous if ${f}^{1}(B)$ is open whenever $B$ is; the subcollection of continuous partial functions forms a coherent subdcppo. This holds as, for any open set $B\subseteq Y$ and compatible collection of partial functions ${f}_{i}(i\in I):X\rightharpoonup Y$, we have:
$${(\underset{i\in I}{\bigvee}{f}_{i})}^{1}(B)=\bigcup _{i\in I}{f}_{i}^{1}(B)$$ 
We write $\mathcal{C}[X,Y]$ for the dcppo of partial continuous functions from $X$ to $Y$.
It is convenient to use a variation on cartesian product when working with powers of $\mathbb{R}$. We set:
$${\mathbb{R}}^{m}\dot{\times}{\mathbb{R}}^{n}{=}_{\text{def}}{\mathbb{R}}^{m+n}\mathit{\hspace{1em}}(m,n\ge 0)$$ 
This version of product is associative. Vector concatenation then serves as tupling; however, for clarity, we may use the usual notation $({\mathbf{x}}_{0},\mathrm{\dots},{\mathbf{x}}_{k1})$ instead of ${\mathbf{x}}_{0}\mathrm{\dots}{\mathbf{x}}_{k1}$. There are evident definitions of the projections ${\pi}_{i}^{{m}_{0},\mathrm{\dots},{m}_{k1}}:{\mathbb{R}}^{{m}_{0}}\dot{\times}\mathrm{\dots}\dot{\times}{\mathbb{R}}^{{m}_{k1}}\to {\mathbb{R}}^{{m}_{i}}$, and of the tupling
$$\u27e8{f}_{0},\mathrm{\dots},{f}_{k1}\u27e9:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{{m}_{0}}\dot{\times}\mathrm{\dots}\dot{\times}{\mathbb{R}}^{{m}_{k1}}$$ 
of ${f}_{i}:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{{m}_{i}}$. We may ignore the superscripts on the projections when they can be understood from the context.
4.1. Continuity, differentiability, and smoothness
Standard multivariate analysis of vectorvalued real functions from ${\mathbb{R}}^{n}$ to ${\mathbb{R}}^{m}$ (with $n,m>0$) considers functions $f$ defined on an open domain of ${\mathbb{R}}^{n}$, see, e.g., (Trench, 2003). These are precisely the partial functions:
$$f:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}\mathit{\hspace{1em}}(n,m>0)$$ 
with open domain.
When $m=1$, such a function $f$ has partial derivatives
$${\partial}_{j}(f):{\mathbb{R}}^{n}\rightharpoonup \mathbb{R}\mathit{\hspace{1em}\hspace{1em}}(j=0,n1)$$ 
where
$${\partial}_{j}(f)({x}_{0},\mathrm{\dots},{x}_{n1}){\simeq}_{\mathrm{def}}\frac{\partial f}{\partial {x}_{j}}$$ 
viewing $f$ as a function of ${x}_{0},\mathrm{\dots},{x}_{n1}$. Taken together, these partial derivatives form its gradient
$$\nabla (f):{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{n}$$ 
where
(1)  $$\nabla (f)(\mathbf{x})\simeq \u27e8{\partial}_{0}(f)(\mathbf{x}),\mathrm{\dots},{\partial}_{n1}(f)(\mathbf{x})\u27e9$$ 
We write ${\nabla}_{\mathbf{x}}(f)$ for $\nabla (f)(\mathbf{x})$.
We say $f$ is continuously differentiable if all the ${\partial}_{j}(f)$ are continuous with domain that of $f$, equivalently if $\nabla (f)$ is continuous with domain that of $f$. As an example, removing $0$ from the domain of definition of the nondifferentiable ReLU function $f(x)=\mathrm{max}(x,0)$, we obtain a continuously differentiable partial function with domain $\mathbb{R}\backslash 0$; its derivative also has domain $\mathbb{R}\backslash 0$, with value $0$, if $$, and $1$, if $x>0$.
We now turn to the general case where $f:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$. The Jacobian ${\mathrm{J}}_{\mathbf{x}}(f)$ of $f$ at $\mathbf{x}\in {\mathbb{R}}^{n}$ is the $m$ by $n$ matrix:
(2)  $${\mathrm{J}}_{\mathbf{x}}{(f)}_{i,j}\simeq {\partial}_{j}({\pi}_{i}\circ f)(\mathbf{x})$$ 
where the matrix is undefined if any of the ${\partial}_{j}({\pi}_{i}\circ f)(\mathbf{x})$ are. These Jacobians form a partial function:
$$\mathrm{J}(f):{\mathbb{R}}^{n}\rightharpoonup \mathrm{Mat}(m,n)$$ 
where $\mathrm{Mat}(m,n)$ is the collection of $m$ by $n$ matrices. Viewing $\mathrm{Mat}(m,n)$ as ${\mathbb{R}}^{m\times n}$, we say $f$ is continuously differentiable if $\mathrm{J}(f)$ is continuous and has the same domain as $f$ (equivalently if each component ${\pi}_{i}\circ f$ of $f$ is continuously differentiable).
The differential ^{3}^{3} 3 Differentials are discussed in (Trench, 2003) see: p325 for differentials of functions of several variables; p348 for higherorder differentials; p381 for differentials of vectorvalued functions of several variables; and p388 for the chain rule in differential terms.
$$\mathrm{d}(f):{\mathbb{R}}^{n}\dot{\times}{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$$ 
of $f$ is defined by:
(3)  $$\mathrm{d}(f)(\mathbf{x},\mathbf{y})\simeq {\mathrm{J}}_{\mathbf{x}}(f)\cdot \mathbf{y}$$ 
and $f$ is continuously differentiable iff $\mathrm{d}(f)$ has domain $\mathrm{Dom}(f)\times {\mathbb{R}}^{n}$ and is continuous there.
We write ${\mathrm{d}}_{\mathbf{x}}(f)$ for the partial function $\mathrm{d}(f)(\mathbf{x},)$; it is either $\u27c2$ or everywhere defined and linear, the latter occurring precisely when ${\mathrm{J}}_{\mathbf{x}}(f)$ is defined. If $f$ is linear then ${\mathrm{d}}_{\mathbf{x}}f=f$, for all $\mathbf{x}\in {\mathbb{R}}^{n}$.
In the automatic differentiation literature, ${\mathrm{d}}_{\mathbf{x}}(f)$ is called the forwardmode derivative of $f$ at $\mathbf{x}$. For the reversemode derivative we define:
$${\mathrm{d}}^{r}(f):{\mathbb{R}}^{n}\dot{\times}{\mathbb{R}}^{m}\rightharpoonup {\mathbb{R}}^{n}$$ 
by:
(4)  $${\mathrm{d}}^{r}(f)(\mathbf{x},\mathbf{y})\simeq {\mathrm{J}}_{\mathbf{x}}{(f)}^{t}\cdot \mathbf{y}$$ 
and write ${\mathrm{d}}_{\mathbf{x}}^{r}(f)$ for the partial function ${\mathrm{d}}^{r}(f)(\mathbf{x},)$; $f$ is continuously differentiable iff ${\mathrm{d}}^{r}(f)$ has domain $\mathrm{Dom}(f)\times {\mathbb{R}}^{m}$ and is continuous there.
In terms of the differentials, the two modes are related by:
(5)  $${\mathrm{d}}_{\mathbf{x}}^{r}(f)={\mathrm{d}}_{\mathbf{x}}{(f)}^{\u2020}$$ 
(setting ${\u27c2}^{\u2020}=\u27c2$). So, if $f$ is linear, then ${\mathrm{d}}_{\mathbf{x}}^{r}(f)={\mathrm{d}}_{\mathbf{x}}{(f)}^{\u2020}={f}^{\u2020}$.
The semantic content of the definition of forwardmode from reversemode given in Section 2 is the following equality:
(6)  $${\mathrm{d}}_{\mathbf{x}}f={\mathrm{d}}_{\mathrm{\U0001d7ce}}^{R}({\mathrm{d}}_{\mathbf{x}}^{R}f)$$ 
This holds as: ${\mathrm{d}}_{\mathrm{\U0001d7ce}}^{R}({\mathrm{d}}_{\mathbf{x}}^{R}f)={({\mathrm{d}}_{\mathbf{x}}^{R}f)}^{\u2020}={({({\mathrm{d}}_{\mathbf{x}}f)}^{\u2020})}^{\u2020}={\mathrm{d}}_{\mathbf{x}}f$ (the first equality holds as ${\mathrm{d}}_{\mathbf{x}}^{R}f$ is linear if $\ne \u27c2$).
The continuously differentiable functions are closed under composition. If $h:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{l}$ is the composition of two such functions $f:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$ and $g:{\mathbb{R}}^{m}\rightharpoonup {\mathbb{R}}^{l}$, then the chain rule expresses the derivative of $h$ in terms of those of $f$ and $g$. In terms of Jacobians, the chain rule is:
$$\begin{array}{cccc}{\mathrm{J}}_{\mathbf{x}}(h)\hfill & \hfill \simeq \hfill & {\mathrm{J}}_{f(\mathbf{x})}(g)\cdot {\mathrm{J}}_{\mathbf{x}}(f)\hfill & (\mathbf{x}\in \mathrm{Dom}(h))\hfill \end{array}$$ 
(Note that $\mathbf{x}\in \mathrm{Dom}(h)$ iff $\mathbf{x}\in \mathrm{Dom}(f)$ and $f(\mathbf{x})\in \mathrm{Dom}(g)$.) In terms of forwardmode derivatives the chain rule is:
(7)  $$\begin{array}{cccc}{\mathrm{d}}_{\mathbf{x}}(h)\hfill & \hfill =\hfill & {\mathrm{d}}_{f(\mathbf{x})}(g)\circ {\mathrm{d}}_{\mathbf{x}}(f)\hfill & (\mathbf{x}\in \mathrm{Dom}(h))\hfill \end{array}$$ 
and in terms of reversemode derivatives it is:
(8)  $$\begin{array}{cccc}{\mathrm{d}}_{\mathbf{x}}^{r}(h)\hfill & \hfill =\hfill & {\mathrm{d}}_{\mathbf{x}}^{r}(f)\circ {\mathrm{d}}_{f(\mathbf{x})}^{r}(g)\hfill & (\mathbf{x}\in \mathrm{Dom}(h))\hfill \end{array}$$ 
Derivatives with respect to two variables can be reduced to derivatives in each separately. Specifically suppose $f:{\mathbb{R}}^{n+{n}^{\prime}}\rightharpoonup {\mathbb{R}}^{m}$. Then, for $\mathbf{x}\in {\mathbb{R}}^{n}$ and $\mathbf{y}\in {\mathbb{R}}^{{n}^{\prime}}$, we have:
(9)  $${\mathrm{d}}_{\u27e8\mathbf{x},\mathbf{y}\u27e9}(f)={\mathrm{d}}_{\mathbf{x}}(f(,\mathbf{y}))+{\mathrm{d}}_{\mathbf{y}}(f(\mathbf{x},))$$ 
and
(10)  $${\mathrm{d}}_{\u27e8\mathbf{x},\mathbf{y}\u27e9}^{R}(f)=\u27e8{\mathrm{d}}_{\mathbf{x}}^{R}(f(,\mathbf{y})),{\mathrm{d}}_{\mathbf{y}}^{R}(f(\mathbf{x},))\u27e9$$ 
These equations are useful for dealing with fanin.
Our programming language has all finite product types of the reals. We will therefore need to work with partial functions with open domain $f:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$ where $n$ or $m$ is zero. To this end we regard ${\mathbb{R}}^{0}$ as having as sole element the empty vector, the trivial topology, and the trivial vector space structure. Every such total function is linear, and this determines its adjoint.
We take such a function $f:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$, where $n$ or $m$ is zero, to be continuously differentiable if it is continuous, and we define $\mathrm{d}(f):{\mathbb{R}}^{n}\dot{\times}{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$ and ${\mathrm{d}}^{r}(f):{\mathbb{R}}^{n}\dot{\times}{\mathbb{R}}^{m}\rightharpoonup {\mathbb{R}}^{n}$ by:
$$\mathrm{d}(f)(\mathbf{x},\mathbf{y})\simeq \{\begin{array}{cc}0\hfill & (f(\mathbf{x})\downarrow )\hfill \\ \uparrow \hfill & (\text{otherwise})\hfill \end{array}\mathit{\hspace{1em}\hspace{1em}}{\mathrm{d}}^{r}(f)(\mathbf{x},\mathbf{y})\simeq \{\begin{array}{cc}0\hfill & (f(\mathbf{x})\downarrow )\hfill \\ \uparrow \hfill & (\text{otherwise})\hfill \end{array}$$ 
The derivative $\mathrm{d}(f)$ has domain $\mathrm{Dom}(f)\times {\mathbb{R}}^{n}$ and is continuous (and so continuously differentiable) iff $f$ is, and a similar remark applies to ${\mathrm{d}}^{r}(f)$. We understand ${\mathrm{d}}_{\mathbf{x}}(f)$ and ${\mathrm{d}}_{\mathbf{x}}^{r}(f)$ similarly to before. In case $f$ is linear (i.e., total) we have ${\mathrm{d}}_{\mathbf{x}}(f)=f$ and ${\mathrm{d}}_{\mathbf{x}}^{r}(f)={f}^{\u2020}$ as before and equation (5) relating the two modes continues to hold, as does equation (6) and also equations (10) and (9) concerning derivatives with respect to two variables. In particular for $\mathrm{t}:{\mathbb{R}}^{n}\to {\mathbb{R}}^{0}$ we have:
$$\begin{array}{ccc}({\mathrm{d}}_{\mathbf{x}}\mathrm{t}){\mathbf{x}}^{\prime}\hfill & \hfill =\hfill & \ast \mathbf{\hspace{1em}\hspace{1em}}({\mathrm{d}}_{\mathbf{x}}^{r}\mathrm{t})\ast =\mathrm{\U0001d7ce}\hfill \end{array}$$ 
Regarding compositions we note a useful fact:
Fact 1 ().
If $h\mathrm{:}{\mathrm{R}}^{n}\mathrm{\rightharpoonup}{\mathrm{R}}^{l}\mathrm{(}l\mathrm{,}n\mathrm{\ge}\mathrm{0}\mathrm{)}$ is constant on its domain, then it is continuously differentiable iff it is continuous and then, for any $\mathrm{x}$ in its domain, ${\mathrm{d}}_{\mathrm{x}}\mathit{}\mathrm{(}h\mathrm{)}$ is the constantly $\mathrm{0}$ function, as is ${\mathrm{d}}_{\mathrm{x}}^{r}\mathit{}\mathrm{(}h\mathrm{)}$.
It follows that the continuously differentiable functions, if taken in our wider sense, remain closed under composition and pointwise addition, and that the chain rule for forward derivatives continues to hold, as does that for reverse derivatives. From now on whenever we consider partial functions from an ${\mathbb{R}}^{n}$ to an ${\mathbb{R}}^{m}$, we include the cases where $n$ or $m$ is $0$.
The projections ${\pi}_{i}^{{m}_{0},\mathrm{\dots},{m}_{k1}}:{\mathbb{R}}^{{m}_{0}}\dot{\times}\mathrm{\dots}\dot{\times}{\mathbb{R}}^{{m}_{k1}}\to {\mathbb{R}}^{{m}_{i}}$ are total linear functions, so we have:
$$\begin{array}{ccc}{\mathrm{d}}_{\mathbf{x}}{\pi}_{i}^{{m}_{0},\mathrm{\dots},{m}_{k1}}\hfill & \hfill =\hfill & {\pi}_{i}^{{m}_{0},\mathrm{\dots},{m}_{k1}}\hfill \\ ({\mathrm{d}}_{\mathbf{x}}^{r}{\pi}_{i}^{{m}_{0},\mathrm{\dots},{m}_{k1}})\mathbf{y}\hfill & \hfill =\hfill & (0,\mathrm{\dots},0,\mathbf{y},0,\mathrm{\dots},0)\hfill \end{array}$$ 
Regarding the tupling $\u27e8{f}_{0},\mathrm{\dots},{f}_{k1}\u27e9:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{{m}_{0}}\dot{\times}\mathrm{\dots}\dot{\times}{\mathbb{R}}^{{m}_{k1}}$ of ${f}_{i}:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{{m}_{i}}$ we have:
$$\begin{array}{ccc}{\mathrm{d}}_{\mathbf{x}}\u27e8{f}_{0},\mathrm{\dots},{f}_{k1}\u27e9\hfill & \hfill =\hfill & \u27e8{\mathrm{d}}_{\mathbf{x}}{f}_{0},\mathrm{\dots},{\mathrm{d}}_{\mathbf{x}}{f}_{k1}\u27e9\hfill \\ ({\mathrm{d}}_{\mathbf{x}}^{r}\u27e8{f}_{0},\mathrm{\dots},{f}_{k1}\u27e9)({\mathbf{y}}_{0},\mathrm{\dots},{\mathbf{y}}_{k1})\hfill & \hfill \simeq \hfill & ({\mathrm{d}}_{\mathbf{x}}^{r}{f}_{0}){\mathbf{y}}_{0}+\mathrm{\dots}+({\mathrm{d}}_{\mathbf{x}}^{r}{f}_{k1}){\mathbf{y}}_{k1}\hfill \end{array}$$ 
For the semantics of our language we work with infinitely differentiable functions, i.e., smooth ones. First we define smoothness classes ${C}^{k}$. We say that a partial function $f:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$ is ${C}^{0}$ if it is continuous, and, inductively, is ${C}^{k+1}$ if $\mathrm{d}(f)$ has domain $\mathrm{Dom}(f)\times {\mathbb{R}}^{n}$ and is ${C}^{k}$. This defines a decreasing sequence of classes of functions, and we say that $f$ is smooth or ${C}^{\mathrm{\infty}}$ if it is ${C}^{k}$ for all $k$. The ${C}^{1}$ functions are precisely the continuously differentiable ones. Using the chain rule for differentials one shows that the ${C}^{k}$ functions, and so too the smooth ones, are closed under composition. The projections are smooth, as are all linear functions and the ${C}^{k}$ functions, and so too the smooth ones, are closed under tupling.
4.2. Cppos of differentiable functions
The subgraph partial order on partial functions between powers of $\mathbb{R}$ interacts well with the differential structure. We have:
Proposition 4.1 ().

(1)
For any $f\le g:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$ with open domain we have:
$$\mathrm{d}f=(\mathrm{d}g)\upharpoonright (\mathrm{Dom}(f)\times {\mathbb{R}}^{n})\mathit{\hspace{1em}\hspace{1em}}{\mathrm{d}}^{r}f=({\mathrm{d}}^{r}g)\upharpoonright (\mathrm{Dom}(f)\times {\mathbb{R}}^{m})$$ 
(2)
For any compatible family of functions with open domain ${f}_{i}:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}(i\in I)$, we have:
$$\mathrm{d}\underset{i\in I}{\bigvee}{f}_{i}=\underset{i\in I}{\bigvee}\mathrm{d}{f}_{i}\mathit{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}{\mathrm{d}}^{r}\underset{i\in I}{\bigvee}{f}_{i}=\underset{i\in I}{\bigvee}{\mathrm{d}}^{r}{f}_{i}$$
Proof.
For the first part, if $m$ or $n$ is 0, the conclusion is immediate using Fact 1. Otherwise, as $f$ and $g$ agree on. an open set including $x$ we have:
$${\mathrm{J}}_{\mathbf{x}}(f)\simeq {\mathrm{J}}_{\mathbf{x}}(g)$$ 
and the conclusion follows. For the second part, set $f=\bigvee {f}_{i}$. For the forwardmode derivative, using the first part we calculate:
$$\begin{array}{ccccc}\mathrm{d}f\hfill & \hfill =\hfill & \mathrm{d}f\upharpoonright (\mathrm{Dom}(f)\times {\mathbb{R}}^{n})\hfill & \hfill =\hfill & \mathrm{d}f\upharpoonright \bigcup _{i\in I}(\mathrm{Dom}({f}_{i})\times {\mathbb{R}}^{n})\hfill \\ & \hfill =\hfill & \underset{i\in I}{\bigvee}\mathrm{d}f\upharpoonright (\mathrm{Dom}({f}_{i})\times {\mathbb{R}}^{n})\hfill & \hfill =\hfill & \underset{i\in I}{\bigvee}\mathrm{d}{f}_{i}\hfill \end{array}$$ 
The proof for the reversemode derivative is similar.
∎
Proposition 4.2 ().

(1)
Let $f\le g:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$ be partial functions with open domain. Then $f$ is smooth if $g$ is.

(2)
Let ${f}_{i}:{\mathbb{R}}^{n}\rightharpoonup {\mathbb{R}}^{m}$ be a compatible family of partial functions with open domain, and with sup $f$. Then $f$ is smooth if all the ${f}_{i}$ are.
Proof.
For the first part, we prove by induction that if $g$ is ${C}^{k}$ then so is $f$. For $k=0$ we note that for any open $V\subseteq {\mathbb{R}}^{m}$, ${f}^{1}(V)={g}^{1}(V)\cap \mathrm{Dom}(f)$. For $k+1$, as $g$ is ${C}^{k+1}$, $\mathrm{Dom}(\mathrm{d}g)=\mathrm{Dom}(g)\times {\mathbb{R}}^{n}$ and $\mathrm{d}g$ is ${C}^{k}$. From part (1) of Proposition 4.1 we have $\mathrm{d}f\le \mathrm{d}g$ and $\mathrm{Dom}(\mathrm{d}f)=\mathrm{Dom}(f)\times {\mathbb{R}}^{n}$. So $\mathrm{d}f$ and $\mathrm{d}g$ have open domain, $\mathrm{d}f\le \mathrm{d}g$, and $\mathrm{d}g$ is ${C}^{k}$. It follows from the induction hypothesis that $\mathrm{d}f$ is ${C}^{k}$. So $f$ is ${C}^{k+1}$, as required.
For the second part we prove, by induction on $k$ that if all the ${f}_{i}$ are ${C}^{k}$, then so is $f$. For $k=0$ this is clear. For $k+1$, we have, for all $i$, that $\mathrm{Dom}(\mathrm{d}{f}_{i})=\mathrm{Dom}({f}_{i})\times {\mathbb{R}}^{n}$ and ${f}_{i}$ is ${C}^{k}$. From part (2) of Proposition 4.1 we have $\mathrm{d}f={\bigvee}_{i}\mathrm{d}{f}_{i}$. So, first,
$$\begin{array}{ccccc}\mathrm{Dom}(\mathrm{d}f)\hfill & \hfill =\hfill & \mathrm{Dom}(\underset{i}{\bigvee}\mathrm{d}{f}_{i})\hfill & \hfill =\hfill & \bigcup _{i}\mathrm{Dom}(\mathrm{d}{f}_{i})\hfill \\ & \hfill =\hfill & \bigcup _{i}\mathrm{Dom}({f}_{i})\times {\mathbb{R}}^{n}\hfill & \hfill =\hfill & \mathrm{Dom}(f)\times {\mathbb{R}}^{n}\hfill \end{array}$$ 
and second, also using the induction hypothesis, we have that $\mathrm{d}f$ is ${C}^{k}$. So $f$ is ${C}^{k+1}$, as required. ∎
We write $\mathcal{S}[{\mathbb{R}}^{n},{\mathbb{R}}^{m}]$ for the coherent dcppo of smooth partial functions between ${\mathbb{R}}^{n}$ and ${\mathbb{R}}^{m}$.
4.3. Conditionals and recursion
Differentiation and conditionals interact well. The conditional combinator ${\mathrm{Cond}}_{n,m}$ is defined for $p:{R}^{n}\rightharpoonup \mathbb{T}$ and $f,g:{R}^{n}\rightharpoonup {\mathbb{R}}^{m}$ by:
$${\mathrm{Cond}}_{n,m}(p,f,g)(\mathbf{x})\simeq \{\begin{array}{cc}f(\mathbf{x})\hfill & (p(\mathbf{x})=tt)\hfill \\ g(\mathbf{x})\hfill & (p(\mathbf{x})=ff)\hfill \\ \uparrow \hfill & (\text{otherwise})\hfill \end{array}$$ 
where $\mathbb{T}=\{tt,ff\}$. The conditional combinator is continuous. For differentiability, with $\mathbb{T}$ a discrete topological space, we have:
Proposition 4.3 ().
Suppose $p$ is continuous (equivalently: both ${p}^{\mathrm{}\mathrm{1}}\mathit{}\mathrm{(}t\mathit{}t\mathrm{)}$ and ${p}^{\mathrm{}\mathrm{1}}\mathit{}\mathrm{(}f\mathit{}f\mathrm{)}$ are open). Then:
$$\begin{array}{ccc}\mathrm{d}({\mathrm{Cond}}_{n,m}(p,f,g))\hfill & \hfill =\hfill & {\mathrm{Cond}}_{(n+n),m}(p\circ {\pi}_{1},\mathrm{d}f,\mathrm{d}g)\hfill \end{array}$$ 
and
$$\begin{array}{ccc}{\mathrm{d}}^{r}({\mathrm{Cond}}_{n,m}(p,f,g))\hfill & \hfill =\hfill & {\mathrm{Cond}}_{(n+m),n}(p\circ {\pi}_{1},{\mathrm{d}}^{r}f,{\mathrm{d}}^{r}g)\hfill \end{array}$$ 
Further, if $f\mathrm{,}g$ are smooth so is ${\mathrm{Cond}}_{n\mathrm{,}m}\mathit{}\mathrm{(}p\mathrm{,}f\mathrm{,}g\mathrm{)}$.
Proof.
Assume that $p$ is continuous. Set $h={\mathrm{Cond}}_{n,m}(p,f,g)$. The domain of $h$ is open, indeed $\mathrm{Dom}(h)=({p}^{1}(tt)\cap \mathrm{Dom}(f))\cup ({p}^{1}(ff)\cap \mathrm{Dom}(g))$ so $\mathrm{d}h$ is defined.
To prove the equality, choose $\mathbf{x}\in {\mathbb{R}}^{n}$. There are three cases. First, if $p(\mathbf{x})\uparrow $ then $h(\mathbf{x})\uparrow $ and so $(\mathrm{d}h)(\mathbf{x})\uparrow $; the equality therefore holds at $\mathbf{x}$. Second if $p(\mathbf{x})=tt$ then, as $h\upharpoonright {p}^{1}(tt)=f\upharpoonright {p}^{1}(tt)$ and ${p}^{1}(tt)$ is open, we see that, by part (1) of Proposition 4.1, ${\mathrm{d}}_{\mathbf{x}}h={\mathrm{d}}_{\mathbf{x}}h\upharpoonright {p}^{1}(tt)={\mathrm{d}}_{\mathbf{x}}f\upharpoonright {p}^{1}(tt)={\mathrm{d}}_{\mathbf{x}}f$, and so the equality again holds at $\mathbf{x}$. The third case is similar to the second.
Suppose further that $f,g$ are smooth. We have $h\upharpoonright {p}^{1}(tt)\le f$, and so, by part (1) of Proposition 4.2, $h\upharpoonright {p}^{1}(tt)$ is smooth as $f$ is and $h\upharpoonright {p}^{1}(tt)$ has open domain. Similarly $h\upharpoonright {p}^{1}(ff)$ is smooth. But then, by part (2) of Proposition 4.2, $h$ is smooth as $h=h\upharpoonright {p}^{1}(\mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e})\vee h\upharpoonright {p}^{1}(ff)$. ∎
In less formal terms than the proof, equality holds because if, say, the condition $p$ holds at $\mathbf{x}\in {\mathbb{R}}^{n}$, it holds in a neighborhood $O$ of $\mathbf{x}$. So $f$ and the conditional are equal throughout $O$, and therefore have the same derivative there. The equality justifies the approaches to the differentiation of conditionals described in the Introduction.
Differentiation and recursion also interact well. For any continuous $f:P\times Q\to Q$ ($P$ a dcpo, $Q$ a dcppo) we write $\mu y:Q.f(x,y)$ for the least fixedpoint (l.f.p.) of $f(x,)$. It is the sup of the iterates ${\mu}_{n}y:Q.f(x,y)$, defined inductively by:
$$\begin{array}{ccc}{\mu}_{n+1}y:Q.f(x,y)\hfill & \hfill =\hfill & f(x,{\mu}_{n}y:Q.f(x,y))\hfill \end{array}$$ 
starting from ${\u27c2}_{Q}$. As functions of $P$, the l.f.p. and the iterates are continuous. When $f:Q\to Q$, we write $\mu y:Q.f(y)$, etc.
Proposition 4.4 ().

(1)
Set $Q=\mathcal{S}[{\mathbb{R}}^{m},{\mathbb{R}}^{l}]$ and $R=\mathcal{S}[{\mathbb{R}}^{m}\dot{\times}{\mathbb{R}}^{m},{\mathbb{R}}^{l}]$. Then if $F:{\mathbb{R}}^{n}\times Q\to Q$ and $G:{\mathbb{R}}^{n}\times Q\times R\to R$ are such that
$$\mathrm{d}F(x,f)=G(x,f,\mathrm{d}f)\mathit{\hspace{1em}}(x\in {\mathbb{R}}^{n},f\in Q)$$ we have:
$$\mathrm{d}(\mu f:Q.F(x,f))=\mu {f}^{\prime}:R.G(x,\mu f:Q.F(x,f),{f}^{\prime})$$ 
(2)
Set $Q=\mathcal{S}[{\mathbb{R}}^{m},{\mathbb{R}}^{l}]$ and $R=\mathcal{S}[{\mathbb{R}}^{m}\dot{\times}{\mathbb{R}}^{l},{\mathbb{R}}^{m}]$. Then if $F:{\mathbb{R}}^{n}\times Q\to Q$ and
$G:{\mathbb{R}}^{n}\times Q\times R\to R$ are such that$${\mathrm{d}}^{r}F(x,f)=G(x,f,{\mathrm{d}}^{r}f)\mathit{\hspace{1em}}(x\in {\mathbb{R}}^{n},f\in Q)$$ we have:
$${\mathrm{d}}^{r}(\mu f:Q.F(x,f))=\mu {f}^{\prime}:R.G(x,\mu f:Q.F(x,f),{f}^{\prime})$$
Proof.
We only consider the forwardmode case as the reversemode case is similar. In one direction we prove by induction on $n$ that
$$\mathrm{d}({\mu}_{n}f:Q.F(x,f))\le \mu {f}^{\prime}:R.G(x,\mu f:Q.F(x,f),{f}^{\prime})$$ 
This is evident for $n=0$. For $n+1$ we calculate (missing out types):
$$\begin{array}{ccc}\mathrm{d}({\mu}_{n+1}f.F(x,f))\hfill & \hfill =\hfill & \mathrm{d}(F(x,{\mu}_{n}f.F(x,f)))\hfill \\ & \hfill =\hfill & G(x,{\mu}_{n}f.F(x,f),\mathrm{d}{\mu}_{n}f.F(x,f))\hfill \\ & \hfill \le \hfill & G(x,\mu f.F(x,f),\mu {f}^{\prime}.G(x,\mu f.F(x,f),{f}^{\prime}))\hfill \\ & \hfill =\hfill & \mu {f}^{\prime}.G(x,\mu f.F(x,f),{f}^{\prime})\hfill \end{array}$$ 
In the other direction we prove by induction on $n$ that
$${\mu}_{n}{f}^{\prime}:R.G(x,\mu f:Q.F(x,f),{f}^{\prime})\le \mathrm{d}(\mu f:Q.F(x,f))$$ 
This is evident for $n=0$. For $n+1$ we calculate:
$$\begin{array}{ccc}{\mu}_{n+1}{f}^{\prime}.G(x,\mu f.F(x,f),{f}^{\prime})\hfill & \hfill =\hfill & G(x,\mu f.F(x,f),{\mu}_{n}{f}^{\prime}.G(x,\mu f.F(x,f),{f}^{\prime}))\hfill \\ & \hfill \le \hfill & G(x,\mu f.F(x,f),\mathrm{d}\mu f.F(x,f))\hfill \\ & \hfill =\hfill & \mathrm{d}F(x,\mu f.F(x,f))\hfill \\ & \hfill =\hfill & \mathrm{d}(\mu f.F(x,f))\hfill \end{array}$$ 
The conclusion then follows using the continuity of $\mathrm{d}$ (Part 2 of Proposition 4.1).
∎
Although we do not do so here, this proposition can be used to justify code transformations of recursive function definitions.
5. Denotational semantics
We begin with a denotational semantics of types as powers of the reals:
$$\begin{array}{ccc}[[\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}]]\hfill & \hfill =\hfill & \mathbb{R}\hfill \\ [[\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}]]\hfill & \hfill =\hfill & {\mathbb{R}}^{0}\hfill \\ [[T\times U]]\hfill & \hfill =\hfill & [[T]]\dot{\times}[[U]]\hfill \end{array}$$ 
Note that $[[T]]={\mathbb{R}}^{T}$, where, $\mathrm{\U0001d69b\U0001d68e\U0001d68a\U0001d695}=1$, $\mathrm{\U0001d69e\U0001d697\U0001d692\U0001d69d}=0$ and, recursively, $T\times U=T+U$. Then, for the semantics of environments $\mathrm{\Gamma}={x}_{0}:{T}_{0},\mathrm{\dots},{x}_{n1}:{T}_{n1}$ we set
$$[[\mathrm{\Gamma}]]=[[{T}_{0}]]\dot{\times}\mathrm{\dots}\dot{\times}[[{T}_{n1}]]$$ 
and for that of function environments $\mathrm{\Phi}={f}_{0}:{T}_{0}\to {U}_{n},\mathrm{\dots},{f}_{n1}:{T}_{n1}\to {U}_{n1}$ we set
$$[[\mathrm{\Phi}]]=\mathcal{S}[[[{T}_{0}]],[[{U}_{0}]]]\times \mathrm{\dots}\times \mathcal{S}[[[{T}_{n1}]],[[{U}_{n1}]]]$$ 
We use $\gamma ,\delta $ to range over $[[\mathrm{\Gamma}]]$ and $\varphi $ over $[[\mathrm{\Phi}]]$.
The denotational semantics of terms $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T$ will be a continuous function:
$$[[\mathrm{\Phi}]]\stackrel{[[\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T]]}{\to}\mathcal{S}[[[\mathrm{\Gamma}]],[[T]]]$$ 
and that of boolean terms $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2B$ will be a continuous function:
$$[[\mathrm{\Phi}]]\stackrel{[[\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2B]]}{\to}\mathcal{C}[[[\mathrm{\Gamma}]],\mathbb{T}]$$ 
When the environments and types are understood from the context, we may just write $[[M]]$ or $[[B]]$.
For the semantics of operation and predicate symbols, for every $\mathrm{op}:T\to U$ we assume available a smooth function $[[\mathrm{op}]]:[[T]]\rightharpoonup [[U]]$ and for every $\mathrm{pred}:T$ a continuous function $[[\mathrm{pred}]]:[[T]]\rightharpoonup \mathbb{T}$, such that:
(11)  $$[[\mathrm{op}]]([[V]])\simeq [[\mathrm{ev}(\mathrm{op},V)]]$$ 
for every closed value $V:T$, and
(12)  $$[[\mathrm{pred}]]([[V]])\simeq [[\mathrm{bev}(\mathrm{pred},V)]]$$ 
for every closed value $V:T$.
$$\begin{array}{cccc}[[x]]\varphi \gamma \hfill & \hfill \simeq \hfill & \gamma (x)\hfill & \\ & & & \\ [[r]]\varphi \gamma \hfill & \hfill \simeq \hfill & r\hspace{1em}\hspace{1em}(r\in \mathbb{R})\hfill & \\ & & & \\ [[M+N]]\varphi \gamma \hfill & \hfill \simeq \hfill & [[M]]\varphi \gamma +[[N]]\varphi \gamma \hfill & \\ & & & \\ [[\mathrm{op}(M)]]\varphi \gamma \hfill & \hfill \simeq \hfill & [[\mathrm{op}]]([[M]]\varphi \gamma )\hfill & \\ & & & \\ [[\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=M\mathrm{\U0001d692\U0001d697}N]]\varphi \gamma \hfill & \hfill \simeq \hfill & [[N]]\varphi (\gamma [[[M]]\varphi \gamma /x])\hfill & \\ & & & \\ [[\ast ]]\varphi \gamma \hfill & \hfill \simeq \hfill & \ast \hfill & \\ & & & \\ [[{\u27e8M,N\u27e9}_{T,U}]]\varphi \gamma \hfill & \hfill \simeq \hfill & \u27e8[[M]]\varphi \gamma ,[[N]]\varphi \gamma \u27e9\hfill & \\ & & & \\ [[{\mathrm{\U0001d68f\U0001d69c\U0001d69d}}_{T,U}(M)]]\varphi \gamma \hfill & \hfill \simeq \hfill & [[{\pi}_{0}]]([[M]]\varphi \gamma )\hfill & \\ & & & \\ [[{\mathrm{\U0001d69c\U0001d697\U0001d68d}}_{T,U}(M)]]\varphi \gamma \hfill & \hfill \simeq \hfill & [[{\pi}_{1}]]([[M]]\varphi \gamma )\hfill & \\ & & & \\ [[\mathrm{\U0001d692\U0001d68f}B\mathrm{\U0001d69d\U0001d691\U0001d68e\U0001d697}M\mathrm{\U0001d68e\U0001d695\U0001d69c\U0001d68e}N]]\varphi \gamma \hfill & \hfill \simeq \hfill & \{\begin{array}{cc}\hfill [[M]]\varphi \gamma \hfill & ([[B]]\varphi \gamma \simeq tt)\hfill \\ \hfill [[N]]\varphi \gamma \hfill & ([[B]]\varphi \gamma \simeq ff)\hfill \\ \hfill \uparrow \hfill & (\text{otherwise})\hfill \end{array}\hfill & \\ & & & \\ \begin{array}{c}[[\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N]]\varphi \gamma \hfill \end{array}\hfill & \hfill \simeq \hfill & \begin{array}{c}[[N]](\varphi [\mu \alpha :\mathcal{S}[[[T]],[[U]]].\hfill \\ \lambda a:[[T]].[[M]](\varphi [\alpha /f])\{x\mapsto a\}/f])\gamma \hfill \end{array}\hfill & \\ & & & \\ [[f(M)]]\varphi \gamma \hfill & \hfill \simeq \hfill & \varphi (f)([[M]]\varphi \gamma )\hfill & \\ & & & \\ [[M.{\mathrm{\U0001d69b\U0001d68d}}_{L}(x:T.N)]]\varphi \gamma \hfill & \hfill \simeq \hfill & \begin{array}{c}{\mathrm{d}}_{[[L]]\varphi \gamma}^{r}(a\in [[T]]\mapsto [[N]]\varphi (\gamma [a/x]))([[M]]\varphi \gamma )\hfill \end{array}\hfill & \\ & & & \\ [[\mathrm{\U0001d69d\U0001d69b\U0001d69e\U0001d68e}]]\varphi \gamma \hfill & \hfill \simeq \hfill & tt\hfill & \\ & & & \\ [[\mathrm{\U0001d68f\U0001d68a\U0001d695\U0001d69c\U0001d68e}]]\varphi \gamma \hfill & \hfill \simeq \hfill & ff\hfill & \\ & & & \\ [[\mathrm{pred}(M)]]\varphi \gamma \hfill & \hfill \simeq \hfill & [[\mathrm{pred}]]([[M]]\varphi \gamma )\hfill & \end{array}$$ 
The denotational semantics is given in Figure 7. Note that it uses the nofreevariable assumption in the clause for recursive functions. Apart from the semantics of reverse differentiation, which uses the reversemode derivative ${\mathrm{d}}^{r}$, it is quite standard. However, the facts that the denotations of terms carry smooth functions to smooth functions, and that the denotations of boolean terms carry smooth functions to continuous ones, use the mathematics developed in the previous section, particularly: the chain rule, e.g., for function application and let constructs; the preservation of smooth functions by the conditional combinator; the remarks on products; and, for recursive function definitions, the fact that the lub of an increasing sequence of smooth functions is smooth.
If a term $M$ contains no function variables (or variables), $[[M]]\varphi \gamma $ is independent of $\varphi $ (and $\gamma $), and we write $[[M]]\gamma $ (resp., $[[M]]$) for it. Trace terms $C$ have no function variables, and closed values $V$ have no function variables or variables.
Using the semantics of terms, we can define the denotational semantics of value environments, function environments, and closures, the latter two by a mutual structural induction.

•
For every $\rho :\mathrm{\Gamma}$, where $\rho =\{{x}_{0}\mapsto {V}_{0},\mathrm{\dots},{x}_{n1}\mapsto {V}_{n1}\}$, we define $[[\rho :\mathrm{\Gamma}]]\in [[\mathrm{\Gamma}]]$ by:
$$[[\rho :\mathrm{\Gamma}]]=([[{V}_{0}]],\mathrm{\dots},[[{V}_{n1}]])$$ 
•
For every $\phi :\mathrm{\Phi}$, where $\phi =\{{f}_{0}\mapsto {\mathrm{Cl}}_{0},\mathrm{\dots},{f}_{n1}\mapsto {\mathrm{Cl}}_{n1}\}$, we define $[[\phi :\mathrm{\Phi}]]\in [[\mathrm{\Phi}]]$ by:
$$[[\phi :\mathrm{\Phi}]]=([[C{l}_{0}]],\mathrm{\dots},[[C{l}_{n1}]])$$ 
•
For every $\mathrm{Cl}={\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{\phi}(f(x:T):U.M):T\to U$ we define $[[\mathrm{Cl}]]\in \mathcal{S}[[[T]],[[U]]]$ by:
$$\begin{array}{ccc}[[\mathrm{Cl}]]\hfill & \hfill =\hfill & \mu \alpha :\mathcal{S}[[[T]],[[U]]].\lambda a:[[T]].[[M]]([[\phi ]][\alpha /f])\{x\mapsto a\}\hfill \end{array}$$
We omit $\mathrm{\Gamma}$ and $\mathrm{\Phi}$ from $[[\rho :\mathrm{\Gamma}]]$ and $[[\phi :\mathrm{\Phi}]]$ when they can be understood from the context.
Lemma 5.1 ().
For any type $T$ we have:

(1)
The denotation $[[V]]$ of any value $V:T$ exists.

(2)
For any $v\in [[T]]$ there is a unique value $V:T$ such that $v=[[V]]$.
The following two fairly standard results about evaluation contexts are needed to show adequacy.
Lemma 5.2 (Evaluation context compositionality).
Suppose $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}M\mathrm{:}U$, $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}E\mathit{}\mathrm{[}M\mathrm{]}\mathrm{:}T$, and $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}N\mathrm{:}U$. Then:
$$[[M]]\varphi \gamma \simeq [[N]]\varphi \gamma \u27f9[[E[M]]]\varphi \gamma \simeq [[E[N]]]\varphi \gamma \mathit{\hspace{1em}\hspace{1em}}(\varphi \in [[\mathrm{\Phi}]],\gamma \in [[\mathrm{\Gamma}]])$$ 
Analogous results hold for: boolean terms in evaluation contexts, $E\mathit{}\mathrm{[}B\mathrm{]}$; terms in boolean evaluation contexts ${E}_{\mathrm{bool}}\mathit{}\mathrm{[}M\mathrm{]}$; and boolean terms in boolean evaluation contexts ${E}_{\mathrm{bool}}\mathit{}\mathrm{[}B\mathrm{]}$.
Lemma 5.3 (Evaluation context strictness).

(1)
Suppose that $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2E[M]:T$ and $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:U$. Then, for any $x\notin \mathrm{FV}(E)$, we have:
$$[[E[M]]]\varphi \gamma \downarrow \u27f9[[M]]\varphi \gamma \downarrow \mathit{\hspace{1em}}(\varphi \in [[\mathrm{\Phi}]],\gamma \in [[\mathrm{\Gamma}]])$$ and so:
$$[[E[M]]]\varphi \gamma \simeq [[E[x]]]\varphi (\gamma [[[M]]\varphi \gamma /x])\mathit{\hspace{1em}}(\varphi \in [[\mathrm{\Phi}]],\gamma \in [[\mathrm{\Gamma}]])$$ The analogous result holds for terms in boolean contexts ${E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}[M]$.

(2)
Suppose that $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2E[B]:T$ and $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2B$. Then:
$$[[E[B]]]\varphi \gamma \downarrow \u27f9[[B]]\varphi \gamma \downarrow \mathit{\hspace{1em}}(\varphi \in [[\mathrm{\Phi}]],\gamma \in [[\mathrm{\Gamma}]])$$ The analogous result holds for boolean terms in boolean contexts ${E}_{\mathrm{\U0001d68b\U0001d698\U0001d698\U0001d695}}[B]$.
6. Adequacy
We present our main results, on the correspondence between operational and denotational semantics. Taken together these are our adequacy theorems. As well as the usual correctness and completeness theorems, there are results peculiar to differentiation, both of interest in themselves and also necessary for the others. Theorem 6.1 shows the correctness of our source code transformation of trace terms, in that formal differentiation corresponds to actual differentiation. Theorem 6.2 has two parts. The first is the usual statement of correctness for the ordinary evaluation relation. The second is a statement of correctness of symbolic evaluation. This states that the trace term resulting from symbolic evaluation of a term has the same denotation not only at the environment used for the symbolic evaluation but on a whole open set including it. This, and Theorem 6.1, are needed to prove the ordinary correctness of the ordinary evaluation relation in the case of differentiation as in that case ordinary evaluation proceeds by first symbolically differentiating and then applying our source code transformation. Finally Theorem 6.7 is the expected completeness theorem, that if the semantics of a term is defined, then both its ordinary and symbolic evaluation terminate. Its proof makes use of Proposition 3.5 which interpolates a symbolic evaluation inside any ordinary (nonboolean) evaluation.
6.1. Operational correctness
Theorem 6.1 (Reversemode differentiation).
Suppose that $\mathrm{\Gamma}\mathrm{[}x\mathrm{:}T\mathrm{]}\mathrm{\u22a2}C\mathrm{:}U$, $\mathrm{\Gamma}\mathrm{\u22a2}V\mathrm{:}T$ and $\mathrm{\Gamma}\mathrm{\u22a2}W\mathrm{:}U$ (and so $\mathrm{\Gamma}\mathrm{\u22a2}W\mathrm{.}{\mathrm{rd}}_{V}\mathrm{(}x\mathrm{:}T\mathrm{.}C\mathrm{)}\mathrm{:}T$). Then, for any $\gamma \mathrm{\in}\mathrm{[}\mathrm{[}\mathrm{\Gamma}\mathrm{]}\mathrm{]}$, we have:
$$[[W.{\mathcal{R}}_{V}(x:T.C)]]\gamma \simeq [[W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.C)]]\gamma $$ 
Proof.
The proof is by structural induction on $C$. We give a representative case. Suppose $C$ is $\mathrm{\U0001d695\U0001d68e\U0001d69d}y:U=D\mathrm{\U0001d692\U0001d697}E$. Set ${\gamma}_{a}=\gamma [a/x]$, for any $a\in [[T]]$, ${\gamma}_{V}={\gamma}_{[[V]]\gamma}$, and ${\gamma}^{\prime}={\gamma}_{V}[[[D]]{\gamma}_{V}/y]$. Then
$$[[W.{\mathcal{R}}_{V}(x:T.C)]]\gamma \simeq [[W.{\mathcal{R}}_{V}(x:T.E){+}_{T}(\mathrm{\U0001d695\U0001d68e\U0001d69d}\overline{y}:S=W.{\mathcal{R}}_{y}(y:S.E)\mathrm{\U0001d692\U0001d697}\overline{y}.{\mathcal{R}}_{V}(x:T.D))]]{\gamma}^{\prime}$$ 
with $x\notin \mathrm{FV}(V,W)$, $y\notin \mathrm{FV}(W)$, and $y,\overline{y}\notin \mathrm{FV}(V,D)$. We calculate:
$$\begin{array}{ccc}{\mathrm{d}}_{[[V]]\gamma}^{r}(a\in [[T]]\mapsto [[\mathrm{\U0001d695\U0001d68e\U0001d69d}y:U=D\mathrm{\U0001d692\U0001d697}E]]{\gamma}_{a})[[W]]\gamma \hfill & & \\ \simeq \mathit{\hspace{1em}}{\mathrm{d}}_{[[V]]\gamma}^{r}(a\in [[T]]\mapsto [[E]]{\gamma}_{a}[[[D]]{\gamma}_{a}/y])[[W]]\gamma \hfill & & \\ \simeq \hspace{1em}{\mathrm{d}}_{[[V]]\gamma}^{r}((c\in [[T\times U]]\mapsto [[E]]{\gamma}_{{\pi}_{0}(c)}[{\pi}_{1}(c)/y])\circ \hfill & & \\ \u27e8a\in [[T]]\mapsto a,a\in [[T]]\mapsto [[D]]{\gamma}_{a}\u27e9)[[W]]\gamma \hfill & & \\ \simeq \hspace{1em}{\mathrm{d}}_{[[V]]\gamma}^{r}(\u27e8a\in [[T]]\mapsto a,a\in [[T]]\mapsto [[D]]{\gamma}_{a}\u27e9)[\hfill & & \\ ({\mathrm{d}}_{\u27e8[[V]]\gamma ,[[D]]{\gamma}_{V}\u27e9}^{r}(c\in [[T\times U]]\mapsto [[E]]{\gamma}_{{\pi}_{0}(c)}[{\pi}_{1}(c)/y]))[[W]]\gamma ]\hfill & & \\ \simeq \hspace{1em}{\mathrm{d}}_{[[V]]\gamma}^{r}(\u27e8a\in [[T]]\mapsto a,a\in [[T]]\mapsto [[D]]{\gamma}_{a}\u27e9)[\hfill & & \\ \u27e8{\mathrm{d}}_{[[V]]\gamma}^{r}(a\in [[T]]\mapsto [[E]]{\gamma}_{a}[[[D]]{\gamma}_{V}/y]),{\mathrm{d}}_{[[D]]{\gamma}_{V}}^{r}(b\in [[U]]\mapsto [[E]]{\gamma}_{[[V]]\gamma}[b/y])\u27e9[[W]]\gamma ]\hfill & & \\ \simeq \hspace{1em}{\mathrm{d}}_{[[V]]\gamma}^{r}(\u27e8a\in [[T]]\mapsto a,a\in [[T]]\mapsto [[D]]{\gamma}_{a}\u27e9)[\hfill & & \\ \u27e8{\mathrm{d}}_{[[V]]\gamma}^{r}(a\in [[T]]\mapsto [[E]]{\gamma}_{a}[[[D]]{\gamma}_{V}/y])[[W]]\gamma ,{\mathrm{d}}_{[[D]]{\gamma}_{V}}^{r}(b\in [[U]]\mapsto [[E]]{\gamma}_{[[V]]\gamma}[b/y])[[W]]\gamma \u27e9]\hfill & & \\ \simeq \hspace{1em}{\mathrm{d}}_{[[V]]{\gamma}^{\prime}}^{r}(a\in [[T]]\mapsto a)[{\mathrm{d}}_{[[V]]\gamma}^{r}(a\in [[T]]\mapsto [[E]]{\gamma}_{a}[[[D]]{\gamma}_{V}/y])[[W]]\gamma ]\hfill & & \\ +{\mathrm{d}}_{[[V]]{\gamma}^{\prime}}^{r}(a\in [[T]]\mapsto [[D]]{\gamma}^{\prime})[{\mathrm{d}}_{[[D]]{\gamma}_{V}}^{r}(b\in [[U]]\mapsto [[E]]{\gamma}_{[[V]]\gamma}[b/y])[[W]]\gamma ]\hfill & & \\ \simeq \mathit{\hspace{1em}}{\mathrm{d}}_{[[V]]\gamma}^{r}(a\in [[T]]\mapsto [[E]]{\gamma}_{a}[[[D]]{\gamma}_{V}/y])[[W]]\gamma \hfill & & \\ +{\mathrm{d}}_{[[V]]{\gamma}^{\prime}}^{r}(a\in [[T]]\mapsto [[D]]{\gamma}^{\prime})[{\mathrm{d}}_{[[D]]{\gamma}_{V}}^{r}(b\in [[U]]\mapsto [[E]]{\gamma}_{[[V]]\gamma}[b/y])[[W]]\gamma ]\hfill & & \\ \simeq \mathit{\hspace{1em}}[[W.{\mathcal{R}}_{V}(x:T.E){+}_{T}(\mathrm{\U0001d695\U0001d68e\U0001d69d}\overline{y}:S=W.{\mathcal{R}}_{y}(y:S.E)\mathrm{\U0001d692\U0001d697}\overline{y}.{\mathcal{R}}_{V}(x:T.D))]]{\gamma}^{\prime}\hfill & & \end{array}$$ 
Note the use of Equation (10) in the the fourth step. ∎
Theorem 6.2 (Operational correctness).
Suppose that $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}M\mathrm{:}T$, $\mathrm{\u22a2}\phi \mathrm{:}\mathrm{\Phi}$, and $\mathrm{\u22a2}\rho \mathrm{:}\mathrm{\Gamma}$. Then:

(1)
Operational semantics.
$$\phi \mid \rho \u22a2M\Rightarrow V\u27f9[[M]][[\phi ]][[\rho ]]=[[V]]$$ (and similarly for boolean terms).

(2)
Symbolic operational semantics.
$$\begin{array}{c}\phi \mid \rho \u22a2M\rightsquigarrow C\u27f9\exists O{\subseteq}_{\text{\mathit{o}\mathit{p}\mathit{e}\mathit{n}}}[[\mathrm{\Gamma}]].[[\rho ]]\in O\wedge \forall \gamma \in O.[[M]][[\phi ]]\gamma \simeq [[C]]\gamma \hfill \end{array}$$
Proof.
The two parts are proved by mutual induction on the size of the proofs that establish the given operational relations, and by cases on the form of $M$. As an example case of the second part, suppose $M$ has the form $\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}L$. Then for some $D$ and ${V}^{\prime}$ we have a smaller proof of $\phi \mid \rho [{V}^{\prime}/x]\u22a2L\rightsquigarrow D$, where ${V}^{\prime}=\rho (V)$, and $C$ has the form $\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}D$.
By the induction hypothesis there is an open set $O$ such that $[[\rho [{V}^{\prime}/x]]]\in O$ and, for all $\delta \in O$, we have $[[L]][[\phi ]]\delta \simeq [[D]]\delta $. Set $\theta (\gamma )=[[\mathrm{\Gamma}]]\mapsto \gamma [[[V]]\gamma /x]$. As $\theta $ is continuous, ${O}^{\prime}{=}_{\text{def}}{\theta}^{1}(O)$ is open. We show it is the required open set.


First, $[[\rho ]]\in {O}^{\prime}$ as we have: $\theta ([[\rho ]])=[[\rho ]][[[V]][[\rho ]]/x]=[[\rho ]][[[{V}^{\prime}]]/x]=[[\rho [{V}^{\prime}/x]]]\in O$.


Second, for any $\gamma \in {O}^{\prime}$ we have:
$$\begin{array}{ccc}[[\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}L]][[\phi ]]\gamma \hfill & \hfill \simeq \hfill & [[L]][[\phi ]]\gamma [[[V]]\gamma /x]\hfill \\ & \hfill \simeq \hfill & [[D]]\gamma [[[V]]\gamma /x]\hfill \\ & \hfill \simeq \hfill & [[\mathrm{\U0001d695\U0001d68e\U0001d69d}x:T=V\mathrm{\U0001d692\U0001d697}D]]\gamma \hfill \end{array}$$ with the second equality holding as $\theta (\gamma )\in O$.
∎
The following corollary shows that our strategy of first symbolically reducing to produce a trace term, then symbolically differentiating, is correct.
Corollary 6.3 ().
Suppose that $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{[}x\mathrm{:}T\mathrm{]}\mathrm{\u22a2}M\mathrm{:}U$, $\mathrm{\Gamma}\mathrm{\u22a2}V\mathrm{:}T$, and $\mathrm{\Gamma}\mathrm{\u22a2}W\mathrm{:}U$. Then, for any $\mathrm{\u22a2}\phi \mathrm{:}\mathrm{\Phi}$ and $\mathrm{\u22a2}\rho \mathrm{:}\mathrm{\Gamma}$ we have:
$$\begin{array}{c}\phi \mid \rho [V\rho /x]\u22a2M\rightsquigarrow C\u27f9[[W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.M)]][[\phi ]][[\rho ]]\simeq [[W.{\mathrm{\U0001d69b\U0001d68d}}_{V}(x:T.C)]][[\phi ]][[\rho ]]\hfill \end{array}$$ 
6.2. Operational completeness
We turn to proving operational completeness, that the evaluation, or symbolic evaluation, of a term terminates when it should, i.e., when its denotation is defined. For terms $M$ write $\phi \mid \rho \u22a2M\Downarrow $ when, for some $V$, $\phi \mid \rho \u22a2M\Rightarrow V$ and, assuming $\phi $ and $\rho $ known from the context, say that $M$ terminates, and adopt similar terminology for boolean terms.
To prove operational completeness we use a standard strategy: first proving operational completeness for an auxiliary “approximation language” in which recursive definitions are replaced by approximations to them, which we call limited recursive definitions, and then lifting that result to the main language. Specifically we replace the syntactic form for recursive definitions by the family of syntactic forms:
$${\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}}_{n}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N\mathit{\hspace{1em}\hspace{1em}}(n\in \mathbb{N})$$ 
with the evident typing rule, and make analogous consequential changes in the various definitions.
In the definition of function environments and closures, the clause for closures becomes:


If $\mathrm{FFV}(M)\backslash \{f\}\subseteq \mathrm{Dom}(\phi )$, $\mathrm{FV}(M)\subseteq \{x\}$, and $n\in \mathbb{N}$, then
$$\u27e8n,\phi ,f,x,T,U,M\u27e9$$ is a closure, written as: ${\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,\phi}(f(x:T):U.M)$.
The limited recursive definition redexes are ${\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}}_{n}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N$; their evaluation rules are in Figure 8. There, and below, we write ${\u22a2}_{l}$ for limited recursion language judgements.
$$\begin{array}{c}\frac{\phi [{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,\phi}(f(x:T):U.M)/f]\mid \rho {\u22a2}_{l}N\Rightarrow {V}^{\prime}}{\phi \mid \rho {\u22a2}_{l}{\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}}_{n}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N\Rightarrow {V}^{\prime}}\hfill \\ \\ \frac{\phi \mid \rho {\u22a2}_{l}V\Rightarrow {V}^{\prime}{\phi}^{\prime}[{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,{\phi}^{\prime}}(f(x:T):U.M)/f]\mid \{x\mapsto {V}^{\prime}\}{\u22a2}_{l}M\Rightarrow W}{\phi \mid \rho {\u22a2}_{l}f(V)\Rightarrow W}\hspace{1em}\text{where}\phi \left(f\right)={\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n+1,{\phi}^{\prime}}(f\left(x:T\right):U.M)\hfill \\ \\ \frac{\phi [{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,\phi}(f(x:T):U.M)/f]\mid \rho {\u22a2}_{l}N\rightsquigarrow C}{\phi \mid \rho {\u22a2}_{l}{\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}}_{n}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N\rightsquigarrow C}\hfill \\ \\ \frac{\phi \mid \rho {\u22a2}_{l}V\Rightarrow {V}^{\prime}{\phi}^{\prime}[{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,{\phi}^{\prime}}(f(x:T):U.M)/f]\mid \{x\mapsto {V}^{\prime}\}{\u22a2}_{l}M\rightsquigarrow C}{\phi \mid \rho {\u22a2}_{l}f(V)\rightsquigarrow C}\hspace{1em}\text{where}\phi \left(f\right)={\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n+1,{\phi}^{\prime}}(f\left(x:T\right):U.M)\hfill \end{array}$$ 
For the closure typing judgement $\mathrm{Cl}:T\to U$ we substitute:
$$\frac{{\u22a2}_{l}\phi :\mathrm{\Phi}\mathit{\hspace{1em}}\mathrm{\Phi},\{x\mapsto T\}{\u22a2}_{l}M:U}{{\u22a2}_{l}{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,\phi}(f(x:T):U.M):T\to U}$$ 
As regards the denotational semantics, the clause for limited recursive definitions is:
$$\begin{array}{c}[[{\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}}_{n}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N]](\varphi )(\gamma )\simeq \hfill \\ [[N]](\varphi [{\mu}_{n}\alpha :\mathcal{S}[[[T]],[[U]]].\lambda a:[[T]].[[M]](\varphi [\alpha /f])\{x\mapsto a\}/f])\gamma \hfill \end{array}$$ 
The results for the operational and denotational semantics carry over to the restricted setting, and we refer to them in the same way as we do to the unrestricted versions.
Relating the two languages, the $n$th approximant ${M}^{\mathrm{(}n\mathrm{)}}$ of a language term $M$ is obtained by replacing every recursive definition in $M$ by an $n$limited one (similarly for boolean terms) and $n$th approximants of closures and function environments are defined by structural recursion:
$$\begin{array}{ccc}{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{\phi}{(f(x:T):U.M)}^{(n)}\hfill & \hfill =\hfill & {\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,{\phi}^{(n)}}(f(x:T):U.{M}^{(n)})\hfill \end{array}$$ 
$${\{\mathrm{\dots},{f}_{i}\mapsto {\mathrm{Cl}}_{i},\mathrm{\dots}\}}^{(n)}=\{\mathrm{\dots},{f}_{i}\mapsto {\mathrm{Cl}}_{i}^{(n)},\mathrm{\dots}\}$$ 
The terms ${M}^{(n)}$ can be defined by structural recursion; we just give one clause of the definition:
$$\begin{array}{c}{(\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}f(x:T):U=M\mathrm{\U0001d692\U0001d697}N)}^{(n)}={\mathrm{\U0001d695\U0001d68e\U0001d69d\U0001d69b\U0001d68e\U0001d68c}}_{n}f(x:T):U={M}^{(n)}\mathrm{\U0001d692\U0001d697}{N}^{(n)}\hfill \end{array}$$ 
Approximation preserves typing judgments.
Termination is proved for the approximation language by structural induction via a suitable notion of computability.


A closure
$${\u22a2}_{l}{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,\phi}(f(x:T):U.M):T\to U$$ is computable iff $n=0$ or $n>0$ and, for all ${\u22a2}_{l}V:T$ we have:
$$\begin{array}{c}[[M]][[\phi [{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n1,\phi}(f(x:T):U.M)/f]]]\{x\mapsto [[V]]\}\downarrow \u27f9\hfill \\ \phi [{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n1,\phi}(f(x:T):U.M)/f]\mid \{x\mapsto V\}{\u22a2}_{l}M\Downarrow \hfill \end{array}$$ 

A function environment ${\u22a2}_{l}\phi :\mathrm{\Phi}$ is computable iff ${\u22a2}_{l}\phi (f):\mathrm{\Phi}(f)$ is a computable closure, for every $f\in \mathrm{Dom}(\phi )$.


A term $\mathrm{\Phi}\mid \mathrm{\Gamma}{\u22a2}_{l}M:T$ is computable iff for every computable ${\u22a2}_{l}\phi :\mathrm{\Phi}$ and every ${\u22a2}_{l}\rho :\mathrm{\Gamma}$
$$[[M]][[\phi ]][[\rho ]]\downarrow \u27f9\phi ,\rho {\u22a2}_{l}M\Downarrow $$ (and similarly for boolean terms).
Strictly speaking, in the above we should say that it is the sequent $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T$ that is computable and similarly for closures and function environments.
Lemma 6.4 ().

(1)
Every closure $\u22a2{\mathrm{\mathbf{c}\mathbf{l}\mathbf{o}}}_{n,\phi}(f(x:T):U.M):T\to U$ is computable.

(2)
Every function environment $\u22a2\phi :\mathrm{\Phi}$ is computable.

(3)
Every welltyped term $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2M:T$ is computable.

(4)
Every welltyped boolean term $\mathrm{\Phi}\mid \mathrm{\Gamma}\u22a2B$ is computable.
The next two lemmas enable us to lift completeness from the approximation language to the main one. The first lets us pass from semantic existence in the main language to semantic existence in the approximation language; the second allows us to pass in the opposite direction from termination in the approximation language to termination in the main one.
Lemma 6.5 ().
For any welltyped term $M$ we have:
$$[[M]]=\underset{n\in \mathbb{N}}{\bigvee}[[{M}^{(n)}]]$$ 
and similarly for boolean terms, closures and function environments.
Lemma 6.6 ().
For any welltyped term $M$ we have:
$${\phi}^{(n)}\mid \rho {\u22a2}_{l}{M}^{(n)}\Rightarrow V\u27f9\phi \mid \rho \u22a2M\Rightarrow V$$ 
and similarly for boolean terms.
Operational completeness follows straightforwardly from these three lemmas:
Theorem 6.7 (Operational completeness).
Suppose that $\mathrm{\Phi}\mathrm{\mid}\mathrm{\Gamma}\mathrm{\u22a2}M\mathrm{:}T$, $\mathrm{\u22a2}\phi \mathrm{:}\mathrm{\Phi}$, and $\mathrm{\u22a2}\rho \mathrm{:}\mathrm{\Gamma}$, then:

(1)
Operational semantics.
$$[[M]][[\phi ]][[\rho ]]\downarrow \u27f9\phi \mid \rho \u22a2M\Downarrow $$ (and similarly for boolean terms).

(2)
Symbolic operational semantics.
$$[[M]][[\phi ]][[\rho ]]\downarrow \u27f9\exists C.\phi \mid \rho \u22a2M\rightsquigarrow C$$
Proof.

(1)
Suppose that $[[M]][[\phi ]][[\rho ]]\downarrow $. Then by Lemma 6.5 we have $[[{M}^{(n)}]][[{\phi}^{(n)}]][[\rho ]]\downarrow $, for some $n\in \mathbb{N}$. By Lemma 6.4, both ${M}^{(n)}$ and ${\phi}^{(n)}$ are computable. So ${\phi}^{(n)}\mid \rho \u22a2{M}^{(n)}\Downarrow $. But then, by Lemma 6.6, we have $\phi \mid \rho \u22a2M\Downarrow $, as required. The proof for boolean terms is similar.

(2)
This follows from the first part and Proposition 3.5.
∎
7. Discussion
There is much more to do on the theory of differentiable programming languages, even at a basic level; we briefly suggest some possibilities. Tracebased automatic differentiation systems generally work with Anormal forms, or equivalent structures. They also employ optimizations. For example, as mentioned in the Introduction, they may record auxiliary information in evaluation traces to reduce recomputation. Other automatic differentiation systems rely on code transformations for differentiation. It would be interesting to define and study such optimizations and alternative approaches, perhaps in the setting of our language.
Another direction of interest would be to deal with nondifferentiable functions used in practice like ReLU or with nonsmooth functions. For the former, one might use Clarke subgradients (Clarke, 1990), following (Di Gianantonio and Edalat, 2013) (the Clarke subgradient of ReLU at $0$ is the interval $[0,1]$); for the latter, one may use ${C}^{k}$ functions. Still another possible such direction of research would be to work with approximate reals, rather than reals, and to seek numerical accuracy theorems. Here it would be natural to work with Scott’s interval domain and, generalizing the Clarke subgradient, a domaintheoretic notion of subdifferentiation of functions over the interval domain, as discussed in (Edalat and Lieutier, 2004; Edalat and Maleki, 2018).
One would like results for richer languages, with a wider range of types or with computational effects. The problem then to be solved is how these additional features interact with differentiation. An extension to sideeffects would make contact with the literature on the automatic differentiation of imperative languages, such as Fortran. An extension to probability, in some form, would make contact with stochastic optimization and, further, with probabilistic languages for statistical learning. For higherorder types, there may be a domaintheoretic analogue of convenient vector spaces that supports the combination of higherorder types and recursion. Further, one might, as suggested in (Vákár et al., 2018), seek a domaintheoretic analogue of diffeological spaces (see (IglesiasZemmour, 2013)); that would also accommodate sum and recursive types. One might also wish to program with Riemannian manifolds, to accommodate natural gradient descent (Amari, 1996); these too should fit into a diffeological framework. In another direction, the work on categories with differential structure may yield an axiomatic version of adequacy theorems for programming languages with differentiation constructs; such categories further equipped with structure to model partiality (Cockett et al., 2011), are of particular interest.
Finally, if perhaps orthogonally, it is important to add explicit tensor (multidimensional array) types, with accompanying shape analysis. There is a long history of programminglanguage design in this area; a salient example is the design of Remora (Slepak et al., 2014).
References
 (1)
 Abadi et al. (2016a) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016a. TensorFlow: a system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, 265–283. https://www.usenix.org/conference/osdi16/technicalsessions/presentation/abadi
 Abadi et al. (2016b) Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016b. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS ’16). ACM, 308–318. https://doi.org/10.1145/2976749.2978318
 Abramsky and Jung (1994) Samson Abramsky and Achim Jung. 1994. Domain theory. In Handbook of Logic in Computer Science (Vol. 3), Samson Abramsky, Dov M. Gabbay, and T. S. E. Maibaum (Eds.). Oxford University Press, Inc., 1–168. http://dl.acm.org/citation.cfm?id=218742.218744
 Amari (1996) Shunichi Amari. 1996. Neural learning in structured parameter spaces — natural Riemannian gradient. In Advances in Neural Information Processing Systems 9, NIPS, Michael Mozer, Michael I. Jordan, and Thomas Petsche (Eds.). MIT Press, 127–133. http://papers.nips.cc/paper/1248neurallearninginstructuredparameterspacesnaturalriemanniangradient
 Baydin et al. (2018) Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research 18, 153 (2018), 1–43. http://jmlr.org/papers/v18/17468.html
 Baydin et al. (2016) Atilim Günes Baydin, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. Tricks from deep learning. CoRR abs/1611.03777 (2016). http://arxiv.org/abs/1611.03777
 Beck and Fischer (1994) Thomas Beck and Herbert Fischer. 1994. The ifproblem in automatic differentiation. J. Comput. Appl. Math. 50, 13 (May 1994), 119–131. https://doi.org/10.1016/03770427(94)902941
 Bergstra et al. (2010) James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), Vol. 4. http://wwwetud.iro.umontreal.ca/~wardefar/publications/theano_scipy2010.pdf
 Bertot and Castéran (2013) Yves Bertot and Pierre Castéran. 2013. Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions. Springer Science & Business Media.
 Blute et al. (2010) Richard Blute, Thomas Ehrhard, and Christine Tasson. 2010. A convenient differential category. arXiv preprint arXiv:1006.3140 (2010).
 Blute et al. (2009) Richard F Blute, J Robin B Cockett, and Robert AG Seely. 2009. Cartesian differential categories. Theory and Applications of Categories 22, 23 (2009), 622–672.
 Bucciarelli et al. (2010) Antonio Bucciarelli, Thomas Ehrhard, and Giulio Manzonetto. 2010. Categorical models for simply typed resource calculi. Electronic Notes in Theoretical Computer Science 265 (2010), 213–230.
 Christianson (2012) Bruce Christianson. 2012. A Leibniz notation for automatic differentiation. In Recent Advances in Algorithmic Differentiation, Shaun Forth, Paul Hovland, Eric Phipps, Jean Utke, and Andrea Walther (Eds.). Lecture Notes in Computational Science and Engineering, Vol. 87. Springer, 1–9.
 Clarke (1990) Frank H. Clarke. 1990. Optimization and nonsmooth analysis. Classics in Applied Mathematics, Vol. 5. SIAM.
 Cockett et al. (2011) J Robin B Cockett, Geoff SH Cruttwell, and Jonathan D Gallagher. 2011. Differential restriction categories. Theory and Applications of Categories 25, 21 (2011), 537–613.
 de Moura et al. (2015) Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob von Raumer. 2015. The Lean Theorem Prover (System Description). In Automated Deduction  CADE25  25th International Conference on Automated Deduction, Berlin, Germany, August 17, 2015, Proceedings (Lecture Notes in Computer Science), Amy P. Felty and Aart Middeldorp (Eds.), Vol. 9195. Springer, 378–388. https://doi.org/10.1007/9783319214016_26
 Di Gianantonio and Edalat (2013) Pietro Di Gianantonio and Abbas Edalat. 2013. A language for differentiable functions. In Foundations of Software Science and Computation Structures, Frank Pfenning (Ed.). Springer, 337–352.
 Edalat and Lieutier (2004) Abbas Edalat and André Lieutier. 2004. Domain theory and differential calculus (functions of one variable). Mathematical Structures in Computer Science 14, 6 (2004), 771–802. https://doi.org/10.1017/S0960129504004359
 Edalat and Maleki (2018) Abbas Edalat and Mehrdad Maleki. 2018. Differential calculus with imprecise input and Its logical framework. In Foundations of Software Science and Computation Structures  21st International Conference, FOSSACS 2018. 459–475. https://doi.org/10.1007/9783319893662_25
 Ehrhard and Regnier (2003) Thomas Ehrhard and Laurent Regnier. 2003. The differential lambdacalculus. Theoretical Computer Science 309, 13 (2003), 1–41.
 Elliott (2018) Conal Elliott. 2018. The simple essence of automatic differentiation. In Proceedings of the ACM on Programming Languages (ICFP). http://conal.net/papers/essenceofad/
 Felleisen and Friedman (1987) Matthias Felleisen and Daniel P. Friedman. 1987. Control operators, the SECDmachine and the $\lambda $calculus. In Formal Description of Programming Concepts III, M. Wirsing (Ed.). Elsevier, 193–217.
 Fischer (2001) H. Fischer. 2001. Automatic differentiation: root problem and branch problem. In Encyclopedia of Optimization, C. A. Floudas and P. M. Pardalos (Eds.). Vol. I. Kluwer Academic Publishers, 118–122.
 Frostig et al. (2018) Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machine learning programs via highlevel tracing. Presented at SysML 2018, available at https://www.sysml.cc/doc/146.pdf.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. https://www.deeplearningbook.org Available at https://www.deeplearningbook.org.
 Griewank (2000) Andreas Griewank. 2000. Evaluating derivatives  principles and techniques of algorithmic differentiation. Frontiers in applied mathematics, Vol. 19. SIAM.
 Hascoët and Pascual (2013) L. Hascoët and V. Pascual. 2013. The Tapenade automatic differentiation tool: principles, model, and specification. ACM Transactions On Mathematical Software 39, 3 (2013). http://dx.doi.org/10.1145/2450153.2450158
 IglesiasZemmour (2013) P. IglesiasZemmour. 2013. Diffeology. American Mathematical Society. Available at https://books.google.com/books?id=Nb0xAAAAQBAJ.
 Kriegl and Michor (1997) Andreas Kriegl and Peter W Michor. 1997. The convenient setting of global analysis. Vol. 53. American Mathematical Soc.
 Maclaurin (2017) Dougal Maclaurin. 2017. The chain rule unchained: modeling, optimization and inference with Autograd. Slides of a talk based on joint work with David Duvenaud and Matthew J. Johnson, available at https://dougalmaclaurin.com/talk.pdf.
 Maclaurin et al. (2015) Dougal Maclaurin, David Duvenaud, and Ryan P Adams. 2015. Autograd: effortless gradients in Numpy. In ICML 2015 AutoML Workshop, Vol. 238.
 Manzyuk (2012) Oleksandr Manzyuk. 2012. A simply typed $\lambda $calculus of forward automatic differentiation. Electronic Notes in Theoretical Computer Science 286 (2012), 257 – 272. https://doi.org/10.1016/j.entcs.2012.08.017 Proceedings of the 28th Conference on the Mathematical Foundations of Programming Semantics (MFPS XXVIII).
 Mayero (2002) Micaela Mayero. 2002. Using theorem proving for numerical analysis (correctness proof of an automatic differentiation algorithm). In Theorem Proving in Higher Order Logics, 15th International Conference, TPHOLs 2002, Hampton, VA, USA, August 2023, 2002, Proceedings (Lecture Notes in Computer Science), Victor Carreño, César A. Muñoz, and Sofiène Tahar (Eds.), Vol. 2410. Springer, 246–262. https://doi.org/10.1007/3540456856_17
 Pearlmutter and Siskind (2008) Barak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reversemode AD in a functional framework: lambda the ultimate backpropagator. ACM Trans. Program. Lang. Syst. 30, 2 (2008), 7:1–7:36. https://doi.org/10.1145/1330017.1330018
 PyTorch (2018) PyTorch. 2018. http://pytorch.org http://pytorch.org.
 Selsam et al. (2017) Daniel Selsam, Percy Liang, and David L. Dill. 2017. Developing bugfree machine learning systems with formal mathematics. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 3047–3056. http://proceedings.mlr.press/v70/selsam17a.html
 Shaikhha et al. (2018) Amir Shaikhha, Andrew Fitzgibbon, Dimitrios Vytiniotis, Simon Peyton Jones, and Christoph Koch. 2018. Efficient differentiable programming in a functional arrayprocessing language. CoRR abs/1806.02136 (2018). arXiv:1806.02136 http://arxiv.org/abs/1806.02136 http://arxiv.org/abs/1806.02136.
 Shankar and Dobson (2017) Asim Shankar and Wolff Dobson. 2017. Eager execution: an imperative, definebyrun interface to TensorFlow. https://research.googleblog.com/2017/10/eagerexecutionimperativedefineby.html https://research.googleblog.com/2017/10/eagerexecutionimperativedefineby.html.
 Siskind and Pearlmutter (2005) Jeffrey Mark Siskind and Barak A. Pearlmutter. 2005. Perturbation confusion and referential transparency: correct functional implementation of forwardmode AD. In Implementation and Application of Functional Languages—17th International Workshop, IFL’05, Andrew Butterfield (Ed.). 1–9. Trinity College Dublin, Computer Science Department Technical Report TCDCS200560.
 Siskind and Pearlmutter (2008) Jeffrey Mark Siskind and Barak A. Pearlmutter. 2008. Nesting forwardmode AD in a functional framework. HigherOrder and Symbolic Computation 21, 4 (2008), 361–376. https://doi.org/10.1007/s1099000890371
 Slepak et al. (2014) Justin Slepak, Olin Shivers, and Panagiotis Manolios. 2014. An arrayoriented language with static rank polymorphism. In Proceedings of the 23rd European Symposium on Programming Languages and Systems  Volume 8410. SpringerVerlag New York, Inc., 27–46. https://doi.org/10.1007/9783642548338_3
 Song et al. (2013) Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. 2013. Stochastic gradient descent with differentially private updates. In IEEE Global Conference on Signal and Information Processing, GlobalSIP 2013, Austin, TX, USA, December 35, 2013. IEEE, 245–248. https://doi.org/10.1109/GlobalSIP.2013.6736861
 Tokui (2018) Seiya Tokui. 2018. Chainer: a powerful, flexible and intuitive framework of neural networks. http://chainer.org http://chainer.org.
 Trench (2003) W.F. Trench. 2003. Introduction to Real Analysis. Prentice Hall/Pearson Education. Available at https://books.google.com/books?id=NkFkQgAACAAJ.
 Vákár et al. (2018) M. Vákár, O. Kammar, and S. Staton. 2018. Diffeological spaces and semantics for differential programming. https://andrejbauer.github.io/domainsfloc2018/slides/MatthijsKammarStaton.pdf. Presented at Domains XIII Workshop.
 van Merrienboer et al. (2018) Bart van Merrienboer, Dan Moldovan, and Alexander B. Wiltschko. 2018. Tangent: automatic differentiation using sourcecode transformation for dynamically typed array programming. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò CesaBianchi, and Roman Garnett (Eds.). 6259–6268. http://papers.nips.cc/paper/7863tangentautomaticdifferentiationusingsourcecodetransformationfordynamicallytypedarrayprogramming
 Wang et al. (2018) Fei Wang, Xilun Wu, Grégory M. Essertel, James M. Decker, and Tiark Rompf. 2018. Demystifying differentiable programming: shift/reset the penultimate backpropagator. CoRR abs/1803.10228 (2018). arXiv:1803.10228 http://arxiv.org/abs/1803.10228 http://arxiv.org/abs/1803.10228.
 Yu et al. (2018) Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, and Xiaoqiang Zheng. 2018. Dynamic control flow in largescale machine learning. In Proceedings of the Thirteenth EuroSys Conference (EuroSys ’18). ACM, Article 18, 15 pages. https://doi.org/10.1145/3190508.3190551