Abstract
Traditional firstorder logic (FOL) reasoning systems usually rely on manualheuristics for proof guidance. We propose TRAIL: a system that learns toperform proof guidance using reinforcement learning. A key design principle ofour system is that it is general enough to allow transfer to problems indifferent domains that do not share the same vocabulary of the training set. Todo so, we developed a novel representation of the internal state of a prover interms of clauses and inference actions. We also propose a novel neuralbasedattention mechanism to learn interactions between clauses. We demonstrate thatthis approach enables the system to generalize from training to test dataacross domains with different vocabularies, suggesting that the neuralarchitecture in TRAIL is well suited for representing and processing of logicalformalisms. We also show that TRAIL's learned strategies provide a comparableperformance to an established heuristicsbased theorem prover.
Quick Read (beta)
A Deep Reinforcement Learning based Approach to
Learning Transferable Proof Guidance Strategies
Abstract
Traditional firstorder logic (FOL) reasoning systems usually rely on manual heuristics for proof guidance. We propose TRAIL: a system that learns to perform proof guidance using reinforcement learning. A key design principle of our system is that it is general enough to allow transfer to problems in different domains that do not share the same vocabulary of the training set. To do so, we developed a novel representation of the internal state of a prover in terms of clauses and inference actions. We also propose a novel neuralbased attention mechanism to learn interactions between clauses. We demonstrate that this approach enables the system to generalize from training to test data across domains with different vocabularies, suggesting that the neural architecture in TRAIL is well suited for representing and processing of logical formalisms. We also show that TRAIL’s learned strategies provide a comparable performance to an established heuristicsbased theorem prover.
1 Introduction
Automated theorem provers (ATPs) have established themselves as useful tools for solving problems that are expressible in a variety of knowledge representation formalisms (e.g. firstorder logic). Such problems are commonplace in areas core to computer science (e.g., compilers (curzon1991verified; leroy2009formal), operating systems (klein2009operating), and even distributed systems (hawblitzel2015ironfleet; garland1998ioa)), where ATPs are used to prove that a system satisfies some formal design specification.
Unfortunately, while the formalisms that underlie such problems have been (more or less) fixed, the strategies needed to solve them have been anything but. With each new domain added to the purview of automated theorem proving, there has been a need for the development of new heuristics and strategies that restrict or order how an ATP searches for proofs. The process of guiding a theorem prover during proof search is referred to as proof guidance. Though proof guidance heuristics have been shown to have a drastic impact on theorem proving performance (schulz2016performance), the specifics of when and why to use a particular strategy are still often hard to define (schulzwe).
Many stateoftheart ATPs use machine learning to automatically determine heuristics for assisting with proof guidance. Generally, the features considered by such learned heuristics would be manually designed (KUVIJCAI15features; JUCICM17enigma), though more recently they have been learned through deep learning (LoosISKLPAR17atpnn; ChvaJaSUCoRR19enigmang; paliwal2019graph). However, while deep learning has the obvious advantage of being able to lessen the amount of expert knowledge needed to handcraft new heuristics, its practical application to theorem proving has been limited to domains for which there has been plenty of training data (e.g., in larger theories like Mizar (mizar40for40) or the recent Holist (BaLoRSWiCoRR19holist)).
There has also been work exploring the use of reinforcement learning (RL), where the system automatically learns the right proof guidance strategy from proof attempts. Though RL has been successfully deployed in logics less expressive than firstorder logic (FOL) (KuYaSaCoRR18intuitpl; LeRaSeCoRR18heuristicsatprl; CheTiCoRR18rewrl); for more expressive reasoning it has mostly been directed towards synthetic domains (zombori2019towards) or large domains with plentiful training data (KalUMONeurIPS18atprl; BaLoRSWiCoRR19learning).
In this paper, we deviate from prior work that operates only in datarich or synthetic domains. We instead take aim at using deep reinforcement learning to guide theorem proving in datasparse domains, which have been yet unable to leverage the benefits of deep learning driven advances in theorem proving. To do this, we introduce TRAIL, an RL based proof guidance system which is designed to learn transferable strategies that can be leveraged across different problem domains within a logical formalism. At the heart of TRAIL is (a) a novel representation of clauses, which abstracts away the specifics of the vocabulary of a problem, (b) a novel representation of the entire state of a theorem prover, which allows the system to consider the problem as a whole. This approach forces the system to focus less on individual patterns and symbols in a clause, and more on interactions between clause representations, which helps the system generalize across problem domains. To our knowledge, we are the first to design a deep reinforcement based proof guidance system with the explicit goal of building abstractions that suit transfer across problem domains.
We also evaluate multiple training regimens for transfer to determine which mechanism provides the best generalization: (a) we randomly explore from a tabula rasa state as done in AlphaZero (alphazero) to build a proof guidance system that maximizes exploration, (b) we explore the effectiveness of bootstrapping the system based on proofs from existing reasoners in a specific problem domain, where the system can effectively learn from highly optimized reasoners, and (c) we combine the effects of imitation with exploration. We examine how well the three regimens provide transfer across problem domains.
In summary, the core contributions of our work are as follows: (a) We propose a novel deep reinforcement based proof guidance system for FOL, which is general enough to allow transfer to problems in different domains. (b) We demonstrate the efficacy of this system, by showing that it is just as effective as a system which is trained on a problem set where all commonality in vocabulary between problems is removed. (c) We show significant generalization across different problem domains, from Mizar to TPTP and vice versa (after ensuring no overlap between the two data sets in terms of problems and vocabularies^{1}^{1} 1 No overlap between TPTP and Mizar is achieved by (1) removing common problems, and (2) by consistent renaming, within each problem, all predicates, functions, and constants), when compared to training and testing on the same domain. (d) We ensure that the performance of our neural based system is competitive with saturationbased FOL reasoners, suggesting that the generalized learning shown by TRAIL approaches that of manually optimized reasoners.
2 Background: Reasoning in FOL
We assume the reader has knowledge of basic firstorder logic and automated theorem proving terminology and thus will only briefly describe the terms commonly seen throughout this paper. For readers interested in learning more about logical formalisms and techniques see (thelogicbook; enderton2001mathematical).
In this work, we focus on firstorder logic (FOL) with equality. In the standard FOL problemsolving setting, an ATP is given a conjecture (i.e., a formula to be proved true or false), axioms (i.e., formulas known to be true), and inference rules (i.e., rules that, based on given formulas, allow for the derivation of new true formulas). From these inputs, the ATP performs a proof search, which can be characterized as the successive application of inference rules to axioms and derived formulas until a sequence of derived formulas is found that represents a proof of the given conjecture.
The types of formulas considered here are clauses, i.e. disjunctions of literals (where a literal is a (un)negated formula that otherwise has no inner logical connectives). We further specify all variables to be implicitly universally quantified.
The theorem prover compared against in this work, Beagle (Beagle2015), is saturationbased. A saturationbased theorem prover maintains two sets of clauses, referred to as the processed and unprocessed sets of clauses. These two sets correspond to the clauses that have and have not been yet selected for inference. The actions that saturationbased theorem provers take are referred to as inferences. Inferences involve an inference rule (e.g. resolution, factoring) and a nonempty set of clauses, considered to be the premises of the rule. At each step in proof search, the ATP selects an inference with premises in the unprocessed set (some premises may be part of the processed set) and executes it. This generates a new set of clauses, each of which is added to the unprocessed set. The clauses in the premises that are members of the unprocessed set are then added to the processed set. This iteration continues until a clause is generated (in most cases, the empty clause) that signals a proof has been found.
3 TRAIL
We first describe our overall approach to defining the proof guidance problem in terms of reinforcement learning. We then describe two novel aspects of TRAIL that help with generalized learning of proof guidance: (a) a vectorization process which represents all clauses and actions within a proof state in a way that abstracts away the specifics of the vocabulary of a problem and (b) an attentionbased policy network that learns the interactions between clauses and actions to select the next action. Last, we describe the three different learning regimens we used to see if generalization varies as a function of training procedure.
3.1 RLbased Proof Guidance
We formalize the guidance process as an RL problem where the reasoner provides the environment in which the learning agent operates. Figure 1 shows how an ATP problem is solved in our framework. Given a conjecture and a set of axioms, TRAIL iteratively performs reasoning steps until a proof is found (within a provided time limit). The reasoner tracks the state of the proof, ${s}_{t}$, which encapsulates both the clauses that have been derived or used in the derivation so far and the actions that can be taken by the reasoner at the current step. At each step, this state is passed to the learning agent: an attentionbased model (luong2015attention) that predicts a distribution over the actions and uses it to sample a corresponding action, ${a}_{t,i}$. This action is then given to the reasoner, which executes it and updates the proof state.
Formally, a state, ${s}_{t}=({\mathcal{C}}_{t},{\mathcal{A}}_{t})$, consists of:

•
${\mathcal{C}}_{t}={\{{c}_{t,j}\}}_{j=1}^{N}$, the set of processed clauses, (i.e., all clauses selected by the agent up to step $t$); where ${\mathcal{C}}_{0}=\mathrm{\varnothing}$.

•
${\mathcal{A}}_{t}={\{{a}_{t,i}\}}_{i=1}^{M}$, the set of all available actions that the reasoner could execute at step $t$. An action, ${a}_{t,i}=({\xi}_{t,i},{\widehat{c}}_{t,i})$, is a pair comprising an inference rule, ${\xi}_{t,i}$, and a clause, ${\widehat{c}}_{t,i}$. ${\widehat{c}}_{t,i}$ is an axiom, the negated conjecture, or a derived clause that has yet to be selected by the agent. Informally, ${\mathcal{A}}_{t}$ represents the unprocessed clauses at step $t$ and the inference rules applicable to them. ${\mathcal{A}}_{0}$ is the cross product of the set of all inference rules (denoted by $\mathcal{I}$) and the set containing all axioms and the negated conjecture.
At step $t$, given a state ${s}_{t}$ (provided by the reasoner), the learning agent computes a probability distribution over the set of available actions ${\mathcal{A}}_{t}$, denoted by ${P}_{\theta}({a}_{t,i}{s}_{t})$ (where $\theta $ is the set of parameters for the learning agent), and samples an action ${a}_{t,i}\in {\mathcal{A}}_{t}$. The sampled action ${a}_{t,i}=({\xi}_{t,i},{\widehat{c}}_{t,i})$ is then executed by the reasoner by applying ${\xi}_{t,i}$ to ${\widehat{c}}_{t,i}$ (which may involve other clauses from the processed set ${\mathcal{C}}_{t}$). This yields a set of new derived clauses, ${\overline{\mathcal{C}}}_{t}$, and a new state, ${s}_{t+1}=({\mathcal{C}}_{t+1},{\mathcal{A}}_{t+1})$, where ${\mathcal{C}}_{t+1}={\mathcal{C}}_{t}\cup \{{\widehat{c}}_{t,i}\}$ and ${\mathcal{A}}_{t+1}=({\mathcal{A}}_{t}\{{a}_{t,i}\})\cup (\mathcal{I}\times {\overline{\mathcal{C}}}_{t})$.
Upon completion of a proof attempt, TRAIL must compute a loss and issue a reward that encourages the agent to optimize for decisions leading to a successful proof in the smallest number of steps. Specifically, for an unsuccessful proof attempt (i.e., the underlying reasoner fails to derive a contradiction within the time limit), each step $t$ in the attempt is assigned a reward ${r}_{t}=0$. For a successful proof attempt, in the final step, the underlying reasoner produces a parsimonious refutation proof $\mathcal{P}$ containing only the actions that generated derived facts directly or indirectly involved in the final contradiction. At step $t$ of a successful proof attempt where the action ${a}_{t,i}$ is selected, the reward ${r}_{t}$ is $0$ if ${a}_{t,i}$ is not part of this minimal refutation proof $\mathcal{P}$; otherwise ${r}_{t}$ is inversely proportional to the total number of steps in the proof attempt.
The final loss consists of the standard policy gradient loss (sutton1998reinforcement) and an entropy regularization term to avoid collapse onto a suboptimal deterministic policy and to promote exploration.
$\mathcal{L}(\theta )=$  $\mathbb{E}\left[{r}_{t}{\displaystyle \sum _{i=1}^{M}}{\alpha}_{t}({a}_{t,i})\mathrm{log}({P}_{\theta}({a}_{t,i}{s}_{t}))\right]$  
$\lambda \mathbb{E}\left[{\displaystyle \sum _{i=1}^{M}}{P}_{\theta}({a}_{t,i}{s}_{t})\mathrm{log}({P}_{\theta}({a}_{t,i}{s}_{t}))\right]$ 
where ${\alpha}_{t}$ indicates the action selected at step $t$ (i.e., ${\alpha}_{t}({a}_{t,i})=1$ if action ${a}_{t,i}$ is selected at $t$; otherwise ${\alpha}_{t}({a}_{t,i})=0$), and $\lambda $ is the entropy regularization hyperparameter.
We use a normalized reward to improve stability of training as the intrinsic difficulty of problems can vary widely in our problem dataset. We explored (i) normalization by the inverse of the number of steps performed by a mature traditional reasoner (in this work, Beagle), (ii) normalization by the best reward obtained in repeated attempts to solve the same problem, and (iii) no normalization; the normalization strategy was a hyperparameter. This loss has the effect of giving actions that contributed to the most direct proofs for a given problem higher rewards, while dampening actions that contributed to lengthier proofs for the same problem.
3.2 Vectorization Process
Clause Vectorization: Figure 2 shows an example of the vectorizer operating on a clause with a single positive literal. Our method for vectorization is informed by how inference rules operate over clauses. Consider the resolution inference rule. Two clauses resolve if one contains a positive literal whose constituent atom is unifiable with the constituent atom of a negative literal in the other clause. Hence, vector representations of clauses should capture the relationship between literals and their negations as well as reflect structural similarities between literals that are indicative of unifiability.
Our approach captures these features by deconstructing input clauses into sets of patterns. We define a pattern to be a linear chain that begins from a predicate symbol and includes one argument (and its argument position) at each depth until it ends at a constant or variable. The set of all patterns for a given clause is then simply the set of all linear paths between each predicate and the constants and variables they bottom out with. Since the names of variables are arbitrary, they are replaced with a wildcard symbol, “$\ast $”, indicating that the element may match with anything. Argument position is also indicated with the use of wildcard symbols. Going back to the clause in Figure 2, we obtain the patterns $q(f(g(\ast ,\ast ),\ast ),\ast )$ and $q(\ast ,g(\ast ,\ast ))$).
We obtain a $d$dimensional representation of a clause, ${\mathbf{c}}_{t,j}$, by hashing the linearization of each pattern $p$ using MD5 hashes (rivest1992md5) to compute a hash value $v$, and setting the element at index $vmodd$ to the number of occurrences of the pattern $p$ in the clause ${c}_{t,j}$. Furthermore, we explicitly encode the difference between patterns and their negations by doubling the representation size and hashing them separately, so that the first $d$ elements encode the positive patterns and the second $d$ elements encode the negated patterns. Feature hashing greatly condenses the representation size and has been shown useful in the ATP domain (JUCoRR19enigmahammering; ChvaJaSUCoRR19enigmang)
Even with MD5 hashing, when operating in domains with smaller vocabularies and shorter formulas it is possible (though unlikely) that vocabulary specific features could still be learned. To mitigate this, our approach systematically renames patterns prior to hashing. Specifically, a unique identifier is appended (generated from the proof attempt number) to each predicate, function, and constant (e.g. in the second proof attempt our patterns would be ${q}_{2}({f}_{2}({g}_{2}(\ast ,\ast ),\ast ),\ast )$ and ${q}_{2}(\ast ,{g}_{2}(\ast ,\ast ))$). This does not affect the underlying semantics of a problem, as each predicate, function, and constant has been renamed consistently.
The patternbased vectorization procedure is intended to produce sparse feature vectors that give reasonable estimates of purely structural similarity. Through both the hashing and renaming, TRAIL has minimal dependence on symbol specific features. While this is certainly appealing from a transferability perspective, we note that it could also be useful when constrained to a single domain. For instance, in Mizar nearly 40% of symbols are introduced during the translation of theories into conjunctive normal form (these are skolem constants, skolem functions, or definitional predicates) (olvsak2019property), which means that any system being applied to Mizar must be able to handle problem specific symbols that may appear only a handful of times.
Action Vectorization: Since actions are pairs of clauses and inference rules, our approach represents the clause in each action pair using the process described above and represents the inference rule as a onehot encoding of size $\mathcal{I}$. We write $({\mathbf{z}}_{t,i},{\widehat{\mathbf{c}}}_{t,i})$ to denote the encoding for action ${a}_{t,i}$.
3.3 Attentionbased Policy Network
The architecture of the policy network in TRAIL is shown in Figure 3. The inputs to the policy network are the sets of processed clause representations, $\{{\mathbf{c}}_{t,1},\mathrm{\dots},{\mathbf{c}}_{t,N}\}$, and action representations, $\{({\mathbf{z}}_{t,1},{\widehat{\mathbf{c}}}_{t,1}),\mathrm{\dots},({\mathbf{z}}_{t,M},{\widehat{\mathbf{c}}}_{t,M})\}$. First, we transform the sparse clause representations (from the set of processed clauses or actions) into dense representations by passing them through $k$ fullyconnected layers. This yields sets $\{{\mathbf{h}}_{t,1},\mathrm{\dots},{\mathbf{h}}_{t,N}\}$ and $\{{\widehat{\mathbf{h}}}_{t,1},\mathrm{\dots},{\widehat{\mathbf{h}}}_{t,M}\}$ of dense representations for the processed and action clauses. Then, for each action pair, we concatenate the clause representation ${\widehat{\mathbf{h}}}_{t,i}$ with the corresponding inference representation ${\mathbf{z}}_{t,i}$ to form the new action representation ${\mathbf{a}}_{t,i}=[{\widehat{\mathbf{h}}}_{t,i},{\mathbf{z}}_{t,i}]$. The resulting sets of new clause and action representations are joined into matrices $\mathbf{C}$ and $\mathbf{A}$, respectively.
Throughout the reasoning process, the policy network must produce a distribution over the actions relative to the clauses that have been selected up to the current step, where both the actions and clauses are sets of variable length. This setting is analogous to ones seen in attentionbased approaches to problems like machine translation (luong2015attention; vaswani2017attention) and video captioning (yu2016video; whitehead2018kavd), in which the model must score each encoder state with respect to a decoder state or other encoder states. To score each action relative to each clause, we compute a multiplicative attention (luong2015attention)
$\mathbf{H}$  $={\mathbf{A}}^{\top}{\mathbf{W}}_{a}\mathbf{C},$ 
where ${\mathbf{W}}_{a}\in {\mathbb{R}}^{(2d+\mathcal{I})\times 2d}$ is a learned parameter and the resulting matrix, $\mathbf{H}\in {\mathbb{R}}^{M\times N}$, is a heat map of interaction scores between processed clauses and available actions. We then perform max pooling over the columns (i.e., clauses) of $\mathbf{H}$ to find a single score for each action and apply a softmax normalization to the pooled scores to obtain the distribution over the actions, ${P}_{\theta}({a}_{t,i}{s}_{t})$.
3.4 Training Regimens
We use one of the following three strategies to attempt to solve all the problems in the training set in order to determine if training regimens also determine how effective generalization is.
Random Exploration: We randomly explore the search space as done in AlphaZero (alphazero) to establish performance when the system is started from a tabula rasa state (i.e., a randomly initialized policy network ${P}_{\theta}$). At training, at an early step $t$ (i.e., $$, where ${\tau}_{0}$, the temperature threshold, is a hyperparameter that indicates the depth in the reasoning process at which training exploration stops), we sample the action ${a}_{t,i}$ in the set of available actions ${\mathcal{A}}_{t}$, according to the following probability distribution ${\widehat{P}}_{\theta}$ derived from the policy network ${P}_{\theta}$:
${\widehat{P}}_{\theta}({a}_{t,i}{s}_{t})={\displaystyle \frac{{P}_{\theta}{({a}_{t,i}{s}_{t})}^{1/\tau}}{{\sum}_{{a}_{t,j}\in {\mathcal{A}}_{t}}{P}_{\theta}{\left({a}_{t,j}{s}_{t}\right)}^{1/\tau}}}$ 
where $\tau $, the temperature, is a hyperparameter that controls the explorationexploitation tradeoff and decays over the iterations (a higher temperature promotes more exploration). On the other hand, when the number of steps already performed is above the temperature threshold (i.e., $t\ge {\tau}_{0}$), an action, ${a}_{t,i}$, with the highest probability from the policy network, is selected, i.e., ${a}_{t,i}=\mathrm{arg}{\mathrm{max}}_{{a}_{t,j}\in {\mathcal{A}}_{t}}{P}_{\theta}({a}_{t,j}{s}_{t})$.
At the end of training iteration $k$, the newly collected examples and those collected in the previous $w$ iterations ($w$ is the example buffer hyperparameter) are used to train, in a supervised manner, the policy network using the reward structure and loss function defined in Section 3.1. The updated policy network is retained for the next iteration if it is superior to the previous one in terms of number of problems solved on the validation problem set; otherwise, it is discarded. At validation and testing, exploration is disabled (i.e., the temperature threshold is set to 0).
This approach has the disadvantage that the system spends a significant amount of time in unproductive parts of the search space, but it may help transfer because of the random exploration of search space.
Expert Bootstrapping Learning: We explore the effectiveness of bootstrapping the RL process. This approach is similar to random exploration with the exception that, at the first iteration, the initial randomly initialized policy network model is trained, in a supervised manner, using examples from problems from the training set solved by an existing reasoner (in this work we use Beagle (Beagle2015)). Thus, the first iteration ends with a model trained by an expert, then training proceeds exactly as in the first random exploration approach. We contrast this approach versus the random exploration strategy for generalization.
Exploratory Imitation Learning: Similar to expert bootstrapping, we bootstrap the training with examples from an existing reasoner (our expert). But, in later iterations, we reduce our reliance on this reasoner for example collection. Specifically, at iteration $k$ at training, for a step $t$ in the reasoning process, with a probability ${\rho}^{k1}$, we delegate the selection of the action from the list of available actions ${\mathcal{A}}_{t}$ to the expert, and, with a probability $1{\rho}^{k1}$, we follow the same action selection strategy as in the random exploration approach. Here $$ is a hyperparameter controlling the decay of our reliance on the expert reasoner. At validation and testing, there is no reliance on the expert. This approach is the middle ground between random exploration and expert based learning, and is hence a useful datapoint in understanding the effectiveness of either on generalization.
4 Experiments and Results
$k$ layers  units per layer  dropout  $\lambda $ (reg.)  $2d$ (sparse vector size)  $\tau $ (temp.)  ${\tau}_{0}$ (temp. threshold)  $\rho $ (expert decay)  reward normalization 
2  161  0.57  0.004  645  1.13  11  0.75  (iii) No normalization 
In this section, we are trying to answer the following questions: (a) How well does the proof guidance system in TRAIL generalize across problem domains? To evaluate this question, we trained on two different datasets, and then measured generalization across them. (b) What factors contribute to generalization? To evaluate this question, we examined if differences in training regimens in RL affect generalization. (c) Is TRAIL effective at generalized proof guidance? For this, we tested whether TRAIL’s performance was competitive with an existing reasoner.
4.1 Datasets
We evaluated TRAIL using problems drawn from both the Mizar^{2}^{2} 2 https://github.com/JUrban/deepmath/ (mizar) dataset and the Thousands of Problems for Theorem Provers (TPTP)^{3}^{3} 3 http://tptp.cs.miami.edu/ dataset. Mizar is a well known mathematical library of formalized and mechanicallyverified mathematical problems. TPTP is the definitive benchmarking library for theorem provers, designed to test ATP performance across a wide range of problem domains (e.g., biology, geography, number theory, etc.). The different problem domains within the TPTP will serve as our datasparse setting in our transfer experiment, as each distinct problem domain within the TPTP has on average only 400 problems . Problems from Mizar were drawn from the subset used by (AlemiNIPS16deepmath), i.e. the subset of problems known to be solvable by existing ATPs (this subset was used to allow for a direct comparison against our baseline reasoner). From the TPTP dataset, a random subset of 2,000 problems were selected from various problem domains. For transfer and generalization experiments, we ensure no overlap between the two datasets in terms of problems and vocabularies by (a) removing common problems from the test sets, and (b) by consistent renaming, within each problem, all predicates, functions, and constants.
4.2 HyperParameter Tuning and Experimental Setup
We used gradientboosted tree search from scikitoptimize^{4}^{4} 4 https://scikitoptimize.github.io/ to find effective hyperparameters using 10% of the Mizar dataset. This returned the hyperparameter values in Table 1. We then selected a different 10% of the dataset (completely disjoint from the one used for hyperparameter tuning), performed a 3fold cross evaluation on it, and, for each iteration, we report the average across the combined set of problems in all folds. The maximum time limit for solving a problem was 100 seconds. Experiments were conducted over a cluster of 19 CPU (56 x 2.0 GHz cores & 247 GB RAM) and 10 GPU machines (2 x P100 GPU, 16 x 2.0 GHz CPU cores, & 120 GB RAM) over 4 to 5 days (for hyperparameter tuning, we added 5 CPU and 2 GPU machines).
We use three metrics to measure performance. The first is cumulative completion performance. Following (BaLoRSWiCoRR19learning), this is the cumulative number of distinct problems solved by TRAIL across all iterations divided by the total number of problems. The second metric is best iteration completion performance. This was reported in (KalUMONeurIPS18atprl) and is the number of problems solved at the best performing iteration divided by the total number of problems. The third metric is average proof length, which measures the average number of steps taken to find a proof.
Domain  # Prob.  # TRAIL  # Beagle  # Random 
CSR  27  16  16  11 
SET  73  25  32  4 
GEO  68  21  27  1 
SWC  46  14  14  0 
KLE  35  4  5  0 
LCL  33  8  8  1 
4.3 Generalization Across Problem Domains
Table 2 shows TRAIL’s performance when it is trained on Mizar and tested on distinct problem domains from the TPTP. The subject matter between domains can vary greatly. For instance, CSR tests common sense reasoning, while GEO involves reasoning about geometry, and SWC covers software verification. This results in different needs between domains, e.g., the CSR domain requires heavy use of the resolution inference rule while the GEO domain involves more equational reasoning and thus requires the superposition inference rules. As can be seen in Table 2, each distinct domain TRAIL was evaluated on contains only a few problems. By training on Mizar (a completely different dataset designed for different purposes), TRAIL was able to nearly match Beagle in many categories. Furthermore, the performance of the untrained, random model shows that TRAIL was clearly learning transferable proof search strategies.
Training Regimens  Testing  
Mizar  TPTP  
Beagle  64.1%  37.9%  
Random model  22.2%  8.3%  
\setstackgapS4.05ex\Centerstack[l]Training
on Mizar 
Tabula Rasa  64.3%  31.1% 
Expert Bootstrapping  64.8%  33.1%  
Exploratory Imitation  64.4%  30.5%  
\setstackgapS4.05ex\Centerstack[l]Training
on TPTP 
Tabula Rasa  58.5%  31.0% 
Expert Bootstrapping  61.0%  29.6%  
Exploratory Imitation  63.1%  31.6% 
4.4 Generalization Across Datasets
In this experiment, we show TRAIL’s performance when we train it on Mizar and test its performance on TPTP and vice versa. As described earlier, these two datasets have very different vocabularies. In examining the factors that help this sort of generalization across datasets, we examine the effect of each type of training regimen.
Training on Different Regimens: Table 3 shows the performance of the different training regimens and how much each regimen facilitates generalization. As a baseline, we also include the performance of a model with randomly initialized weights, without any training. This randomly initialized model could solve 22.2% of Mizar problems and 8.3% of TPTP problems. We can see first that TRAIL when tested and trained on the same domain performed much better compared to the randomly initialized models (a statistical test of $$ for Mizar, $$ for TPTP), suggesting that learning did occur. Notice that TRAIL solved on average 64.5% of the Mizar test problems when trained on the Mizar dataset across regimens, and less on Mizar when trained on TPTP at 60.9% (a statistical test of $$ suggested statistical significance).
For TPTP, on average, training on TPTP (30.7% of TPTP test problems solved) was statistically just as good as training on Mizar and testing on TPTP (31.6% of TPTP test problems solved). In general, training regimens had no effect on success rates for either Mizar or TPTP. In all other cases such as transfer from Mizar to Mizar, or vice versa or training on TPTP, these differences were not statistically significant, so we do not describe these results further. We therefore conclude that TRAIL provides very good generalization across different datasets, regardless of type of regimen.
4.5 Effectiveness of Trail
Baselines: Beagle (Beagle2015) is an established reasoner that provides competitive performance on ATP datasets. The current implementation of TRAIL uses Beagle as its underlying reasoner. This is purely an implementation choice, made mostly because Beagle is open source and could have its proof guidance removed and replaced with TRAIL. The purpose of Beagle in TRAIL is only to execute the actions selected by the TRAIL learning agent; i.e., Beagle’s proof guidance was completely disabled when it was embedded as a component in TRAIL. TRAIL is not reasonerdependent, and any reasoner that can apply FOL inference rules can serve the same role as Beagle in TRAIL. We support this claim with an experiment in the following section that shows similar performance gains when Beagle is substituted with a different (inhouse) reasoner.
The purpose of the baseline experiments are to ensure that proof guidance controlled by TRAIL functions at a competitive level with a manuallyoptimized reasoner.
Comparison Against a Heuristicsbased Reasoner: Table 3 also shows the cumulative completion performance as percentage of problems solved by TRAIL compared to Beagle (Beagle2015). Beagle’s optimized strategy (heuristicsbased) managed to solve 64.1% of the Mizar problems and 37.9% of TPTP problems. On the other hand, TRAIL with expert bootstrapping managed to solve more problems than Beagle on Mizar with 64.8% (statistically insignificant with $z=0.57$ and $p=.57$) and it solved 33.1% (statistically insignificant with $z=1.46$ and $p=0.14$) of the TPTP problems.
Comparison Against a RLbased Reasoner: We also compare TRAIL’s performance against (KalUMONeurIPS18atprl) using the best iteration completion performance as percentage of problems solved on Mizar dataset at testing (same metric used by (KalUMONeurIPS18atprl)). While TRAIL managed to solve 61.6% of the problems, (KalUMONeurIPS18atprl) solved only 50%. Although a direct comparison with TRAIL cannot be made due to different time limits and hardware, (KalUMONeurIPS18atprl) reported 90% problems solved by Vampire in the same setting for which they got 50%, suggesting that the baseline performance of TRAIL is very promising.
Proof Length: Figure 4 shows the average number of steps required to find a proof improves over iterations. For each problem, this score is the number of proof steps used by Beagle divided by the number of steps used by TRAIL to find the proof. This score is computed on problems solved by both systems. A score greater than one means TRAIL finds more efficient and shorter proofs compared to Beagle. All training regimens started from the same initial model weights; hence they have the same performance at iteration 1. After the first model update, the performance of all models improved significantly. These scores improve over the iterations, almost matching Beagle for tabula rasa and slightly above Beagle for exploratory imitation at the last iteration.
Reasoner Agnosticism: TRAIL is a reasoneragnostic system; i.e., one can use any reasoner after disabling the reasoner’s own proof guidance strategy, as long as this reasoner can execute the actions proposed by TRAIL successfully. To demonstrate that this is indeed the case, we also integrated a baseline reasoner (Basic) in TRAIL. Basic is an inhouse reasoner that implements some of the more common stateoftheart optimization techniques such as subsumption checking, demodulation, and term indexing. The fully optimized version of Basic could solve 13.6% of Mizar problems. TRAIL integrated with Basic could solve only 3% of these problems prior to any training, however it improved to 15.6% after 30 iterations; a 2% improvement over Basic’s fully optimized version. Our goal here was only to demonstrate the generality of the TRAIL system architecture, and to show that the results of prior sections were not due to the choice of a particular reasoner like Beagle.
5 Related Work
Several approaches focus on the (sub)problem of premise selection (i.e., finding the axioms relevant to proving the considered problem) (AlamaHKTUjar14premisescorpuskernel; Blanchettejar16premisesisabellehol; AlemiNIPS16deepmath; WangTWDNIPS17deepgraph). As is often the case with automated theorem proving, most early approaches were based on manual heuristics (hoder2011sine; roederer2009divvy) and traditional machine learning (AlamaHKTUjar14premisescorpuskernel); though a few recent works are neural (AlemiNIPS16deepmath; WangTWDNIPS17deepgraph). As humans still outperform fully automated systems, there has also been research on using learning to support interactive theorem proving (Blanchettejar16premisesisabellehol; BancerekJAR2018mml).
Some early research has applied (deep) RL for guiding inference (TaylorMSWFLAIRS07cycrl), planning, and machine learning techniques for inference in relational domains (surveyRLRD). Several papers consider propositional logic or other decidable FOL fragments and thus focus on much simpler algorithms than we do. Close to TRAIL is the approach described in (KalUMONeurIPS18atprl). It applies RL combined with MonteCarlo tree search (MCTS) for automated theorem proving in FOL, but has some key limitations: 1) The input axioms are represented by features that depend on the vocabulary (i.e., userdefined predicates etc.). As a result, the approach would not transfer well to new problems with a different vocabulary. 2) The approach is specific to tableaubased reasoners and therefore may present difficulties for theories containing many equality axioms, which are better handled in the superposition calculus (bachmair1994refutational). 3) It relies upon simple linear learners and gradient boosting as policy and value predictors. Our work also aligns well with the recent proposal of an API for deep RLbased interactive theorem proving in HOL Light, using imitation learning from human proofs (BaLoRSWiCoRR19holist). That paper also describes an ATP as a proofofconcept. However, their ATP is intended as a baseline and lacks more advanced features like our exploratory learning.
NonRL based approaches using deeplearning to guide proof search include (ChvaJaSUCoRR19enigmang; LoosISKLPAR17atpnn; paliwal2019graph). Each of the listed works would use a neural network during proof guidance to rank the list of available clauses with respect to only the conjecture. Their underlying theorem prover would then expand proof search from the highest ranked clause. This ranking scheme was recognized as a limitation in (piotrowski2019guiding), where the authors described how such a methodology would fail to capture any dependencies between nonconjecture clauses. Their proposed solution was an RNNbased encoding scheme for embedding entire proof branches in a tableaubased reasoner. The choice of a tableau reasoner was due to the relative compactness of tableau proof branches, which helped to keep the RNN from incorrectly discarding information. It was unclear how to extend their method to saturationbased theorem provers, where a theorem prover state may include thousands of irrelevant clauses.
6 Conclusions
We presented TRAIL: a flexible, deep reinforcement learning based proof guidance system that transfers well across FOL problem domains. The transfer was robust across training regimens and changes in problem vocabularies. A next step is to see if training transfers across logical formalisms.