Abstract
Generating formal-language represented by relational tuples, such as Lispprograms or mathematical expressions, from a natural-language input is anextremely challenging task because it requires to explicitly capture discretesymbolic structural information from the input to generate the output. Moststate-of-the-art neural sequence models do not explicitly capture suchstructure information, and thus do not perform well on these tasks. In thispaper, we propose a new encoder-decoder model based on Tensor ProductRepresentations (TPRs) for Natural- to Formal-language generation, calledTP-N2F. The encoder of TP-N2F employs TPR 'binding' to encode natural-languagesymbolic structure in vector space and the decoder uses TPR 'unbinding' togenerate a sequence of relational tuples, each consisting of a relation (oroperation) and a number of arguments, in symbolic space. TP-N2F considerablyoutperforms LSTM-based Seq2Seq models, creating a new state of the art resultson two benchmarks: the MathQA dataset for math problem solving, and theAlgoList dataset for program synthesis. Ablation studies show that improvementsare mainly attributed to the use of TPRs in both the encoder and decoder toexplicitly capture relational structure information for symbolic reasoning.
Quick Read (beta)
Natural- to formal-language generation
using Tensor Product Representations
Abstract
Generating formal-language represented by relational tuples, such as Lisp programs or mathematical operations, from natural-language input is a challenging task because it requires explicitly capturing discrete symbolic structural information implicit in the input. Most state-of-the-art neural sequence models do not explicitly capture such structural information, limiting their performance on these tasks. In this paper we propose a new encoder-decoder model based on Tensor Product Representations (TPRs) for Natural- to Formal-language generation, called TP-N2F. The encoder of TP-N2F employs TPR ‘binding’ to encode natural-language symbolic structure in vector space and the decoder uses TPR ‘unbinding’ to generate, in symbolic space, a sequence of relational tuples, each consisting of a relation (or operation) and a number of arguments. On two benchmarks, TP-N2F considerably outperforms LSTM-based seq2seq models, creating new state-of-the-art results: the MathQA dataset for math problem solving, and the AlgoLisp dataset for program synthesis. Ablation studies show that improvements can be attributed to the use of TPRs in both the encoder and decoder to explicitly capture relational structure to support reasoning.
Natural- to formal-language generation
using Tensor Product Representations
Kezhen Chen${}^{\mathrm{\u2020}}$^{†}^{†}thanks: Kezhen Chen contributed to this work as an intern at Microsoft Research, Redmond WA., Qiuyuan Huang${}^{\mathrm{\u2021}}$, Hamid Palangi${}^{\mathrm{\u2021}}$, Paul Smolensky${}^{\mathrm{\u2021}\mathrm{\S}}$, |
---|
Kenneth D. Forbus${}^{\mathrm{\u2020}}$, Jianfeng Gao${}^{\mathrm{\u2021}}$ |
${}^{\u2020}$Northwestern University, Evanston, IL |
${}^{\u2021}$Microsoft Research, Redmond, WA |
${}^{\mathrm{\S}}$Johns Hopkins University, Baltimore, MD |
[email protected] |
{qihua,hpalangi,psmo,jfgao}@microsoft.com |
[email protected] |
[email protected] |
1 INTRODUCTION
When people perform explicit reasoning, they can typically describe the way to the conclusion step by step via relational descriptions. There is ample evidence that relational representations are important for human cognition (e.g., (Gentner2003; Forbus2017; Crouse2018; Chen2018; Chen2019)). Although a rapidly growing number of researchers use deep learning to solve complex symbolic reasoning and language tasks (a recent review is (gao2019neural)), most existing deep learning models, including sequence models such as LSTMs, do not explicitly capture human-like relational structure information.
In this paper we propose a novel neural architecture, TP-N2F, to solve natural- to formal-language generation tasks (N2F). In the tasks we study, math or programming problems are stated in natural-language, and answers are given as programs, sequences of relational representations, to solve the problem. TP-N2F encodes the natural-language symbolic structure of the problem in an input vector space, maps this to a vector in an intermediate space, and uses that vector to produce a sequence of output vectors that are decoded as relational structures. Both input and output structures are modelled as Tensor Product Representations (TPRs) (Smolensky1990). During encoding, NL-input symbolic structures are encoded as vector space embeddings using TPR ‘binding’ (following Palangi2018); during decoding, symbolic constituents are extracted from structure-embedding output vectors using TPR ‘unbinding’ (following Huang2018; Huang2019).
Our contributions in this work are as follows. (i) We propose a role-level analysis of N2F tasks. (ii) We present a new TP-N2F model which gives a neural-network-level implementation of a model solving the N2F task under the role-level description proposed in (i). To our knowledge, this is the first model to be proposed which combines both the binding and unbinding operations of TPRs to achieve generation tasks through deep learning. (iii) State-of-the-art performance on two recently developed N2F tasks shows that the TP-N2F model has significant structure learning ability on tasks requiring symbolic reasoning through program synthesis.
2 Background: Review of Tensor-Product Representation
The TPR mechanism is a method to create a vector space embedding of complex symbolic structures. The type of a symbol structure is defined by a set of structural positions or roles, such as the left-child-of-root position in a tree, or the second-argument-of-$R$ position of a given relation $R$. In a particular instance of a structural type, each of these roles may be occupied by a particular filler, which can be an atomic symbol or a substructure (e.g., the entire left sub-tree of a binary tree can serve as the filler of the role left-child-of-root). For now, we assume the fillers to be atomic symbols.^{1}^{1} 1 When fillers are structures themselves, binding can be used recursively, giving tensors of order higher than 2. In general, binding is done with the tensor product, since conflation with matrix algebra is only possible for order-2 tensors. Our unbinding of relational tuples involves the order-3 TPRs defined in Sec. 3.1.2.
The TPR embedding of a symbol structure is the sum of the embeddings of all its constituents, each constituent comprising a role together with its filler. The embedding of a constituent is constructed from the embedding of a role and the embedding of the filler of that role: these are joined together by the TPR ‘binding’ operation, the tensor (or generalized outer) product $\otimes $.
Formally, suppose a symbolic type is defined by the roles $\{{r}_{i}\}$, and suppose that in a particular instance of that type, $\U0001d682$, role ${r}_{i}$ is bound by filler ${f}_{i}$. The TPR embedding of $\U0001d682$ is the order-2 tensor
$\bm{T}={\displaystyle \sum _{i}}{\bm{f}}_{i}\otimes {\bm{r}}_{i}={\displaystyle \sum _{i}}{\bm{f}}_{i}{\bm{r}}_{i}^{\top}$ | (1) |
where $\{{\bm{f}}_{i}\}$ are vector embeddings of the fillers and $\{{\bm{r}}_{i}\}$ are vector embeddings of the roles. In Eq. 1, and below, for notational simplicity we conflate order-2 tensors and matrices.
As a simple example, consider the symbolic type string, and choose roles to be ${r}_{1}=$ first_element, ${r}_{2}=$ second_element, etc. Then in the specific string S = cba, the first role ${r}_{1}$ is filled by c, and ${r}_{2}$ and ${r}_{3}$ by b and a, respectively. The TPR for S is $\bm{c}\otimes {\bm{r}}_{1}+\bm{b}\otimes {\bm{r}}_{2}+\bm{a}\otimes {\bm{r}}_{3}$, where $\bm{a},\bm{b},\bm{c}$ are the vector embeddings of the symbols a, b, c, and ${\bm{r}}_{i}$ is the vector embedding of role ${r}_{i}$.
A TPR scheme for embedding a set of symbol structures is defined by a decomposition of those structures into roles bound to fillers, an embedding of each role as a role vector, and an embedding of each filler as a filler vector. Let the total number of roles and fillers available be ${n}_{\mathrm{R}},{n}_{\mathrm{F}}$, respectively. Define the matrix of all possible role vectors to be $\bm{R}\in {\mathbb{R}}^{{d}_{\mathrm{R}}\times {n}_{\mathrm{R}}}$, with column $i$, ${[\bm{R}]}_{:i}={\bm{r}}_{i}\in {\mathbb{R}}^{{d}_{\mathrm{R}}}$, comprising the embedding of ${r}_{i}$. Similarly let $\bm{F}\in {\mathbb{R}}^{{d}_{\mathrm{F}}\times {n}_{\mathrm{F}}}$ be the matrix of all possible filler vectors. The TPR $\bm{T}\in {\mathbb{R}}^{{d}_{\mathrm{F}}\times {d}_{\mathrm{R}}}$. Below, ${d}_{\mathrm{R}},{n}_{\mathrm{R}},{d}_{\mathrm{F}},{n}_{\mathrm{F}}$ will be hyper-parameters, while $\bm{R},\bm{F}$ will be learned parameter matrices.
Using summation in Eq.1 to combine the vectors embedding the constituents of a structure risks non-recoverability of those constituents given the embedding $\bm{T}$ of the the structure as a whole. The tensor product is chosen as the binding operation in order to enable recovery of the filler of any role in a structure $\U0001d682$ given its TPR $\bm{T}$. This can be done with perfect precision if the embeddings of the roles are linearly independent. In that case the role matrix $\bm{R}$ has a left inverse $\bm{U}$: $\bm{U}\bm{R}=\bm{I}$. Now define the unbinding (or dual) vector for role ${r}_{j}$, ${\bm{u}}_{j}$, to be the ${j}^{\mathrm{th}}$ column of ${\bm{U}}^{\top}$: ${U}_{:j}^{\top}$. Then, since ${[\bm{I}]}_{ji}={[\bm{U}\bm{R}]}_{ji}={\bm{U}}_{j:}{\bm{R}}_{:i}={[{\bm{U}}_{:j}^{\top}]}^{\top}{\bm{R}}_{:i}={\bm{u}}_{j}^{\top}{\bm{r}}_{i}={\bm{r}}_{i}^{\top}{\bm{u}}_{j}$, we have ${\bm{r}}_{i}^{\top}{\bm{u}}_{j}={\delta}_{ji}$. This means that, to recover the filler of ${r}_{j}$ in the structure with TPR $\bm{T}$, we can take its tensor inner product (or matrix-vector product) with ${\bm{u}}_{j}$:^{2}^{2} 2 When the role vectors are not linearly independent, this operation performs unbinding approximately, taking $\bm{U}$ to be the left pseudo-inverse of $\bm{R}$. Because randomly chosen vectors on the unit sphere in a high-dimensional space are approximately orthogonal, the approximation is often excellent (anonymous2019).
$\bm{T}{\bm{u}}_{j}=\left[{\displaystyle \sum _{i}}{\bm{f}}_{i}{\bm{r}}_{i}^{\top}\right]{\bm{u}}_{j}={\displaystyle \sum _{i}}{\bm{f}}_{i}{\delta}_{ij}={\bm{f}}_{j}$ | (2) |
In the architecture proposed here, we will make use of both TPR binding using the tensor product with role vectors ${\bm{r}}_{i}$ and TPR unbinding using the tensor inner product with unbinding vectors ${\bm{u}}_{j}$. Binding will be used to produce the order-2 tensor ${\bm{T}}_{S}$ embedding of the NL problem statement. Unbinding will be used to generate output relational tuples from an order-3 tensor $\bm{H}$. Because they pertain to different representations (of different orders in fact), the binding and unbinding vectors we will use are not related to one another.
3 TP-N2F Model
We propose a general TP-N2F neural network architecture operating over TPRs to solve N2F tasks under a proposed role-level description of those tasks. In this description, natural-language input is represented as a straightforward order-2 role structure, and formal-language relational representations of outputs are represented with a new order-3 recursive role structure proposed here. Figure 1 shows an overview diagram of the TP-N2F model. It depicts the following high-level description.
As shown in Figure 1, while the natural-language input is a sequence of words, the output is a sequence of multi-argument relational tuples such as $(R{A}_{1}{A}_{2})$, a 3-tuple consisting of a binary relation (or operation) $R$ with its two arguments. The “TP-N2F encoder” uses two LSTMs to produce a pair consisting of a filler vector and a role vector, which are bound together with the tensor product. These tensor products, concatenated, comprise the “context” over which attention will operate in the decoder. The sum of the word-level TPRs, flattened to a vector, is treated as a representation of the entire problem statement; it is fed to the “Reasoning MLP”, which transforms this encoding of the problem into a vector encoding the solution. This is the initial state of the “TP-N2F decoder” attentional LSTM, which outputs at each time step an order-3 tensor representing a relational tuple. To generate a correct tuple from decoder operations, the model must learn to give the order-3 tensor the form of a TPR for a $(R{A}_{1}{A}_{2})$ tuple (detailed explanation in Sec. 3.1.2). In the following sections, we first introduce the details of our proposed role-level description for N2F tasks, and then present how our proposed TP-N2F model uses TPR binding and unbinding operations to create a neural network implementation of this description of N2F tasks.
3.1 Role-level description of N2F tasks
In this section, we propose a role-level description of N2F tasks, which specifies the filler/role structures of the input natural-language symbolic expressions and the output relational representations.
3.1.1 Role-level description for natural-language input
Instead of encoding each token of a sentence with a non-compositional embedding vector looked up in a learned dictionary, we use a learned role-filler decomposition to compose a tensor representation for each token. Given a sentence $S$ with $n$ word tokens $\{{w}^{0},{w}^{1},\mathrm{\dots},{w}^{n-1}\}$, each word token ${w}^{t}$ is assigned a learned role vector ${\bm{r}}^{t}$, soft-selected from the learned dictionary $\bm{R}$, and a learned filler vector ${\bm{f}}^{t}$, soft-selected from the learned dictionary $\bm{F}$ (Sec. 2). The mechanism closely follows that of Palangi2018, and we hypothesize similar results: the role and filler approximately encode the grammatical role of the token and its lexical semantics, respectively.^{3}^{3} 3 Although the TPR formalism treats fillers and roles symmetrically, in use, hyperparameters are selected so that the number of available fillers is greater than that of roles. Thus, on average, each role is assigned to more words, encouraging it to take on a more general function, such as a grammatical role. Then each word token ${w}^{t}$ is represented by the tensor product of the role vector and the filler vector: ${\bm{T}}^{t}={\bm{f}}^{t}\otimes {\bm{r}}^{t}$. In addition to the set of all its token embeddings $\{{\bm{T}}^{0},\mathrm{\dots},{\bm{T}}^{n-1}\}$, the sentence $S$ as a whole is assigned a TPR equal to the sum of the TPR embeddings of all its word tokens: ${\bm{T}}_{S}={\sum}_{t=0}^{n-1}{\bm{T}}^{t}$.
Using TPRs to encode natural language has several advantages. First, natural language TPRs can be interpreted by exploring the distribution of tokens grouped by the role and filler vectors they are assigned by a trained model (as in Palangi2018). Second, TPRs avoid the Bag of Word (BoW) confusion (Huang2018): the BoW encoding of Jay saw Kay is the same as the BoW encoding of Kay saw Jay but the encodings are different with TPR embedding, because the role filled by a symbol changes with its context.
3.1.2 Role-level description for relational representations
In this section, we propose a novel recursive role-level description for representing symbolic relational tuples. Each relational tuple contains a relation token and multiple argument tokens. Given a binary relation $rel$, a relational tuple can be written as $(relar{g}_{1}ar{g}_{2})$ where $ar{g}_{1},ar{g}_{2}$ indicate two arguments of relation $rel$. Let us adopt the two positional roles, ${p}_{i}^{rel}=$ arg${}_{i}$-of-$r\mathit{}e\mathit{}l$ for $i=1,2$. The filler of role ${p}_{i}^{rel}$ is $ar{g}_{i}$. Now let us use role decomposition recursively, noting that the role ${p}_{i}^{rel}$ can itself be decomposed into a sub-role ${p}_{i}=$ arg${}_{i}$-of-$\underset{\mathrm{\xaf}}{}$ which has a sub-filler $rel$. Suppose that $ar{g}_{i},rel,{p}_{i}$ are embedded as vectors ${\bm{a}}_{i},\bm{r},{\bm{p}}_{i}$. Then the TPR encoding of ${p}_{i}^{rel}$ is ${\bm{r}}_{rel}\otimes {\bm{p}}_{i}$, so the TPR encoding of filler $ar{g}_{i}$ bound to role ${p}_{i}^{rel}$ is ${\bm{a}}_{i}\otimes ({\bm{r}}_{rel}\otimes {\bm{p}}_{i})$. The tensor product is associative, so we can omit parentheses and write the TPR for the formal-language expression, the relational tuple $(relar{g}_{1}ar{g}_{2})$, as:
$\bm{H}={\bm{a}}_{1}\otimes {\bm{r}}_{rel}\otimes {\bm{p}}_{1}+{\bm{a}}_{2}\otimes {\bm{r}}_{rel}\otimes {\bm{p}}_{2}.$ | (3) |
Given the unbinding vectors ${\bm{p}}_{i}^{\prime}$ for positional role vectors ${\bm{p}}_{i}$ and the unbinding vector ${\bm{r}}_{rel}^{\prime}$ for the vector ${\bm{r}}_{rel}$ that embeds relation $rel$, each argument can be unbound in two steps as shown in Eqs. 4–5.
$\bm{H}\cdot {\bm{p}}_{i}^{\prime}=\left[{\bm{a}}_{1}\otimes {\bm{r}}_{rel}\otimes {\bm{p}}_{1}+{\bm{a}}_{2}\otimes {\bm{r}}_{rel}\otimes {\bm{p}}_{2}\right]\cdot {\bm{p}}_{i}^{\prime}={\bm{a}}_{i}\otimes {\bm{r}}_{rel}$ | (4) | ||
$\left[{\bm{a}}_{i}\otimes {\bm{r}}_{rel}\right]\cdot {\bm{r}}_{rel}^{\prime}={\bm{a}}_{i}$ | (5) |
Here $\cdot $ denotes the tensor inner product, which for the order-3 $\bm{H}$ and order-1 ${\bm{p}}_{i}^{\prime}$ in Eq. 4 can be defined as ${[\bm{H}\cdot {\bm{p}}_{i}^{\prime}]}_{jk}={\sum}_{l}{[\bm{H}]}_{jkl}{[{\bm{p}}_{i}^{\prime}]}_{l}$; in Eq. 5, $\cdot $ is equivalent to the matrix-vector product.
Our proposed scheme can be contrasted with the TPR scheme in which $(relar{g}_{1}ar{g}_{2})$ is embedded as ${\bm{r}}_{rel}\otimes {\bm{a}}_{1}\otimes {\bm{a}}_{2}$ (e.g., smolensky2016basic; Schlag2018). In that scheme, an $n$-ary-relation tuple is embedded as an order-($n+1$) tensor, and unbinding an argument requires knowing all the other arguments (to use their unbinding vectors). In the scheme proposed here, an $n$-ary-relation tuple is still embedded as an order-3 tensor: there are just $n$ terms in the sum in Eq. 3, using $n$ position vectors ${\bm{p}}_{1},\mathrm{\dots},{\bm{p}}_{n}$; unbinding simply requires knowing the unbinding vectors for these fixed position vectors.
In the model, the order-3 tensor $\bm{H}$ of Eq. 3 has a different status than the order-2 tensor ${\bm{T}}_{S}$ of Sec. 3.1.1. ${\bm{T}}_{S}$ is a TPR by construction, whereas $\bm{H}$ is a TPR as a result of successful learning. To generate the output relational tuples, the decoder assumes each tuple has the form of Eq. 3, and performs the unbinding operations which that structure calls for. In Appendix Sec. A.3, it is shown that, if unbinding each of a set of roles from some unknown tensor $\bm{T}$ gives a target set of fillers, then $\bm{T}$ must equal the TPR generated by those role/filler pairs, plus some tensor that is irrelevant because unbinding from it produces the zero vector. In other words, if the decoder succeeds in producing filler vectors that correspond to output relational tuples that match the target, then, as far as what the decoder can see, the tensor that it operates on is the TPR of Eq. 3.
3.1.3 The TP-N2F Scheme for Learning the input-output mapping
To generate formal relational tuples from natural-language descriptions, a learning strategy for the mapping between the two structures is particularly important. As shown in (6), we formalize the learning scheme as learning a mapping function ${f}_{\mathrm{mapping}}(\cdot )$, which, given a structural representation of the natural-language input, ${\bm{T}}_{S}$, outputs a tensor ${\bm{T}}_{F}$ from which the structural representation of the output can be generated. At the role level of description, there’s nothing more to be said about this mapping; how it is modeled at the neural network level is discussed in Sec. 3.2.1.
${\bm{T}}_{\mathrm{F}}={f}_{\mathrm{mapping}}({\bm{T}}_{S})$ | (6) |
3.2 The TP-N2F Model for Natural- to Formal-Language Generation
As shown in Figure 1, the TP-N2F model is implemented with three steps: encoding, mapping, and decoding. The encoding step is implemented by the TP-N2F natural-language encoder (TP-N2F Encoder), which takes the sequence of word tokens as inputs, and encodes them via TPR binding according to the TP-N2F role scheme for natural-language input given in Sec. 3.1.1. The mapping step is implemented by an MLP called the Reasoning Module, which takes the encoding produced by the TP-N2F Encoder as input. It learns to map the natural-language-structure encoding of the input to a representation that will be processed under the assumption that it follows the role scheme for output relational-tuples specified in Sec. 3.1.2: the model needs to learn to produce TPRs such that this processing generates correct output programs. The decoding step is implemented by the TP-N2F relational tuples decoder (TP-N2F Decoder), which takes the output from the Reasoning Module (Sec. 3.1.3) and decodes the target sequence of relational tuples via TPR unbinding. The TP-N2F Decoder utilizes an attention mechanism over the individual-word TPRs ${\bm{T}}^{t}$ produced by the TP-N2F Encoder. The detailed implementations are introduced below.
3.2.1 The TP-N2F natural-language Encoder
The TP-N2F encoder follows the role scheme in Sec. 3.1.1 to encode each word token ${w}^{t}$ by soft-selecting one of ${n}_{\mathrm{F}}$ fillers and one of ${n}_{\mathrm{R}}$ roles. The fillers and roles are embedded as vectors. These embedding vectors, and the functions for selecting fillers and roles, are learned by two LSTMs, the Filler-LSTM and the Role-LSTM. (See Figure 2.) At each time-step $t$, the Filler-LSTM and the Role-LSTM take a learned word-token embedding ${\bm{w}}^{t}$ as input. The hidden state of the Filler-LSTM, ${\bm{h}}_{\mathrm{F}}^{t}$, is used to compute softmax scores ${u}_{k}^{\mathrm{F}}$ over ${n}_{\mathrm{F}}$ filler slots, and a filler vector ${\bm{f}}^{t}=\bm{F}{\bm{u}}^{\mathrm{F}}$ is computed from the softmax scores (recall from Sec. 2 that $\bm{F}$ is the learned matrix of filler vectors). Similarly, a role vector is computed from the hidden state of the Role-LSTM, ${\bm{h}}_{\mathrm{R}}^{t}$. ${f}_{\mathrm{F}}$ and ${f}_{\mathrm{R}}$ denote the functions that generate ${\bm{f}}^{t}$ and ${\bm{r}}^{t}$ from the hidden states of the two LSTMs. The token ${w}^{t}$ is encoded as ${\bm{T}}^{t}$, the tensor product of ${\bm{f}}^{t}$ and ${\bm{r}}^{t}$. ${\bm{T}}^{t}$ replaces the hidden vector in each LSTM and is passed to the next time step, together with the LSTM cell-state vector ${\bm{c}}^{t}$: see (7)–(8). After encoding the whole sequence, the TP-N2F encoder outputs the sum of all tensor products ${\sum}_{t}{\bm{T}}^{t}$ to the next module. We use an MLP, called the Reasoning MLP, for TPR mapping; it takes an order-2 TPR from the encoder and maps it to the initial state of the decoder. Detailed equations and implementation are provided in Sec. A.2.1 of the Appendix.
${\bm{h}}_{\mathrm{F}}^{t}={f}_{\mathrm{Filler}-\mathrm{LSTM}}({\bm{w}}^{t},{\bm{T}}^{t-1},{\bm{c}}_{\mathrm{F}}^{t-1})\mathit{\hspace{1em}\hspace{1em}\hspace{0.5em}\hspace{0.25em}}{\bm{h}}_{\mathrm{R}}^{t}={f}_{\mathrm{Role}-\mathrm{LSTM}}({\bm{w}}^{t},{\bm{T}}^{t-1},{\bm{c}}_{\mathrm{R}}^{t-1})$ | (7) | ||
${\bm{T}}^{t}={\bm{f}}^{t}\otimes {\bm{r}}^{t}={f}_{\mathrm{F}}({\bm{h}}_{\mathrm{F}}^{t})\otimes {f}_{\mathrm{R}}({\bm{h}}_{\mathrm{R}}^{t})$ | (8) |
3.2.2 The TP-N2F Relational-Tuple Decoder
The TP-N2F Decoder is an RNN that takes the output from the reasoning MLP as its initial hidden state for generating a sequence of relational tuples (Figure 3). This decoder contains an attentional LSTM called the Tuple-LSTM which feeds an unbinding module: attention operates on the context vector of the encoder, consisting of all individual encoder outputs $\{{\bm{T}}^{t}\}$. The hidden-state $\bm{H}$ of the Tuple-LSTM is treated as a TPR of a relational tuple and is unbound to a relation and arguments. During training, the Tuple-LSTM needs to learn a way to make $\bm{H}$ suitably approximate a TPR. At each time step $t$, the hidden state ${\bm{H}}^{t}$ of the Tuple-LSTM with attention (The version in luong2015att) (9) is fed as input to the unbinding module, which regards ${\bm{H}}^{t}$ as if it were the TPR of a relational tuple with $m$ arguments possessing the role structure described in Sec. 3.1.2: ${\bm{H}}^{t}\approx {\sum}_{i=1}^{m}{\bm{a}}_{i}^{t}\otimes {\bm{r}}_{rel}^{t}\otimes {\bm{p}}_{i}$. (In Figure 3, the assumed hypothetical form of ${\bm{H}}^{t}$, as well as that of ${\bm{B}}_{i}^{t}$ below, is shown in a bubble with dashed border.) To decode a binary relational tuple, the unbinding module decodes it from ${\bm{H}}^{t}$ using the two steps of TPR unbinding given in (4)–(5). The positional unbinding vectors ${\bm{p}}_{i}^{\prime}$ are learned during training and shared across all time steps. After the first unbinding step (4), i.e., the inner product of ${\bm{H}}^{t}$ with ${\bm{p}}_{i}^{\prime}$, we get tensors ${\bm{B}}_{i}^{t}$ (10). These are treated as the TPRs of two arguments ${\bm{a}}_{i}^{t}$ bound to a relation ${\bm{r}}_{rel}^{t}$. A relational unbinding vector ${\bm{r}}_{rel}^{\prime t}$ is computed by a linear function from the sum of the ${\bm{B}}_{i}^{t}$ and used to compute the inner product with each ${\bm{B}}_{i}^{t}$ to yield ${\bm{a}}_{i}^{t}$, which are treated as the embedding of argument vectors (11). Based on the TPR theory, ${\bm{r}}_{rel}^{\prime t}$ is passed to a linear function to get ${\bm{r}}_{rel}^{t}$ as the embedding of a relation vector. Finally, the softmax probability distribution over symbolic outputs is computed for relations and arguments separately. In generation, the most probable symbol is selected. (Detailed equations are in Appendix Sec. A.2.3)
${\bm{H}}^{t}=\mathrm{Atten}({f}_{\mathrm{Tuple}-\mathrm{LSTM}}(re{l}^{t},ar{g}_{1}^{t},ar{g}_{2}^{t},{\bm{H}}^{t-1},{c}^{t-1}),[{\bm{T}}^{0},\mathrm{\dots},{\bm{T}}^{n-1}])$ | (9) | ||
${\bm{B}}_{1}^{t}={\bm{H}}^{t}\cdot {\bm{p}}_{1}^{\prime}\mathit{\hspace{1em}\hspace{1em}\hspace{0.5em}\hspace{0.25em}}{\bm{B}}_{2}^{t}={\bm{H}}^{t}\cdot {\bm{p}}_{2}^{\prime}$ | (10) | ||
${\bm{r}}_{rel}^{\prime t}={f}_{\mathrm{linear}}({\bm{B}}_{1}^{t}+{\bm{B}}_{2}^{t})\mathit{\hspace{1em}\hspace{1em}\hspace{0.5em}\hspace{0.25em}}{\bm{a}}_{1}^{t}={\bm{B}}_{1}^{t}\cdot {\bm{r}}_{rel}^{\prime t}\mathit{\hspace{1em}\hspace{1em}\hspace{0.5em}\hspace{0.25em}}{\bm{a}}_{2}^{t}={\bm{B}}_{2}^{t}\cdot {\bm{r}}_{rel}^{\prime t}$ | (11) |
3.3 Inference and The Learning Strategy of the TP-N2F Model
During inference time, natural language questions are encoded via the encoder and the Reasoning MLP maps the output of the encoder to the input of the decoder. We use greedy decoding (selecting the most likely class) to decode one relation and its arguments. The relation and argument vectors are concatenated to construct a new vector as the input for the Tuple-LSTM in the next step.
TP-N2F is trained using back-propagation (rumelhart1986learning) with the Adam optimizer (adam2017) and teacher-forcing. At each time step, the ground-truth relational tuple is provided as the input for the next time step. As the TP-N2F decoder decodes a relational tuple at each time step, the relation token is selected only from the relation vocabulary and the argument tokens from the argument vocabulary. For an input $\mathcal{I}$ that generates $N$ output relational tuples, the loss is the sum of the cross entropy loss $\mathcal{L}$ between the true labels $L$ and predicted tokens for relations and arguments as shown in (12).
${\mathcal{L}}_{\mathcal{I}}={\displaystyle \sum _{i=0}^{N-1}}\mathcal{L}(re{l}^{i},{L}_{re{l}^{i}})+{\displaystyle \sum _{i=0}^{N-1}}{\displaystyle \sum _{j=1}^{2}}\mathcal{L}(ar{g}_{j}^{i},{L}_{ar{g}_{j}^{i}})$ | (12) |
4 EXPERIMENTS
The proposed TP-N2F model is evaluated on two N2F tasks, generating operation sequences to solve math problems and generating Lisp programs. In both tasks, TP-N2F achieves state-of-the-art performance. We further analyze the behavior of the unbinding relation vectors in the proposed model. Results of each task and the analysis of the unbinding relation vectors are introduced in turn. Details of experiments and datasets are described in Sec. A.1 in the Appendix.
4.1 Generating operation sequences to solve math problems
Given a natural-language math problem, we need to generate a sequence of operations (operators and corresponding arguments) from a set of operators and arguments to solve the given problem. Each operation is regarded as a relational tuple by viewing the operator as relation, e.g., $(add,n1,n2)$. We test TP-N2F for this task on the MathQA dataset (Amini2019). The MathQA dataset consists of about 37k math word problems, each with a corresponding list of multi-choice options and the corresponding operation sequence. In this task, TP-N2F is deployed to generate the operation sequence given the question. The generated operations are executed with the execution script from Amini2019 to select a multi-choice answer. As there are about 30% noisy data (where the execution script returns the wrong answer when given the ground-truth program; see Sec. A.1 of the Appendix), we report both execution accuracy (of the final multi-choice answer after running the execution engine) and operation sequence accuracy (where the generated operation sequence must match the ground truth sequence exactly). TP-N2F is compared to a baseline provided by the seq2prog model in Amini2019, an LSTM-based seq2seq model with attention. Our model outperforms both the original seq2prog, designated SEQ2PROG-orig, and the best reimplemented seq2prog after an extensive hyperparameter search, designated SEQ2PROG-best. Table 1 presents the results. To verify the importance of the TP-N2F encoder and decoder, we conducted experiments to replace either the encoder with a standard LSTM (denoted LSTM2TP) or the decoder with a standard attentional LSTM (denoted TP2LSTM). We observe that both the TPR components of TP-N2F are important for achieving the observed performance gain relative to the baseline.
MODEL | Operation Accuracy(%) | Execution Accuracy(%) |
---|---|---|
SEQ2PROG-orig | 59.4 | 51.9 |
SEQ2PROG-best | 66.97 | 54.0 |
LSTM2TP (ours) | 68.21 | 54.61 |
TP2LSTM (ours) | 68.84 | 54.61 |
TP-N2F (ours) | 71.89 | 55.95 |
4.2 Generating program trees from natural-language descriptions
Generating Lisp programs requires sensitivity to structural information because Lisp code can be regarded as tree-structured. Given a natural-language query, we need to generate code containing function calls with parameters. Each function call is a relational tuple, which has a function as the relation and parameters as arguments. We evaluate our model on the AlgoLisp dataset for this task and achieve state-of-the-art performance. The AlgoLisp dataset (Polosukhin2018) is a program synthesis dataset. Each sample contains a problem description, a corresponding Lisp program tree, and 10 input-output testing pairs. We parse the program tree into a straight-line sequence of tuples (same style as in MathQA). AlgoLisp provides an execution script to run the generated program and has three evaluation metrics: the accuracy of passing all test cases (Acc), the accuracy of passing 50% of test cases (50p-Acc), and the accuracy of generating an exactly matching program (M-Acc). AlgoLisp has about 10% noisy data (details in the Appendix), so we report results both on the full test set and the cleaned test set (in which all noisy testing samples are removed). TP-N2F is compared with an LSTM seq2seq with attention model, the Seq2Tree model in Polosukhin2018, and a seq2seq model with a pre-trained tree decoder from the Tree2Tree autoencoder (SAPS) reported in Bednarek2019. As shown in Table 2, TP-N2F outperforms all existing models on both the full test set and the cleaned test set. Ablation experiments with TP2LSTM and LSTM2TP show that, for this task, the TP-N2F Decoder is more helpful than TP-N2F Encoder. This may be because lisp codes rely more heavily on structure representations.
Full Testing Set | Cleaned Testing Set | |||||
---|---|---|---|---|---|---|
MODEL (%) | Acc | 50p-Acc | M-Acc | Acc | 50p-Acc | M-Acc |
Seq2Tree | 61.0 | |||||
LSTM2LSTM+atten | 67.54 | 70.89 | 75.12 | 76.83 | 78.86 | 75.42 |
TP2LSTM (ours) | 72.28 | 77.62 | 79.92 | 77.67 | 80.51 | 76.75 |
LSTM2TPR (ours) | 75.31 | 79.26 | 83.05 | 84.44 | 86.13 | 83.43 |
SAPSpre-VH-Att-256 | 83.80 | 87.45 | 92.98 | 94.15 | ||
TP-N2F (ours) | 84.02 | 88.01 | 93.06 | 93.48 | 94.64 | 92.78 |
4.3 Interpretation of learned structure
To interpret the structure learned by the model, we extract the trained unbinding relation vectors from the TP-N2F Decoder and reduce the dimension of vectors via Principal Component Analysis. K-means clustering results on the average vectors are presented in Figure 4 and Figure 5 (in Appendix A.6). Results show that unbinding vectors for operators or functions with similar semantics tend to be close to each other. For example, with 5 clusters in the MathQA dataset, arithmetic operators such as add, subtract, multiply, divide are clustered together, and operators related to square or volume of geometry are clustered together. With 4 clusters in the AlgoLisp dataset, partial/lambda functions and sort functions are in one cluster, and string processing functions are clustered together. Note that there is no direct supervision to inform the model about the nature of the operations, and the TP-N2F decoder has induced this role structure using weak supervision signals from question/operation-sequence-answer pairs. More clustering results are presented in the Appendix A.6.
5 Related work
N2F tasks include many different subtasks such as symbolic reasoning or semantic parsing (spsurvey2019; Cai2019; LiaoQSE2018; Amini2019; Polosukhin2018; Bednarek2019). These tasks require models with strong structure-learning ability. TPR is a promising technique for encoding symbolic structural information and modeling symbolic reasoning in vector space. TPR binding has been used for encoding and exploring grammatical structural information of natural language (Palangi2018; Huang2019). TPR unbinding has also been used to generate natural language captions from images (Huang2018). Some researchers use TPRs for modeling deductive reasoning processes both on a rule-based model and deep learning models in vector space (Lee2016; smolensky2016basic; Schlag2018). However, none of these previous models takes advantage of combining TPR binding and TPR unbinding to learn structure representation mappings explicitly, as done in our model. Although researchers are paying increasing attention to N2F tasks, most of the proposed models either do not encode structural information explicitly or are specialized to particular tasks. Our proposed TP-N2F neural model can be applied to many tasks.
6 CONCLUSION AND FUTURE WORK
In this paper we propose a new scheme for neural-symbolic relational representations and a new architecture, TP-N2F, for formal-language generation from natural-language descriptions. To our knowledge, TP-N2F is the first model that combines TPR binding and TPR unbinding in the encoder-decoder fashion. TP-N2F achieves the state-of-the-art on two instances of N2F tasks, showing significant structure learning ability. The results show that both the TP-N2F encoder and the TP-N2F decoder are important for improving natural- to formal-language generation. We believe that the interpretation and symbolic structure encoding of TPRs are a promising direction for future work. We also plan to combine large-scale deep learning models such as BERT with TP-N2F to take advantage of structure learning for other generation tasks.
References
Appendix A Appendix
A.1 Implementations of TP-N2F for experiments
In this section, we present details of the experiments of TP-N2F on the two datasets. We present the implementation of TP-N2F on each dataset.
The MathQA dataset consists of about 37k math word problems ((80/12/8)% training/dev/testing problems), each with a corresponding list of multi-choice options and an straight-line operation sequence program to solve the problem. An example from the dataset is presented in the Appendix A.4. In this task, TP-N2F is deployed to generate the operation sequence given the question. The generated operations are executed to generate the solution for the given math problem. We use the execution script from Amini2019 to execute the generated operation sequence and compute the multi-choice accuracy for each problem. During our experiments we observed that there are about 30% noisy examples (on which the execution script fails to get the correct answer on the ground truth program). Therefore, we report both execution accuracy (the final multi-choice answer after running the execution engine) and operation sequence accuracy (where the generated operation sequence must match the ground truth sequence exactly).
The AlgoLisp dataset (Polosukhin2018) is a program synthesis dataset, which has 79k/9k/10k training/dev/testing samples. Each sample contains a problem description, a corresponding Lisp program tree, and 10 input-output testing pairs. We parse the program tree into a straight-line sequence of commands from leaves to root and (as in MathQA) use the symbol ${\mathrm{\#}}_{i}$ to indicate the result of the ${i}^{\mathrm{th}}$ command (generated previously by the model). A dataset sample with our parsed command sequence is presented in the Appendix A.4. AlgoLisp provides an execution script to run the generated program and has three evaluation metrics: accuracy of passing all test cases (Acc), accuracy of passing 50% of test cases (50p-Acc), and accuracy of generating an exactly matched program (M-Acc). AlgoLisp has about 10% noise data (where the execution script fails to pass all test cases on the ground truth program), so we report results both on the full test set and the cleaned test set (in which all noisy testing samples are removed).
We use ${d}_{\mathrm{R}},{n}_{\mathrm{R}},{d}_{\mathrm{F}},{n}_{\mathrm{F}}$ to indicate the TP-N2F encoder hyperparameters, the dimension of role vectors, the number of roles, the dimension of filler vectors and the number of fillers. ${d}_{Rel},{d}_{Arg},{d}_{Pos}$ indicate the TP-N2F decoder hyper-parameters, the dimension of relation vectors, the dimension of argument vectors, and the dimension of position vectors.
In the experiment on the MathQA dataset, we use ${n}_{\mathrm{F}}=150$, ${n}_{\mathrm{R}}=50$, ${d}_{\mathrm{F}}=30$, ${d}_{\mathrm{R}}=20$, ${d}_{Rel}=20$, ${d}_{Arg}=10$, ${d}_{Pos}=5$ and we train the model for 60 epochs with learning rate 0.00115. The reasoning module only contains one layer. As most of the math operators in this dataset are binary, we replace all operators taking three arguments with a set of binary operators based on hand-encoded rules, and for all operators taking one argument, a padding symbol is appended. For the baseline SEQ2PROG-orig, TP2LSTM and LSTM2TP, we use hidden size 100, single-direction, one-layer LSTM. For the SEQ2PROG-best, we performed a hyperparameter search on the hidden size for both encoder and decoder; the best score is reported.
In the experiment on the AlgoLisp dataset, we use ${n}_{\mathrm{F}}=150$, ${n}_{\mathrm{R}}=50$, ${d}_{\mathrm{F}}=30$, ${d}_{\mathrm{R}}=30$, ${d}_{Rel}=30$, ${d}_{Arg}=20$, ${d}_{Pos}=5$ and we train the model for 50 epochs with learning rate 0.00115. We also use one-layer in the reasoning module like in MathQA. For this dataset, most function calls take three arguments so we simply add padding symbols for those functions with fewer than three arguments.
A.2 Detailed equations of TP-N2F
A.2.1 TP-N2F encoder
Filler-LSTM in TP-N2F encoder
This is a standard LSTM, governed by the equations:
$${\bm{f}}_{\mathrm{f}}^{t}=\phi ({\bm{U}}_{\mathrm{ff}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{ff}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{ff}})$$ | (13) |
$${\bm{g}}_{\mathrm{f}}^{t}=\mathrm{tanh}({\bm{U}}_{\mathrm{fg}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{fg}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{fg}})$$ | (14) |
$${\bm{i}}_{\mathrm{f}}^{t}=\phi ({\bm{U}}_{\mathrm{fi}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{fi}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{fi}})$$ | (15) |
$${\bm{o}}_{\mathrm{f}}^{t}=\phi ({\bm{U}}_{\mathrm{fo}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{fo}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{fo}})$$ | (16) |
$${\bm{c}}_{\mathrm{f}}^{t}={\bm{f}}_{\mathrm{f}}^{t}\odot {\bm{c}}_{\mathrm{f}}^{t-1}+{\bm{i}}_{\mathrm{f}}^{t}\odot {\bm{g}}_{\mathrm{f}}^{t}$$ | (17) |
$${\bm{h}}_{\mathrm{f}}^{t}={\bm{o}}_{\mathrm{f}}^{t}\odot \mathrm{tanh}({\bm{c}}_{f}^{t})$$ | (18) |
$\phi ,\mathrm{tanh}$ are the logistic sigmoid and tanh functions applied elementwise. $\mathrm{\u266d}$ flattens (reshapes) a matrix in ${\mathbb{R}}^{{d}_{\mathrm{F}}\times {d}_{\mathrm{R}}}$ into a vector in ${\mathbb{R}}^{{d}_{\mathrm{T}}}$, where ${d}_{\mathrm{T}}={d}_{\mathrm{F}}{d}_{\mathrm{R}}$. $\odot $ is elementwise multiplication. The variables have the following dimensions:
${\bm{f}}_{\mathrm{f}}^{t},{\bm{g}}_{\mathrm{f}}^{t},{\bm{i}}_{\mathrm{f}}^{t},{\bm{o}}_{\mathrm{f}}^{t},{\bm{c}}_{\mathrm{f}}^{t},{\bm{h}}_{\mathrm{f}}^{t},{\bm{b}}_{\mathrm{ff}},{\bm{b}}_{\mathrm{fg}},{\bm{b}}_{\mathrm{fi}},{\bm{b}}_{\mathrm{fo}},\mathrm{\u266d}({\bm{T}}^{t-1})\in {\mathbb{R}}^{{d}_{\mathrm{T}}}$ | ||
${w}^{t}\in {\mathbb{R}}^{d}$ | ||
${\bm{U}}_{\mathrm{ff}},{\bm{U}}_{\mathrm{fg}},{\bm{U}}_{\mathrm{fi}},{\bm{U}}_{\mathrm{fo}}\in {\mathbb{R}}^{{d}_{\mathrm{T}}\times d}$ | ||
${\bm{V}}_{\mathrm{ff}},{\bm{V}}_{\mathrm{fg}},{\bm{V}}_{\mathrm{fi}},{\bm{V}}_{\mathrm{fo}}\in {\mathbb{R}}^{{d}_{\mathrm{T}}\times {d}_{\mathrm{T}}}$ |
Filler vector
The filler vector for input token ${w}^{t}$ is ${\bm{f}}^{t}$, defined through an attention vector over possible fillers, ${\bm{a}}_{\mathrm{f}}^{t}$:
$${\bm{a}}_{\mathrm{f}}^{t}=\mathrm{softmax}(({\bm{W}}_{\mathrm{fa}}{\bm{h}}_{\mathrm{f}}^{t})/T)$$ | (19) |
$${\bm{f}}^{t}={\bm{W}}_{\mathrm{f}}{\bm{a}}_{\mathrm{f}}^{t}$$ | (20) |
(${W}_{\mathrm{f}}$ is the same as $\bm{F}$ of Sec. 2.) The variables’ dimensions are:
${\bm{W}}_{\mathrm{fa}}\in {\mathbb{R}}^{{n}_{\mathrm{F}}\times {d}_{\mathrm{T}}}$ | ||
${\bm{a}}_{\mathrm{f}}^{t}\in {\mathbb{R}}^{{n}_{\mathrm{F}}}$ | ||
${\bm{W}}_{\mathrm{f}}\in {\mathbb{R}}^{{d}_{\mathrm{F}}\times {n}_{\mathrm{F}}}$ | ||
${\bm{f}}^{t}\in {\mathbb{R}}^{{d}_{\mathrm{F}}}$ |
$T$ is the temperature factor, which is fixed at 0.1.
Role-LSTM in TP-N2F encoder
Similar to the Filler-LSTM, the Role-LSTM is also a standard LSTM, governed by the equations:
$${\bm{f}}_{\mathrm{r}}^{t}=\phi ({\bm{U}}_{\mathrm{rf}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{rf}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{rf}})$$ | (21) |
$${\bm{g}}_{\mathrm{r}}^{t}=\mathrm{tanh}({\bm{U}}_{\mathrm{rg}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{rg}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{rg}})$$ | (22) |
$${\bm{i}}_{\mathrm{r}}^{t}=\phi ({\bm{U}}_{\mathrm{ri}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{ri}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{ri}})$$ | (23) |
$${\bm{o}}_{\mathrm{r}}^{t}=\phi ({\bm{U}}_{\mathrm{ro}}{\bm{w}}^{t}+{\bm{V}}_{\mathrm{ro}}\mathrm{\u266d}({\bm{T}}^{t-1})+{\bm{b}}_{\mathrm{ro}})$$ | (24) |
$${\bm{c}}_{\mathrm{r}}^{t}={\bm{f}}_{\mathrm{r}}^{t}\odot {\bm{c}}_{\mathrm{r}}^{t-1}+{\bm{i}}_{\mathrm{r}}^{t}\odot {\bm{g}}_{\mathrm{r}}^{t}$$ | (25) |
$${\bm{h}}_{\mathrm{r}}^{t}={\bm{o}}_{\mathrm{r}}^{t}\odot \mathrm{tanh}({\bm{c}}_{\mathrm{r}}^{t})$$ | (26) |
The variable dimensions are:
${\bm{f}}_{\mathrm{r}}^{t},{\bm{g}}_{\mathrm{r}}^{t},{\bm{i}}_{\mathrm{r}}^{t},{\bm{o}}_{\mathrm{r}}^{t},{\bm{c}}_{\mathrm{r}}^{t},{\bm{h}}_{\mathrm{r}}^{t},{\bm{b}}_{\mathrm{rf}},{\bm{b}}_{\mathrm{rg}},{\bm{b}}_{\mathrm{ri}},{\bm{b}}_{\mathrm{ro}},\mathrm{\u266d}({\bm{T}}^{t-1})\in {\mathbb{R}}^{{d}_{\mathrm{T}}}$ | ||
${w}^{t}\in {\mathbb{R}}^{d}$ | ||
${\bm{U}}_{\mathrm{rf}},{\bm{U}}_{\mathrm{rg}},{\bm{U}}_{\mathrm{ri}},{\bm{U}}_{\mathrm{ro}}\in {\mathbb{R}}^{{d}_{\mathrm{T}}\times d}$ | ||
${\bm{V}}_{\mathrm{rf}},{\bm{V}}_{\mathrm{rg}},{\bm{V}}_{\mathrm{ri}},{\bm{V}}_{\mathrm{ro}}\in {\mathbb{R}}^{{d}_{\mathrm{T}}\times {d}_{\mathrm{T}}}$ |
Role vector
The role vector for input token ${w}^{t}$ is determined analogously to its filler vector:
$${\bm{a}}_{\mathrm{r}}^{t}=\mathrm{softmax}(({\bm{W}}_{\mathrm{ra}}{\bm{h}}_{\mathrm{r}}^{t})/T)$$ | (27) |
$${\bm{r}}^{t}={\bm{W}}_{\mathrm{r}}{\bm{a}}_{\mathrm{r}}^{t}$$ | (28) |
The dimensions are:
${\bm{W}}_{\mathrm{ra}}\in {\mathbb{R}}^{{n}_{\mathrm{R}}\times {d}_{\mathrm{T}}}$ | ||
${\bm{a}}_{\mathrm{r}}^{t}\in {\mathbb{R}}^{{n}_{\mathrm{R}}}$ | ||
${\bm{W}}_{\mathrm{r}}\in {\mathbb{R}}^{{d}_{\mathrm{R}}\times {n}_{\mathrm{R}}}$ | ||
${\bm{r}}^{t}\in {\mathbb{R}}^{{d}_{\mathrm{R}}}$ |
Binding
The TPR for the filler/role binding for token ${w}^{t}$ is then:
$${\bm{T}}_{t}={\bm{r}}^{t}\otimes {\bm{f}}^{t}$$ | (29) |
where
${\bm{T}}^{t}\in {\mathbb{R}}^{{d}_{\mathrm{R}}\times {d}_{\mathrm{F}}}$ |
A.2.2 Structure Mapping
$${\bm{H}}^{0}={f}_{\mathrm{mapping}}({\bm{T}}_{t})$$ | (30) |
${\bm{H}}^{0}\in {\mathbb{R}}^{{d}_{\mathrm{H}}}$, where ${d}_{\mathrm{H}}={d}_{\mathrm{A}},{d}_{\mathrm{O}},{d}_{\mathrm{P}}$ are dimension of argument vector, operator vector and position vector. ${f}_{\mathrm{mapping}}$ is implemented with a MLP (linear layer followed by a tanh) for mapping the ${\bm{T}}_{t}\in {\mathbb{R}}^{{d}_{\mathrm{T}}}$ to the initial state of decoder ${\bm{H}}^{0}$.
A.2.3 TP-N2F decoder
Tuple-LSTM
The output tuples are also generated via a standard LSTM:
$${\bm{w}}_{d}^{t}=\gamma ({\bm{w}}_{Rel}^{t-1},{\bm{w}}_{Arg1}^{t-1},{\bm{w}}_{Arg2}^{t-1})$$ | (31) |
$${\bm{f}}^{t}=\phi ({\bm{U}}_{\mathrm{f}}{\bm{w}}_{d}^{t}+{\bm{V}}_{\mathrm{f}}\mathrm{\u266d}({\bm{H}}^{t-1})+{\bm{b}}_{\mathrm{f}})$$ | (32) |
$${\bm{g}}^{t}=\mathrm{tanh}({\bm{U}}_{\mathrm{g}}{\bm{w}}_{d}^{t}+{\bm{V}}_{\mathrm{g}}\mathrm{\u266d}({\bm{H}}^{t-1})+{\bm{b}}_{\mathrm{g}})$$ | (33) |
$${\bm{i}}^{t}=\phi ({\bm{U}}_{\mathrm{i}}{\bm{w}}_{d}^{t}+{\bm{V}}_{\mathrm{i}}\mathrm{\u266d}({\bm{H}}^{t-1})+{\bm{b}}_{\mathrm{i}})$$ | (34) |
$${\bm{o}}^{t}=\phi ({\bm{U}}_{\mathrm{o}}{\bm{w}}_{d}^{t}+{\bm{V}}_{\mathrm{o}}\mathrm{\u266d}({\bm{H}}^{t-1})+{\bm{b}}_{\mathrm{o}})$$ | (35) |
$${\bm{c}}^{t}={\bm{f}}^{t}\odot {\bm{c}}^{t-1}+{\bm{i}}^{t}\odot {\bm{g}}^{t}$$ | (36) |
$${\bm{h}}_{\mathrm{input}}^{t}={\bm{o}}^{t}\odot \mathrm{tanh}({\bm{c}}^{t})$$ | (37) |
$${\bm{H}}^{t}=\mathrm{Atten}({\bm{h}}_{\mathrm{input}}^{t},[{\bm{T}}_{0},\mathrm{\dots},{\bm{T}}_{n-1}])$$ | (38) |
Here, $\gamma $ is the concatenation function. ${\bm{w}}_{Rel}^{t-1}$ is the trained embedding vector for the Relation of the input binary tuple, ${\bm{w}}_{Arg1}^{t-1}$ is the embedding vector for the first argument and ${\bm{w}}_{Arg2}^{t-1}$ is the embedding vector for the second argument. Then the input for the Tuple LSTM is the concatenation of the embedding vectors of relation and arguments, with dimension ${d}_{\mathrm{dec}}$.
${\bm{f}}^{t},{\bm{g}}^{t},{\bm{i}}^{t},{\bm{o}}^{t},{\bm{c}}^{t},{\bm{h}}_{\mathrm{input}}^{t},{\bm{b}}_{\mathrm{f}},{\bm{b}}_{\mathrm{g}},{\bm{b}}_{\mathrm{i}},{\bm{b}}_{\mathrm{o}},\mathrm{\u266d}({\bm{H}}^{t-1})\in {\mathbb{R}}^{{d}_{\mathrm{H}}}$ | ||
${\bm{w}}_{d}^{t}\in {\mathbb{R}}^{{d}_{\mathrm{dec}}}$ | ||
${\bm{U}}_{\mathrm{f}},{\bm{U}}_{\mathrm{g}},{\bm{U}}_{\mathrm{i}},{\bm{U}}_{\mathrm{o}}\in {\mathbb{R}}^{{d}_{\mathrm{H}}\times {d}_{\mathrm{dec}}}$ | ||
${\bm{V}}_{\mathrm{f}},{\bm{V}}_{\mathrm{g}},{\bm{V}}_{\mathrm{i}},{\bm{V}}_{\mathrm{o}}\in {\mathbb{R}}^{{d}_{\mathrm{H}}\times {d}_{\mathrm{H}}}$ | ||
${\bm{H}}^{t}\in {\mathbb{R}}^{{d}_{\mathrm{H}}}$ |
$\mathrm{Atten}$ is the attention mechanism used in luong2015att, which computes the dot product between ${\bm{h}}_{\mathrm{input}}^{t}$ and each ${\bm{T}}_{{t}^{\prime}}$. Then a linear function is used on the concatenation of ${\bm{h}}_{\mathrm{input}}^{t}$ and the softmax scores on all dot products to generate ${\bm{H}}^{t}$. The following equations show the attention mechanism:
$${\bm{d}}^{t}=\mathrm{score}({\bm{h}}_{\mathrm{input}}^{t},{\bm{C}}_{T})$$ | (39) |
$${\bm{s}}^{t}={\bm{C}}_{T}\mathrm{softmax}({\bm{d}}^{t})$$ | (40) |
$${\bm{H}}^{t}=\bm{K}\gamma ({\bm{h}}_{\mathrm{input}}^{t},{\bm{s}}^{t})$$ | (41) |
$\mathrm{score}$ is the score function of the attention. In this paper, the score function is dot product.
${\bm{C}}_{T}\in {\mathbb{R}}^{{d}_{\mathrm{H}}\times n}$ | ||
${\bm{d}}_{t}\in {\mathbb{R}}^{n}$ | ||
${\bm{s}}_{t}\in {\mathbb{R}}^{{d}_{\mathrm{H}}}$ | ||
$\bm{K}\in {\mathbb{R}}^{{d}_{\mathrm{H}}\times ({d}_{\mathrm{T}}+n)}$ |
Unbinding
At each timestep $t$, the 2-step unbinding process described in Sec. 3.1.2 operates first on an encoding of the triple as a whole, $\bm{H}$, using two unbinding vectors ${\bm{p}}_{i}^{\prime}$ that are learned but fixed for all tuples. This first unbinding gives an encoding of the two operator-argument bindings, ${\bm{B}}_{i}$. The second unbinding operates on the ${\bm{B}}_{i}$, using a generated unbinding vector for the operator, ${\bm{r}}_{rel}^{\prime}$, giving encodings of the arguments, ${\bm{a}}_{i}$. The generated unbinding vector for the operator, ${\bm{r}}^{\prime}$, and the generated encodings of the arguments, ${\bm{a}}_{i}$, each produce a probability distribution over symbolic operator outputs $Rel$ and symbolic argument outputs $Ar{g}_{i}$; these probabilities are used in the cross-entropy loss function. For generating a single symbolic output, the most-probable symbols are selected.
$${\bm{B}}_{1}^{t}={\bm{H}}^{t}{\bm{p}}_{1}^{\prime}$$ | (42) |
$${\bm{B}}_{2}^{t}={\bm{H}}^{t}{\bm{p}}_{2}^{\prime}$$ | (43) |
$${\bm{r}}_{rel}^{\prime t}={\bm{W}}_{\mathrm{dual}}({B}_{1}^{t}+{B}_{2}^{t})$$ | (44) |
$${\bm{a}}_{1}^{t}={\bm{B}}_{1}^{t}{\bm{r}}_{rel}^{\prime t}$$ | (45) |
$${\bm{a}}_{2}^{t}={\bm{B}}_{2}^{t}{\bm{r}}_{rel}^{\prime t}$$ | (46) |
$${\bm{l}}_{{r}_{rel}}^{t}={\bm{L}}_{{r}_{rel}}^{t}{\bm{r}}_{rel}^{\prime t}$$ | (47) |
$${\bm{l}}_{{a}_{1}}^{t}={\bm{L}}_{a}^{t}{\bm{a}}_{1}^{t}$$ | (48) |
$${\bm{l}}_{{a}_{2}}^{t}={\bm{L}}_{a}^{t}{\bm{a}}_{2}^{t}$$ | (49) |
$$Re{l}^{t}=\mathrm{argmax}(\mathrm{softmax}({\bm{l}}_{r}^{t}))$$ | (50) |
$$Arg{1}^{t}=\mathrm{argmax}(\mathrm{softmax}({\bm{l}}_{{a}_{1}}^{t}))$$ | (51) |
$$Arg{2}^{t}=\mathrm{argmax}(\mathrm{softmax}({\bm{l}}_{{a}_{2}}^{t}))$$ | (52) |
The dimensions are:
${\bm{r}}_{rel}^{\prime t}\in {\mathbb{R}}^{{d}_{\mathrm{O}}}$ | ||
${\bm{a}}_{1}^{t},{\bm{a}}_{2}^{t}\in {\mathbb{R}}^{{d}_{\mathrm{A}}}$ | ||
${\bm{p}}_{1}^{\prime},{\bm{p}}_{2}^{\prime}\in {\mathbb{R}}^{{d}_{\mathrm{P}}}$ | ||
${\bm{B}}_{1}^{t},{\bm{B}}_{2}^{t}\in {\mathbb{R}}^{{d}_{\mathrm{A}}\times {d}_{\mathrm{O}}}$ | ||
${\bm{W}}_{\mathrm{dual}}\in {\mathbb{R}}^{{d}_{\mathrm{H}}}$ | ||
${\bm{L}}_{r}^{t}\in {\mathbb{R}}^{{n}_{\mathrm{O}}\times {d}_{\mathrm{O}}}$ | ||
${\bm{L}}_{a}^{t}\in {\mathbb{R}}^{{n}_{\mathrm{A}}\times {d}_{\mathrm{A}}}$ | ||
${\bm{l}}_{r}^{t}\in {\mathbb{R}}^{{n}_{\mathrm{R}}}$ | ||
${\bm{l}}_{{a}_{1}}^{t},{\bm{l}}_{{a}_{2}}^{t}\in {\mathbb{R}}^{{n}_{\mathrm{A}}}$ |
A.3 The tensor that is input to the decoder’s Unbinding Module is a TPR
Here we show that, if learning is successful, the order-3 tensor $\bm{H}$ that each iteration of the decoder’s Tuple LSTM feeds to the decoder’s Unbinding Module (Figure 3) will be a TPR of the form assumed in Eq. 3, repeated here:
$\bm{H}={\displaystyle \sum _{j}}{\bm{a}}_{j}\otimes {\bm{r}}_{rel}\otimes {\bm{p}}_{j}.$ | (53) |
The operations performed by the decoder are given in Eqs. 4–5, and Eqs. 10–11, rewritten here:
$\bm{H}\cdot {\bm{p}}_{i}^{\prime}={\bm{q}}_{i}$ | (54) | ||
${\bm{q}}_{i}\cdot {\bm{r}}_{rel}^{\prime}={\bm{a}}_{i}$ | (55) |
This is the standard TPR unbinding operation, used recursively: first with the unbinding vectors for positions, ${\bm{p}}_{i}^{\prime}$, then with the unbinding vector for the operator, ${\bm{r}}_{rel}^{\prime}$. It therefore suffices to analyze a single unbinding; the result can then be used recursively. This in effect reduces the problem to the order-2 case. What we will show is: given a set of unbinding vectors $\{{\bm{r}}_{i}^{\prime}\}$ which are dual to a set of role vectors $\{{\bm{r}}_{i}\}$, with $i$ ranging over some index set $I$, if $\bm{H}$ is an order-2 tensor such that
$\bm{H}\cdot {\bm{r}}_{i}^{\prime}={\bm{f}}_{i},\forall i\in I$ | (56) |
then
$\bm{H}={\displaystyle \sum _{i\in I}}{\bm{f}}_{i}{\bm{r}}_{i}^{\top}+\bm{Z}\equiv {\bm{H}}_{\mathrm{TPR}}+\bm{Z}$ | (57) |
for some tensor $\bm{Z}$ that annihilates all the unbinding vectors:
$\bm{Z}\cdot {\bm{r}}_{i}^{\prime}=\mathrm{\U0001d7ce},\forall i\in I.$ | (58) |
If learning is successful, the processing in the decoder will generate the target relational tuple $(R,{A}_{1},{A}_{2})$ by obeying Eq. 54 in the first unbinding, where we have ${\bm{r}}_{i}^{\prime}={\bm{p}}_{i}^{\prime},{\bm{f}}_{i}={\bm{q}}_{i},I=\{1,2\}$, and obeying Eq. 55 in the second unbinding, where we have ${\bm{r}}_{i}^{\prime}={\bm{r}}_{rel}^{\prime},{\bm{f}}_{i}^{\prime}={\bm{a}}_{i}$, with $I=$ the set containing only the null index.
Treat rank-2 tensors as matrices; then unbinding is simply matrix-vector multiplication. Assume the set of unbinding vectors is linearly independent (otherwise there would in general be no way to satisfy Eq. 56 exactly, contrary to assumption). Then expand the set of unbinding vectors, if necessary, into a basis ${\{{\bm{r}}_{k}^{\prime}\}}_{k\in K\supseteq I}$. Find the dual basis, with ${\bm{r}}_{k}$ dual to ${\bm{r}}_{k}^{\prime}$ (so that ${\bm{r}}_{l}^{\top}{\bm{r}}_{j}^{\prime}={\delta}_{lj}$). Because ${\{{\bm{r}}_{k}^{\prime}\}}_{k\in K}$ is a basis, so is ${\{{\bm{r}}_{k}\}}_{k\in K}$, so any matrix $\bm{H}$ can be expanded as $\bm{H}={\sum}_{k\in K}{\bm{v}}_{k}{\bm{r}}_{k}^{\top}$. Since $\bm{H}{\bm{r}}_{i}^{\prime}={\bm{f}}_{i},\forall i\in I$ are the unbinding conditions (Eq. 56), we must have ${\bm{v}}_{i}={\bm{f}}_{i},i\in I$. Let ${\bm{H}}_{\mathrm{TPR}}\equiv {\sum}_{i\in I}{\bm{f}}_{i}{\bm{r}}_{i}^{\top}$. This is the desired TPR, with fillers ${\bm{f}}_{i}$ bound to the role vectors ${\bm{r}}_{i}$ which are the duals of the unbinding vectors ${\bm{r}}_{i}^{\prime}$ ($i\in I$). Then we have $\bm{H}={\bm{H}}_{\mathrm{TPR}}+\bm{Z}$ (Eq. 57) where $\bm{Z}\equiv {\sum}_{j\in K,j\notin I}{\bm{v}}_{j}{\bm{r}}_{j}^{\top}$; so $\bm{Z}{\bm{r}}_{i}^{\prime}=\mathrm{\U0001d7ce},i\in I$ (Eq. 58). Thus, if training is successful, the model must have learned how to feed the decoder with order-3 TPRs with the structure posited in Eq. 53.
The argument so far addresses the case where the unbinding vectors are linearly independent, making it possible to satisfy Eq. 56 exactly. In relatively high-dimensional vector spaces, it will often happen that even when the number of unbinding vectors exceeds the dimension of their space by a factor of 2 or 3 (which applies to the TP-N2F models presented here), there is a set of role vectors ${\{{\bm{r}}_{k}\}}_{k\in K}$ approximately dual to ${\{{\bm{r}}_{k}^{\prime}\}}_{k\in K}$, such that ${\bm{r}}_{l}^{\top}{\bm{r}}_{j}^{\prime}={\delta}_{lj}\forall l,j\in K$ holds to a good approximation. (If the distribution of normalized unbinding vectors is approximately uniform on the unit sphere, then choosing the approximate dual vectors to equal the unbinding vectors themselves will do, since they will be nearly orthonormal (anonymous2019). If the ${\{{\bm{r}}_{k}^{\prime}\}}_{k\in K}$ are not normalized, we just rescale the role vectors, choosing ${\bm{r}}_{k}={\bm{r}}_{k}^{\prime}/{\parallel {\bm{r}}_{k}^{\prime}\parallel}^{2}$.) When the number of such role vectors exceeds the dimension of the embedding space, they will be overcomplete, so while it is still true that any matrix $\bm{H}$ can be expanded as above ($\bm{H}={\sum}_{k\in K}{\bm{v}}_{k}{\bm{r}}_{k}^{\top}$), this expansion will no longer be unique. So while it remains true that $\bm{H}$ a TPR, it is no longer uniquely decomposable into filler/role pairs. The claim above does not claim uniqueness in this sense, and remains true.)
A.4 Dataset samples
A.4.1 Data sample from MathQA dataset
Problem: The present polulation of a town is 3888. Population increase rate is 20%. Find the population of town after 1 year?
Options: a) 2500, b) 2100, c) 3500, d) 3600, e) 2700
Operations: multiply(n0,n1), divide(#0,const-100), add(n0,#1)
A.4.2 Data sample from AlgoLisp dataset
Problem: Consider an array of numbers and a number, decrements each element in the given array by the given number, what is the given array?
Program Nested List: (map a (partial1 b –))
Command-Sequence: (partial1 b –), (map a #0)
A.5 Generated programs comparison
In this section, we display some generated samples from the two datasets, where the TP-N2F model generates correct programs but LSTM-Seq2Seq does not.
Question: A train running at the speed of 50 km per hour crosses a post in 4 seconds. What is the length of the train?
TP-N2F(correct):
(multiply,n0,const1000) (divide,#0,const3600) (multiply,n1,#1)
LSTM(wrong):
(multiply,n0,const0.2778) (multiply,n1,#0)
Question: 20 is subtracted from 60 percent of a number, the result is 88. Find the number?
TP-N2F(correct):
(add,n0,n2) (divide,n1,const100) (divide,#0,#1)
LSTM(wrong):
(add,n0,n2) (divide,n1,const100) (divide,#0,#1) (multiply,#2,n3) (subtract,#3,n0)
Question: The population of a village is 14300. It increases annually at the rate of 15 percent. What will be its population after 2 years?
TP-N2F(correct):
(divide,n1,const100) (add,#0,const1) (power,#1,n2) (multiply,n0,#2)
LSTM(wrong):
(multiply,const4,const100) (sqrt,#0)
Question: There are two groups of students in the sixth grade. There are 45 students in group a, and 55 students in group b. If, on a particular day, 20 percent of the students in group a forget their homework, and 40 percent of the students in group b forget their homework, then what percentage of the sixth graders forgot their homework?
TP-N2F(correct):
(add,n0,n1) (multiply,n0,n2) (multiply,n1,n3) (divide,#1,const100) (divide,#2,const100) (add,#3,#4) (divide,#5,#0) (multiply,#6,const100)
LSTM(wrong):
(multiply,n0,n1) (subtract,n0,n1) (divide,#0,#1)
Question: 1 divided by 0.05 is equal to
TP-N2F(correct):
(divide,n0,n1)
LSTM(wrong):
(divide,n0,n1) (multiply,n2,#0)
Question: Consider a number a, compute factorial of a
TP-N2F(correct):
( ¡=,arg1,1 ) ( -,arg1,1 ) ( self,#1 ) ( *,#2,arg1 ) ( if,#0,1,#3 ) ( lambda1,#4 ) ( invoke1,#5,a )
LSTM(wrong):
( ¡=,arg1,1 ) ( -,arg1,1 ) ( self,#1 ) ( *,#2,arg1 ) ( if,#0,1,#3 ) ( lambda1,#4 ) ( len,a ) ( invoke1,#5,#6 )
Question: Given an array of numbers and numbers b and c, add c to elements of the product of elements of the given array and b, what is the product of elements of the given array and b?
TP-N2F(correct):
( partial, b,* ) ( partial1,c,+ ) ( map,a,#0 ) ( map,#2,#1 )
LSTM(wrong):
( partial1,b,+ ) ( partial1,c,+ ) ( map,a,#0 ) ( map,#2,#1 )
Question: You are given an array of numbers a and numbers b , c and d , let how many times you can replace the median in a with sum of its digits before it becomes a single digit number and b be the coordinates of one end and c and d be the coordinates of another end of segment e , your task is to find the length of segment e rounded down
TP-N2F(correct):
( digits arg1 ) ( len #0 ) ( == #1 1 ) ( digits arg1 ) ( reduce #3 0 + ) ( self #4 ) ( + 1 #5 ) ( if #2 0 #6 ) ( lambda1 #7 ) ( sort a ) ( len a ) ( / #10 2 ) ( deref #9 #11 ) ( invoke1 #8 #12 ) ( - #13 c ) ( digits arg1 ) ( len #15 ) ( == #16 1 ) ( digits arg1 ) ( reduce #18 0 + ) ( self #19 ) ( + 1 #20 ) ( if #17 0 #21 ) ( lambda1 #22 ) ( sort a ) ( len a ) ( / #25 2 ) ( deref #24 #26 ) ( invoke1 #23 #27 ) ( - #28 c ) ( * #14 #29 ) ( - b d ) ( - b d ) ( * #31 #32 ) ( + #30 #33 ) ( sqrt #34 ) ( floor #35 )
LSTM(wrong): ( digits arg1 ) ( len #0 ) ( == #1 1 ) ( digits arg1 ) ( reduce #3 0 + ) ( self #4 ) ( + 1 #5 ) ( if #2 0 #6 ) ( lambda1 #7 ) ( sort a ) ( len a ) ( / #10 2 ) ( deref #9 #11 ) ( invoke1 #8 #12 c ) ( - #13 ) ( - b d ) ( - b d ) ( * #15 #16 ) ( * #14 #17 ) ( + #18 ) ( sqrt #19 ) ( floor #20 )
Question: Given numbers a , b , c and e , let d be c , reverse digits in d , let a and the number in the range from 1 to b inclusive that has the maximum value when its digits are reversed be the coordinates of one end and d and e be the coordinates of another end of segment f , find the length of segment f squared
TP-N2F(correct):
( digits c ) ( reverse #0 ) ( * arg1 10 ) ( + #2 arg2 ) ( lambda2 #3 ) ( reduce #1 0 #4 ) ( - a #5 ) ( digits c ) ( reverse #7 ) ( * arg1 10 ) ( + #9 arg2 ) ( lambda2 #10 ) ( reduce #8 0 #11 ) ( - a #12 ) ( * #6 #13 ) ( + b 1 ) ( range 0 #15 ) ( digits arg1 ) ( reverse #17 ) ( * arg1 10 ) ( + #19 arg2 ) ( lambda2 #20 ) ( reduce #18 0 #21 ) ( digits arg2 ) ( reverse #23 ) ( * arg1 10 ) ( + #25 arg2 ) ( lambda2 #26 ) ( reduce #24 0 #27 ) ( ¿ #22 #28 ) ( if #29 arg1 arg2 ) ( lambda2 #30 ) ( reduce #16 0 #31 ) ( - #32 e ) ( + b 1 ) ( range 0 #34 ) ( digits arg1 ) ( reverse #36 ) ( * arg1 10 ) ( + #38 arg2 ) ( lambda2 #39 ) ( reduce #37 0 #40 ) ( digits arg2 ) ( reverse #42 ) ( * arg1 10 ) ( + #44 arg2 ) ( lambda2 #45 ) ( reduce #43 0 #46 ) ( ¿ #41 #47 ) ( if #48 arg1 arg2 ) ( lambda2 #49 ) ( reduce #35 0 #50 ) ( - #51 e ) ( * #33 #52 ) ( + #14 #53 )
LSTM(wrong):
( - a d ) ( - a d ) ( * #0 #1 ) ( digits c ) ( reverse #3 ) ( * arg1 10 ) ( + #5 arg2 ) ( lambda2 #6 ) ( reduce #4 0 #7 ) ( - #8 e ) ( + b 1 ) ( range 0 #10 ) ( digits arg1 ) ( reverse #12 ) ( * arg1 10 ) ( + #14 arg2 ) ( lambda2 #15 ) ( reduce #13 0 #16 ) ( digits arg2 ) ( reverse #18 ) ( * arg1 10 ) ( + #20 arg2 ) ( lambda2 #21 ) ( reduce #19 0 #22 ) ( ¿ #17 #23 ) ( if #24 arg1 arg2 ) ( lambda2 #25 ) ( reduce #11 0 #26 ) ( - #27 e ) ( * #9 #28 ) ( + #2 #29 )
A.6 Unbinding relation vector clustering
We run K-means clustering on both datasets with $k=3,4,5,6$ clusters and the results are displayed in Figure 4 and Figure 5. As described before, unbinding-vectors for operators or functions with similar semantics tend to be closer to each other. For example, in the MathQA dataset, arithmetic operators such as add, subtract, multiply, divide are clustered together at middle, and operators related to geometry such as square or volume are clustered together at bottom left. In AlgoLisp dataset, basic arithmetic functions are clustered at middle, and string processing functions are clustered at right.