Natural- to formal-language generation using Tensor Product Representations

  • 2019-10-05 22:57:04
  • Kezhen Chen, Qiuyuan Huang, Hamid Palangi, Paul Smolensky, Kenneth D. Forbus, Jianfeng Gao
  • 0

Abstract

Generating formal-language represented by relational tuples, such as Lispprograms or mathematical expressions, from a natural-language input is anextremely challenging task because it requires to explicitly capture discretesymbolic structural information from the input to generate the output. Moststate-of-the-art neural sequence models do not explicitly capture suchstructure information, and thus do not perform well on these tasks. In thispaper, we propose a new encoder-decoder model based on Tensor ProductRepresentations (TPRs) for Natural- to Formal-language generation, calledTP-N2F. The encoder of TP-N2F employs TPR 'binding' to encode natural-languagesymbolic structure in vector space and the decoder uses TPR 'unbinding' togenerate a sequence of relational tuples, each consisting of a relation (oroperation) and a number of arguments, in symbolic space. TP-N2F considerablyoutperforms LSTM-based Seq2Seq models, creating a new state of the art resultson two benchmarks: the MathQA dataset for math problem solving, and theAlgoList dataset for program synthesis. Ablation studies show that improvementsare mainly attributed to the use of TPRs in both the encoder and decoder toexplicitly capture relational structure information for symbolic reasoning.

 

Quick Read (beta)

Natural- to formal-language generation
using Tensor Product Representations

Kezhen Chen, Qiuyuan Huang, Hamid Palangi, Paul Smolensky§,
Kenneth D. Forbus, Jianfeng Gao
Northwestern University, Evanston, IL
Microsoft Research, Redmond, WA
§Johns Hopkins University, Baltimore, MD
[email protected]
{qihua,hpalangi,psmo,jfgao}@microsoft.com
[email protected]
[email protected]
Kezhen Chen contributed to this work as an intern at Microsoft Research, Redmond WA.
Abstract

Generating formal-language represented by relational tuples, such as Lisp programs or mathematical operations, from natural-language input is a challenging task because it requires explicitly capturing discrete symbolic structural information implicit in the input. Most state-of-the-art neural sequence models do not explicitly capture such structural information, limiting their performance on these tasks. In this paper we propose a new encoder-decoder model based on Tensor Product Representations (TPRs) for Natural- to Formal-language generation, called TP-N2F. The encoder of TP-N2F employs TPR ‘binding’ to encode natural-language symbolic structure in vector space and the decoder uses TPR ‘unbinding’ to generate, in symbolic space, a sequence of relational tuples, each consisting of a relation (or operation) and a number of arguments. On two benchmarks, TP-N2F considerably outperforms LSTM-based seq2seq models, creating new state-of-the-art results: the MathQA dataset for math problem solving, and the AlgoLisp dataset for program synthesis. Ablation studies show that improvements can be attributed to the use of TPRs in both the encoder and decoder to explicitly capture relational structure to support for reasoning.

Natural- to formal-language generation
using Tensor Product Representations

Kezhen Chenthanks: Kezhen Chen contributed to this work as an intern at Microsoft Research, Redmond WA., Qiuyuan Huang, Hamid Palangi, Paul Smolensky§,
Kenneth D. Forbus, Jianfeng Gao
Northwestern University, Evanston, IL
Microsoft Research, Redmond, WA
§Johns Hopkins University, Baltimore, MD
[email protected]
{qihua,hpalangi,psmo,jfgao}@microsoft.com
[email protected]
[email protected]

1 INTRODUCTION

When people perform explicit reasoning, they can typically describe the way to the conclusion step by step via relational descriptions. There is ample evidence that relational representations are important for human cognition (e.g., (Goldin-Meadow and Gentner, 2003; Forbus et al., 2017; Crouse et al., 2018; Chen and Forbus, 2018; Chen et al., 2019)). Although a rapidly growing number of researchers use deep learning to solve complex symbolic reasoning and language tasks (a recent review is (Gao et al., 2019)), most existing deep learning models, including sequence models such as LSTMs, do not explicitly capture human-like relational structure information.

In this paper we propose a novel neural architecture, TP-N2F, to solve natural- to formal-language generation tasks (N2F). In the tasks we study, math or programming problems are stated in natural-language, and answers are given as programs, each program being a sequence of operator/arguments tuples. These are our relational descriptions. TP-N2F encodes natural-language symbolic structure in vector space, and decodes it as relational structures, both types of structures being embedded as Tensor Product Representations (TPRs) (Smolensky, 1990). During encoding, this approach builds symbolic structures as vector space embeddings using TPR ‘binding’ (Palangi et al., 2018); during decoding, it extracts symbolic constituents from structure embedding vectors using TPR ‘unbinding’ (Huang et al., 2018; 2019).

As is typical of work using TPRs, it is useful to adopt three levels of description. At the highest symbolic level, we have the symbolic descriptions of the inputs and outputs: here, problems stated in natural-language and solutions stated in a formal-language of relational descriptions of program operations. At the lowest neural-network level, we have the activation vectors and connection weight matrices of the model performing the task. At the intermediate level, unique to TPR approaches, the input and output symbol structures are decomposed into structural roles, which are filled by content-bearing symbols called fillers: this is the role level spelled out in Section 2.

Our contributions in this work are as follows. (i) We propose a role-level analysis of N2F tasks. (ii) We present a new TP-N2F model which gives a neural-network-level implementation of a model solving the N2F task under our proposed role-level description (i). To our knowledge, this is the first model to be proposed which combines both binding and unbinding features of TPRs to achieve generation tasks in deep learning. (iii) State-of-the-art performance on two recently developed N2F tasks shows that the TP-N2F model has significant structure learning ability on tasks requiring symbolic reasoning through program synthesis.

2 Background: Review of Tensor-Product Representation

The TPR mechanism is a method to create a vector space embedding of complex symbolic structure. The type of a symbol structure is defined by a set of structural positions or roles, such as the left-child-of-root position in a tree, or the second-argument-of-R position of a particular relation R. In a particular instance of a structural type, each of these roles may be occupied by a particular filler, which can be an atomic symbol or a substructure (e.g., the left sub-tree of a binary tree can serve as the filler of the role left-child-of-root). For now, we assume the fillers to be atomic symbols.

The TPR embedding of a symbol structure is the sum of the embeddings of all its constituents, each constituent comprising a role together with its filler. The embedding of a constituent is constructed from the embedding of a role and the embedding of the filler of that role: these are joined together by the TPR ‘binding’ operation, the tensor (or generalized outer) product .

Suppose a symbolic type is defined by the roles {ri}, and suppose that in a particular instance of that type, 𝚂, role ri is bound by filler fi. Then the TPR embedding of 𝚂 is the order-2 tensor

𝑻=i𝒇i𝒓i=i𝒇i𝒓i (1)

where {𝒇i} are vector embeddings of the fillers and {𝒓i} are vector embeddings of the roles. As a simple example, consider the symbolic type string, and choose roles to be r1= first_element, r2= second_element, etc. Then in the specific string S = cba, the first role r1 is filled by c, and r2 and r3 by b and a, respectively. Then the TPR for S is 𝒄𝒓1+𝒃𝒓2+𝒂𝒓3, where 𝒂,𝒃,𝒄 are the vector embeddings of the symbols a, b, c, and 𝒓i is the vector embedding of role ri.

Define the matrix of all nR possible role vectors to be 𝑹dR×nR, with column i, [𝑹]:i=𝒓idR, comprising the embedding of ri. Similarly let 𝑭dF×nF be the matrix of all possible filler vectors. The TPR 𝑻dF×dR. dR,nR,dF,nF are hyperparameters, while 𝑹,𝑭 are learned parameter matrices.

Choosing the tensor product as the binding operation makes it possible to recover the filler of any role in a structure 𝚂 given the TPR 𝑻 of 𝚂. This can be done with perfect precision if the embeddings of the roles are linearly independent. In that case the role matrix 𝑹 is invertible: 𝑼=𝑹-1 exists such that 𝑼𝑹=𝑰. Now define the ‘unbinding’ vector for role rj, 𝒖j, to be the jth column of 𝑼: U:j. Then, since [𝑰]ji=[𝑼𝑹]ji=𝑼j:𝑹:i=[𝑼:j]𝑹:i=𝒖j𝒓i=𝒓i𝒖j, we have 𝒓i𝒖j=δji. This means that, to recover the filler of rj in the structure with TPR 𝑻, we can take its tensor inner product with 𝒖j (or equivalently, viewing 𝑻 as a matrix, take the matrix-vector product):

𝑻𝒖j=[i𝒇i𝒓i]𝒖j=i𝒇iδij=𝒇j (2)

In the architecture proposed here, we will make use of both TPR binding using the tensor product with role vectors 𝒓i and TPR unbinding using the tensor inner product with unbinding vectors 𝒖j.

3 TP-N2F Model

We propose a general TP-N2F neural-network architecture operating over TPRs to solve N2F tasks under a proposed role-level description of the task. In this description, natural-language input is represented as a straightforward order-2 role structure, and formal-language relational representations of outputs are represented with a new order-3 recursive role structure proposed here. Figure 1 shows an overview diagram of the TP-N2F model.

Figure 1: Overview diagram of TP-N2F.

As shown in Figure 1, while the natural-language input is a sequence of words, the output is a sequence of multi-argument relational tuples such as (RA1A2), a 3-tuple consisting of a binary relation (or operation) R with its two arguments. The model generates one complete tuple at a time. TP-N2F contains a “TP-N2F encoder”, which encodes the input natural-language sequence via TPR binding, and a “TP-N2F decoder”, which decodes relational tuples via TPR unbinding. In the following sections, we first introduce the details of our proposed role-level description for N2F tasks, and then present how our proposed TP-N2F model uses TPR binding and unbinding operations to create a neural-network implementation of this description of the N2F task.

3.1 Role-level description of N2F tasks

In this section, we propose a role-level description of N2F tasks, which specifies the structure of the input natural-language symbolic expressions and the output relational representations.

3.1.1 Role-level description for natural-language input

Instead of encoding each token of a sentence with a non-compositional embedding vector looked up in a learned dictionary, we use a learned role-filler decomposition to compose a tensor representation for each token from a role vector encoding the word’s structural role in the sentence and a filler vector encoding the word’s content. Given a sentence S with n word tokens {w0,w1,,wn-1}, each word token wt is assigned a learned role vector 𝒓t and a learned filler vector 𝒇t which, following the results of Palangi et al. (2018), we can hypothesize to approximately encode the grammatical role of the token and its lexical semantics, respectively. Then each word token wt is represented by the tensor product of the role vector and the filler vector: 𝑻t=𝒇t𝒓t. In addition to the set of all its token embeddings {𝑻0,,𝑻n-1}, the sentence S as a whole is assigned a TPR equal to the sum of the TPR embeddings of all its word tokens: 𝑻S=t=0n-1𝑻t.

Using TPRs to encode natural-language has several advantages. First, natural-language TPRs can be interpreted by exploring the distribution of tokens grouped by the role and symbol vectors they are assigned by a trained model (as in Palangi et al. (2018)). Second, TPRs avoid the Bag of Word (BoW) confusion (Huang et al., 2018): the BoW encoding of Jay saw Kay is the same as the BoW encoding of Kay saw Jay. However, in a TPR embedding using, say, grammatical roles, 𝒇Jay𝒓subject+𝒇saw𝒓verb+𝒇Kay𝒓object is different from 𝒇Kay𝒓subject+𝒇saw𝒓verb+𝒇Jay𝒓object, because the role filled by a symbol changes with its context.

3.1.2 Role-level description for relational representations

In this section, we propose a novel recursive role-level description for representing symbolic relational tuples. Each relational tuple contains a relation token and multiple argument tokens. Given a binary relation R, a relational tuple can be written as (RA1A2) where A1,A2 indicate two arguments of relation R. Let us adopt the two positional roles, piR= argi-of-R for i=1,2. The filler of role piR is Ai. Now let us use role decomposition recursively, noting that the role piR can itself be decomposed into a sub-role pi= argi-of-¯ which has a sub-filler R. Suppose that Ai,R,pi are embedded as vectors 𝒂i,𝒓,𝒑i. Then the TPR encoding of piR is 𝒓𝒑i, so the TPR encoding of filler Ai bound to role piR is 𝒂i(𝒓𝒑i). Since the tensor product is associative, we can write the TPR for the formal-language expression, the relational tuple (RA1A2), as:

𝑻F=𝒂1𝒓𝒑1+𝒂2𝒓𝒑2. (3)

Given the unbinding vectors 𝒑i for positional role vectors 𝒑i and the unbinding vector 𝒓 for the vector 𝒓 embedding relation R, each argument can be unbound in two steps as shown in (4)–(5).

𝑻F𝒑i=[𝒂1𝒓𝒑1+𝒂2𝒓𝒑2]𝒑i=𝒂i𝒓 (4)
[𝒂i𝒓]𝒓i=𝒂i (5)

Here denotes the tensor inner product, which for the order-3 𝑻F and order-1 𝒑i in (4) can be defined as [𝑻F𝒑i]jk=l[𝑻F]jkl[𝒑i]l; in (5), is equivalent to the matrix-vector product.

Our proposed scheme can be contrasted with the TPR scheme in which (RA1A2) is embedded as 𝒓𝒂1𝒂2 (e.g., Smolensky et al. (2016); Schlag and Schmidhuber (2018)). In that scheme, an n-ary-relation tuple is embedded as an order-(n+1) tensor, and unbinding an argument requires knowing all the other arguments (to use their unbinding vectors). In the scheme proposed here, an n-ary-relation tuple is still embedded as an order-3 tensor: there are just n terms in the sum in (3), using n position vectors 𝒑1,,𝒑n; unbinding simply requires knowing the unbinding vectors for these fixed position vectors.

3.1.3 The TP-N2F Scheme for Learning the input-output mapping

To generate formal relational tuples from natural-language descriptions, a learning strategy for the mapping between the two structures is particularly important. As shown in (6), we formalize the learning scheme as learning a mapping function fmapping(), which, given a structural representation of the natural-language input, 𝑻S, outputs a tensor 𝑻F from which the structural representation can be generated. In this paper, we use an MLP module to learn the TP-N2F mapping function. Other modules will be tested in future work.

𝑻F=fmapping(𝑻S) (6)

3.2 The TP-N2F Model for Natural- to Formal-Language Generation

As shown in Figure 1, the TP-N2F model is implemented with three steps: encoding, mapping, and decoding. The encoding step is implemented by the TP-N2F natural-language encoder (TP-N2F Encoder), which takes the sequence of word tokens as inputs, and encodes them via TPR binding according to the TP-N2F role scheme for natural-language input given in Sec. 3.1.1. The mapping step is implemented by an MLP called the Reasoning Module, which takes the encoding produced by the TP-N2F Encoder as input. The Reasoning Module learns to map the natural-language-structure encoding of the input to a representation that will be processed under the assumption that it follows the role scheme for output relational-tuples specified in Sec. 3.1.2: the model needs to learn to produce representations such that this processing generates correct output programs. The decoding step is implemented by the TP-N2F relational tuples decoder (TP-N2F Decoder), which takes the output from the Reasoning Module (Sec. 3.1.3) and decodes the target sequence of relational tuples via TPR unbinding. The TP-N2F Decoder utilizes an attention mechanism over the individual-word TPRs 𝑻t produced by the the TP-N2F Encoder. The implementation of the TP-N2F Encoder, the TP-N2F Decoder, and the Reasoning Module are introduced in turn.

3.2.1 The TP-N2F natural-language Encoder

The TP-N2F encoder follows the role scheme in Sec. 3.1.1 to encode each word token wt by selecting one of nF fillers and one of nR roles. The fillers and roles are embedded as vectors. These embedding vectors, and the functions for selecting fillers and roles, are learned by two LSTMs, the Filler-LSTM and the Role-LSTM. (See Figure 2.)

At each time-step t, the Filler-LSTM and the Role-LSTM take a learned word-token embedding 𝒘t as input. The hidden state of the Filler-LSTM, 𝒉Ft, is used to compute softmax scores ukF over nF filler slots, and a filler vector 𝒇t=𝑭𝒖F is computed from the softmax scores (recall 𝑭 is the learned matrix of filler vectors). Similarly, a role vector is computed from the hidden state of the Role-LSTM, 𝒉Rt. fF and fR denote the functions that generate 𝒇t and 𝒓t from the hidden states of the two LSTMs. The token wt is encoded as 𝑻t, the tensor product of 𝒇t and 𝒓t. 𝑻t replaces the hidden vector in each LSTM and is passed to the next time step, together with the LSTM cell-state vector 𝒄t: see (7)–(8). Detailed equations are provided in the Appendix. After encoding the whole sequence, the TP-N2F encoder outputs the sum of all tensor products t𝑻t to the next module.

𝒉Ft=fFiller-LSTM(𝒘t,𝑻t,𝒄Ft)    𝒉Rt=fRole-LSTM(𝒘t,𝑻t,𝒄Rt) (7)
𝑻t=𝒇t𝒓t=fF(𝒉Ft)fR(𝒉Rt) (8)
Figure 2: Implementation of TP-N2F encoder.

3.2.2 The TP-N2F Relational-Tuple Decoder

The TP-N2F Decoder is an RNN that takes the output from the reasoning MLP as its initial hidden state for generating a sequence of relational tuples (see Figure 3). This decoder contains an attentional LSTM called the Tuple-LSTM which feeds an unbinding module: attention operates on the context vector of the encoder, consisting of all individual encoder outputs {𝑻t}. Following Huang et al. (2018), the unbinding module is TPR-ready: it is designed so that if its input 𝑯 is a TPR of a certain form, then it will unbind it correctly to produce a relational tuple. 𝑯, which is produced by the Tuple-LSTM, is not constructed to explicitly be a TPR: rather, because the unbinding module operates on 𝑯 as if it were a TPR, the Tuple-LSTM tends to learn a way to make 𝑯 suitably approximate a TPR.

At each time step t, the hidden state 𝑯t of the Tuple-LSTM with attention (9) is fed as input to the unbinding module, which acts upon 𝑯t as if it were the TPR of a relational tuple with m arguments possessing the role structure described in Sec. 3.1.2: 𝑯ti=1m𝒂it𝒓t𝒑i. In this paper we assume an arity of m=2. (In Figure 3, the assumed hypothetical form of 𝑯t, as well as that of 𝑩it below, is shown in a bubble with dashed border.) Specifically, the unbinding module decodes a relational tuple from 𝑯t using the two steps of TPR unbinding given in (4)–(5). The positional unbinding vectors 𝒑i are learned during training and shared across all time steps. After the first unbinding step (4), i.e., inner product of 𝑯t with 𝒑i, we get tensors 𝑩it (10). These are treated as the TPRs of two arguments 𝒂it bound to a relation 𝒓t: 𝑩it𝒂it𝒓t. A relational unbinding vector 𝒓t is computed by a Relation Unbinding MLP from the sum of the 𝑩it (11): both these vectors, by hypothesis, contain information about 𝒓t. The inner product of each 𝑩it with the vector 𝒓t yields vectors 𝒂it which are treated as the embedding of relational arguments (11). Finally, symbolic values of the arguments and the relation in the generated relational tuple are predicted by classifiers over the vectors 𝒂it and 𝒓t. (More detailed equations are given in the Appendix.)

𝑯t=Atten(fTuple-LSTM(relt-1,arg1t-1,arg2t-1,𝑯t-1,ct-1)) (9)
𝑩1t=𝑯t𝒑1    𝑩2t=𝑯t𝒑2 (10)
𝒓t=flinear(𝑩1t+𝑩2t)    𝒂1t=𝑩1t𝒓t    𝒂2t=𝑩2t𝒓t (11)
Figure 3: Implementation of TP-N2F decoder.

3.3 The Learning Strategy of the TP-N2F Model

TP-N2F is trained using back-propagation (Rumelhart et al., 1986) with the Adam optimizer (Kingma and Ba, 2017) and teacher-forcing. At each time step, the ground-truth relational tuple is provided as the input for the next time step. As the TP-N2F decoder decodes a relational tuple at each time step, the relation token is selected only from the relation vocabulary and the argument tokens from the argument vocabulary. For an input that generates N output relational tuples, the loss is obtained by summing the cross entropy loss between the true labels L and predicted tokens for relations and arguments as shown in (12).

=i=0N(reli,Lreli)+i=0Nj=12(argji,Largji) (12)

4 EXPERIMENTS

The proposed TP-N2F model is evaluated on two N2F tasks, generating operation sequences to solve math problems and generating Lisp programs. In both tasks, TP-N2F achieves state-of-the-art performance. We further analyze the behavior of the unbinding relation vectors in the proposed model. Results of each task and analysis on unbinding relation vectors are introduced in turn.

4.1 Generating operation sequences to solve math problems

Given a natural-language math problem, we need to generate a sequence of operations (operators and corresponding arguments) from a set of operators and arguments to solve the given problem. Each operation is regarded as a relational tuple by viewing the operator as relation, e.g., (add,n1,n2). We test TP-N2F for this task on the MathQA dataset (Amini et al., 2019).

4.1.1 MathQA dataset

The MathQA dataset consists of about 37k math word problems ((80/12/8)% training/dev/testing problems), each with a corresponding list of multi-choice options and an straight-line operation sequence program to solve the problem. An example from the dataset is presented in the Appendix.

In this task, TP-N2F is deployed to generate the operation sequence given the question. The generated operations are executed to generate the solution for the given math problem. We use the execution script from Amini et al. (2019) to execute the generated operation sequence and compute the multi-choice accuracy for each problem. During our experiments we observed that there are about 30% noisy examples (on which the execution script fails to get the correct answer on the ground truth program). Therefore, we report both execution accuracy (the final multi-choice answer after running the execution engine) and operation sequence accuracy (where the generated operation sequence must match the ground truth sequence exactly). Details about preparing the data and hyperparameters of the TP-N2F model are described in the Appendix.

4.1.2 Results and discussion

TP-N2F is compared to a baseline provided by the seq2prog model in Amini et al. (2019), an LSTM-based seq2seq model with attention. Our model outperforms both the original seq2prog, designated SEQ2PROG-orig, and the best reimplemented seq2prog after an extensive hyperparameter search, designated SEQ2PROG-best. Table 1 presents the results. To verify the importance of the TP-N2F encoder and decoder, we conducted experiments to replace either the encoder with a standard LSTM (denoted LSTM2TP) or the decoder with a standard attentional LSTM (denoted TP2LSTM). We observe that both the TPR components of TP-N2F are important for achieving the observed performance gain relative to the baseline.

Table 1: Results on MathQA dataset testing set
MODEL Operation Accuracy(%) Execution Accuracy(%)
SEQ2PROG-orig 59.4 51.9
SEQ2PROG-best 66.97 54.0
TP2LSTM (ours) 68.84 54.61
LSTM2TP (ours) 68.21 54.61
TP-N2F (ours) 71.89 55.95

4.2 Generating program trees from natural-language descriptions

Generating Lisp programs requires sensitivity to structural information because Lisp code can be regarded as tree-structured. Given a natural-language query, we need to generate code containing function calls with parameters. Each function call is a relational tuple, which has a function as the relation and parameters as arguments. We evaluate our model on the AlgoLisp dataset for this task and achieve state-of-the-art performance.

4.2.1 AlgoLisp dataset

The AlgoLisp dataset (Polosukhin and Skidanov, 2018) is a program synthesis dataset, which has 79k/9k/10k training/dev/testing samples. Each sample contains a problem description, a corresponding Lisp program tree, and 10 input-output testing pairs. We parse the program tree into a straight-line sequence of commands from leaves to root and (as in MathQA) use the symbol #i to indicate the result of the ith command (generated previously by the model). A dataset sample with our parsed command sequence is presented in the Appendix.

AlgoLisp provides an execution script to run the generated program and has three evaluation metrics: accuracy of passing all test cases (Acc), accuracy of passing 50% of test cases (50p-Acc), and accuracy of generating an exactly matched program (M-Acc). AlgoLisp has about 10% noise data (where the execution script fails to pass all test cases on the ground truth program), so we report results both on the full test set and the cleaned test set (in which all noisy testing samples are removed). Details about hyperparameters of the TP-N2F model are reported in the Appendix.

4.2.2 Results and discussion

TP-N2F is compared with an LSTM seq2seq with attention model and a seq2seq model with a pre-trained tree decoder from the Tree2Tree autoencoder (SAPS) reported in Bednarek et al. (2019). As shown in Table 2, TP-N2F outperforms all existing models on both the full test set and the cleaned test set. Ablation experiments with TP2LSTM and LSTM2TP show that, for this task, the TP-N2F Decoder is more helpful than TP-N2F Encoder.

Table 2: Results of AlgoLisp dataset
Full Testing Set Cleaned Testing Set
MODEL (%) Acc 50p-Acc M-Acc Acc 50p-Acc M-Acc
LSTM2LSTM+atten 67.54 70.89 75.12 76.83 78.86 75.42
TP2LSTM (ours) 72.28 77.62 79.92 77.67 80.51 76.75
LSTM2TPR (ours) 75.31 79.26 83.05 84.44 86.13 83.43
SAPSpre-VH-Att-256 83.80 87.45 92.98 94.15
TP-N2F (ours) 84.02 88.01 93.06 93.48 94.64 92.78

4.3 Interpretation of learned structure

To interpret the structure learned by the model, we extract the trained unbinding relation vectors from the TP-N2F Decoder and reduce the dimension of vectors via Principal Component Analysis. K-means clustering results on the average vectors for each relation are presented in Figure 4. Results show that unbinding vectors for operators or functions with similar semantics tend to be close to each other. For example, with 5 clusters in the MathQA dataset, arithmetic operators such as add, subtract, multiply, divide are clustered together, and operators related to square or volume of geometry are clustered together. With 4 clusters in the AlgoLisp dataset, partial/lambda functions and sort functions are in one cluster, and string processing functions are clustered together. Note that there is no direct supervision to inform the model about the nature of the operations, and the TP-N2F decoder has induced this role structure using weak supervision signals from natural-language-question/operation-sequence-answer pairs. More clustering results are presented in the Appendix.

Figure 4: K-means clustering results: MathQA with 5 clusters and AlgoLisp with 4 clusters

5 Related work

TPR is a promising technique for encoding symbolic structural information and modeling symbolic reasoning in vector space. TPR binding has been used for encoding and exploring grammatical structural information of natural-language (Palangi et al., 2018; Huang et al., 2019). TPRs have also been used to unbind natural-language captions from images (Huang et al., 2018). Some researchers use TPRs for modeling deductive reasoning processes both on a rule-based model and deep learning models in vector space (Lee et al., 2016; Smolensky et al., 2016; Schlag and Schmidhuber, 2018). However, none of these previous models takes advantage of combining TPR binding and TPR unbinding in deep learning models to learn structure representation mappings explicitly, as done in our model. Many researchers have explored generating formal-language expressions from natural-language descriptions. For example, operation sequence generation from math problems has been formalized as a machine translation task on a word-token level and a modified LSTM model has been deployed on this task (Amini et al., 2019). Program generation has been regarded as decoding tree structure from natural-language and a tree decoder has been utilized to generate program trees (Polosukhin and Skidanov, 2018; Bednarek et al., 2019). Although researchers are paying increasing attention to N2F tasks, most of the proposed models either do not encode structural information explicitly or are applicable for specific tasks. Our proposed TP-N2F neural model can be applied to multiple tasks.

6 CONCLUSION AND FUTURE WORK

In this paper we propose a new scheme for neural-symbolic relational representations and a new architecture, TP-N2F, for formal-language generation from natural-language descriptions. To our knowledge, TP-N2F is the first model that combines TPR binding and TPR unbinding in the encoder-decoder fashion. TP-N2F achieves the state-of-the-art on two instances of N2F tasks, showing significant structure learning ability. The results show that both the TP-N2F encoder and the TP-N2F decoder are important for improving natural- to formal-language generation. We believe that the interpretation and symbolic structure encoding of TPRs are a promising direction for future work. We also plan to combine large-scale deep learning models such as BERT with TP-N2F to take advantage of structure learning for other generation tasks.

References

  • A. Amini, S. Gabriel, P. Lin, R. K. Kedziorski, Y. Choi, and H. Hajishirzi (2019) MathQA: towards interpretable math word problem solving with operation-based formalisms. In NACCL, Cited by: §4.1.1, §4.1.2, §4.1, §5.
  • J. Bednarek, K. Piaskowski, and K. Krawiec (2019) Ain’t nobody got time for coding: structure-aware program synthesis from natural language. In arXiv.org, Cited by: §4.2.2, §5.
  • K. Chen and K. D. Forbus (2018) Action recognition from skeleton data via analogical generalization over qualitative representations. In Thirty-Second AAAI Conference, Cited by: §1.
  • K. Chen, I. Rabkina, M. D. McLure, and K. D. Forbus (2019) Human-like sketch object recognition via analogical learning. In Thirty-Third AAAI Conference, Vol. 33, pp. 1336–1343. Cited by: §1.
  • M. Crouse, C. McFate, and K. D. Forbus (2018) Learning from unannotated qa pairs to analogically disanbiguate and answer questions. In Thirty-Second AAAI Conference, Cited by: §1.
  • Kenneth.D. Forbus, C. Liang, and I. Rabkina (2017) Representation and computation in cognitive models. In Top Cognitive System, Cited by: §1.
  • J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §1.
  • S. Goldin-Meadow and D. Gentner (2003) Language in mind: advances in the study of language and thought. MIT Press. Cited by: §1.
  • Q. Huang, L. Deng, D. Wu, c. Liu, and X. He (2019) Attentive tensor product learning. In Thirty-Third AAAI Conference, Vol. 33. Cited by: §1, §5.
  • Q. Huang, P. Smolensky, X. He, O. Wu, and L. Deng (2018) Tensor product generation networks for deep nlp modeling. In NAACL, Cited by: §1, §3.1.1, §3.2.2, §5.
  • D. P. Kingma and J. Ba (2017) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  • M. Lee, X. He, W. Yih, J. Gao, L. Deng, and P. Smolensky (2016) Reasoning in vector space: an exploratory study of question answering. In ICLR, Cited by: §5.
  • H. Palangi, P. Smolensky, X. He, and L. Deng (2018) Question-answering with grammatically-interpretable representations. In AAAI, Cited by: §1, §3.1.1, §3.1.1, §5.
  • I. Polosukhin and A. Skidanov (2018) Neural program search: solving programming tasks from description and examples. In ICLR workshop, Cited by: §4.2.1, §5.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning internal representations by error propagation. In Parallel distributed processing: Explorations in the microstructure of cognition, D. E. Rumelhart, J. L. McClelland, and the PDP Group (Eds.), Vol. 1, pp. 318–362. Cited by: §3.3.
  • I. Schlag and J. Schmidhuber (2018) Learning to reason with third order tensor products. In Neural Information Processing Systems, Cited by: §3.1.2, §5.
  • P. Smolensky, M. Lee, X. He, W. Yih, J. Gao, and L. Deng (2016) Basic reasoning with tensor product representations. arXiv preprint arXiv:1601.02745. Cited by: §3.1.2, §5.
  • P. Smolensky (1990) Tensor product variable binding and the representation of symbolic structures in connectionist networks. In Artificial Intelligence, Vol. 46, pp. 159–216. Cited by: §1.

Appendix A Appendix

A.1 Implementations of TP-N2F for experiments

In this section, we introduce the detailed implementation of TP-N2F on the two datasets. We use dR,nR,dF,nF to indicate the TP-N2F encoder hyperparameters, the dimension of role vectors, the number of roles, the dimension of filler vectors and the number of fillers. dRel,dArg,dPos indicate the TP-N2F decoder hyper-parameters, the dimension of relation vectors, the dimension of argument vectors, and the dimension of position vectors.

In the experiment on the MathQA dataset, we use nF=150, nR=50, dF=30, dR=20, dRel=10, dArg=20, dPos=5 and we train the model for 60 epochs with learning rate 0.00115. As most of the math operators in this dataset are binary, we replace all operators taking three arguments with a set of binary operators based on hand-encoded rules, and for all operators taking one argument, a padding symbol is appended.

In the experiment on the AlgoLisp dataset, we use nF=150, nR=50, dF=30, dR=30, dRel=30, dArg=20, dPos=5 and we train the model for 50 epochs with learning rate 0.00115. For this dataset, most function calls take three arguments so we simply add padding symbols for those functions with fewer than three arguments.

A.2 Detailed formulas of TP-N2F

A.2.1 TP-N2F encoder

Filler-LSTM in TP-N2F encoder (In the LSTM, we flatten the 𝑻 to multiply 𝑽)

𝒇ft=φ(𝑼ff𝒘t+𝑽ff𝑻t-1+𝒃ff) (13)
𝒈ft=tanh(𝑼fg𝒘t+𝑽fg𝑻t-1+𝒃fg) (14)
𝒊ft=φ(𝑼fi𝒘t+𝑽fi𝑻t-1+𝒃fi) (15)
𝒐ft=φ(𝑼fo𝒘t+𝑽fo𝑻t-1+𝒃fo) (16)
𝒄ft=𝒇ft*𝒄ft-1+𝒊ft*𝒈ft (17)
𝒉ft=𝒐ft*tanh(𝒄ft) (18)

Filler vector

𝒂ft=fsm(flinear(𝒉ft)/temp) (19)
𝒇t=flinear(𝒂ft) (20)

Role-LSTM in TP-N2F encoder (In the LSTM, we flatten the 𝑻 to multiply 𝑽)

𝒇rt=φ(𝑼rf𝒘t+𝑽rf𝑻t-1+𝒃rf) (21)
𝒈rt=tanh(𝑼rg𝒘t+𝑽rg𝑻t-1+𝒃rg) (22)
𝒊rt=φ(𝑼ri𝒘t+𝑽ri𝑻t-1+𝒃ri) (23)
𝒐rt=φ(𝑼ro𝒘t+𝑽ro𝑻t-1+𝒃rO) (24)
𝒄rt=𝒇rt*𝒄rt-1+𝒊rt*𝒈rt (25)
𝒉rt=𝒐rt*tanh(𝒄rt) (26)

Role vector

𝒂rt=fsm(flinear(𝒉rt)/temp) (27)
𝒓t=flinear(𝒂rt) (28)

A.2.2 TP-N2F decoder

Tuple-LSTM (In the LSTM, we flatten the 𝑯 to multiply 𝑽)

𝒇t=φ(𝑼f𝒘t+𝑽f𝑯t-1+𝒃f) (29)
𝒈t=tanh(𝑼G𝒘t+𝑽G𝑯t-1+𝒃G) (30)
𝒊t=φ(𝑼I𝒘t+𝑽I𝑯t-1+𝒃I) (31)
𝒐t=φ(𝑼O𝒘t+𝑽O𝑯t-1+𝒃O) (32)
𝒄t=𝒇t*𝒄t-1+𝒊t*𝒈t (33)
𝒉inputt=𝒐t*tanh(𝒄t) (34)
𝒂t=fsm(fscore(Context,flinear(𝒉inputt))) (35)
𝒄weightedt=i=0n(𝒂it𝒄i),𝒂it𝒂t,𝒄iContext (36)
𝑯t=tanh(flinear(concat(𝒄weightedt,𝒉inputt))) (37)

Unbinding

𝑩1t=𝒂1t𝒓t=𝑯t𝒑1 (38)
𝑩2t=𝒂2t𝒓t=𝑯t𝒑2 (39)
𝒓dualt=flinear(B1t+B2t) (40)
𝒂1t=𝑩1t𝒓t (41)
𝒂2t=𝑩2t𝒓t (42)
Relt=Classifierrel(𝒓t) (43)
Arg1t=Classifierarg(𝒂1t) (44)
Arg2t=Classifierarg(𝒂2t) (45)

A.3 Dataset samples

A.3.1 Data sample from MathQA dataset

Problem: The present polulation of a town is 3888. Population increase rate is 20%. Find the population of town after 1 year?
Options: a) 2500, b) 2100, c) 3500, d) 3600, e) 2700
Operations: multiply(n0,n1), divide(#0,const-100), add(n0,#1)

A.3.2 Data sample from AlgoLisp dataset

Problem: Consider an array of numbers and a number, decrements each element in the given array by the given number, what is the given array?
Program Nested List: (map a (partial1 b –))
Command-Sequence: (partial1 b –), (map a #0)

A.4 Generated programs comparison

In this section, we display some generated samples from the two datasets, where the TP-N2F model generates correct programs but LSTM-Seq2Seq does not.

Question: A train running at the speed of 50 km per hour crosses a post in 4 seconds. What is the length of the train?
TP-N2F(correct):
(multiply,n0,const1000) (divide,#0,const3600) (multiply,n1,#1)
LSTM(wrong):
(multiply,n0,const0.2778) (multiply,n1,#0)

Question: 20 is subtracted from 60 percent of a number, the result is 88. Find the number?
TP-N2F(correct):
(add,n0,n2) (divide,n1,const100) (divide,#0,#1)
LSTM(wrong):
(add,n0,n2) (divide,n1,const100) (divide,#0,#1) (multiply,#2,n3) (subtract,#3,n0)

Question: The population of a village is 14300. It increases annually at the rate of 15 percent. What will be its population after 2 years?
TP-N2F(correct):
(divide,n1,const100) (add,#0,const1) (power,#1,n2) (multiply,n0,#2)
LSTM(wrong):
(multiply,const4,const100) (sqrt,#0)

Question: There are two groups of students in the sixth grade. There are 45 students in group a, and 55 students in group b. If, on a particular day, 20 percent of the students in group a forget their homework, and 40 percent of the students in group b forget their homework, then what percentage of the sixth graders forgot their homework?
TP-N2F(correct):
(add,n0,n1) (multiply,n0,n2) (multiply,n1,n3) (divide,#1,const100) (divide,#2,const100) (add,#3,#4) (divide,#5,#0) (multiply,#6,const100)
LSTM(wrong):
(multiply,n0,n1) (subtract,n0,n1) (divide,#0,#1)

Question: 1 divided by 0.05 is equal to
TP-N2F(correct):
(divide,n0,n1)
LSTM(wrong):
(divide,n0,n1) (multiply,n2,#0)

Question: Consider a number a, compute factorial of a
TP-N2F(correct):
(,arg1,1) (-,arg1,1) (self,#1) (*,#2,arg1) (if,#0,1,#3) (lambda1,#4) (invoke1,#5,a)
LSTM(wrong):
(,arg1,1) (-,arg1,1) (self,#1) (*,#2,arg1) (if,#0,1,#3) (lambda1,#4) (len,a) (invoke1,#5,#6)

Question: Given an array of numbers and numbers b and c, add c to elements of the product of elements of the given array and b, what is the product of elements of the given array and b?
TP-N2F(correct):
(partial,b,*) (partial1,c,+) (map,a,#0) (map,#2,#1)
LSTM(wrong):
(partial1,b,+) (partial1,c,+) (map,a,#0) (map,#2,#1)

A.5 Unbinding relation vector clustering

We run K-means clustering on both datasets with k=3,4,5,6 clusters and the results are displayed in Figure 5 and Figure 6. As described before, unbinding-vectors for operators or functions with similar semantics tend to be closer to each other. For example, in the MathQA dataset, arithmetic operators such as add, subtract, multiply, divide are clustered together at middle, and operators related to geometry such as square or volume are clustered together at bottom left. In AlgoLisp dataset, basic arithmetic functions are clustered at middle, and string processing functions are clustered at right.

Figure 5: MathQA clustering results
Figure 6: AlgoLisp clustering results