Positional Attention-based Frame Identification with BERT: A Deep Learning Approach to Target Disambiguation and Semantic Frame Selection

  • 2019-10-31 15:51:04
  • Sang-Sang Tan, Jin-Cheon Na
  • 0


Semantic parsing is the task of transforming sentences from natural languageinto formal representations of predicate-argument structures. Under thisresearch area, frame-semantic parsing has attracted much interest. This parsingapproach leverages the lexical information defined in FrameNet to associatemarked predicates or targets with semantic frames, thereby assigning semanticroles to sentence components based on pre-specified frame elements in FrameNet.In this paper, a deep neural network architecture known as PositionalAttention-based Frame Identification with BERT (PAFIBERT) is presented as asolution to the frame identification subtask in frame-semantic parsing.Although the importance of this subtask is well-established, prior research hasyet to find a robust solution that works satisfactorily for both in-domain andout-of-domain data. This study thus set out to improve frame identification inlight of recent advancements of language modeling and transfer learning innatural language processing. The proposed method is partially empowered byBERT, a pre-trained language model that excels at capturing contextualinformation in texts. By combining the language representation power of BERTwith a position-based attention mechanism, PAFIBERT is able to attend totarget-specific contexts in sentences for disambiguating targets andassociating them with the most suitable semantic frames. Under variousexperimental settings, PAFIBERT outperformed existing solutions by asignificant margin, achieving new state-of-the-art results for both in-domainand out-of-domain benchmark test sets.


Quick Read (beta)

Positional Attention-based Frame Identification with BERT: A Deep Learning Approach to Target Disambiguation and Semantic Frame Selection

Sang-Sang Tan
Wee Kim Wee School of Communication and Information
Nanyang Technological University
31 Nanyang Link, Singapore 637718
[email protected]
&Jin-Cheon Na
Wee Kim Wee School of Communication and Information
Nanyang Technological University
31 Nanyang Link, Singapore 637718
[email protected]
Corresponding author

Semantic parsing is the task of transforming sentences from natural language into formal representations of predicate-argument structures. Under this research area, frame-semantic parsing has attracted much interest. This parsing approach leverages the lexical information defined in FrameNet to associate marked predicates or targets with semantic frames, thereby assigning semantic roles to sentence components based on pre-specified frame elements in FrameNet. In this paper, a deep neural network architecture known as Positional Attention-based Frame Identification with BERT (PAFIBERT) is presented as a solution to the frame identification subtask in frame-semantic parsing. Although the importance of this subtask is well-established, prior research has yet to find a robust solution that works satisfactorily for both in-domain and out-of-domain data. This study thus set out to improve frame identification in light of recent advancements of language modeling and transfer learning in natural language processing. The proposed method is partially empowered by BERT, a pre-trained language model that excels at capturing contextual information in texts. By combining the language representation power of BERT with a position-based attention mechanism, PAFIBERT is able to attend to target-specific contexts in sentences for disambiguating targets and associating them with the most suitable semantic frames. Under various experimental settings, PAFIBERT outperformed existing solutions by a significant margin, achieving new state-of-the-art results for both in-domain and out-of-domain benchmark test sets.


Positional Attention-based Frame Identification with BERT: A Deep Learning Approach to Target Disambiguation and Semantic Frame Selection

  Sang-Sang Tanthanks: Corresponding author Wee Kim Wee School of Communication and Information Nanyang Technological University 31 Nanyang Link, Singapore 637718 [email protected] Jin-Cheon Na Wee Kim Wee School of Communication and Information Nanyang Technological University 31 Nanyang Link, Singapore 637718 [email protected]

Keywords Frame Identification   Semantic Parsing   FrameNet   BERT   Attention   Word Sense Disambiguation

1 Introduction

Semantic parsing is essential for many knowledge-based applications that require semantic reasoning and interpretation of natural language. Such applications include question answering [1, 2], information extraction [3, 4], and knowledge graph construction [5, 6], among many others. Given a sentence, the general goal of a semantic parser is to (1) identify predicates in the sentence, (2) detect sentence components that form the semantic arguments related to the predicates, and (3) determine the roles (e.g., Agent, Patient, Temporal, Manner, Cause, and so forth) of the semantic arguments. Through the detection of predicates and their semantic arguments, semantic parsing allows sentences to be computationally interpreted to answer the question “Who did What to Whom, and How, When and Where?” [7, p.1].

Recent years have seen increasingly rapid advances in semantic parsing due to the continuing efforts devoted to creating and maintaining semantically-annotated lexical resources [8, 9]. These resources come with manually-labeled text corpora, which are often used as development data to build statistical models for automated semantic parsing. In this study, we focus on the frame identification subtask in frame-semantic parsing. This parsing approach makes use of the lexical information defined in FrameNet [8] to generate semantic representations from natural language. As a knowledge base stemmed from the theory of Frame Semantics [10], FrameNet is centralized around the idea that semantic frames are the conceptual categories and contexts that become the basis for meaning representation. Figure 1 shows an example of frame-semantic parsing and a simplified schematic representation of the [Desirability] frame retrieved from FrameNet 1.7. The majority of work in frame-semantic parsing has concentrated on the following subtasks:

Figure 1: An example of frame-semantic parsing (retrieved from FrameNet 1.7). The parsing process involves two steps: first, the frame identification step identifies [Desirability] as the most likely frame for ‘excellent’; second, the semantic role assignment step determines the roles (i.e., Circumstances and Evaluee) for the sentence components based on the identified frame.
  • Frame identification. Given a sentence, this task identifies the most likely semantic frame for a marked lexical predicate (also known as target), which can be any word or phrase in the sentence. This step makes use of the association relations between the frames and lexical units (LUs) defined in FrameNet.

  • Semantic role assignment. Given the target and its identified frame, the second subtask aims to associate sentence components with semantic roles, which are specified as frame elements in FrameNet.

Frame identification plays a vital role in frame-semantic parsing because recognizing frames correctly is essential for the subsequent step of semantic role assignment. Specifically, the frame identification step reduces the search scope for the most suitable semantic roles from hundreds of roles in FrameNet to at most 30 potential roles per frame [11]. The study conducted by [11] showed that frame identification errors could cause drastic degradation in the accuracy of role assignment.

Motivated by the need for a high-quality and robust frame identification solution, this study presents a new method called Positional Attention-based Frame Identification with BERT (PAFIBERT). The proposed method employs a deep learning neural network architecture that combines a position-based attention mechanism and a state-of-the-art pre-trained language model known as Bidirectional Encoder Representations from Transformers (BERT) [12]. As will be discussed in more detail in section 3, the position-based attention mechanism is essential for the frame identification problem, in which multiple occurrences of the same target (i.e., the same word or phrase) in a sentence may refer to different frames. Incorporating this attention mechanism thus allows PAFIBERT to distinguish different instances of the target based on their positions and select the most suitable frame for each instance by attending to its surrounding context.

This work is inspired by the recent promising achievements of pre-trained language models in a wide range of natural language processing (NLP) tasks [12, 13, 14]. These language models are pre-trained via unsupervised learning, for which an enormous amount of textual data can be obtained easily. The pre-training stage allows the models to capture basic linguistic information such as word contexts and inter-sentence relations [12]. During pre-training, the language models learn weights that can be fine-tuned later for downstream NLP tasks, thus eliminating the need to learn the weights from scratch in those tasks, for which labeled data are usually required but hard to come by.

The frame identification task can be regarded as a conceptualization process that maps a target to a more abstract concept (i.e., a semantic frame). Because of homonymy and polysemy, an ambiguous target may refer to multiple frames. For instance, ‘beat’ may be associated with either the [Cause_Harm] frame or the [Beat_Opponent] frame, depending on the context in which it is used. In this regard, frame identification is analogous to a word sense disambiguation task that disambiguates a target based on its sentential context. Therefore, frame identification may benefit from the deep, bidirectional architecture of BERT, which is designed to capture contextual information in texts from both left-to-right and right-to-left directions. The contributions of this study are as follows:

  • This study presents a novel neural network architecture called PAFIBERT that combines a position-based attention mechanism with BERT to solve the frame identification problem. By providing a solution for frame identification, this study makes a solid contribution to the field of semantic parsing and its potential application areas.

  • PAFIBERT produced new state-of-the-art results on two benchmark datasets: the in-domain Das’s test set [15] and the out-of-domain YAGS test set [11]. The former consists of data held out from FrameNet’s text corpora, whereas the latter contains data collected from other sources. To date, most studies have only reported test results for the former. However, as demonstrated by [11], there exists a disparity between the results produced from the in-domain and out-of-domain evaluations. Their findings have suggested that the out-of-domain evaluation is crucial in frame identification as its results serve as an indicator of model generalizability in real-world applications. The present study not only included the YAGS test set but also examined PAFIBERT qualitatively on other out-of-domain sample sentences that contain interesting cases of unseen targets. The results obtained in these quantitative and qualitative out-of-domain evaluations demonstrated the robustness and generalizability of our method.

  • In addition to obtaining promising results in frame identification, the proposed neural network design also provides insights into the adaptation of BERT [12], not only for this specific problem, but also for other similar NLP problems that rely on sentential contexts to make target-specific predictions. Examples of such problems include word sense disambiguation, aspect-based sentiment analysis, and named entity recognition, among others. The findings and achievements in this study provide convincing empirical proof that adds to the rapidly expanding body of research on language modeling and transfer learning in NLP [12, 13, 14].

2 Related Work

2.1 Frame Identification

Frame-semantic parsing has started to attract more interest after the introduction of the SemEval 2007 shared task on frame-semantic structure extraction [16]. The best system in the shared task was presented by [17]. For the frame identification subtask, the researchers employed Support Vector Machines (SVMs) to learn a set of classifiers using the following features: lemmatized targets, word forms of the targets, features generated from dependency parsing such as dependents and parents of the targets, and so forth. This system was then outperformed by an earlier version of SEMAFOR [18], a frame-semantic parser that used a conditional log-linear model for frame identification. A key challenge in frame identification is dealing with unseen targets, i.e., the targets that do not exist in the FrameNet lexicon and the hand-labeled text corpora released along with FrameNet. While the best system in the SemEval 2007 shared task tried to handle unseen targets by expanding the LU coverage of FrameNet using WordNet synsets [17], the approach proposed by [18] used a latent variable to incorporate lexical-semantic relations into their model. These lexical-semantic features allowed the model to make predictions based on the relations between unseen targets and FrameNet’s LUs. Further improvement was also suggested by [15] to address the problem of unseen targets. They adopted a semi-supervised learning approach to build a graph from FrameNet’s text corpora and a large amount of unlabeled data. The graph vertices corresponded to targets, whereas the edges represented frame similarity and distributional similarity. For the known targets (a subset of the graph vertices), their weights were initialized based on the frame distribution in FrameNet’s text corpora. Through a label propagation process, the frames (i.e., labels) were transferred from the known targets to the unlabeled vertices of the graph. The constructed graph was then used to suggest candidate frames for unseen targets.

Earlier work on frame identification mainly focused on feature-based approaches, which relied on hand-engineered features. Since the popularization of distributed representations [19], more recent work in frame identification has gradually shifted toward the use of word embeddings or low-dimensional vector representations. The word embedding approach not only reduces the need for expensive feature engineering of semantic structures but also leads to improved accuracy due to better generalization [20]. One of the earliest studies that leveraged distributed representations for frame identification was carried out by [21]. Their technique extracted the syntactic contexts of targets from data and generated an initial high-dimensional vector space using word embeddings. The WSABIE algorithm [22] was then used to find a transformation function that mapped the initial high-dimensional vectors to low-dimensional representations. Embeddings were also learned for frame labels in the low-dimensional space such that for each target, the proximity between the target and its matching frame (i.e., gold frame) exceeded that between the target and the competing frames. At inference time, after a target and its context were projected using the learned function to the low dimensional space, the nearest frame label was chosen as the prediction. The accuracy achieved by this technique was highly promising, surpassing the results reported in previous studies by a large margin. However, for the generation of the initial high-dimensional vectors, this technique still involved a great deal of feature engineering to extract features via syntactic parsing. In the same vein, a study conducted by [11] experimented with pre-trained word embeddings to represent targets and their contexts as inputs for learning two classification models: a two-layer neural network and a WSABIE model as proposed by [21]. Although their models did not outperform that of [21] for Das’s benchmark test set [15], the results they obtained on out-of-domain data seemed to suggest that their models generalized better in other domains.

Following the success of deep learning in many NLP tasks, researchers in frame-semantic parsing have also explored various types of neural networks for frame identification. The method proposed by [23] combined pre-trained word embeddings, learned word embeddings, and part-of-speech (POS) embeddings for frame identification using a long short-term memory (LSTM) neural network and achieved an accuracy similar to that of [11]. Another study [24] examined the effectiveness of a multi-layer neural network for frame identification and found that the overall model performance was competitive, and their method outperformed other models in predicting frames for ambiguous LUs.

In a recent study, [25] proposed a frame identification solution that employed a joint decoding approach. This approach used disjoint data to improve model performance in multi-task scenarios by allowing information sharing across tasks to benefit all the tasks. For the frame-semantic parsing problem, which consists of the frame identification and semantic role assignment subtasks, the joint decoding approach could learn model parameters jointly for both tasks by optimizing a multi-task objective. Their solution showed significant improvements over previous work, producing state-of-the-art accuracies of 90% and 89.1% for frame identification on FrameNet 1.5 and FrameNet 1.7 respectively.

Although extensive research has been carried out in this area, as pointed out by [11], most studies have only focused on increasing model accuracies for Das’s benchmark test set [15]. Without investigating rigorously the performance of frame identification models on other datasets that may contain more cases of unseen targets, domain-specific words, and jargon, it is hard to tell how well these models would perform in real-world NLP applications. The present study thus set out to develop a robust solution that not only outperforms other models on Das’s test set but also works reliably on other data.

2.2 Pre-trained Language Models

There are various ways to convert textual data into the numerical representations that are processable by machine learning algorithms. Local representations in the form of sparse, binary vectors can be obtained easily through one-hot encoding, but they tend to suffer several problems such as the curse of dimensionality and lack of semantics [26, 27]. To overcome the limitations of local representations, research in distributional semantics [19, 28, 29, 30] aims to produce numerical representations for linguistic items based on their distributions in textual data. This line of research has led to the popularization of word embedding, which has become a standard representation technique in NLP. In contrast to the local representations generated from one-hot encoding, distributed representations use low-dimensional vectors of real numbers to capture the semantic and syntactic relations between linguistic items [19, 29].

Generating high-quality word embeddings efficiently is of significant interest to the research community. In particular, the production and optimization of pre-trained word embeddings have been an active research area (e.g., [31, 32]). Pre-trained word embeddings can be obtained via unsupervised learning on unlabeled data. Since the pre-training process is task-agnostic and domain-independent, the resulted word vectors are applicable to a wide range of downstream tasks. Apparently, pre-trained word vectors such as those produced by GloVe [32] and Word2Vec [19] have brought NLP a long way. However, this type of pre-training approach is considered shallow because word embeddings usually serve as input features to initialize the first layer of neural network models, but the rest of the models are trained from scratch using task-specific data [13]. There have been some attempts to incorporate pre-trained embeddings into other layers of neural networks, but since the embeddings are still treated as fixed weights in those models, their effectiveness and usefulness are somewhat limited [13].

Outside of NLP, transfer learning in computer vision research has shown revolutionary improvements in image classification tasks [33, 34, 35]. It is thus not surprising that transfer learning has also attracted tremendous research interest among the NLP community in recent years. Transfer learning usually involves fine-tuning a pre-trained model, i.e., a model that was trained in an unsupervised setting on a massive unlabeled dataset. Instead of learning from scratch, the learning process for a downstream task can then benefit from the weights of the pre-trained models. In other words, the process only needs to adjust the weights to adapt to the specific task, usually with supervised learning on a relatively smaller set of labeled data. Compared to using pre-trained word embeddings, employing a pre-trained language modeling approach allows all model weights—including weights in the pre-trained layers and the task-specific layers—to be jointly updated for a downstream NLP task. Recent advancements in NLP have acquired convincing empirical proof that supports the shift of paradigm from shallow to deep pre-training [12, 13, 14]. Highly promising results have been achieved using pre-trained language models like Universal Language Model Fine-tuning (ULMFiT) [13], OpenAI’s Generative Pre-trained Transformer [14], and Google’s Bidirectional Encoder Representations from Transformers (BERT) [12] on a wide range of NLP tasks.

BERT was introduced by [12] to overcome some limitations in the implementation of language models. They argued that the unidirectional architecture of language models like OpenAI’s GTP [14] could be suboptimal. BERT is a deep bidirectional model trained using two novel pre-training tasks that allowed the language model to capture (1) the contextual information of tokens from both left-to-right and right-to-left directions and (2) the relations between sentences. Although the concept underlying the development of BERT is simple, it has been empirically proven as more potent than other pre-trained language models, rewriting the records of state-of-the-art results on benchmark datasets in 11 NLP tasks.

3 The Proposed Method

We propose an attention-based neural network architecture called Positional Attention-based Frame Identification with BERT (PAFIBERT) as a solution to the frame identification problem. In NLP, the use of attention mechanism has its root in machine translation [36]. It has since become an essential component of neural network models for many NLP problems like text summarization (e.g., [37, 38]), question answering (e.g., [39, 40]), and aspect-based sentiment analysis (e.g., [41, 42]). In general, the primary focus of attention mechanisms in deep learning is to compute alignment vectors consisting of weights that represent the importance of other elements in neural networks. Our method makes use of targets’ positions to attend to the sentence constituents that are useful for disambiguating the targets. In the following subsections, we define the frame identification problem in detail and describe our solution to the problem.

3.1 Problem Definition

Several examples of FrameNet annotations for frame identification are given in Figure 2. These examples are revealing in several ways: first, they show that targets are marked by indices that indicate their positions in sentences; second, a target may consist of non-contiguous stretches of words (as in Example 2); third, as demonstrated by Example 1, multiple instances of the same target may occur at different positions in a sentence and be associated with different frames.

Figure 2: Examples of FrameNet annotations

Given a sentence s and a set of indices x={(a1,b1),(a2,b2),} that indicates the positions of a target, the goal of frame identification is to determine the most suitable frame for the target based on its usage in s.

3.2 Positional Attention-based Frame Identification with BERT (PAFIBERT)

The PAFIBERT method proposed in this study is built on top of BERT. The model architecture used by BERT is a multi-layer transformer encoder, implemented based on the original transformer architecture proposed by [43]. Two BERT models of different sizes are available: BERTBASE and BERTLARGE. This study used BERTBASE, which consists of 12 transformer blocks, 768 hidden dimensions, and 12 self-attention heads. Input sequences of BERT are constructed by combining token embeddings, segment embeddings, and position embeddings. Special tokens like [CLS] and [SEP] are used to indicate the beginning of sequences or separate non-consecutive sequences in multi-sequence tasks. BERT adopts the WordPiece technique [44] for token representation, and word pieces or subwords are preceded by ##.

BERT is designed to handle various types of downstream NLP tasks, including those requiring sentence pairs instead of single sentences as inputs. As demonstrated by [12], BERT could be adopted easily for numerous NLP tasks by simply adding an output layer to the model. Several task-specific models were presented in their work for common NLP tasks like sentiment analysis, sentence similarity analysis, and textual entailment. However, as described in subsection 3.1 of this paper, for the frame identification problem, an ambiguous target with multiple occurrences in a given sentence may be associated with different frames, depending on its context or usage in the sentence. None of the task-specific models introduced by [12] provides a valid solution to this problem. For example, a model that uses sentence-target pairs as inputs would generate the same input representation for all occurrences of the target in the same sentence. This limitation prevents the model from disambiguating different instances of the target. As a remedy to this problem, PAFIBERT leverages targets’ positions for position-based attention, thus allowing the generation of target-specific representations for frame identification.

Two variants of PAFIBERT are presented in this study: without frame filtering and with frame filtering. The two neural network designs are depicted in Figure 3 and 4. The key difference between these two variants will be discussed in the next subsection, but in general, PAFIBERT needs two inputs: sentence s and a target indicated by its position indices x. The input sequence generated from s is fed through the transformer layers of BERT to produce H, an (n×d) matrix of hidden states where n is the sequence length and d is the hidden dimension. The attention mechanism transforms H into a pooled attentional hidden state h~, which is a d-dimensional vector that captures the information about the input sentence and the target. Using the default configuration in BERTBASE, the hidden dimension is set to 768. As suggested by [45], the attentional hidden state can be computed as

h~=tanh(Wc[c;t]) (1)

where the context vector c and target vector t are combined via concatenation and passed through a neural network layer. Wc is the weight matrix of the layer, and the context vector c is a d-dimensional vector derived as follows:

c=HTα (2)
Figure 3: PAFIBERT without frame filtering
Figure 4: PAFIBERT with frame filtering

The context vector c captures the relations between the hidden states and the n-dimensional alignment vector α, which defines how much each hidden state contributes to the prediction of an output. Various alternatives exist for computing the alignment vector α in the above equation. PAFIBERT adopts a local attention mechanism [45] that restricts the attention to only a subset of the input hidden states. Compared to a global attention mechanism that computes the alignment scores for all hidden states of an input sequence, a local attention mechanism is less expensive in terms of computational costs. The intuition of using a local attention mechanism for the frame identification problem is that, given a target in a lengthy sentence, frame identification models may not need the whole sentence to predict the most likely frame for the target. Instead, the contextual words adjacent to the target are likely to be sufficient for this task. Hence, PAFIBERT applies a fixed window (w) to limit the scope of attention to only w hidden states before and after the target positions. In other words, it only computes the alignment scores for the hidden states that fall within the range of [β1,β2] where

β1 ={pstart-wif (pstart-w)>1,1if (pstart-w)1, (3)
β2 ={pend+wif (pend+w)<n,nif (pend+w)n. (4)

In the above formulas, pstart and pend denote the smallest and largest target positions (pstart = pend for single-word targets). By observing the typical scope needed to infer the correct frames for the hand-labeled annotations in FrameNet’s text corpora, we selected the size of the fixed window as w=10 in this study. For each hidden state at position i (i.e., for the ith input token) within the range of [β1,β2], its alignment score (αi) is calculated as follows:

αi=exp(hit)j=β1β2exp(hjt) (5)

Note that Equation 5 describes a softmax function such that α represents a probability distribution that defines the importance of each hidden state. For the hidden states that fall outside the fixed window, their alignment scores are set to zeros. The derivation of t—the target vector—in Equation 1 and 5 is very much problem-specific and varies across different attention-based neural networks. In our method, t is extracted from the hidden states that correspond to the positions of a specified target. The target positions are represented using an n-dimensional position vector (p), which consists of binary values indicating whether a particular position is a target position (value 1) or a non-target position (value 0). Since BERT uses the WordPiece technique [44] for text tokenization, the position vector p is derived such that it is compatible with BERT’s input tokens. As depicted in Figure 5, if a target is split by BERT’s tokenizer into word pieces, the position vector marks the word pieces as part of the target. Based on the target’s positions, the d-dimensional target vector is then derived as follows from the hidden states in H:

t=HTp (6)
Figure 5: An example of the position vector p

3.3 Frame Filtering

Most frame identification studies have adopted the standard experimental setup in which FrameNet is used to specify the candidate frames given a target, and the most suitable frame for the target is then selected from the candidate frames [11]. Under this setup, if the target is unambiguous—it has only one candidate frame—the frame is selected directly. On the other hand, if the target is ambiguous, this frame filtering step allows frame identification models to limit their search to the candidate frames instead of considering every frame in FrameNet. This setup, however, makes strong assumptions about the comprehensiveness of FrameNet: It assumes that FrameNet specifies every possible sense of the target and that the list of LUs under each frame is exhaustive. These assumptions may be acceptable when applied to the development data released by FrameNet, but they certainly do not hold for real-world data. Therefore, following the suggestion of [11], we experimented with two neural network designs: with frame filtering and without frame filtering.

In the no-filtering model, the attentional hidden state (h~) is simply fed through a softmax layer to generate a k-dimensional vector y using Equation 7 below, where k is the number of frames in FrameNet. Vector y denotes the frame probability distribution given sentence s and a target marked by position indices x. Ws is the weight matrix of the softmax layer.

y=softmax(Wsh~) (7)

In the model with frame filtering, the filtering step is added before the softmax layer so that the probability values for the non-candidate frames are set to zeros, whereas the probability values for the candidate frames sum up to 1. This filtering step generates the filtered prediction vector o, which is a k-dimensional vector calculated using the formula below:

o=(Wkh~)v (8)

In the above equation, h~ is passed through a fully-connected layer consisting of k nodes. Wk is the weight matrix of the fully-connected layer. Element-wise multiplications are then performed between the output of the full-connected layer and vector v, which is a k-dimensional vector consisting of binary values that specify whether a particular frame is a candidate frame. For unseen targets, this step assumes that every frame in FrameNet is a potential candidate. Due to the element-wise multiplications, only the candidate frames will have non-zero probability values. Finally, the model produces the k-dimensional frame probability distribution as follows:

y=softmax(Wso) (9)

3.4 Listing of Candidate Frames

As shown in Figure 2, FrameNet annotations associate each target with a gold LU and gold frame. Therefore, when implementing frame filtering in frame identification models, existing studies (e.g., [21, 23, 25]) have assumed the gold LUs are known. In other words, the models were allowed to use the gold LUs to retrieve candidate frames from FrameNet.

Apparently, this simplified setup does not reflect real-world deployment scenarios accurately. In a more realistic setup, the gold LUs should be unknown, and frame identification models should predict based on the given targets, which may occur in any word forms (e.g., singular or plural form, past tense or present tense, comparative or superlative form, and so forth). When applying frame filtering to PAFIBERT, we experimented with the following approaches:

  • Frame filtering by LUs. This is the simplified approach described above, in which the gold LUs are used to retrieve candidate frames for frame filtering.

  • Frame filtering by targets. This approach assumes that the gold LUs are unknown. Additional pre-processing steps are thus needed to enumerate potential candidate frames for frame filtering. To this end, we utilized the Pattern toolkit (https://www.clips.uantwerpen.be/pattern) to generate derived forms such as comparatives, superlatives, conjugated verbs, and so forth for every LU in FrameNet. A lookup table was then created to link these inflections to potential candidate frames. During model training and testing, the given targets were used as lookup keys to retrieve candidate frames from the table. For instance, for the LU ‘know.v’, the table construction step generated ‘know’, ‘knows’, ‘knowing’, and ‘knew’ as the derived forms. When one of these derived forms was marked as the target, the lookup table returned [Certainty], [Differentiation], [Awareness], and [Familiarity] as candidate frames.

4 Experiments

4.1 Model Training and Hyperparameter Tuning

The latest version of FrameNet is 1.7, but FrameNet 1.5 has been the version used almost exclusively in frame-semantic research. Compared to FrameNet 1.5, FrameNet 1.7 has a substantially larger number of frames and LUs, thus presenting a more challenging stage for frame identification. To facilitate the comparison of results with other studies, we conducted our experiments using both FrameNet versions. FrameNet provides the following types of hand-labeled data with each release:

  • The exemplary data, which contain example sentences of the LUs. Each sentence is annotated for one target and LU.

  • The full-text data, which consist of sentences with multiple annotations pertaining to multiple targets and LUs.

Both the exemplary data and the full-text data were used in this study for model development. We used the same train/test split as in previous work, and 16 documents from the full-text data were set aside for hyperparameter tuning [21]. The hyperparameters were selected from the ranges of values suggested by [12]. The ranges for the batch size and the learning rate are {16, 32} and {2e-5, 3e-5, 5e-5} respectively. However, we found that the range suggested for the number of epochs (i.e., {3, 4}) did not produce optimal models in our initial experiments. We thus increased the range to {3, 4, 5, 6, 7, 8} for the hyperparameter search. The maximum input sequence length (n) was set based on the length of the longest sentence in the training data. We chose n=260 for FrameNet 1.5 and n=320 for FrameNet 1.7. The selected hyperparameters are shown in Table 1.

Table 1: The selected hyperparameters
FrameNet 1.5 FrameNet 1.7
Filtered by LUs Filtered by Targets No Filter Filtered by LUs Filtered by Targets No Filter
Batch size 32 32 32 16 16 16
Learning rate 3e-5 3e-5 3e-5 3e-5 3e-5 3e-5
Epoch 8 6 8 7 6 8

As described in the previous section, the following variants of PAFIBERT were developed in this study:

  • No frame filtering

  • Frame filtering based on LUs

  • Frame filtering based on Targets

4.2 Model Evaluation

The above models were evaluated on two benchmark test sets: Das’s test set [15] and the YAGS test set [11]. The former is a widely used test set randomly sampled by [15] from FrameNet’s full-text data. It is an in-domain test set because it shares the same data sources as the training data. In FrameNet 1.5, Das’s test set contains 4,458 frame annotations from 23 documents. The number of frame annotations in these documents was increased to 6,728 with FrameNet 1.7. It is worth noting that some sentences in Das’s test set also occur in FrameNet’s exemplary data, but we removed the duplicated sentences from the training data to ensure that the test data were completely unseen during the model training stage. The YAGS test set was introduced by [11] as an out-of-domain test set. This test set contains 1,415 sentences sampled from Yahoo! Answers, and the number of frame annotations is 3,091. Apparently, for a frame identification model to be usable for real-world problems, it is crucial to make sure the model remains robust for out-of-domain data. In order to evaluate our method more rigorously, in addition to the evaluation on the YAGS test set, we also examined PAFIBERT qualitatively on several sample sentences obtained from product reviews in a relatively under-explored domain, i.e., the beauty product domain. All tokens in the sample sentences, except certain stopwords, were passed as targets to our models. We included sentences that contain unseen targets and domain-specific terms to examine the capability of the models in handling these more complicated cases.

5 Results and Discussion

5.1 Model Performance on Benchmark Test Sets

Table 2 and 3 compare the accuracies of our models to the results reported by several representative studies in frame-semantic parsing. Since the key challenge in frame identification lies in the ambiguous LUs, some studies have assessed these more challenging and interesting cases separately to provide a more useful performance measure. Likewise, we employed the same procedure to evaluate our models for (1) all targets and (2) the ambiguous targets.

As can be seen from the tables, PAFIBERT outperformed other models on both the in-domain Das’s test set and the out-of-domain YAGS test set, yielding new state-of-the-art results for frame identification. As expected, the results obtained with LU-based frame filtering were slightly better than those produced using target-based frame filtering, suggesting that a realistic setup is crucial in measuring model performance more precisely. In the following discussion, we only refer to the target-based approach when the discussion involves PAFIBERT with frame filtering.

Table 2: Results with frame filtering
Model FrameNet 1.5 FrameNet 1.7
All Ambiguous All Ambiguous
Das’s Test Set [15]
SEMAFOR [46] 83.60 69.19 - -
Hermann et al. [21] 88.73 73.67 - -
Yang and Mitchell [24] 88.20 75.70 - -
Hartmann et al. [11] 87.63 73.80 - -
Botschen et al. [47] 88.82 75.28 - -
Peng et al. [25] 90.00 78.00 89.10 77.50
PAFIBERT - Filtered by LUs 92.22 82.90 91.44 82.55
PAFIBERT - Filtered by Targets 91.39 82.80 90.15 81.92
YAGS Test Set [11]
SEMAFOR (Reported by [11]) 60.01 - - -
Hartmann et al. [11] 62.51 - - -
PAFIBERT - Filtered by LUs 75.06 69.07 - -
PAFIBERT - Filtered by Targets 75.01 69.07 - -
Table 3: Results without frame filtering
Model FrameNet 1.5 FrameNet 1.7
All Ambiguous All Ambiguous
Das’s Test Set [15]
Hartmann et al. [11] 77.49 - - -
Botschen et al. [47] 81.21 72.51 - -
PAFIBERT - No Filter 89.57 81.86 88.97 81.83
YAGS Test Set [11]
PAFIBERT - No Filter 72.17 67.17 - -

Table 4 shows the top five most frequently confused pairs of frames from the experiments. There is a substantial overlap in these confusing frames across models. In particular, our models seemed to have trouble distinguishing the following frames:

Table 4: Most frequently confused pairs of frames (for FrameNet 1.7)
Gold Frame Predicted Frame # Misclassified Instances Potentially Confusing LUs
PAFIBERT - Filtered by Targets
[Possession] [Have_associated] 14 have.v, have got.v
[Possibility] [Capability] 13 can.v
[Measure_duration] [Calendric_unit] 12 century.n, day.n, decade.n, fortnight.n, hour.n, millennium.n, minute.n, month.n, second.n, week.n, year.n
[Political_locales] [Locale_by_use] 9 country.n, city.n, village.n
[Visiting] [Arriving] 9 visit.v, visit.n
PAFIBERT - No Filter
[Possibility] [Capability] 13 can.v
[Possession] [Have_associated] 12 have.v, have got.v
[Capability] [Possibility] 10 can.v
[Locative_relation] [Spatial_co_location] 9 at.prep, here.adv, there.adv
[Visiting] [Arriving] 9 visit.v, visit.n
  • [Possibility] and [Capability]

  • [Possession] and [Have_associated]

Some examples of the misclassified instances are shown in Table 5. Figure 6 gives the simplified schematic representations of the confused frames to aid the interpretation of the misclassified examples. From the examples, we found that recognizing the distinction between these confused senses is indeed quite tricky, and it requires a deep understanding of the underlying sentential contexts. Despite the difficulty in distinguishing these more nuanced senses, the overall improvements made by PAFIBERT are still encouraging. In particular, the performance gain for ambiguous targets is quite promising: On Das’s test set, for example, PAFIBERT achieved absolute improvements of 4-5% and 9% with frame filtering (Table 2) and without frame filtering (Table 3) respectively. As reported by [11], frame identification models usually suffer a drastic drop in performance when tested on out-of-domain data. The same trend was observed in our experiments for the YAGS test set. Nevertheless, PAFIBERT still outperformed existing methods by a large margin on this out-of-domain test set.

Table 5: Examples of instances misclassified by PAFIBERT with target-based frame filtering
Test Sentence Gold Frame Predicted Frame
The people who can benefit most directly from your generosity have no time to waste . [Possibility] [Capability]
You can help them to know that feeling . [Capability] [Possibility]
Hong Kong is crowded - it has one of the world ’s greatest population densities . [Possession] [Have_associated]
The casinos have no admission charge and formal dress is optional , though long pants for men are required . [Have_associated] [Possession]
Figure 6: Simplified schematic representations for the frequently confused frames

Besides, in the more challenging setup that involved no frame filtering, the results obtained by PAFIBERT were on par with the prior state-of-the-art accuracies achieved with frame filtering. As will be demonstrated in the discussion below, PAFIBERT without frame filtering also has the advantages of generalizing well in other domains and making reasonable inferences for unseen targets.

5.2 Qualitative Evaluation on Out-of-Domain Sample Sentences

In this section, we present interesting frame identification cases for several sample sentences selected from beauty product reviews. The predictions made by the two PAFIBERT variants—without frame filtering and with target-based filtering—are compared. Figure 7 shows the sentences, targets, and predicted frames.

Figure 7: The sample sentences, targets, and predicted frames (based on FrameNet 1.7). Unseen targets are presented in boldface.

First and foremost, due to the way PAFIBERT was designed to utilize targets’ positions for attention-based frame identification, both variants of PAFIBERT were able to handle the ambiguous targets that occur multiple times in the same sentence but refer to different frames with each occurrence. This scenario is exemplified by Sentence 1, in which the word ‘so’ occurs twice. The first occurrence of ‘so’ was linked to the [Degree] frame, whereas the second occurrence was recognized as referring to the [Causation] frame.

What stands out the most from this qualitative evaluation is that although PAFIBERT with frame filtering achieved higher accuracies on the benchmark test sets, it did not produce reasonable predictions for the unseen targets in the sample sentences. On the other hand, PAFIBERT without frame filtering seemed to handle the unseen targets more satisfactorily. It is generally agreed that inadequate coverage of LUs is a major shortcoming of FrameNet [11, 47]. Previous studies have tried to address this issue in many ways: some extended FrameNet using other resources such as WordNet, VerbNet, and so forth [17, 48, 49]; others adopted statistical or graph-based strategies to generalize to unseen targets [15, 18]. Nevertheless, unseen targets remain a major obstacle in applying frame-semantic parsing to real-world problems. From the examples in Figure 7, we observed that PAFIBERT without frame filtering selected suitable frames for most of the unseen targets, including ‘toronto’ (Sentence 2), ‘walmart’ (Sentence 2), ‘cakey’ (Sentence 3), and ‘highlighting’ (Sentence 3). Particularly interesting is the ability of the model to associate ‘toronto’ with [Political_locales] and ‘walmart’ with [Businesses]. This behavior of the model resembles that of a named entity recognizer (e.g., [50]), but it encompasses a broader range of categories in comparison to the latter.

Also, since meanings of words are sometimes domain-specific, another attractive property of PAFIBERT without frame filtering is its ability to identify frames in accordance with the domain. For instance, associating the word ‘highlighting’ with [Body_decoration] (Sentence 3) is appropriate in the beauty domain because a highlighter is a cosmetic product applied to some regions of the face to reflect light and create contours. Another domain-aware identification is exemplified by the word ‘shade’ (Sentence 3). Although ‘shade.n’ is listed under the [Type] frame in FrameNet, it was tagged as [Color] by PAFIBERT without frame filtering. Indeed, in this specific context, [Color] seems to be a better match for ‘shade.’

Furthermore, PAFIBERT without frame filtering worked better for non-contiguous multi-word LUs, i.e., LUs that contain insertions. In Sentence 4, for example, when the constituents of the LU ‘give up.v’ are separated by the word ‘it,’ PAFIBERT with frame filtering failed to associate ‘up’ with [Surrendering_possession] because this frame is not a candidate frame for ‘up.’ However, without frame filtering and the constraints it enforces, PAFIBERT was able to recognize the word ‘up’ as part of the phrasal verb ‘give up.’

So far, PAFIBERT without frame filtering is in the lead in the above comparison of model generalizability and ability to handle unseen targets. However, there is a disadvantage in performing frame identification without frame filtering. Specifically, since a no-filtering model is not bounded by candidate frames, it relies solely on sentential contexts to make its predictions. This renders the model highly sensitive to contextual variations. In other words, the model might be more prone to identification errors when the context it encountered in a sentence differed significantly from the typical contexts seen in the training data. In contrast, PAFIBERT with frame filtering can take full advantage of the frame-LU relations defined in FrameNet to avoid straying from the correct frames, thus reducing the risk of being misled by unfamiliar contexts. For instance, with frame filtering, the target ‘worth’ in Sentence 1 was treated as unambiguous because FrameNet only returned [Deserving] as the candidate frame for this target. However, due to its sole reliance on sentential contexts, PAFIBERT without frame filtering chose [Earnings_and_losses] for this target. Although the predicted frame is not too far-fetched, it does not seem to be a better match than [Deserving]. This issue creates a dilemma because while frame filtering can reduce the errors caused by contextual variations, enforcing candidate frames would reduce the generalizability of frame identification models. Future research should attempt to find a hybrid solution that biases frame predictions toward the candidate frames specified by FrameNet while maintaining the ability to deal with unseen targets satisfactorily and to choose other frames over candidate frames if deemed more suitable.

6 Conclusions

This paper presented a neural network architecture called Positional Attention-based Frame Identification with BERT (PAFIBERT) as a solution to frame identification, which is an indispensable subtask in frame-semantic parsing. PAFIBERT inherited the language representation power of BERT and combined it with a position-based attention mechanism to capture target-specific contextual information. The comparison with existing models revealed that PAFIBERT was empirically powerful and superior to other models. Under various experimental settings, PAFIBERT showed significant improvements for both in-domain and out-of-domain test sets. Two variants of PAFIBERT were created and compared in this study, and each variant has its own strengths and weaknesses. Future research may work toward finding a solution that combines the best of both worlds. Besides, through the analysis of the most frequently confused pairs of frames, we found that PAFIBERT still failed to recognize some subtle differences that required a more profound understanding of sentence contexts and semantics. Considerably more work will need to be done to find out how far the rich semantic representation of pre-trained language models can take us in dealing with this kind of nuanced sense distinction. Since the top frequently confused frames caused a substantial loss in model accuracies, another possible progression of this work is to consider a divide-and-conquer strategy that places additional emphasis on solving the more challenging cases.


The research leading to the results in this study has received funding from Nanyang Technological University, Singapore, under grant agreement M4082112.


  • [1] Srini Narayanan and Sanda Harabagiu. Question answering based on semantic structures. In Proceedings of the 20th international conference on Computational Linguistics, page 693. Association for Computational Linguistics, 2004.
  • [2] Dan Shen and Mirella Lapata. Using semantic roles to improve question answering. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pages 12–21, 2007.
  • [3] Janara Christensen, Stephen Soderland, and Oren Etzioni. Semantic role labeling for open information extraction. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pages 52–60. Association for Computational Linguistics, 2010.
  • [4] Mihai Surdeanu, Sanda Harabagiu, John Williams, and Paul Aarseth. Using predicate-argument structures for information extraction. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 8–15, 2003.
  • [5] Aldo Gangemi, Diego Reforgiato Recupero, Misael Mongiovì, Andrea Giovanni Nuzzolese, and Valentina Presutti. Identifying motifs for evaluating open knowledge extraction on the web. Knowledge-Based Systems, 108:33–41, 2016.
  • [6] Mehwish Alam, Diego Reforgiato Recupero, Misael Mongiovi, Aldo Gangemi, and Petar Ristoski. Event-based knowledge reconciliation using frame embeddings and frame similarity. Knowledge-Based Systems, 135:192–203, 2017.
  • [7] Martha Palmer, Daniel Gildea, and Nianwen Xue. Semantic role labeling. Synthesis Lectures on Human Language Technologies, 3(1):1–103, 2010.
  • [8] Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 86–90. Association for Computational Linguistics, 1998.
  • [9] Martha Palmer, Daniel Gildea, and Paul Kingsbury. The proposition bank: An annotated corpus of semantic roles. Computational linguistics, 31(1):71–106, 2005.
  • [10] Charles J Fillmore. Frame semantics and the nature of language. In Annals of the New York Academy of Sciences: Conference on the origin and development of language and speech, pages 20–32, 1976.
  • [11] Silvana Hartmann, Ilia Kuznetsov, Teresa Martin, and Iryna Gurevych. Out-of-domain framenet semantic role labeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 471–482, 2017.
  • [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [13] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
  • [14] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
  • [15] Dipanjan Das and Noah A Smith. Semi-supervised frame-semantic parsing for unknown predicates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 1435–1444. Association for Computational Linguistics, 2011.
  • [16] Collin Baker, Michael Ellsworth, and Katrin Erk. Semeval’07 task 19: frame semantic structure extraction. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 99–104. Association for Computational Linguistics, 2007.
  • [17] Richard Johansson and Pierre Nugues. Lth: semantic structure extraction using nonprojective dependency trees. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pages 227–230, 2007.
  • [18] Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A Smith. Semafor 1.0: A probabilistic frame-semantic parser. Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 2010.
  • [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • [20] Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. Semantic role labeling with neural network factors. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 960–970, 2015.
  • [21] Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. Semantic frame identification with distributed word representations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1448–1458, 2014.
  • [22] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
  • [23] Swabha Swayamdipta, Sam Thomson, Chris Dyer, and Noah A Smith. Frame-semantic parsing with softmax-margin segmental rnns and a syntactic scaffold. arXiv preprint arXiv:1706.09528, 2017.
  • [24] Bishan Yang and Tom Mitchell. A joint sequential and relational model for frame-semantic parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1247–1256, 2017.
  • [25] Hao Peng, Sam Thomson, Swabha Swayamdipta, and Noah A Smith. Learning joint semantic parsers from disjoint data. arXiv preprint arXiv:1804.05990, 2018.
  • [26] Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems, pages 400–406, 2000.
  • [27] Weinan Zhang, Tianming Du, and Jun Wang. Deep learning over multi-field categorical data. In European conference on information retrieval, pages 45–57. Springer, 2016.
  • [28] Elia Bruni, Nam-Khanh Tran, and Marco Baroni. Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49:1–47, 2014.
  • [29] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [30] Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429, 2010.
  • [31] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405, 2017.
  • [32] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [35] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [36] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [37] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
  • [38] Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
  • [39] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, pages 2397–2406, 2016.
  • [40] Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. anmm: Ranking short answer texts with attention-based neural matching model. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 287–296. ACM, 2016.
  • [41] Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. Interactive attention networks for aspect-level sentiment classification. arXiv preprint arXiv:1709.00893, 2017.
  • [42] Yequan Wang, Minlie Huang, and Li Zhao. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 606–615, 2016.
  • [43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [44] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
  • [45] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
  • [46] Dipanjan Das, Desai Chen, André FT Martins, Nathan Schneider, and Noah A Smith. Frame-semantic parsing. Computational linguistics, 40(1):9–56, 2014.
  • [47] Teresa Botschen, Iryna Gurevych, Jan-Christoph Klie, Hatem Mousselly-Sergieh, and Stefan Roth. Multimodal frame identification with multilingual evaluation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1481–1491, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  • [48] Lei Shi and Rada Mihalcea. Putting pieces together: Combining framenet, verbnet and wordnet for robust semantic parsing. In International conference on intelligent text processing and computational linguistics, pages 100–111. Springer, 2005.
  • [49] Sara Tonelli and Claudio Giuliano. Wikipedia as frame information repository. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 276–285. Association for Computational Linguistics, 2009.
  • [50] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 363–370. Association for Computational Linguistics, 2005.