Anaphora resolution is a challenging task which has been the interest of NLPresearchers for a long time. Traditional resolution techniques like eliminativeconstraints and weighted preferences were successful in many languages.However, they are ineffective in free word order languages like most SouthAsianlanguages.Heuristic and rule-based techniques were typical in these languages,which are constrained to context and domain.In this paper, we venture a newstrategy us-ing neural networks for resolving anaphora in human-humandialogues. The architecture chiefly consists of three components, a shallowparser for extracting features, a feature vector generator which produces theword embed-dings, and a neural network model which will predict the antecedentmention of an anaphora.The system has been trained and tested on Teluguconversation corpus we generated. Given the advantage of the semanticinformation in word embeddings and appending actor, gender, number, person andpart of plural features the model has reached an F1-score of 86.
Quick Read (beta)
Anaphora Resolution in Dialogue Systems for South Asian Languages
Anaphora resolution is a challenging task which has been the interest of NLP researchers for a long time. Traditional resolution techniques like eliminative constraints and weighted preferences were successful in many languages. However, they are ineffective in free word order languages like most South Asian languages. Heuristic and rule-based techniques were typical in these languages, which are constrained to context and domain. In this paper, we venture a new strategy using neural networks for resolving anaphora in human-human dialogues. The architecture chiefly consists of three components, a shallow parser for extracting features, a feature vector generator which produces the word embeddings, and a neural network model which will predict the antecedent mention of an anaphora. The system has been trained and tested on Telugu conversation corpus we generated. Given the advantage of the semantic information in word embeddings and appending actor, gender, number, person and part of plural features the model has reached an F1-score of 86.
Vinay Annam BML Munjal University annam.vinay.15csc @bml.edu.in Nikhil Koditala BML Munjal University nikhil.koditala.15cse @bml.edu.in Radhika Mamidi IIIT Hyderabad radhika.mamidi @iiit.ac.in
Throughout the information era, we have seen a shift in human-computer interactions, from clicks to chats. Conversational agents and dialogue systems are becoming prominent with the daily advances in the field of Artificial Intelligence. Technology will be effective if it can reach for the vaster population, by building computational models for popular languages. According to (Eberhard et al., 2019), Telugu, which belongs to the Dravidian family, is one of the active growing languages and is ranked 16 among 7,111 living languages with 93 million speakers universally. Despite such attention, Telugu has inadequate resources when compared to its counter-partners. And also, with the advent of deep learning, many recent works are producing promising results for many languages.
In a discourse, anaphora is a lexical device which acts as a substitution for an entity mentioned earlier. As shown in example (1) it is complicated to define a computable representation of the resolution process because humans personally deal with it subconsciously and mostly oblivious of the particularities.
Shyam: Will Ram come to our school tomorrow for the competition?
Prem: It is too far from his house.
Here the pronouns ’his’ refers to Ram, ’it’ refers to school and ‘our’ refers to both Shyam and Prem.
Despite the involvement of such intricacy, these systems are yet crucial in dialogue systems, machine translation, and information extraction. In this paper, we build a system that resolves the anaphora in Telugu dialogues. In contrast to syntactic and rule-based systems, which are approximate solutions, our method uses few handcrafted features appended to the word embeddings, focusing on semantic features and works excellently on real conversations. We present a new strategy to resolve speaker-hearer mentions and plural mentions, which were never tackled before. To the best of our knowledge, it is the first time deep learning has successfully implemented in Telugu dialogue NLP research.
2 Related Work
Hobbs (1978) was one of the first persons to pioneer in the area of anaphora resolution focusing on early syntactic heuristics. His algorithm takes sentences up to target pronoun as input, and as it traverses backward it finds the noun phrases with same gender and number. Hobbs evaluated his algorithm manually and reported an accuracy of 88.3 percent. Then (Hirst, 1981) directed the anaphora problem towards resolving it in discourse. (Lappin and J.Leass, 1994; Denber, 1998) described several syntactic heuristics for reflexive, reciprocal and pleonastic anaphora. (Grosz et al., 1995) claimed that at any given point there is a single entity being centered. Using this claim they proposed a centering algorithm which finds an entity which is divergent from other evoked entities. (Mitkov et al., 1998) proposed a robust, knowledge-poor multilingual approach in resolving pronouns where each entity is provided a score based on indicators and entity with high score is considered antecedent. (Ng and Cardie, 2002) suggested a machine learning approach to anaphora resolution. However, statistical learning methods suffer from the difficulties of small corpora and corpus dependent learning.
Most of the work in Indian languages has been done in Hindi, Bengali, and Tamil. (Dakwale et al., 2013) built a hybrid approach for anaphora resolution in Hindi using dependency parser and a decision tree classifier. (Jonnalagadda and Mamidi, 2015) proposed a rule-based system for anaphora resolution in Telugu dialog systems, After preprocessing the data using Morphological analyzer and POS tagger they used a set of hard-coded rules to deal with different types of pronouns.
Clark (2015) has done pioneering work in coreference resolution using deep learning that automatically learns dense vector representations for mention pairs for English and Chinese. He built them using the word embeddings in the mention and surrounding context, which will maintain the semantic similarity. Despite using a few hand-engineered features, he trained an incremental coreference system that can utilize entity-level information. His mention pair model acted as an inspiration for our feature representations, and we updated it for free word order languages. In free word order languages, despite changing the order of words in a sentence the overall meaning of the sentence will not change. As shown in Example (2) telugu is a free word order language. Later, (Clark and Manning, 2016) used reinforcement learning to optimize a neural mention ranking model for coreference resolution.
Ram gave Nikhil a book.
S1: rAmu nikhilki pustakam icchADu
(Ram Nikhil book gave)
S2: rAmu pustakam nikhilki icchADu.
(Ram book Shyam gave)
Here the order of the words doesn’t affect the meaning of the Telugu sentence.
3 Anaphora Resolution in Telugu Language
In Telugu, the verbs are formed by adding the grammatical information as suffixes. Along with gender, number and pronoun (GNP), the verb also agrees with tense, aspect, and modality (TAM), which makes the complete structure of the verb as verb root + TAM suffix + GNP suffix. The pronoun should agree with all the components in order to refer to an entity in previous utterances. There are three genders (male, female, nonhuman), three persons (first, second and third) and two numbers (singular and plural) in Telugu. Example (3) shows the variations produced by changing GNP variables for a common root word ’icchaa’(gave). The subject verb agreement becomes more complex because of honorifics, proximity and formality features attached to the subject in Telugu culture (Subbarao and Murthy, 2000).
For Verb ’gave’ when subject is: Male 1st singular: icchaanu Male 2nd singular: icchaavu Male 3rd singular: icchaaDu Female 3rd singular: icchindi Any 3rd plural: iccharu
3.1 Types of Anaphora
When two or more entities refer to the same person or thing then it is known as coreference (Brown and Yule, 1983; Jurafsky and Martin, 2000). Coreference is of two types exophoric and endophoric. In Exopheric coreference, words or entities refer to something which is outside text or discourse. Whereas in Endophoric coreference, entities refer to words which are present in the text. Endophoric coreference is further divided into two types: Anaphora and Cataphora.
In anaphoric reference, words refer to entities which are earlier mentioned in the discourse, whereas in cataphoric reference words refer to entities which are mentioned later in discourse. Anaphoric references are of different types such as repeated, pronominal, lexical and one anaphora.
3.2 Types of Pronouns in Telugu
There is a wide variety of pronouns in Telugu. These pronouns differ in their usage based on gender, number, person or other semantic variables. Listed below are few commonly used types of pronouns in Telugu:
Personal Pronouns: Telugu pronouns that are used as substitutes for known noun phrases. Ex: nEnu (I), manamu (we), nIvu (you), vAru (they).
Interrogative Pronouns: Telugu pronouns that indicate questions. Ex: EmI (what), Edi (which), EvaDu (who).
Possessive Pronouns: Telugu pronouns that indicate ownership. Ex: nA (my), atani (his), Amedi (Hers).
Adverbial Pronouns: Telugu adverbs that are formed by combining a pronoun with a preposition. Ex: imducEta (whereby), anduvaLa (whereby), imdulO (wherein).
Reflexive Pronouns: These pronouns are used when subject and object are same in a sentence. Ex: tAnu (oneself), tAmu (themself).
Demonstrative Pronouns: Pronouns that point to specific things. Ex: I (This — These), A (That — Those).
Reciprocal Pronouns: Reciprocal pronouns are used to indicate that both the parties got benefited by performing certain action or task. Ex: Okarikokaru (Each other).
According to ClarkClark (2015), the primary motive of a neural mention pair model is to perform a binary classification, predicting whether two vectors are co-referent or not. The vectors should be able to learn the linguistic phenomena that appears in the nominal and pronominal mentions in the dialogues. We call these linguistic devices as features. Since Telugu is verb-final language and verbs are strongly inflected than in English, the noun and verb mentions agree more on gender, number, and person. Therefore, in contrast to the 17 features applied by ClarkClark (2015), we suggest only 6 features:
Word embeddings (100 Dim)
Gender, Number, Person (10 Dim)
Part-of-Plural (1 Dim)
Speaker-Hearer (2 Dim)
This section introduces our framework to build the feature vectors and the deep learning model which associates anaphora with its antecedent. Our methodology can be mainly classified into three stages:
Parsing the dialogues
Feature vector generation
Neural network model
4.1 Parsing the dialogues
As Telugu is an agglutinative language to get the mentions from the utterances, we need to use a tokenizer and a sandhi splitter which breaks the complex terms into individual stems or root words. Then use a parts of speech tagger to detect the mentions. Then we need to do morph analysis of each word to extract the Gender, Number and Person features from Telugu dialogues. We used an online shallow parser build by LTRC center at IIIT Hyderabad. This shallow parser takes a text sentence as an input in the form of UTF-8 or WX format and generates an output in the form of Shakti Standard Format (SSF) given by (Bharati et al., 2014). This SSF acts as a common format of data for all the Indian languages. See example 4 for output in SSF format.
unnADu VM fs af=’unDu,v,m,sg,3,,A,A’ name=””
In the above example, we are able to capture parts of speech of the given word which is ’VM’ gender which is ‘m’(Male), number which is ‘sg’(Single) and person which is 3(Third Person). Gender is of three types , , and . Number is of 3 types , and . Person is of three types , , and . To encode these into the vector we need to hot encode them. So the GNP vector will be a vector of 10 dimensions. In this way, we are extracting three important features of our model i.e., Gender, Number and Person. The shallow parser also helps us the nouns, pronouns and verb phrases in the dialogues, which are potential mentions of real entities.
4.2 Feature Vector Generation
For generating the word embeddings for Telugu, we scraped Telugu pages in Wikipedia and Andhrajyothi newspaper. From Andhrajyothi website we scrapped all the telugu articles published between 2015 and 2017. This accounted to a total of around 133148 articles. Using Gensim, a word representation tool, we trained our own word2vec model using the scraped data. After training, we obtained 23,000 unique words (types) in our vocabulary. Each vector is of 100 dimensions. Since the data collected from these sources is vast and a mixture of several domains, the vectors have a rich semantic description. Since the conversations involve plenty of 1st and 2nd person mentions, we suggest an experimental feature called . It easily discriminates between the two actors by assigning its value to respectively. Plural mention discontinuity is popular in coreference resolution systems, but no work has tackled it. Here we introduce a feature called that will allow the model to treat plural definite noun mentions as single mentions. For each mention in the dialogue, we will generate the feature vector by appending all the features making it a 113-dimensional vector.
4.3 Neural Network Model
The model we build is a Binary Classification Multilayer Perceptron that classifies the pair as a true or false antecedent and anaphora pair. The input is a feature vector that is created by appending vectors of two mentions making it a 226-dimensional mention pair vector. Given the small dimension of the input, there is no expensive computation involved. So we are using a dense neural network.
Let be the mention feature vector of the mention and be the mention pair vector that represents the antecedent-anaphora pair. Now we will send this vector into a fully connected dense neural network with two hidden layers.
The output layer consists of a single value which denotes the probability of the pair to be a true antecedent-anaphora pair. We calculate the loss using a binary cross entropy function.
Here are all the annotated mentions in the data set and represents the actual labels of the mention pairs. Here represents a false pair and represents a true pair. See figure 1 for the complete model.
4.3.2 Hyper Parameters
After each hidden layer, a dropout layer of probability for regularization is added. Regularization helps in over-fitting of the model. Then each epoch of the training phase is optimized using the Adam optimizer (Kingma and Ba, 2014). Adam is a momentum based gradient descent optimization technique. We are using a mini-batch of size pairs in each training epoch. The first hidden layer has units and the second hidden layer has units. We use Rectified Linear Unit activation functions in both the hidden layers and Sigmoid for the last layer.
5 Corpus and Annotation
Telugu is a digital resource-limited language. Most of the research for Telugu was done in sentiment analysis, POS tagging, NER, and text summarization. Publicly available annotated dialogue dataset for Telugu is not available. However, we built a corpus of 157 conversations, consisting of simple to complex dialogues that we hear in our daily life. We collected the corpus in such a way that it consists of all the possible pronoun types and mentions are balanced in gender, number, and person. About 50% of the conversations are hand engineered, and the remaining 50% is a translation from English and online scraping. To translate conversations from English to Telugu we are using Google translate API and on top of it a reviewer will evaluate the correctness of the translation, These conversations are then parsed using the shallow parser discussed in section 4.1. The total number of mentions in the corpus is 775.
After the corpus is ready, the conversations are annotated using a web application we have built specifically for annotating the mentions. The annotator allows you to make a pair of antecedent and anaphora mentions in the conversation. If both the mentions are a single real entity, then they are labeled true, else, they are labeled false. There are 642 true mention pairs and 1818 false mention pairs. The total number of mention pairs in the corpus after oversampling is 3636. Note that the LTRC shallow parser for Telugu is far from human-level performance. So, for enhancing training, the semantic features are corrected and manually tagged with the help of annotator. Each conversation is annotated by two reviewers and in case if there is any conflict, then the conversation is sent to a third reviewer.
Consider that, in a given context, if there are mentions, where , mentions among them are referring the same entity, where . Then there are pairs which are true coreference mention pairs and pairs which are false coreference mention pairs. After observing the graph constructed based on these two equations for a given and , there are more possibilities of the false pairs dominating the true pairs. In figure 2, we can interpret from the region bounded by the two curves that the true and false mention pairs are unbalanced. This leads to bias while training the model on this corpus.
To fix this we followed sampling strategies. There are two strategies for balancing the data. In undersampling, we will reduce the number of false pair instances randomly. In oversampling, we inflate the number of true pair instances, by generating synthetic samples using a distance-based technique called SMOTE (Chawla et al., 2002). For testing, a separate set of dialogues are used. See the comparison of the model for both the strategies in table 1.
To check the performance of the model with features as part of the embedding, we compared the model to the baseline model. A baseline model is a naive model assuming to be the least possible intelligent system. Here we achieved the baseline model by training the neural network only on the 100-dimensional word embeddings. To understand the significance of every feature, we trained the model considering a feature at a time. See table 2 for the comparison based on features.
7.1 Reporting Speech
The word vector representation we chose cannot deal with reporting speech. See example (5).
Speaker: Ram said, ‘I am the king of the world’.
Here the pronouns ’I’ refers to Ram. But our feature representation will refer to speaker because it is 1st person.
When using the system in real conversations, the parser may not give correct GNP tags. These affects the predictions. Also, the morph analyzer gives unnecessary tokenization which leads to unresolved mentions.
Sometimes the pronoun will be a part of the compound word, which is difficult to split with any computational sandhi splitter in Telugu.
Only he came.
atanu + okkaDu + vacchADu
he + alone + came
Here ‘he‘ is part of the compound word which cannot be split and resolved.
8 Conclusion and Future work
This model is the best anaphora resolution system for Telugu dialogues. It can be used to build more natural conversational agents in Telugu. Since most of the linguistics of the Dravidian language family are similar, we can extend this work for other south Indian languages. The feature vectors are constructible for any language. Our system has surpassed the recent state of the art in Telugu anaphora resolution (Jonnalagadda and Mamidi, 2015), whose accuracy is 61.1%. With more data and discovering more useful features we can further improve this system.
- Bharati et al. (2014) Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Anil Kumar Singh. 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT. Association for Computational Linguistics and Dublin City University.
- Brown and Yule (1983) Gillian Brown and George Yule. 1983. Discourse Analysis, 6th edition. Cambridge: Cambridge University Press.
- Chawla et al. (2002) N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.
- Clark (2015) Kevin Clark. 2015. Neural coreference resolution.
- Clark and Manning (2016) Kevin Clark and Christopher D. Manning. 2016. Deep reinforcement learning for mention-ranking coreference models. Association for Computational Linguistics, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing:2256–2262.
- Dakwale et al. (2013) Praveen Dakwale, Vandan Mujadia, and Dipti M. Sharma. 2013. A hybrid approach for anaphora resolution in hindi. Association for Computational Linguistics, Proceedings of the Sixth International Joint Conference on Natural Language Processing:977–981.
- Denber (1998) Michel Denber. 1998. Automatic Resolution of Anaphora in English. Eastman Kodak Co.
- Eberhard et al. (2019) Eberhard, David M., Gary F. Simons, and Charles D. Fennig. 2019.
- Grosz et al. (1995) Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 1995. Centering: a framework for modeling the local coherence of discourse. In Association for Computational Linguistics, pages 203–225, Division of Applied Sciences, Harvard University,Cambridge.
- Hirst (1981) Graeme Hirst. 1981. Discourse-oriented anaphora resolution in natural language understanding: A review. American Journal of Computational Linguistics, 7(2):85–98.
- Hobbs (1978) Jerry R. Hobbs. 1978. Resolving pronoun references. In Lingua, pages 311–338, Dept. of Computer Sciences, City College, CUNY, New York U.S.A.
- Jonnalagadda and Mamidi (2015) Hemanth Reddy Jonnalagadda and Radhika Mamidi. 2015. Resolution of pronominal anaphora for telugu dialogues. Association for Computational Linguistics, Proceedings of the 12th International Conference on Natural Language Processing:183–188.
- Jurafsky and Martin (2000) Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edition. Prentice Hall PTR, Upper Saddle River, NJ, USA.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
- Lappin and J.Leass (1994) Shalom Lappin and Herbert J.Leass. 1994. An algorithm for pronominal anaphora resolution. Association for Computational Linguistics, pages 535–561.
- Mitkov et al. (1998) Ruslan Mitkov, Lamia Belguith, and Malgorzata Stys. 1998. Multilingual robust anaphora resolution. Association for Computational Linguistics.
- Ng and Cardie (2002) Vincent Ng and Claire Cardie. 2002. Improving machine learning approaches to coreference resolutio. Department of Computer Science, Cornell University, NY.
- Subbarao and Murthy (2000) V. Subbarao and B. Lalitha Murthy. 2000. Lexical anaphors and pronouns in telugu. In Barbara C. Lust, Kashi Wali, James W. Gair, and K. V. Subbarao, editors, Lexical Anaphors and Pronouns in Selected South Asian Languages:. DE GRUYTER MOUTON.