Natural Language Processing (NLP) tools and frameworks have significantlycontributed with solutions to the problems of extracting entities and relationsand linking them to the related knowledge graphs. Albeit effective, themajority of existing tools are available for only one knowledge graph. In thispaper, we present Falcon 2.0, a rule-based tool capable of accurately mappingentities and relations in short texts to resources in both DBpedia and Wikidatafollowing the same approach in both cases. The input of Falcon 2.0 is a shortnatural language text in the English language. Falcon 2.0 resorts tofundamental principles of the English morphology (e.g., N-Gram tiling andN-Gram splitting) and background knowledge of labels alignments obtained fromstudied knowledge graph to return as an output; the resulting entity andrelation resources are either in the DBpedia or Wikidata knowledge graphs. Wehave empirically studied the impact using only Wikidata on Falcon 2.0, andobserved it is knowledge graph agnostic, i.e., Falcon 2.0 performance andbehavior are not affected by the knowledge graph used as background knowledge.Falcon 2.0 is public and can be reused by the community. Additionally, Falcon2.0 and its background knowledge bases are available as resources athttps://labs.tib.eu/falcon/falcon2/.
Quick Read (beta)
Falcon 2.0: An Entity and Relation Linking Tool over Wikidata
Natural Language Processing (NLP) tools and frameworks have significantly contributed with solutions to the problems of extracting entities and relations and linking them to the related knowledge graphs. Albeit effective, the majority of existing tools are available for only one knowledge graph. In this paper, we present Falcon 2.0, a rule-based tool capable of accurately mapping entities and relations in short texts to resources in both DBpedia and Wikidata following the same approach in both cases. The input of Falcon 2.0 is a short natural language text in the English language. Falcon 2.0 resorts to fundamental principles of the English morphology (e.g., N-Gram tiling and N-Gram splitting) and background knowledge of labels alignments obtained from studied knowledge graph to return as an output; the resulting entity and relation resources are either in the DBpedia or Wikidata knowledge graphs. We have empirically studied the impact using only Wikidata on Falcon 2.0, and observed it is knowledge graph-agnostic, i.e., the Falcon 2.0 performance and behavior are not affected by the knowledge graph used as background knowledge. Falcon 2.0 is public and can be reused by the community. Additionally, Falcon 2.0 and its background knowledge bases are available as resources at https://labs.tib.eu/falcon/falcon2/.
Keywords:NLP Entity Linking Relation Linking Background Knowledge English morphology DBpedia Wikidata
Resource Type: APIs and software frameworks
Web API https://labs.tib.eu/falcon/falcon2/
License: GNU General Public License v3.0
Entity linking (EL) -interchangeably used as Named Entity Disambiguation (NED)- is a well-studied research domain for aligning unstructured text to its structured mentions in various knowledge repositories (e.g., Wikipedia, DBpedia [DBLP:conf/semweb/AuerBKLCI07], Freebase [DBLP:conf/aaai/BollackerCT07] or Wikidata [DBLP:conf/www/Vrandecic12]). Entity linking comprises two sub-tasks. The first task is named entity recognition (NER), in which an approach aims at identifying entity labels (or surface forms) in an input sentence. Entity disambiguation is the second sub-task where the goal is to link entity surface forms to semi-structured knowledge repositories. With growing popularity’s of publicly available knowledge graphs, researchers have developed several approaches and tools for entity linking over knowledge graphs. Some of these approaches implicitly perform the NER task and directly provided mentions of entity surface forms in the sentences into the knowledge graph (often referred to as an end-to-end EL approaches). Other attempts (e.g., MAG [moussallem2017mag]) assume recognized surface forms of the entities as additional inputs besides the input sentence to perform entity linking. Irrespective of the input format and underlying technologies, the majority of the existing attempts in the EL research domain are confined to well-structured knowledge graphs such as DBpedia , Freebase, and Yago. These knowledge graphs relies on a well-defined process of extracting information directly from the Wikipedia info boxes and do not provide direct access to the users to add/delete the entities. Wikidata, on the other hand, also allows users to edit Wikidata pages directly, add newer entities, and define new relations between the objects. The popularity of Wikidata can be measured by the fact that since its launch in 2012, over 1 billion edits have been made by the users across the world11 1 https://www.wikidata.org/wiki/Wikidata:Statistics. Nevertheless, although EL has been extensively studied, linking to Wikidata remains challenging.
Motivation, Approach, and Contributions.
We motivate our work by the fact that in spite of the vast popularity of Wikidata, there are limited attempts to target entity linking over Wikidata. In this paper, we focus on providing Falcon 2.0 -a tool for joint entity and relation linking framework over Wikidata that provides Wikidata mentions of entity and relation surface forms in a short sentence. In our previous work, we proposed Falcon [sakor2019old], a rule based approach for entity and relation linking of short text Over DBpedia. Falcon has two novel concepts: 1) A linguistic based approach that relies on several English morphology principles such as tokenization, and N-gram tiling; 2) a knowledge graph which serves as a source of background knowledge. This knowledge graph is a collection of entities from DBpedia enriched with the Wikidata labels. We resort to the Falcon approach for developing Falcon 2.0. Hence, we do not claim novelty in the underlying linguistic based approach for Falcon 2.0. However, for Falcon 2.0, we extend the background knowledge graph of Falcon and enrich it with the Wikidata entities and associated alias labels. In this paper, we propose following two reusable,open source, and easily accessible resources:
Falcon 2.0: We propose Falcon 2.0- a tool for joint entity and relation linking over Wikidata. Falcon 2.0 relies on fundamental principles of English morphology (tokenization and compounding) and links entity and relation surface forms in a short sentences to its Wikidata mentions. Falcon 2.0 is available as an online API and can be accessed at https://labs.tib.eu/falcon/falcon2/. We empirically evaluated Falcon 2.0 on a question answering datasets tailored for Wikidata, and Falcon 2.0 significantly outperform the baseline. For the ease of use, we integrate Falcon API22 2 https://labs.tib.eu/falcon/ into Falcon 2.0 and users can also get corresponding DBpedia URIs of entities and predicate present in an input short text.
Falcon 2.0 Background Knowledge Base: We replaced the background knowledge base of Falcon with the new Background KG specially tailored for Wikidata. We extracted 48,042,867 Wikidata entities from its public dump and aligned these entities with the aliases present in the Wikidata. For example, Barack Obama is a Wikidata entity Wiki:Q76. We created a mapping between the label (Barack Obama) of Wiki:Q7633 3 https://www.wikidata.org/wiki/Q76 with its aliases such as President Obama, Barack Hussein Obama, Barry Obama and stored it in the background knowledge graph. We did a similar alignment for 15,645 properties/relations of Wikidata. The background knowledge graph is an indexed graph and can be easily queried using ElasticSearch.
The rest of this paper is organized as follows: the next section describes our two resources and approach to build Falcon 2.0. Section 2 presents the importance and impact of this work for the research community. Section 3 presents experiments to evaluate the performance of Falcon 2.0. The availability and sustainability of resources is explained in Section 5 and its maintenance related discussion is presented in Section 6. Section 7 reviews the state of the art, and we close with the conclusion and future work in Section 8.
2 Falcon 2.0
In this section, we present Falcon 2.0. We first explain the architecture of Falcon 2.0. Next, we discuss the background knowledge used to match the surface forms in the text to resources in a specific knowledge graph.
The Falcon 2.0 architecture is depicted in Figure 1. Falcon 2.0 receives as short input texts and outputs a set of entities and relations extracted from the text; each entity and relation in the output is associated with a unique IRI in Wikidata. Falcon 2.0 resorts to a background knowledge and a catalog of rules for performing entity and relation linking. The background knowledge combines Wikidata labels and their corresponding aliases. Additionally, it comprises alignments between nouns and entities in Wikidata knowledge graph. Alignments are stored in a text search engine, e.g., ElasticSearch, while the knowledge source is maintained in an RDF triple store accessible via a SPARQL endpoint. The rules that represent the English morphology are maintained in a catalog; a forward chaining inference process is performed on top of the catalog during the tasks of extraction and linking. Falcon 2.0 also comprises several modules that identify and link entities and relations to Wikidata knowledge graph. These modules implement POS Tagging, Tokenization & Compounding, N-Gram Tiling, Candidate List Generation, Matching & Ranking, Query Classifier, and N-Gram Splitting and reused from the implementation of Falcon.
2.2 Background Knowledge
Wikidata contains over 52 million entities and 3.9 billion facts (consisting of subject-predicate-object triples). A significant portion of this extensive information is not useful for entity and relation linking. Therefore, we sliced Wikidata and extracted all the entity and relation labels to create a local background knowledge graph. For example, the entity United States of America44 4 https://www.wikidata.org/wiki/Q30 in Wikidata has the natural language label ‘United States of America’ and other several aliases (or known_as labels) of United States of America such as the United States of America, America, U.S.A., the US, United States, and others. We extended our background knowledge graph with this information from Wikidata. Similarly, for relation’s labels, the background knowledge graph is enriched with known_as labels to provide synonyms and derived word forms. For example, the relation spouse 55 5 https://www.wikidata.org/wiki/Property:P26 in Wikidata has the label spouse and the other known us labels are husband, wife, married to, wedded to, partner and other labels. This variety of synonyms for each relation empowers Falcon 2.0 to match the surface form in the text to a relation in the knowledge graph. This is possible even though the surface form (married to) has considerable similarity difference to the representative label(spouse) w.r.t string matching similarity like Levenshtein algorithm66 6 Rematch [DBLP:conf/i-semantics/MulangSO17] have used Levenshtein algorithm for the task of relation linking. Figure 2 illuminates how the background knowledge is built.
2.3 Catalog of Rules
Falcon 2.0 is a rule-based approach. A catalog of rules is predefined to extract entities and relations from the text. The rules are based on the English morphological principles. For example, Falcon 2.0 excludes all verbs from the entities candidates list based on the rule verbs are not entities. For example, the N-Gram tiling module in the Falcon 2.0 architecture resorts to the rule: entities with only stopwords between them are one entity. Another example of such rule When -> date, Where -> place solves the ambiguity of matching the correct relation in case the short text is a question by looking at the questions headword. Some question words determine the range of the relation, which solves the ambiguity. For example, give the two questions When did Princess Diana die? and Where did Princess Diana die?, the relation died can be the death place or the death year. The question headword (When/Where) is the only insight to solve the ambiguity here. When the question word is where Falcon 2.0 matches only relations that have a place as a range of the relation.
Extraction phase in Falcon 2.0 consists of three modules. POS tagging, tokenization & compounding, and N-Gram tiling. The input of this phase is the natural language text. The output of the phase is the list of surface forms that are related to entities or relations.
Part-of-speech tagging (POS) Tagging
receives the natural language text as an input. Then it tags each word in the text with its related tag, e.g., noun, verb, and adverb. This module differentiates between nouns and verbs with the aim of enabling the application of the morphological rules from the catalog. The output of the module is a list of the pairs of (word, tag).
Tokenization & Compounding
builds the tokens list by removing the stopwords from the input and splitting verbs from nouns. For example, if the input is What is the operating income for Qantas, the output of this module is a list of three tokens [operating, income, Qantas].
module combines tokens which have only stopwords between them relying on one of the rules from a catalog of rules. For example, if we consider the output of the previous module as an input for the n-gram tilling module, operating and income tokens will be combined in one token. The output of the module is a list of two tokens [operating income, Qantas].
The linking phase consists of four modules — candidate list generation, matching & ranking, relevant rule selection, and n-gram splitting.
Candidate List Generation
receives the output of the recognition phase. The module queries the text search engine for each token. Then, tokens will have its associated candidate list of resources. For example, the retrieved candidate list of the token operating income is [(P3362, operating income), (P2139, income), (P3362, operating profit)]; where the first element is the Wikidata predicate identifier and the second one is associated labels of the predicates which matched the query ”operating income”.
Matching & Ranking
rank the candidate list received from the candidate list generation module and match candidates’ entities and relations. Since, in any knowledge graph, the facts are represented as triples, the matching and ranking module creates triples consisting of the entities and relations from the candidates’ list. Then, for each pair of entity and relation, the module checks if the triple exists in the RDF triple store (Wikidata). The checking is done by executing a simple ASK query over the RDF triple store. For each existing triple, the module increases the rank of the involved relations and entities. The output of this module is ranked and sorted the list of candidates.
Relevant Rule Selection
interacts with the matching & ranking module by suggesting increasing the ranks of some candidates relying on the catalog of rules. One of the suggestions is considering the question headword to clear the ambiguity between two relations based on the range of relations in the knowledge graph. For example, if the question word is ”where”, then the relation to be recognized should be linked to a property in the knowledge graph with the range ”place”.
is called if none of the triples tested in the matching & ranking modules exists in the triple store, i.e., the compounding the approach did in the tokenization & compounding module led to combining two separated entities. The module splits the tokens from the right side and passes the tokens again to the candidate list generation module. Splitting the tokens from the right side resorts to one of the fundamentals of the English morphology; the compound words in English have their headword always towards the right side [williams1981notions].
Text Search Engine
stores all the alignments of the labels. ElastisSearch [gormley2015elasticsearch] is used as the text search engine. It receives a token as an input, then returns all the related resources which have labels similar to the received token.
RDF Triple store
can be seen as a local copy of Wikidata endpoint. It consists of all the RDF triples of Wikidata labeled with the English language. An RDF triple store is used to check the existence of the triples passed from the Matching & Ranking module. The RDF triple store keeps around 3.9 billion triples.
3 Experimental Study
We report on the following metrics: Precision, Recall, and F-measure. Precision is the fraction of relevant resources among the retrieved resources (Equation 1).
Recall is the fraction of relevant resources that have been retrieved over the total amount of relevant resources (Equation 2).
F-Measure or F-Score is a measure that combines Precision and Recall; it is the harmonic mean of precision and recall (Equation 3).
We relied on two different question answering datasets namely SimpleQuestion dataset for Wikidata [diefenbach2017question] and LC-QuAD 2.0 [dubey2019lc]. SimpleQuestion dataset contains 6,505 test questions which are answerable using Wikidata as underlying knowledge graph. We randomly selected 1,000 questions from LC-QuAD 2.0 to test the robustness of our tool on complex questions.
We chose OpenTapioca [delpeuch2019opentapioca] as our baseline for entity linking. OpenTapioca is available as a web API; it can provide Wikidata URIs for relations and entities. We are not aware of any other tool/approach that provides end-to-end Wikidata entity linking.
A laptop machine, with eight cores and 16GB RAM running Ubuntu 18.04 is used for implementing Falcon 2.0. We deployed its web API on a server with 723GB RAM, 96 cores (Intel(R) Xeon(R) Platinum 8160CPU with 2.10GHz) running Ubuntu 18.04. This publicly available API is used to calculate standard metrics of Precision, Recall, and F-score.
3.1 Experimental Results
Experimental Results 1
In the first experiment, we compare entity linking performance of Falcon 2.0 with the baseline OpenTapioca. We first chose the SimpleQuestion dataset. Surprisingly, we observe that for the baseline, the values are approximately 0.0 for Precision, Recall, and F-score. We analyzed the source of errors, and found that out of 6,505 questions, only 246 have entity labels in uppercase letters. Opentapioca cannot recognize entities and link any entity written in lowercase letters. Case sensitivity is a common issue for entity linking tools over short text as reported by Singh et al. [singh2018no, DBLP:conf/www/SinghRBSLUVKP0V18] in a detailed analysis. For the remaining 246 questions, only 70 gives the correct answer for OpenTapioca. On the other hand, Falcon 2.0 reports F-score 0.63 on the same dataset (cf. Table 1). Given that OpenTapioca finds limitation in lowercase letters of entity surface forms, we randomly selected 1,000 questions from LC-QuAD 2.0 dataset and compared it against Falcon 2.0. OpenTapioca reports F-score 0.25 against Falcon 2.0 with F-score 0.68 reported in Table 1.
Compare to Falcon, Falcon 2.0 has drop in its performance (please see [sakor2019old] for detailed performance analysis of Falcon). We analyzed the source of errors. The first source of error is the dataset(s) itself. In both datasets, many questions are grammatically incorrect. For example, where was hank cochran birthed is one of the questions of SimpleQuestion dataset. Falcon 2.0 resorts to fundamental principles of the English morphology and overcomes the state of the art in the task of recognizing entities in grammatically correct questions. The same issue persists in LC-QuAD 2.0, where a large portion of the dataset has grammatically incorrect questions. Furthermore, in questions such as i) Tell me art movement whose name has the word yamato in it, ii) which doctrine starts with the letter t, there is no clear Wikidata relation. Our tool is not able to identify any entity.
Figure 6 provides a more detailed description of the results of this experiment. As observed in Figure 6, the number of questions that have Recall equal to 0.0 for Falcon 2.0 is much less than the ones processed by OpenTapioca. As we mentioned before, OpenTapioca cannot recognize entities that are not uppercase, which explains the high number of questions that have Recall equal to 0.0. While Falcon 2.0 is able to recognize lower case entities.
Experimental Results 2:
In the second experiment, we evaluate relation linking performance of Falcon 2.0. We are not aware of any other baseline for relation linking over Wikidata. Table 2 summarizes relation linking performance. For relation linking, the performance of Falcon 2.0 reports comparable performance with Falcon over DBpedia [sakor2019old].
In August 2019, Wikidata became first Wikimedia project that crossed 1 billion edits and there are over 20,000 active Wikidata editors 77 7 https://www.wikidata.org/wiki/Wikidata:Statistics. A large subset of the Semantic web community has extensively relied its research around DBpedia and Wikidata targeting different research problems such as knowledge graph completion, question answering, entity linking, and data quality assessments. Furthermore, entity and relation linking tasks have been studied well beyond Semantic web research, especially NLP and information extraction. Many entity linking tools and approaches have been developed by researchers and industry practitioners but focus only on other knowledge graphs such as DBpedia, Yago, or Freebase. Despite Wikidata being hugely popular, there is only one public web API (OpenTapioca) available for reuse and aligning unstructured text to Wikidata mentions. However, when it comes to a short text, the performance of OpenTapioca is limited. Falcon 2.0 targets entity and relation linking over Wikidata for short text. We believe the availability of Falcon 2.0 as web API along with open source access to its code will provide researchers an easy and reusable way to annotate unstructured text against Wikidata. It is important to notice that Falcon 2.0 is being used for entity and relation linking in biomedical semi-structured data sources in the context of EU H2020 projects iASiS88 8 http://project-iasis.eu/ and BigMedilytics99 9 https://www.bigmedilytics.eu/. The extracted entities and relations have enabled the linking of various types of entities, e.g., drugs, drug-drug interactions, diseases, and toxicities, to the corresponding concepts in DBpedia.
5 Adoption and Reusability
Falcon 2.0 is open source and code is available in our public Github: https://github.com/SDM-TIB/Falcon2.0 for reusability and reproducibility. It is currently available for the English language. However, there is no assumption in the approach or while building the background knowledge base that restricts its adaptation or extensibility in other languages. The background knowledge of Falcon 2.0 is available for the community. The background knowledge consists of 48042867 alignments of Wikidata entities and 15645 alignments for Wikidata predicates. GNU General Public License v3.0 allows for the free distribution and re-usage of Falcon 2.0. We hope the research community and industry practitioners will use Falcon 2.0 resources and for various usages such as linking entities and relations to Wikidata, annotating an unstructured text, developing new low language resources, and others.
6 Maintenance and Sustainability
Falcon 2.0 is released as a publicly available resource offering of the Scientific Data Management(SDM) group at TIB, Hannover1010 10 https://www.tib.eu/en/research-development/scientific-data-management/. TIB is one of the largest libraries for Science and Technology in the world 1111 11 https://www.tib.eu/en/tib/profile/. It is actively engaged in promoting open access to scientific artifacts, e.g., research data, scientific literature, non-textual material, and software. Similar to other publicly maintained repositories of SDM, Falcon 2.0 will be kept and regularly updated to fix bugs and include new features1212 12 https://github.com/SDM-TIB. Similar to Falcon, the Falcon 2.0 API will also be sustained on the TIB servers to allow for unrestricted free access.
7 Related Work
There are several surveys that provides a detailed overview on the advancements of the techniques employed in entity linking over knowledge graphs [shen2015, balog_2018]. Various reading lists [hengji2019],online forums1313 13 http://nlpprogress.com/english/entity_linking.html and github repositories1414 14 https://github.com/sebastianruder/NLP-progress/blob/master/english/entity_linking.md track the progress in the domain of entity linking. Initial attempts in EL considered Wikipedia as underlying knowledge source. The research field is quite matured and the SOTA is nearly human level performance [raiman2018deeptype]. With the advent of publicly available knowledge graphs such as DBpedia, Yago, and Freebase, the focus shifted to develop EL over knowledge graphs. The developments in Deep Learning has introduced a range of models that carry out both NER and NED as a single end-to-end step [kolitsas2018end, DBLP:conf/emnlp/GaneaH17]. NCEL [CaoYixin-2018] learns both local and global features from Wikipedia articles, hyperlinks, and entity links to derive joint embeddings of words and entities. These embeddings are used to train a deep Graph Convolutional Network (GCN) that integrates all the features through a Multi-layer Perceptron. The output is passed into a Sub-Graph Convolution Network which finally resorts to a fully connected decoder. The decoder maps the output states to linked entities. The BI-LSTM+CRF model [Emrah-W18-2403] formulates entity linking as a sequence learning task in which the entity mentions are a sequence whose length equals to the series of the output entities. In this model, an RDF2Vec layer initially transforms each mention into fixed-length vectors. The Bidirectional LSTM then outputs latent vectors at different timestamps which are used in the CRF layer to finally resolves the ambiguity.
There are concrete evidence in the literature that the machine learning based models trained over generic datasets such as WikiDisamb30 [DBLP:conf/cikm/FerraginaS10], CoNLL (YAGO) [DBLP:conf/emnlp/HoffartYBFPSTTW11] do not perform well when applied to the short texts. Singh et. al. evaluated over 20 entity linking tools for short text -questions in this case- and concluded that issues like capitalization of surface forms, implicit entities, and multi word entities, affect performance of EL tools in an input short text. Falcon [sakor2019old] addresses specific challenges of short texts by applying a rule based approach for EL over DBpedia. Falcon not only links entities to DBpedia, but also provides DBpedia URIs of the relations in a short text. EARL [banerjeejoint] is another tool that proposes a traveling salesman algorithm based approach for joint entity and relation linking over DBpedia. Besides EARL and Falcon, we are not aware of any other tool that provides joint entity and relation linking.
Entity linking over Wikidata is relatively new domain. Cetoli et al. [cetoli2019neural] propose a neural network based approach for linking entities to Wikidata. Authors also align an existing Wikipedia corpus based dataset to Wikidata. However, this work only targets entity disambiguation and assumes that the entities are already recognized in the sentences. Arjun [mulang2019context] is the latest work for Wikidata entity linking and use attention based neural network for linking Wikidata entity labels. OpenTapioca [delpeuch2019opentapioca] is another attempt which performs end-to-end entity linking over Wikidata; it is the closest to our work even though OpenTapioca does not provide Wikidata Ids of relations in a sentence. OpenTapioca is also available as an API, and it is utilized as our baseline.
8 Conclusion and Future Work
We presented the resource Falcon 2.0, a rule-based entity and relation linking tool able to recognize entities and relations in short text, and to link them to existing knowledge graph, e.g., DBpedia and Wikidata. Although, there are various approaches for entity and relation linking to DBpedia, Falcon 2.0 is one the few tools able to perform this task over Wikidata. Thus, given the number of facts -generic and domain specific- that compose Wikidata, Falcon 2.0 has the potential of impacting on researchers and practitioners that resort to NLP tools for transforming semi-structured data into structured facts. Falcon 2.0 is open source, and the API is publicly accessible and maintained in the servers of the TIB labs1515 15 https://labs.tib.eu. Falcon 2.0 has been empirically evaluated on two benchmarks, and the outcomes suggest that it is able to overcome the state of the art. Albeit promising, the experimental results can be improved. In the future, we plan to continue researching on novel techniques that enable to adjust the catalog of rules and alignments to the changes in Wikidata.
This work has received funding from the EU H2020 Project No. 727658 (IASIS).