Neural Networks for Relational Data

  • 2020-01-13 17:26:00
  • Navdeep Kaur, Gautam Kunapuli, Saket Joshi, Kristian Kersting, Sriraam Natarajan
  • 0

Abstract

While deep networks have been enormously successful over the last decade,they rely on flat-feature vector representations, which makes them unsuitablefor richly structured domains such as those arising in applications like socialnetwork analysis. Such domains rely on relational representations to capturecomplex relationships between entities and their attributes. Thus, we considerthe problem of learning neural networks for relational data. We distinguishourselves from current approaches that rely on expert hand-coded rules bylearning relational random-walk-based features to capture local structuralinteractions and the resulting network architecture. We further exploitparameter tying of the network weights of the resulting relational neuralnetwork, where instances of the same type share parameters. Our experimentalresults across several standard relational data sets demonstrate theeffectiveness of the proposed approach over multiple neural net baselines aswell as state-of-the-art statistical relational models.

 

Quick Read (beta)

Neural Networks for Relational Data

Navdeep Kaur 1The University of Texas at Dallas
1{Navdeep.Kaur, Gautam.Kunapuli, Sriraam.Natarajan}@utdallas.edu
   Gautam Kunapuli 1The University of Texas at Dallas
1{Navdeep.Kaur, Gautam.Kunapuli, Sriraam.Natarajan}@utdallas.edu
   Saket Joshi 2Amazon Inc. 2{[email protected]}    Kristian Kersting 3TU Darmstadt, Germany 3{[email protected]}    Sriraam Natarajan 1The University of Texas at Dallas
1{Navdeep.Kaur, Gautam.Kunapuli, Sriraam.Natarajan}@utdallas.edu
Abstract

While deep networks have been enormously successful over the last decade, they rely on flat-feature vector representations, which makes them unsuitable for richly structured domains such as those arising in applications like social network analysis. Such domains rely on relational representations to capture complex relationships between entities and their attributes. Thus, we consider the problem of learning neural networks for relational data. We distinguish ourselves from current approaches that rely on expert hand-coded rules by learning relational random-walk-based features to capture local structural interactions and the resulting network architecture. We further exploit parameter tying of the network weights of the resulting relational neural network, where instances of the same type share parameters. Our experimental results across several standard relational data sets demonstrate the effectiveness of the proposed approach over multiple neural net baselines as well as state-of-the-art statistical relational models.

Keywords:
neural networks relational models

1 Introduction

While successful, deep networks have a few important limitations. Apart from the key issue of interpretability, the other major limitation is the requirement of a flat inputs (vectors, matrics, tensors), which limits applications to tabular, propositional representations. On the other hand, symbolic and structured representations [14, 7, 13, 38, 1] have the advantage of being interpretable, while also allowing for rich representations that allow for learning and reasoning with multiple levels of abstraction. This representability allows them to model complex data structures such as graphs far more easily and interpretably than basic propositional representations. While expressive, these models do not incorporate or discover latent relationships between features as effectively as deep networks.

Consequently, there has been focus on achieving the dream team of logical and statistical learning methods such as relational neural networks [19, 43]. While specific architectures differ, these methods generally employ hand-coded relational rules or Inductive Logic Programming (ILP, [24]) to identify the domain’s structural rules; these rules are then used with the observed data to unroll and learn a neural network. We improve upon these methods in two specific ways: (1) we employ a rule learner that has been recently successful to automatically extract interpretable rules that are then employed as hidden layer of the neural network; (2) we exploit the notion of parameter tying from the perspective of statistical relational learning models that allow multiple instances of the same rule share the same parameter. These two extensions significantly improve the adaptation of neural networks (NNs) for relational data.

We employ Relational Random Walks [22] to extract relational rules from a database, which are then used as the first layer of the NN. These random walks have the advantages of being learned from data (instead of time-consumingly hand-coded), and interpretable (as walks are rules in a database schema). Given evidence (facts), relational random walks are instantiated (grounded); parameter tying ensures that groundings of the same random walk share the same parameters with far fewer network parameters to be learned during training.

For combining outputs from different groundings of the same clause, we employ combination functions [30, 16]. For instance, given a rule: 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫(𝐏), 𝐀𝐮𝐭𝐡𝐨𝐫(𝐏,𝐔),𝐀𝐮𝐭𝐡𝐨𝐫(𝐒,𝐔),𝐒𝐭𝐮𝐝𝐞𝐧𝐭(𝐒), the 𝐚𝐧𝐚-𝐛𝐨𝐛 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫-𝐒𝐭𝐮𝐝𝐞𝐧𝐭 pair could have coauthored 6 papers, while the 𝐜𝐚𝐦-𝐝𝐚𝐧 pair could have coauthored 10 publications (𝐔). Combination functions are a natural way to compare such relational features arising from rules. Our network handles this in two steps: first, by ensuring that all instances (papers) of a particular 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫-𝐒𝐭𝐮𝐝𝐞𝐧𝐭 pair share the same weights. Second, by combining predictions from each of these instances (papers) using a combination function. We explore the use of Or, Max and Average combination functions. Once the network weights are appropriately constrained by parameter tying and combination functions, they can be learned using standard techniques such as backpropagation.

We make the following contributions: (1) we learn a NN that can be fully trained from data and with no significant engineering, unlike previous approaches; (2) we combine the successful paradigms of relational random walks and parameter tying from SRL methods; this allows the resulting NN to faithfully model relational data while being fully learnable; (3) we evaluate the proposed approach against recent relational NN approaches and demonstrate its efficacy.

2 Related Work

Lifted Relational Neural Networks. Our work is closest to Lifted Relational Neural Networks (LRNN) [43] due to Šourek et al., in terms of the architecture. LRNN uses expert hand-crafted relational rules as input, which are then instantiated (based on data) and rolled out as a ground network. While at a high-level, our approach appears similar to the LRNN framework, there are significant differences. First, while Šourek et al., exploit tied parameters across examples within the same rule, there is no parameter tying across multiple instances; our model, however, ensures parameter tying of multiple ground instances of the rule (in our case, a relational random walk). Second, since they adopt a fuzzy notion, their system supports weighted facts (called ground atoms in logic literature). We take a more standard approach and our observations are Boolean. Third, while the previous difference appears to be limiting in our case, note that this leads to a reduction in the number of network weight parameters.

Ŝourek et al., have extended their work to learn network structure using predicate invention [44]; our work learns relational random walks as rules for the network structure. As we show in our experiments, NNs cannot only easily handle such large number of such random walks, but can also use them effectively as a bag of weakly predictive intermediate layers capturing local features. This allows for learning a more robust model than the induced rules, which take a more global view of the domain. Another recent approach is due to Kazemi and Poole [19], who proposed a relational neural network by adding hidden layers to their Relational Logistic Regression [18] model. A key limitation of their work is that they are restricted to unary relation predictions, that is, they can only predict attributes of objects instead of relations between. In contrast, ours is a more general framework in that can be used to predict relations between objects.

Much of this recent work is closely related to a significant body of research called neural-symbolic integration [12], which aims to combine (arguably) two of the oldest formalisms in machine learning: symbolic representations with neural learning architectures. Some of the earliest systems such as KBANN [45] date back to the early 90s; KBANN also rolls out the network architecture from rules, though it only supports propositional rules. Current work, including ours, instead explores relational rules which serve as templates to roll out more complex architectures. Other recent approaches such as CILP++ [11] and Deep Relational Machines [26] incorporate relational information as network layers. However, such models propositionalize relational data into flat-feature vector and hence, cannot be seen as truly relational models. A rather distinctive approach in this vein is due to Hu et al. [15], where two independent networks incorporating rules and data are trained together. Finally, NNs have also been trained to approximate ILP clause evaluation [8], perform SLD-resolution in first-order logic [21], and approximate entailment operators in propositional logic [10].

Relational Random Walks. The Path Ranking Algorithm (PRA, [22]) is a key framework, where a combination of random walks replaces exhaustive search in order to answer queries. Recently, Das et al. [6] considered random walks between query entities to perform composition of embeddings of relations on each walk with recurrent neural networks. DeepWalks [34] performs random walks on graphs by treating each node as a word, which results in learning embeddings for each node of graph. Kaur et al.[17] consider relational random walks to generate count and existential features to train a relational restricted Boltzmann machine [23]. This feature transformation induces propositionalization that could potentially result in loss of information, as we show in our experiments.

Tensor Based Models. Recently, several tensor-based models [31, 4, 41, 3, 47] have been proposed to learn embeddings of objects and relations. Such models have been very effective for large-scale knowledge-base construction. However, they are computationally expensive as they learn parameters for each object and relation in the knowledge base. Furthermore, the embedding into some ambient vector space makes the models more difficult to interpret. Though rule distillation can yield human-readable rules [48], it is another computationally intensive post-processing step, which limits the size of the interpreted rules.

Other Models. Several NNs have been utilized with relational databases schemas [2, 37]. These models differ on how they handle 1-to-N joins, cyclicity, and indirect relationships between relations. However, they all learn one network per relation, which makes them computationally expensive. In the same vein, graph-based models take graph structure into consideration during training. Pham et al. [35] perform collective classification via a deep neural network where connections between adjacent layers are established according to given graph structure. Niepert et al. [32] proposed an algorithm that prepares the relational data to be directly input to standard convolutional network by assigning an ordering to enable feature convolution. Scarselli et al. [39] proposed Graph Neural Networks in which one neural network is installed at each node of the graph, which is trained by obtaining input from all the incoming edges of graph. One neural network per node makes the model computationally very expensive Finally, with the rapid growth of deep learning, relational counterparts of most of existing connectionist models have been also proposed [40, 33, 46, 49].

3 Neural Networks with Relational Parameter Tying

We first introduce some notation for relational logic, which is used for relational representation, with the domain being represented using constants, variables and predicates. We adopt the following conventions: (1) constants used to represent entities in the domain are written in lower-case (e.g., 𝐚𝐧𝐚, 𝐛𝐨𝐛); (2) variables and entity types are capitalized (e.g., 𝐒𝐭𝐮𝐝𝐞𝐧𝐭, 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫); and (3) relations and predicate symbols between entities and attributes are represented as 𝐐(,). A grounding is a predicate applied to a tuple of terms (i.e., either a full or partial instantiation), e.g. 𝐀𝐝𝐯𝐢𝐬𝐞𝐝𝐁𝐲(𝐒𝐭𝐮𝐝𝐞𝐧𝐭,𝐚𝐧𝐚), is a partial instantiation.

Rules are constructed from atoms using logical connectives (, ) and quantifiers (, ). Due to the use of relational random walks, the relational rules that we employ are universally conjunctions of the form 𝐡𝐛𝟏𝐛, where the head 𝐡 is the target of prediction and the body 𝐛𝟏𝐛 corresponds to conditions that make up the rule (that is, each literal 𝐛𝐢 in the body is a predicate 𝐐(,)). We do not consider negations in this work.

An example rule could be 𝐀𝐝𝐯𝐢𝐬𝐞𝐝𝐁𝐲(𝐒,𝐏)𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫(𝐏)𝐖𝐨𝐫𝐤𝐬𝐈𝐧(𝐏,𝐓)𝐏𝐚𝐫𝐭𝐎𝐟(𝐓,𝐒)𝐒𝐭𝐮𝐝𝐞𝐧𝐭(𝐒). This rules states that if a 𝐒𝐭𝐮𝐝𝐞𝐧𝐭 is a part of the project that the 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫 works on, then the 𝐒𝐭𝐮𝐝𝐞𝐧𝐭 is advised by that 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫. The body of the rule is learned as a random walk that starts with 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫 and ends with 𝐒𝐭𝐮𝐝𝐞𝐧𝐭. Such a random walk represents a chain of relations that could possibly connect a 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫 to a 𝐒𝐭𝐮𝐝𝐞𝐧𝐭 and is a relational feature that could help in the prediction. The rule head is the target that we are interested in predicting. Since these rules are essentially “soft” rules, we can also associate clauses with weights, i.e., weighted rules: (𝐑,𝐰).

A relational neural network 𝒩 is a set of M weighted rules describing interactions in the domain {𝐑𝐣,𝐰𝐣)}j=1M. We are given a set of atomic facts , known to be true (the evidence) and labeled relational training examples {(𝐱i,yi)}i=1. In general, labels yi can take multiple values corresponding to a multi-class problem. We seek to learn a relational neural network model 𝒩{𝐑𝐣,𝐰𝐣)}j=1M to predict a 𝐓𝐚𝐫𝐠𝐞𝐭 relation, given relational examples 𝐱, that is: y=𝐓𝐚𝐫𝐠𝐞𝐭(𝐱).

Given: Set of instances , 𝐓𝐚𝐫𝐠𝐞𝐭 relation, relational data set (𝐱,y)𝒟; Construct (structure learning): 𝐑𝐣, relational random walk rules (relational feature describing the network structure of 𝒩); Train (parameter learning): wj, rule weights via gradient descent with rule-based parameter tying to identify a sparse set of network weights of 𝒩

Example

The movie domain contains the entity types (variables) Person(P), Movie(M) and Genre(G). In addition there are relations (features): Directed(P,M), ActedIn(P,G) and InGenre(M,G). The domain also has relations for entity resolution: SamePerson(P1,P2) and SameGenre(G1,G2). The task is to predict if P1 worked under P2, with the target predicate (label): WorkedUnder(P1,P2).

3.1 Generating Lifted Random Walks

The core component of a neural network model is the architecture, which determines how the various neurons are connected to each other, and ultimately how all the input features interact with each other. In a relational neural network, the architecture is determined by the domain structure, or the set of relational rules that determines how various relations, entities and attributes interact in the domain as shown earlier with the 𝐀𝐝𝐯𝐢𝐬𝐞𝐝𝐁𝐲 example. While previous approaches employed carefully hand-crafted rules, we, instead, use relational random walks to define the network architecture and model the local relational structure of the domain. A similar approach was also used by Kaur et al [17], though the random walk features were used to instantiate a restricted Boltzmann machine, which has a far more limited architecture and their work is not lifted since it instantiates the entire network before learning.

Relational data is often represented using a lifted graph, which defines the domain’s schema; in such a representation, a relation 𝐏𝐫𝐞𝐝𝐢𝐜𝐚𝐭𝐞(𝐓𝐲𝐩𝐞𝟏,𝐓𝐲𝐩𝐞𝟐) is a predicate edge between two type nodes: 𝐓𝐲𝐩𝐞𝟏𝐏𝐫𝐞𝐝𝐢𝐜𝐚𝐭𝐞𝐓𝐲𝐩𝐞𝟐. A relational random walk through a graph is a chain of such edges corresponding to a conjunction of predicates. For a random walk to be semantically sound, we should ensure that the input type (argument domain) of the (i+1)-th predicate is the same as the output type (argument range) of the i-th predicate.

Example (continued)

The body of the rule

𝐀𝐜𝐭𝐞𝐝𝐈𝐧(𝐏𝟏,𝐆𝟏)𝐒𝐚𝐦𝐞𝐆𝐞𝐧𝐫𝐞(𝐆𝟏,𝐆𝟐)𝐀𝐜𝐭𝐞𝐝𝐈𝐧-𝟏(𝐆𝟐,𝐏𝟐)
𝐒𝐚𝐦𝐞𝐏𝐞𝐫𝐬𝐨𝐧(𝐏𝟐,𝐏𝟑)𝐀𝐜𝐭𝐞𝐝𝐈𝐧-𝟏(𝐏𝟑,𝐌)𝐃𝐢𝐫𝐞𝐜𝐭𝐞𝐝(𝐌,𝐏𝟒) 𝐖𝐨𝐫𝐤𝐞𝐝𝐔𝐧𝐝𝐞𝐫(𝐏𝟏,𝐏𝟒)

can be represented graphically as

𝐏𝟏𝐀𝐜𝐭𝐞𝐝𝐈𝐧𝐆𝟏𝐒𝐚𝐦𝐞𝐆𝐞𝐧𝐫𝐞𝐆𝟐𝐀𝐜𝐭𝐞𝐝𝐈𝐧-𝟏𝐏𝟐𝐒𝐚𝐦𝐞𝐏𝐞𝐫𝐬𝐨𝐧𝐏𝟑𝐀𝐜𝐭𝐞𝐝𝐈𝐧-𝟏𝐌𝐃𝐢𝐫𝐞𝐜𝐭𝐞𝐝𝐏𝟒.

This is a lifted random walk between two entities P1P4 in the target predicate, WorkedUnder(P1,P4). It is semantically sound as it is possible to chain the second argument of a predicate to the first argument of the succeeding predicate. This walk also contains an inverse predicate ActedIn-1, which is distinct from ActedIn (since the argument types are reversed).

We use path-constrained random walks [22] approach to generate M lifted random walks 𝐑𝐣, j=1,,M. These random walks form the backbone of the lifted neural network, as they are templates for various feature combinations in the domain. They can also be interpreted as domain rules as they impart localized structure to the domain model, that is, they provide a qualitative description of the domain. When these rules, or lifted random walks have weights associated with them, we are then able to endow the rules with a quantitative influence on the target predicate. We now describe a novel approach to network instantiation using these random-walk-based relational features. A key component of the proposed instantiation is rule-based parameter tying, which reduces the number of network parameters to be learned significantly, while still effectively maintaining the quantitative influences as described by the relational random walks.

3.2 Network Instantiation

Figure 1: The relational neural network is unrolled in three stages, ensuring that the output is a function of facts through two hidden layers: the combining rules layer (with lifted random walks) and the grounding layer (with instantiated random walks). Weights are tied between the input and grounding layers based on which fact/feature ultimately contributes to which rule in the combining rules layer.

The relational random walks (𝐑𝐣) generated in the previous subsection are the relational features of the lifted relational neural network, 𝒩. Our goal is to unroll and ground the network with several intermediate layers that capture the relationships expressed by the random walks. A key difference in network construction between our proposed work and recent approaches such as that of Šourek et al., [42] is that we do not perform an exhaustive grounding to generate all possible instances before constructing the network. Instead, we only ground as needed leading to a much more compact network. We unroll the network in the following manner (cf. Figure 1).

Output Layer: For the 𝐓𝐚𝐫𝐠𝐞𝐭, which is also the head 𝐡 in all the rules 𝐑𝐣, introduce an output neuron called the target neuron, Ah. With one-hot encoding of the target labels, this architecture can handle multi-class problems. The target neuron uses the softmax activation function. Without loss of generality, we describe the rest of the network unrolling assuming a single output neuron.

Combining Rules Layer: The target neuron is connected to M lifted rule neurons, each corresponding to one of the lifted relational random walks, (𝐑𝐣,𝐰𝐣). Each rule 𝐑𝐣 is a conjunction of predicates defined by random walks:

𝐐𝟏𝐣(𝐗,)𝐐𝐋𝐣(,𝐙)𝐓𝐚𝐫𝐠𝐞𝐭(𝐗,𝐙),𝐣=𝟏,,𝐌, (1)

and corresponds to the lifted rule neuron Aj. This layer of neurons is fully connected to the output layer to ensure that all the lifted random walks (that capture the domain structure) influence the output. The extent of their influence is determined by learnable weights, uj between Aj and the output neuron Ah.

In Fig. 1, we see that the rule neuron Aj is connected to the neurons Aji; these neurons correspond to Nj instantiations of the random-walk 𝐑𝐣. The lifted rule neuron Aj aims to combine the influence of the groundings/instantiations of the random-walk feature 𝐑𝐣 that are true in the evidence. Thus, each lifted rule neuron can also be viewed as a rule combination neuron. The activation function of a rule combination neuron can be any aggregator or combining rule [30]. This can include value aggregators such as weighted mean, max0 or distribution aggregators (if inputs to the this layer are probabilities) such as Noisy-Or. Many such aggregators can be incorporated into the combining rules layer with appropriate weights (vji) and activation functions of the rule neurons. For instance, combining rule instantiations 𝗈𝗎𝗍(Aji) with a weighted mean will require learning vji, with the nodes using unit functions for activation. The formulation of this layer is much more general and subsumes the approach of Šourek et al [42], which uses a max combination layer.

Grounding Layer: For each instantiated (ground) random walk 𝐑𝐣𝜽𝐢,𝐢=𝟏,,𝐍𝐣, we introduce a ground rule neuron, Aji. This ground rule neuron represents the i-th instantiation (grounding) of the body of the j-th rule, 𝐑𝐣𝜽𝐢: 𝐐𝟏𝐣𝜽𝐢𝐐𝐣𝜽𝐢 (cf. eqn 1). The activation function of a ground rule neuron is a logical AND (); it is only activated when all its constituent inputs are true (that is, only when the entire instantiation is true in the evidence).

This requires all the constituent facts 𝐐𝟏𝐣𝜽𝐢,,𝐐𝐣𝜽𝐢 to be in the evidence. Thus, the (j,i)-th ground rule neuron is connected to all the fact neurons that appear in its corresponding instantiated rule body. A key novelty of our approach is regarding relational parameter tying: the weights of connections between the fact and grounding layers are tied by the rule these facts appear in together. This is described in detail further below.

Input Layer: Each instantiated (grounded) predicate that appears as a part of an instantiated rule body is a fact, that is 𝐐𝐤𝐣𝜽𝐢. For each such instantiated fact, we create a fact neuron Af, ensuring that each unique fact in evidence has only one single neuron associated with it. Every example is a collection of facts, that is, example 𝐱ii. Thus, an example is input into the system by simply activating its constituent facts in the input layer.

Relational Parameter Tying: The most important thing to note about this construction is that we employ rule-based parameter tying for the weights between the grounding layer and the input/facts layer. Parameter tying ensures that instances corresponding to an example all share the same weight wj if they occur in the same lifted rule 𝐑𝐣. The shared weights wj are propagated through the network in a bottom-up fashion, ensuring that weights in the succeeding hidden layers are influenced by them.

Our approach to parameter tying is in sharp contrast to that of Šourek et al., [42], who learn the weights of the network edges between the output layer and the combining rules layer. Furthermore, they also use fuzzy facts (weighted instances), whereas in our case, the facts/instances are Boolean, though their edge weights are tied. Our approach also differs from that of Kaur et al., [17] who also use relational random walks. From a parametric standpoint, Kaur et al., used relational random walks as features for a restricted Boltzmann machine, where the instance neurons and the rule neurons form a bipartite graph. Thus, the relational RBM formulation has significantly more edges, and commensurately many more parameters to optimize during learning.

Figure 2: Example: unrolling the network with relational parameter tying.
Example (continued, see Fig. 2)

Consider two lifted random walks (R1,w1) and (R2,w2) for the target predicate WorkedUnder(P1,P2)

𝐖𝐨𝐫𝐤𝐞𝐝𝐔𝐧𝐝𝐞𝐫(𝐏𝟏,𝐏𝟐) 𝐀𝐜𝐭𝐞𝐝𝐈𝐧(𝐏𝟏,𝐌)𝐃𝐢𝐫𝐞𝐜𝐭𝐞𝐝-𝟏(𝐌,𝐏𝟐),
𝐖𝐨𝐫𝐤𝐞𝐝𝐔𝐧𝐝𝐞𝐫(𝐏𝟏,𝐏𝟐) 𝐒𝐚𝐦𝐞𝐏𝐞𝐫𝐬𝐨𝐧(𝐏𝟏,𝐏𝟑)𝐀𝐜𝐭𝐞𝐝𝐈𝐧(𝐏𝟑,𝐌)𝐃𝐢𝐫𝐞𝐜𝐭𝐞𝐝-𝟏(𝐌,𝐏𝟐).

Note that while the inverse predicate Directed-1(M,P) is syntactically different from Directed(P,M) (argument order is reversed), they are both semantically same. The output layer consists of a single neuron Ah corresponding to the binary target WorkedUnder. The lifted rule layer (also known as combining rules layer) has two lifted rule nodes A1 corresponding to rule R1 and A2 corresponding to rule R2. These rule nodes combine inputs corresponding to instantiations that are true in the evidence. The network is unrolled based on the specific training example, for instance: WorkedUnder(Leo,Marty). For this example, the rule R1 has two instantiations that are true in the evidence. Then, we introduce a ground rule node for each such instantiation:

A11: 𝐀𝐜𝐭𝐞𝐝𝐈𝐧(𝐋𝐞𝐨,"𝐓𝐡𝐞𝐃𝐞𝐩𝐚𝐫𝐭𝐞𝐝")𝐃𝐢𝐫𝐞𝐜𝐭𝐞𝐝-𝟏("𝐓𝐡𝐞𝐃𝐞𝐩𝐚𝐫𝐭𝐞𝐝",𝐌𝐚𝐫𝐭𝐲),
A12: 𝐀𝐜𝐭𝐞𝐝𝐈𝐧(𝐋𝐞𝐨,"𝐓𝐡𝐞𝐀𝐯𝐢𝐚𝐭𝐨𝐫")𝐃𝐢𝐫𝐞𝐜𝐭𝐞𝐝-𝟏("𝐓𝐡𝐞𝐀𝐯𝐢𝐚𝐭𝐨𝐫",𝐌𝐚𝐫𝐭𝐲).

The rule R2 has only one instantiation, and consequently only one node:

A21: 𝐒𝐚𝐦𝐞𝐏𝐞𝐫𝐬𝐨𝐧(𝐋𝐞𝐨,𝐋𝐞𝐨𝐧𝐚𝐫𝐝𝐨)𝐀𝐜𝐭𝐞𝐝𝐈𝐧(𝐋𝐞𝐨,"𝐓𝐡𝐞𝐃𝐞𝐩𝐚𝐫𝐭𝐞𝐝")
𝐃𝐢𝐫𝐞𝐜𝐭𝐞𝐝-𝟏("𝐓𝐡𝐞𝐃𝐞𝐩𝐚𝐫𝐭𝐞𝐝",𝐌𝐚𝐫𝐭𝐲).

The grounding layer consists of ground rule nodes corresponding to instantiations of rules that are true in the evidence. The edges AjiAj have weights vji that depend on the combining rule implemented in Aj. In this example, the combining rule is average, so we have v11=v12=12 and v21=1. The input layer consists of atomics fact in evidence: fF. The fact nodes ActedIn(Leo,"TheAviator") and Directed-1("TheAviator", Marty) appear in the grounding R1𝛉2 and are connected to the corresponding ground rule neuron A12. Finally, parameters are tied on the edges between the facts layer and the grounding layer. This ensures that all facts that ultimately contribute to a rule are pooled together, which increases the influence of the rule during weight learning. This, in turn, ensures that a rule that holds strongly in the evidence gets a higher weight.

Once the network 𝒩𝜽 is instantiated, the weights wj and uj can be learned using standard techniques such as backpropagation. We denote our approach Neural Networks with Relational Parameter Tying (NNRPT). The tied parameters incorporate the structure captured by the relational features (lifted random walks), leading to a network with significantly fewer weights, while also endowing the it with semantic interpretability regarding the discriminative power of the relational features. We now demonstrate the importance of parameter tying and the use of relational random walks as compared to previous frameworks.

4 Experiments

Our empirical evaluation aims to answer the following questions explicitly11 1 https://github.com/navdeepkjohal/NNRPT: Q1:] How does 𝙽𝙽𝚁𝙿𝚃 compare to the state-of-the-art SRL models i.e., what the value of learning a neural net over standard models? Q2: How does 𝙽𝙽𝚁𝙿𝚃 compare to propositionalization models i.e., what is the need for parameterization of standard neural networks? Q3: How does 𝙽𝙽𝚁𝙿𝚃 compare to other relational neural networks in literature?

Data Sets:

We use five standard data sets to evaluate our algorithm (see Table 1): Uw-Cse. [38] is a standard data set that consists of predicates and relations such as 𝐏𝐫𝐨𝐟𝐞𝐬𝐬𝐨𝐫, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭, 𝐏𝐮𝐛𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧, 𝐇𝐚𝐬𝐏𝐨𝐬𝐢𝐭𝐢𝐨𝐧 and 𝐓𝐚𝐮𝐠𝐡𝐭𝐁𝐲 etc. The data set contains information from 5 different areas of computer science about professors, students and courses, and the task is to predict the 𝐀𝐝𝐯𝐢𝐬𝐞𝐝𝐁𝐲 relationship between a professor and a student. Imdb was first created by Mihalkova and Mooney [27] and contains nine predicates such as 𝐆𝐞𝐧𝐝𝐞𝐫, 𝐆𝐞𝐧𝐫𝐞, 𝐌𝐨𝐯𝐢𝐞, and 𝐃𝐢𝐫𝐞𝐜𝐭𝐨𝐫. We predict whether an actor has 𝐖𝐨𝐫𝐤𝐞𝐝𝐔𝐧𝐝𝐞𝐫 a director. Cora is a citation matching data set modified by Poon and Domingos [36]. It contains predicates 𝐀𝐮𝐭𝐡𝐨𝐫, 𝐓𝐢𝐭𝐥𝐞, 𝐕𝐞𝐧𝐮𝐞, 𝐇𝐚𝐬𝐖𝐨𝐫𝐝𝐀𝐮𝐭𝐡𝐨𝐫, 𝐇𝐚𝐬𝐖𝐨𝐫𝐝𝐓𝐢𝐭𝐥𝐞, 𝐇𝐚𝐬𝐖𝐨𝐫𝐝𝐕𝐞𝐧𝐮𝐞, 𝐒𝐚𝐦𝐞𝐀𝐮𝐭𝐡𝐨𝐫, and 𝐒𝐚𝐦𝐞𝐓𝐢𝐭𝐥𝐞. The task is to predict if one venue is 𝐒𝐚𝐦𝐞𝐕𝐞𝐧𝐮𝐞 as another.

Mutagenesis [25] was originally used to predict whether a compound is mutagenetic or not. It consists of properties of compounds, their constituent atoms and the type of bond that exists between atoms. We performed relation prediction of whether an atom is a constituent of a given molecule or not (𝐌𝐨𝐥𝐞𝐀𝐭𝐦(𝐀𝐭𝐨𝐦𝐈𝐃,𝐌𝐨𝐥𝐈𝐃)). Sports consists of facts from the sports domain crawled by the Never-Ending Language Learner (NELL, [5]) including details of players, sports, individual plays, league information etc. The goal is to predict which sport a particular team plays.

Table 1: Data sets used in our experiments to answer Q1Q3. The last column shows the number of sampled groundings of random walks per example for 𝙽𝙽𝚁𝙿𝚃.
Domain Target #Facts #Pos #Neg #RW #Samp/RW
Uw-Cse 𝚊𝚍𝚟𝚒𝚜𝚎𝚍𝙱𝚢 2817 90 180 2500 1000
Mutagenesis 𝙼𝚘𝚕𝚎𝙰𝚝𝚖 29986 1000 2000 100 100
Cora 𝚂𝚊𝚖𝚎𝚅𝚎𝚗𝚞𝚎 31086 2331 4662 100 100
Imdb 𝚆𝚘𝚛𝚔𝚎𝚍𝚄𝚗𝚍𝚎𝚛 914 305 710 80 -
Sports 𝚃𝚎𝚊𝚖𝙿𝚕𝚊𝚢𝚜𝚂𝚙𝚘𝚛𝚝 7824 200 400 200 100

Baselines and Experimental Details:

To answer Q1, we compare 𝙽𝙽𝚁𝙿𝚃 with the more recent and state-of-the-art relational gradient-boosting methods, 𝚁𝙳𝙽-𝙱𝚘𝚘𝚜𝚝[29], 𝙼𝙻𝙽-𝙱𝚘𝚘𝚜𝚝 [20], and relational restricted Boltzmann machines 𝚁𝚁𝙱𝙼-𝙴, 𝚁𝚁𝙱𝙼-𝙲 [17]. As the random walks chain binary predicates in our model, we convert unary and ternary predicates into binary predicates for all data sets. Further, to maintain consistency in experimentation, we use the same resulting predicates across all our baselines as well. We run 𝚁𝙳𝙽-𝙱𝚘𝚘𝚜𝚝 and 𝙼𝙻𝙽-𝙱𝚘𝚘𝚜𝚝 with their default settings and learn 20 trees for each model. Also, we train 𝚁𝚁𝙱𝙼-𝙴 and 𝚁𝚁𝙱𝙼-𝙲 according to the settings recommended in [17].

For 𝙽𝙽𝚁𝙿𝚃, we generate random walks by considering each predicate and its inverse to be two distinct predicates. Also, we avoid loops in the random walks by enforcing sanity constraints on the random walk generation. We consider 100 random walks for Mutagenesis, Cora, 80 random walks for Imdb, 200 random walks for Sports and 2500 random walks for Uw-Cse as suggested by Kaur et al [17] (see Table 1). Since we use a large number of random walks, exhaustive grounding becomes prohibitively expensive. To overcome this, we sample groundings for each random walk for large data sets. Specifically, we sample 100 groundings per random walk per example for Cora, Sports, Mutagenesis, and 1000 groundings per random walk per example for Uw-Cse (see Table 1).

For all experiments, we set the positive to negative example ratio to be 1:2 for training, set combination function to be average and perform 5-fold cross validation. For 𝙽𝙽𝚁𝙿𝚃, we set the learning rate to be 0.05, batch size to 1, and number of epochs to 1. We train our model with L1-regularized AdaGrad [9]. Since these are relational data sets where the data is skewed, AUC-PR and AUC-ROC are better measures than likelihood and accuracy.

To answer Q2, we generated flat feature vectors by Bottom Clause Propositionalization (BCP, [11]), according to which one bottom clause is generated for each example. BCP considers each predicate in the body of the bottom clause as a unique feature when it propositionalizes bottom clauses to flat feature vector. We use Progol [28] to generate these bottom clauses. After propositionalization, we train two connectionist models: a propositionalized restricted Boltzmann machine (𝙱𝙲𝙿-𝚁𝙱𝙼) and a propositionalized neural network (𝙱𝙲𝙿-𝙽𝙽). The NN has two hidden layers in our experiments, which makes 𝙱𝙲𝙿-𝙽𝙽 model a modified version of CILP++ [11] that had one hidden layer. The hyper-parameters of both the models were optimized by line search on validation set.

To answer Q3, we compare our model with Lifted Relational Neural Networks (𝙻𝚁𝙽𝙽, [42]). To ensure fairness, we perform structure learning by using PROGOL [28] and input the same clauses to both 𝙻𝚁𝙽𝙽 and 𝙽𝙽𝚁𝙿𝚃. PROGOL learned 4 clauses for Cora, 8 clauses for Imdb, 3 clauses for Sports, 10 clauses for Uw-Cse and 11 clauses for Mutagenesis in our experiment.

Table 2: Comparison of different learning algorithms based on AUC-ROC and AUC-PR. 𝙽𝙽𝚁𝙿𝚃 is comparable or better than standard SRL methods across all data sets.
\Xhline3 \Xhline3 Data Set Measure 𝚁𝙳𝙽-𝙱𝚘𝚘𝚜𝚝 𝙼𝙻𝙽-𝙱𝚘𝚘𝚜𝚝 𝚁𝚁𝙱𝙼-𝙴 𝚁𝚁𝙱𝙼-𝙲 𝙽𝙽𝚁𝙿𝚃
\Xhline3 \Xhline3 Uw-Cse AUC-ROC 0.973±0.014 0.968±0.014 0.975±0.013 0.968±0.011 0.959±0.024
AUC-PR 0.931±0.036 0.916±0.035 0.923±0.056 0.924±0.040 0.896±0.063
\Xhline3 Imdb AUC-ROC 0.955±0.046 0.944±0.070 1.000±0.000 0.997±0.006 0.984±0.025
AUC-PR 0.863±0.112 0.839±0.169 1.000±0.000 0.992±0.017 0.951±0.082
\Xhline3 Cora AUC-ROC 0.895±0.183 0.835±0.035 0.984±0.009 0.867±0.041 0.952±0.043
AUC-PR 0.833±0.259 0.799±0.034 0.948±0.042 0.825±0.050 0.899±0.070
\Xhline3 Mutag. AUC-ROC 0.999±0.000 0.999±0.000 0.999±0.000 0.998±0.001 0.981±0.024
AUC-PR 0.999±0.000 0.999±0.000 0.999±0.000 0.997±0.002 0.970±0.039
\Xhline3 Sports AUC-ROC 0.801±0.026 0.806±0.016 0.760±0.016 0.656±0.071 0.780±0.026
AUC-PR 0.670±0.028 0.652±0.032 0.634±0.020 0.648±0.085 0.668±0.070
\Xhline3 \Xhline3
Table 3: Comparison of 𝙽𝙽𝚁𝙿𝚃 with propositionalization-based approaches. 𝙽𝙽𝚁𝙿𝚃 is significantly better on a majority of data sets.
\Xhline3 \Xhline3 Data Set Measure 𝙱𝙲𝙿-𝚁𝙱𝙼 𝙱𝙲𝙿-𝙽𝙽 𝙽𝙽𝚁𝙿𝚃
\Xhline3 \Xhline3 Uw-Cse AUC-ROC 0.951±0.041 0.868±0.053 0.959±0.024
AUC-PR 0.860±0.114 0.869±0.033 0.896±0.063
\Xhline3 Imdb AUC-ROC 0.780±0.164 0.540±0.152 0.984±0.025
AUC-PR 0.367±0.139 0.536±0.231 0.951±0.082
\Xhline3 Cora AUC-ROC 0.801±0.017 0.670±0.064 0.952±0.043
AUC-PR 0.647±0.050 0.658±0.064 0.899±0.070
\Xhline3 Mutag. AUC-ROC 0.991±0.003 0.945±0.019 0.981±0.024
AUC-PR 0.995±0.001 0.973±0.012 0.970±0.039

\Xhline3 Sports
AUC-ROC 0.664±0.021 0.543±0.037 0.780±0.026
AUC-PR 0.532±0.041 0.499±0.065 0.668±0.070
\Xhline3 \Xhline3

Results:

Table 2 compares our 𝙽𝙽𝚁𝙿𝚃 to 𝙼𝙻𝙽-𝙱𝚘𝚘𝚜𝚝, 𝚁𝙳𝙽-𝙱𝚘𝚘𝚜𝚝, 𝚁𝚁𝙱𝙼-𝙴 and 𝚁𝚁𝙱𝙼-𝙲 to answer Q1. As we see, 𝙽𝙽𝚁𝙿𝚃 is significantly better than 𝚁𝚁𝙱𝙼-𝙲 for Cora and Sports on both AUC-ROC and AUC-PR, and performs comparably to the other data sets. It also performs better than 𝙼𝙻𝙽-𝙱𝚘𝚘𝚜𝚝, 𝚁𝙳𝙽-𝙱𝚘𝚘𝚜𝚝 on Imdb and Cora data sets, and comparably on other data sets. Similarly, it performs better than 𝚁𝚁𝙱𝙼-𝙴 on Sports, both on AUC-ROC and AUC-PR and comparably on other data sets. Broadly, Q1 can be answered affirmatively in that 𝙽𝙽𝚁𝙿𝚃 performs comparably to or better than state-of-the-art SRL models.

Table 3 shows the comparison of 𝙽𝙽𝚁𝙿𝚃 with two propositionalization models: 𝙱𝙲𝙿-𝚁𝙱𝙼 and 𝙱𝙲𝙿-𝙽𝙽 in order to answer Q2. 𝙽𝙽𝚁𝙿𝚃 performs better than 𝙱𝙲𝙿-𝚁𝙱𝙼 on all the data sets except Mutagenesis, where the two models have similar performance. 𝙽𝙽𝚁𝙿𝚃 also performs better than 𝙱𝙲𝙿-𝙽𝙽 on all data sets. It should be noted that BCP feature generation sometimes introduces a large positive-to-negative example skew (for example, in the Imdb data set), which can sometimes gravely affect the performance of the propositional model, as we observe in Table 3. This emphasizes the need for designing models that can handle relational data directly and without propositionalization; our proposed model as an effort in this direction. Q2 can now be answered affirmatively: that 𝙽𝙽𝚁𝙿𝚃 performs better than propositionalization models.

Table 4 compares the performance of 𝙽𝙽𝚁𝙿𝚃 and 𝙻𝚁𝙽𝙽 when both use clauses learned by PROGOL [28]. 𝙽𝙽𝚁𝙿𝚃 performs better on Uw-Cse, Sports evaluated using AUC-PR. This result is especially significant because these data sets are considerably skewed. 𝙽𝙽𝚁𝙿𝚃 also outperforms 𝙻𝚁𝙽𝙽 on Cora and Mutagenesis. Lastly, 𝙽𝙽𝚁𝙿𝚃 has comparable performance on Imdb on both AUC-ROC and AUC-PR. The reason for this big performance gap between the two models on Cora is likely because 𝙻𝚁𝙽𝙽 could not build effective models with the fewer number of clauses (i.e. four) typically learned by PROGOL. In contrast, even with very few clauses, 𝙽𝙽𝚁𝙿𝚃 is able to outperform 𝙻𝚁𝙽𝙽. This helps us answer Q3, affirmatively, that: 𝙽𝙽𝚁𝙿𝚃 offers many advantages over state-of-the-art relational neural networks.

In summary, our experiments clearly show the benefits of parameter tying as well as the expressivity of relational random walks in tightly integrating with a neural network model across a wide variety of domains and settings. The key strengths of 𝙽𝙽𝚁𝙿𝚃 are that it can (1) efficiently incorporate a large number of relational features, (2) capture local qualitative structure through relational random walk features, (3) tie feature weights (parameter-tying) in a manner that captures the global quantitative influences.

Table 4: Comparison of 𝙽𝙽𝚁𝙿𝚃 and 𝙻𝚁𝙽𝙽 on AUC-ROC and AUC-PR on different data sets. Both the models were provided clauses learnt by PROGOL, [28]. 𝙽𝙽𝚁𝙿𝚃 is capable of employing rules to improve performance in some data sets.
\Xhline3 \Xhline3 Model Measure Uw-Cse Imdb Cora Mutagen. Sports
\Xhline3 \Xhline3 𝙻𝚁𝙽𝙽 AUC-ROC 0.923±0.027 0.995±0.004 0.503±0.003 0.500±0.000 0.741±0.016
AUC-PR 0.826±0.056 0.985±0.013 0.356±0.006 0.335±0.000 0.527±0.036

\Xhline3 𝙽𝙽𝚁𝙿𝚃
AUC-ROC 0.700±0.186 0.997±0.007 0.968±0.022 0.532±0.019 0.657±0.014
AUC-PR 0.910±0.072 0.992±0.017 0.943±0.032 0.412±0.032 0.658±0.056

\Xhline3 \Xhline3

Discussion:

A typical convolutional neural network (CNN) is composed of three layers: convolution, max-pooling and (fully-connected) output layers. 𝙽𝙽𝚁𝙿𝚃 can be considered a special instance of a convolutional network in relational domains, where the fact-grounding layer edges are the equivalent of convolution, combining rules layer represents pooling, and softmax layer is the fully-connected layer. If we perform a full and exhaustive grounding of the neural network in 𝙽𝙽𝚁𝙿𝚃, M is the number of lifted random walks (template rules), N is the number of grounded random walks (instances of a template rule) and || is the number of all facts (atomic instances). The data can be represented as a three-dimensional tensor B of size M×N×||, whose elements are precisely Bijk=𝐐kj𝜽i (see the discussion of the Input Layer in Section 3.2). In addition, if we consider the rule layer as tensor T = M×1×||, where parameters are tied across ||, then [wm1f]m=1M constitutes the convolving filter that is repeatedly applied to each of || ground instances. The resulting tensor G=M×N×1 obtained by composing G=DT representing the output of grounded layer passes through a pooling layer (which is the rule-combination layer, here) to downsample the data produce a new tensor C=M×1×1. The tensor C, when composed with the fully-connected non-linear layer F=M×|𝒪| of our model produces tensor of size 1×|𝒪| that represents the probability of each class in the output: 𝒪.

5 Conclusion and Future Work

We considered the problem of learning neural networks from relational data. Our proposed architecture was able to exploit parameter tying i.e., different instances of the same rule shared the same parameters inside the same training example. In addition, we explored the use of relational random walks to create relational features for training these neural nets. Further experiments on larger data sets could yield insights into the scalability of this approach. Integration with an approximate-counting method could potentially reduce the training time. Given the relation to CNNs, stacking could allow for our method to be deeper. Finally, understanding the use of such random-walk-based neural network as a function approximator can allow for efficient and interpretable learning in relational domains with minimal feature engineering.

Acknowledgements: SN, GK & NK gratefully acknowledge AFOSR award FA9550-18-1-0462. The authors acknowledge the support of Amazon faculty award. KK acknowledges the support of the RMU project DeCoDeML. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the AFOSR, Amazon, DeCoDeML or the US government.

References

  • [1] S. Bach, M. Broecheler, B. Huang, and L. Getoor (2017) Hinge-loss Markov random fields and probabilistic soft logic. JMLR. Cited by: §1.
  • [2] H. Blockeel and W. Uwents (2004) Using neural networks for relational learning. In ICML Workshop, Cited by: §2.
  • [3] A. Bordes, X. Glorot, J. Weston, and Y. Bengio (2012) Joint learning of words and meaning representations for open-text semantic parsing. In AISTATS, Cited by: §2.
  • [4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In NeurIPS, Cited by: §2.
  • [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka,Jr., and T. M. Mitchell (2010) Toward an architecture for never-ending language learning. In AAAI, Cited by: §4.
  • [6] R. Das, A. Neelakantan, D. Belanger, and A. McCallum (2017) Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL, Cited by: §2.
  • [7] L. De Raedt, K. Kersting, S. Natarajan, and D. Poole (2016) Statistical Relational Artificial Intelligence: Logic, Probability, and Computation. Morgan & Claypool. Cited by: §1.
  • [8] F. DiMaio and J. Shavlik (2004) Learning an approximation to inductive logic programming clause evaluation. In ILP, Cited by: §2.
  • [9] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. JMLR. Cited by: §4.
  • [10] R. Evans et al. (2018) Can neural networks understand logical entailment?. ICLR. Cited by: §2.
  • [11] M. V. M. França, G. Zaverucha, and A. S. d’Avila Garcez (2014) Fast relational learning using bottom clause propositionalization with artificial neural networks. MLJ. Cited by: §2, §4.
  • [12] A. S. d. Garcez, D. M. Gabbay, and K. B. Broda (2002) Neural-symbolic learning system: foundations and applications. Springer-Verlag. Cited by: §2.
  • [13] L. Getoor, N. Friedman, D. Koller, and A. Pfeffer (2001) Learning probabilistic relational models. RDM. Cited by: §1.
  • [14] L. Getoor and B. Taskar (2007) Introduction to statistical relational learning. MIT Press. Cited by: §1.
  • [15] Z. Hu, X. Ma, Z. Liu, E. H. Hovy, and E. P. Xing (2016) Harnessing deep neural networks with logic rules. In ACL, Cited by: §2.
  • [16] M. Jaeger (2007) Parameter learning for relational bayesian networks. In ICML, Cited by: §1.
  • [17] N. Kaur, G. Kunapuli, T. Khot, K. Kersting, W. Cohen, and S. Natarajan (2017) Relational restricted boltzmann machines: a probabilistic logic learning approach. In ILP, Cited by: §2, §3.1, §3.2, §4, §4.
  • [18] S. M. Kazemi, D. Buchman, K. Kersting, S. Natarajan, and D. Poole (2014) Relational logistic regression. In KR, Cited by: §2.
  • [19] S. M. Kazemi and D. Poole (2018) RelNN: A deep neural model for relational learning. In AAAI, Cited by: §1, §2.
  • [20] T. Khot, S. Natarajan, K. Kersting, and J. Shavlik (2011) Learning Markov logic networks via functional gradient boosting. In ICDM, Cited by: §4.
  • [21] E. Komendantskaya (2007) First-order deduction in neural networks. In LATA, Cited by: §2.
  • [22] N. Lao and W. Cohen (2010) Relational retrieval using a combination of path-constrained random walks. JMLR. Cited by: §1, §2, §3.1.
  • [23] H. Larochelle and Y. Bengio (2008) Classification using discriminative restricted boltzmann machines. In ICML, Cited by: §2.
  • [24] N. Lavrac and Š. Džeroski (1993) Inductive logic programming: techniques and applications. Prentice Hall. Cited by: §1.
  • [25] H. Lodhi and S. Muggleton (2005) Is mutagenesis still challenging ?. In ILP, Cited by: §4.
  • [26] H. Lodhi (2013) Deep relational machines. In ICONIP, Cited by: §2.
  • [27] L. Mihalkova and R. Mooney (2007) Bottom-up learning of Markov logic network structure. In ICML, Cited by: §4.
  • [28] S. Muggleton (1995) Inverse entailment and Progol. New Generation Computing. Cited by: §4, §4, §4, Table 4.
  • [29] S. Natarajan, T. Khot, K. Kersting, B. Guttmann, and J. Shavlik (2012) Gradient-based boosting for statistical relational learning: relational dependency network case. MLJ. Cited by: §4.
  • [30] S. Natarajan, P. Tadepalli, T. G. Dietterich, and A. Fern (2008) Learning first-order probabilistic models with combining rules. ANN MATH ARTIF INTEL. Cited by: §1, §3.2.
  • [31] Nickel,M., V. Tresp, and H. P. Kriegel (2011) A three-way model for collective learning on multirelational data. In ICML, Cited by: §2.
  • [32] M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §2.
  • [33] R. B. Palm, U. Paquet, and O. Winther (2018) Recurrent relational networks for complex relational reasoning. In ICLR, Cited by: §2.
  • [34] B. Perozzi, R. Al-Rfou’, and S. Skiena (2014) DeepWalk: online learning of social representations. In KDD, Cited by: §2.
  • [35] T. Pham, T. Tran, D. Q. Phung, and S. Venkatesh (2016) Column networks for collective classification. In AAAI, Cited by: §2.
  • [36] H. Poon and P. Domingos (2007) Joint inference in information extraction. In AAAI, Cited by: §4.
  • [37] J. Ramon and L. D. Raedt (2000) Multi instance neural network. In ICML Workshop, Cited by: §2.
  • [38] M. Richardson and P. Domingos (2006) Markov logic networks. MLJ. Cited by: §1, §4.
  • [39] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks. Cited by: §2.
  • [40] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In ESWC, Cited by: §2.
  • [41] R. Socher, D. Chen, C. Manning, and A. Ng (2013) Reasoning with neural tensor networks for knowledge base completion. In NeurIPS, Cited by: §2.
  • [42] G. Šourek, V. Aschenbrenner, F. Železny, and O. Kuželka (2015) Lifted relational neural networks. In NeurIPS Workshop, Cited by: §3.2, §3.2, §3.2, §4.
  • [43] G. Šourek, S. Manandhar, F. Železný, S. Schockaert, and O. Kuželka (2016) Learning predictive categories using lifted relational neural networks. In ILP, Cited by: §1, §2.
  • [44] G. Šourek, M. Svatoš, F. Železný, S. Schockaert, and O. Kuželka (2017) Stacked structure learning for lifted relational neural networks. In ILP, Cited by: §2.
  • [45] G. G. Towell, J. W. Shavlik, and M. O. Noordewier (1990) Refinement of approximate domain theories by knowledge-based neural networks. In AAAI, Cited by: §2.
  • [46] H. Wang, X. Shi, and D. Yeung (2015) Relational stacked denoising autoencoder for tag recommendation. In AAAI, Cited by: §2.
  • [47] Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014) Knowledge graph embedding by translating on hyperplanes. In AAAI, Cited by: §2.
  • [48] B. Yang, W. T. Yih, X. He, J. Gao, and L. Deng (2015) Embedding entitities and relations for learning and inference in knowledge bases. In ICLR, Cited by: §2.
  • [49] D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao (2014) Relation classification via convolutional deep neural network. In COLING, Cited by: §2.