### Abstract

While deep networks have been enormously successful over the last decade,they rely on flat-feature vector representations, which makes them unsuitablefor richly structured domains such as those arising in applications like socialnetwork analysis. Such domains rely on relational representations to capturecomplex relationships between entities and their attributes. Thus, we considerthe problem of learning neural networks for relational data. We distinguishourselves from current approaches that rely on expert hand-coded rules bylearning relational random-walk-based features to capture local structuralinteractions and the resulting network architecture. We further exploitparameter tying of the network weights of the resulting relational neuralnetwork, where instances of the same type share parameters. Our experimentalresults across several standard relational data sets demonstrate theeffectiveness of the proposed approach over multiple neural net baselines aswell as state-of-the-art statistical relational models.

### Quick Read (beta)

# Neural Networks for Relational Data

###### Abstract

While deep networks have been enormously successful over the last decade, they rely on flat-feature vector representations, which makes them unsuitable for richly structured domains such as those arising in applications like social network analysis. Such domains rely on relational representations to capture complex relationships between entities and their attributes. Thus, we consider the problem of learning neural networks for relational data. We distinguish ourselves from current approaches that rely on expert hand-coded rules by learning relational random-walk-based features to capture local structural interactions and the resulting network architecture. We further exploit parameter tying of the network weights of the resulting relational neural network, where instances of the same type share parameters. Our experimental results across several standard relational data sets demonstrate the effectiveness of the proposed approach over multiple neural net baselines as well as state-of-the-art statistical relational models.

###### Keywords:

neural networks relational models## 1 Introduction

While successful, deep networks have a few important limitations. Apart from the key issue of interpretability, the other major limitation is the requirement of a flat inputs (vectors, matrics, tensors), which limits applications to tabular, propositional representations. On the other hand, symbolic and structured representations [14, 7, 13, 38, 1] have the advantage of being interpretable, while also allowing for rich representations that allow for learning and reasoning with multiple levels of abstraction. This representability allows them to model complex data structures such as graphs far more easily and interpretably than basic propositional representations. While expressive, these models do not incorporate or discover latent relationships between features as effectively as deep networks.

Consequently, there has been focus on achieving the dream team of logical and statistical learning methods such as relational neural networks [19, 43]. While specific architectures differ, these methods generally employ hand-coded relational rules or Inductive Logic Programming (ILP, [24]) to identify the domain’s structural rules; these rules are then used with the observed data to unroll and learn a neural network. We improve upon these methods in two specific ways: (1) we employ a rule learner that has been recently successful to automatically extract interpretable rules that are then employed as hidden layer of the neural network; (2) we exploit the notion of parameter tying from the perspective of statistical relational learning models that allow multiple instances of the same rule share the same parameter. These two extensions significantly improve the adaptation of neural networks (NNs) for relational data.

We employ Relational Random Walks [22] to extract relational rules from a database, which are then used as the first layer of the NN. These random walks have the advantages of being learned from data (instead of time-consumingly hand-coded), and interpretable (as walks are rules in a database schema). Given evidence (facts), relational random walks are instantiated (grounded); parameter tying ensures that groundings of the same random walk share the same parameters with far fewer network parameters to be learned during training.

For combining outputs from different groundings of the same clause, we employ combination functions [30, 16]. For instance, given a rule: $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}(\mathbf{P})$, $\mathbf{A}\mathbf{u}\mathbf{t}\mathbf{h}\mathbf{o}\mathbf{r}(\mathbf{P},\mathbf{U}),\mathbf{A}\mathbf{u}\mathbf{t}\mathbf{h}\mathbf{o}\mathbf{r}(\mathbf{S},\mathbf{U}),\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}(\mathbf{S})$, the $\mathbf{a}\mathbf{n}\mathbf{a}$-$\mathbf{b}\mathbf{o}\mathbf{b}$ $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}$-$\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$ pair could have coauthored $6$ papers, while the $\mathbf{c}\mathbf{a}\mathbf{m}$-$\mathbf{d}\mathbf{a}\mathbf{n}$ pair could have coauthored $10$ publications ($\mathbf{U}$). Combination functions are a natural way to compare such relational features arising from rules. Our network handles this in two steps: first, by ensuring that all instances (papers) of a particular $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}-\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$ pair share the same weights. Second, by combining predictions from each of these instances (papers) using a combination function. We explore the use of Or, Max and Average combination functions. Once the network weights are appropriately constrained by parameter tying and combination functions, they can be learned using standard techniques such as backpropagation.

We make the following contributions: (1) we learn a NN that can be fully trained from data and with no significant engineering, unlike previous approaches; (2) we combine the successful paradigms of relational random walks and parameter tying from SRL methods; this allows the resulting NN to faithfully model relational data while being fully learnable; (3) we evaluate the proposed approach against recent relational NN approaches and demonstrate its efficacy.

## 2 Related Work

Lifted Relational Neural Networks. Our work is closest to Lifted Relational Neural Networks (LRNN) [43] due to Šourek et al., in terms of the architecture. LRNN uses expert hand-crafted relational rules as input, which are then instantiated (based on data) and rolled out as a ground network. While at a high-level, our approach appears similar to the LRNN framework, there are significant differences. First, while Šourek et al., exploit tied parameters across examples within the same rule, there is no parameter tying across multiple instances; our model, however, ensures parameter tying of multiple ground instances of the rule (in our case, a relational random walk). Second, since they adopt a fuzzy notion, their system supports weighted facts (called ground atoms in logic literature). We take a more standard approach and our observations are Boolean. Third, while the previous difference appears to be limiting in our case, note that this leads to a reduction in the number of network weight parameters.

Ŝourek et al., have extended their work to learn network structure using predicate invention [44]; our work learns relational random walks as rules for the network structure. As we show in our experiments, NNs cannot only easily handle such large number of such random walks, but can also use them effectively as a bag of weakly predictive intermediate layers capturing local features. This allows for learning a more robust model than the induced rules, which take a more global view of the domain. Another recent approach is due to Kazemi and Poole [19], who proposed a relational neural network by adding hidden layers to their Relational Logistic Regression [18] model. A key limitation of their work is that they are restricted to unary relation predictions, that is, they can only predict attributes of objects instead of relations between. In contrast, ours is a more general framework in that can be used to predict relations between objects.

Much of this recent work is closely related to a significant body of research called neural-symbolic integration [12], which aims to combine (arguably) two of the oldest formalisms in machine learning: symbolic representations with neural learning architectures. Some of the earliest systems such as KBANN [45] date back to the early 90s; KBANN also rolls out the network architecture from rules, though it only supports propositional rules. Current work, including ours, instead explores relational rules which serve as templates to roll out more complex architectures. Other recent approaches such as CILP++ [11] and Deep Relational Machines [26] incorporate relational information as network layers. However, such models propositionalize relational data into flat-feature vector and hence, cannot be seen as truly relational models. A rather distinctive approach in this vein is due to Hu et al. [15], where two independent networks incorporating rules and data are trained together. Finally, NNs have also been trained to approximate ILP clause evaluation [8], perform SLD-resolution in first-order logic [21], and approximate entailment operators in propositional logic [10].

Relational Random Walks. The Path Ranking Algorithm (PRA, [22]) is a key framework, where a combination of random walks replaces exhaustive search in order to answer queries. Recently, Das et al. [6] considered random walks between query entities to perform composition of embeddings of relations on each walk with recurrent neural networks. DeepWalks [34] performs random walks on graphs by treating each node as a word, which results in learning embeddings for each node of graph. Kaur et al.[17] consider relational random walks to generate count and existential features to train a relational restricted Boltzmann machine [23]. This feature transformation induces propositionalization that could potentially result in loss of information, as we show in our experiments.

Tensor Based Models. Recently, several tensor-based models [31, 4, 41, 3, 47] have been proposed to learn embeddings of objects and relations. Such models have been very effective for large-scale knowledge-base construction. However, they are computationally expensive as they learn parameters for each object and relation in the knowledge base. Furthermore, the embedding into some ambient vector space makes the models more difficult to interpret. Though rule distillation can yield human-readable rules [48], it is another computationally intensive post-processing step, which limits the size of the interpreted rules.

Other Models. Several NNs have been utilized with relational databases schemas [2, 37]. These models differ on how they handle 1-to-$N$ joins, cyclicity, and indirect relationships between relations. However, they all learn one network per relation, which makes them computationally expensive. In the same vein, graph-based models take graph structure into consideration during training. Pham et al. [35] perform collective classification via a deep neural network where connections between adjacent layers are established according to given graph structure. Niepert et al. [32] proposed an algorithm that prepares the relational data to be directly input to standard convolutional network by assigning an ordering to enable feature convolution. Scarselli et al. [39] proposed Graph Neural Networks in which one neural network is installed at each node of the graph, which is trained by obtaining input from all the incoming edges of graph. One neural network per node makes the model computationally very expensive Finally, with the rapid growth of deep learning, relational counterparts of most of existing connectionist models have been also proposed [40, 33, 46, 49].

## 3 Neural Networks with Relational Parameter Tying

We first introduce some notation for relational logic, which is used for relational representation, with the domain being represented using constants, variables and predicates. We adopt the following conventions: (1) constants used to represent entities in the domain are written in lower-case (e.g., $\mathbf{a}\mathbf{n}\mathbf{a}$, $\mathbf{b}\mathbf{o}\mathbf{b}$); (2) variables and entity types are capitalized (e.g., $\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$, $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}$); and (3) relations and predicate symbols between entities and attributes are represented as $\mathbf{Q}(\cdot ,\cdot )$. A grounding is a predicate applied to a tuple of terms (i.e., either a full or partial instantiation), e.g. $\mathbf{A}\mathbf{d}\mathbf{v}\mathbf{i}\mathbf{s}\mathbf{e}\mathbf{d}\mathbf{B}\mathbf{y}(\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t},\mathbf{a}\mathbf{n}\mathbf{a})$, is a partial instantiation.

Rules are constructed from atoms using logical connectives ($\wedge $, $\vee $) and quantifiers ($\exists $, $\forall $). Due to the use of relational random walks, the relational rules that we employ are universally conjunctions of the form $\mathbf{h}\Leftarrow {\mathbf{b}}_{\mathrm{\U0001d7cf}}\wedge \mathrm{\dots}\wedge {\mathbf{b}}_{\mathrm{\ell}}$, where the head $\mathbf{h}$ is the target of prediction and the body ${\mathbf{b}}_{\mathrm{\U0001d7cf}}\wedge \mathrm{\dots}\wedge {\mathbf{b}}_{\mathrm{\ell}}$ corresponds to conditions that make up the rule (that is, each literal ${\mathbf{b}}_{\mathbf{i}}$ in the body is a predicate $\mathbf{Q}(\cdot ,\cdot )$). We do not consider negations in this work.

An example rule could be $\mathbf{A}\mathbf{d}\mathbf{v}\mathbf{i}\mathbf{s}\mathbf{e}\mathbf{d}\mathbf{B}\mathbf{y}(\mathbf{S},\mathbf{P})\Leftarrow \mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}(\mathbf{P})\wedge \mathbf{W}\mathbf{o}\mathbf{r}\mathbf{k}\mathbf{s}\mathbf{I}\mathbf{n}(\mathbf{P},\mathbf{T})\wedge \mathbf{P}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{O}\mathbf{f}(\mathbf{T},\mathbf{S})\wedge \mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}(\mathbf{S})$. This rules states that if a $\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$ is a part of the project that the $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}$ works on, then the $\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$ is advised by that $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}$. The body of the rule is learned as a random walk that starts with $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}$ and ends with $\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$. Such a random walk represents a chain of relations that could possibly connect a $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}$ to a $\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$ and is a relational feature that could help in the prediction. The rule head is the target that we are interested in predicting. Since these rules are essentially “soft” rules, we can also associate clauses with weights, i.e., weighted rules: $(\mathbf{R},\mathbf{w})$.

A relational neural network $\mathcal{N}$ is a set of $M$ weighted rules describing interactions in the domain $\{{\mathbf{R}}_{\mathbf{j}},{\mathbf{w}}_{\mathbf{j}})\}{}_{j=1}{}^{M}$. We are given a set of atomic facts $\mathcal{F}$, known to be true (the evidence) and labeled relational training examples ${\{({\mathbf{x}}_{i},{y}_{i})\}}_{i=1}^{\mathrm{\ell}}$. In general, labels ${y}_{i}$ can take multiple values corresponding to a multi-class problem. We seek to learn a relational neural network model $\mathcal{N}\equiv \{{\mathbf{R}}_{\mathbf{j}},{\mathbf{w}}_{\mathbf{j}})\}{}_{j=1}{}^{M}$ to predict a $\mathbf{T}\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{e}\mathbf{t}$ relation, given relational examples $\mathbf{x}$, that is: $y=\mathbf{T}\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{e}\mathbf{t}(\mathbf{x})$.

Given: Set of instances $\mathcal{F}$, $\mathbf{T}\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{e}\mathbf{t}$ relation, relational data set $(\mathbf{x},y)\in \mathcal{D}$; Construct (structure learning): ${\mathbf{R}}_{\mathbf{j}}$, relational random walk rules (relational feature describing the network structure of $\mathcal{N}$); Train (parameter learning): ${w}_{j}$, rule weights via gradient descent with rule-based parameter tying to identify a sparse set of network weights of $\mathcal{N}$

$$

###### Example

The movie domain contains the entity types (variables) $\mathrm{P}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{s}\mathit{}\mathrm{o}\mathit{}\mathrm{n}\mathit{}\mathrm{(}\mathrm{P}\mathrm{)}$, $\mathrm{M}\mathit{}\mathrm{o}\mathit{}\mathrm{v}\mathit{}\mathrm{i}\mathit{}\mathrm{e}\mathit{}\mathrm{(}\mathrm{M}\mathrm{)}$ and $\mathrm{G}\mathit{}\mathrm{e}\mathit{}\mathrm{n}\mathit{}\mathrm{r}\mathit{}\mathrm{e}\mathit{}\mathrm{(}\mathrm{G}\mathrm{)}$. In addition there are relations (features): $\mathrm{D}\mathit{}\mathrm{i}\mathit{}\mathrm{r}\mathit{}\mathrm{e}\mathit{}\mathrm{c}\mathit{}\mathrm{t}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{(}\mathrm{P}\mathrm{,}\mathrm{M}\mathrm{)}$, $\mathrm{A}\mathit{}\mathrm{c}\mathit{}\mathrm{t}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{I}\mathit{}\mathrm{n}\mathit{}\mathrm{(}\mathrm{P}\mathrm{,}\mathrm{G}\mathrm{)}$ and $\mathrm{I}\mathit{}\mathrm{n}\mathit{}\mathrm{G}\mathit{}\mathrm{e}\mathit{}\mathrm{n}\mathit{}\mathrm{r}\mathit{}\mathrm{e}\mathit{}\mathrm{(}\mathrm{M}\mathrm{,}\mathrm{G}\mathrm{)}$. The domain also has relations for entity resolution: $\mathrm{S}\mathit{}\mathrm{a}\mathit{}\mathrm{m}\mathit{}\mathrm{e}\mathit{}\mathrm{P}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{s}\mathit{}\mathrm{o}\mathit{}\mathrm{n}\mathit{}\mathrm{(}{\mathrm{P}}_{\mathrm{1}}\mathrm{,}{\mathrm{P}}_{\mathrm{2}}\mathrm{)}$ and $\mathrm{S}\mathit{}\mathrm{a}\mathit{}\mathrm{m}\mathit{}\mathrm{e}\mathit{}\mathrm{G}\mathit{}\mathrm{e}\mathit{}\mathrm{n}\mathit{}\mathrm{r}\mathit{}\mathrm{e}\mathit{}\mathrm{(}{\mathrm{G}}_{\mathrm{1}}\mathrm{,}{\mathrm{G}}_{\mathrm{2}}\mathrm{)}$. The task is to predict if ${\mathrm{P}}_{\mathrm{1}}$ worked under ${\mathrm{P}}_{\mathrm{2}}$, with the target predicate (label): $\mathrm{W}\mathit{}\mathrm{o}\mathit{}\mathrm{r}\mathit{}\mathrm{k}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{U}\mathit{}\mathrm{n}\mathit{}\mathrm{d}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{(}{\mathrm{P}}_{\mathrm{1}}\mathrm{,}{\mathrm{P}}_{\mathrm{2}}\mathrm{)}$.

### 3.1 Generating Lifted Random Walks

The core component of a neural network model is the architecture, which determines how the various neurons are connected to each other, and ultimately how all the input features interact with each other. In a relational neural network, the architecture is determined by the domain structure, or the set of relational rules that determines how various relations, entities and attributes interact in the domain as shown earlier with the $\mathbf{A}\mathbf{d}\mathbf{v}\mathbf{i}\mathbf{s}\mathbf{e}\mathbf{d}\mathbf{B}\mathbf{y}$ example. While previous approaches employed carefully hand-crafted rules, we, instead, use relational random walks to define the network architecture and model the local relational structure of the domain. A similar approach was also used by Kaur et al [17], though the random walk features were used to instantiate a restricted Boltzmann machine, which has a far more limited architecture and their work is not lifted since it instantiates the entire network before learning.

Relational data is often represented using a lifted graph, which defines the domain’s schema; in such a representation, a relation $\mathbf{P}\mathbf{r}\mathbf{e}\mathbf{d}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{e}(\mathbf{T}\mathbf{y}\mathbf{p}{\mathbf{e}}_{\mathrm{\U0001d7cf}},\mathbf{T}\mathbf{y}\mathbf{p}{\mathbf{e}}_{\mathrm{\U0001d7d0}})$ is a predicate edge between two type nodes: $\mathbf{T}\mathbf{y}\mathbf{p}{\mathbf{e}}_{\mathrm{\U0001d7cf}}\stackrel{\mathbf{P}\mathbf{r}\mathbf{e}\mathbf{d}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{e}}{\to}\mathbf{T}\mathbf{y}\mathbf{p}{\mathbf{e}}_{\mathrm{\U0001d7d0}}$. A relational random walk through a graph is a chain of such edges corresponding to a conjunction of predicates. For a random walk to be semantically sound, we should ensure that the input type (argument domain) of the $(i+1)$-th predicate is the same as the output type (argument range) of the $i$-th predicate.

###### Example (continued)

The body of the rule

$\mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}\mathbf{n}({\mathbf{P}}_{\mathrm{\U0001d7cf}},{\mathbf{G}}_{\mathrm{\U0001d7cf}})\wedge \mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{G}\mathbf{e}\mathbf{n}\mathbf{r}\mathbf{e}({\mathbf{G}}_{\mathrm{\U0001d7cf}},{\mathbf{G}}_{\mathrm{\U0001d7d0}})\wedge \mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}{\mathbf{n}}^{-\mathrm{\U0001d7cf}}({\mathbf{G}}_{\mathrm{\U0001d7d0}},{\mathbf{P}}_{\mathrm{\U0001d7d0}})\wedge $ | |||

$\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{P}\mathbf{e}\mathbf{r}\mathbf{s}\mathbf{o}\mathbf{n}({\mathbf{P}}_{\mathrm{\U0001d7d0}},{\mathbf{P}}_{\mathrm{\U0001d7d1}})\wedge \mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}{\mathbf{n}}^{-\mathrm{\U0001d7cf}}({\mathbf{P}}_{\mathrm{\U0001d7d1}},\mathbf{M})\wedge \mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}(\mathbf{M},{\mathbf{P}}_{\mathrm{\U0001d7d2}})$ | $\Rightarrow \mathbf{W}\mathbf{o}\mathbf{r}\mathbf{k}\mathbf{e}\mathbf{d}\mathbf{U}\mathbf{n}\mathbf{d}\mathbf{e}\mathbf{r}({\mathbf{P}}_{\mathrm{\U0001d7cf}},{\mathbf{P}}_{\mathrm{\U0001d7d2}})$ |

can be represented graphically as

${\mathbf{P}}_{\mathrm{\U0001d7cf}}\stackrel{\mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}\mathbf{n}}{\to}{\mathbf{G}}_{\mathrm{\U0001d7cf}}\stackrel{\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{G}\mathbf{e}\mathbf{n}\mathbf{r}\mathbf{e}}{\to}{\mathbf{G}}_{\mathrm{\U0001d7d0}}\stackrel{\mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}{\mathbf{n}}^{-\mathrm{\U0001d7cf}}}{\to}{\mathbf{P}}_{\mathrm{\U0001d7d0}}\stackrel{\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{P}\mathbf{e}\mathbf{r}\mathbf{s}\mathbf{o}\mathbf{n}}{\to}{\mathbf{P}}_{\mathrm{\U0001d7d1}}\stackrel{\mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}{\mathbf{n}}^{-\mathrm{\U0001d7cf}}}{\to}\mathbf{M}\stackrel{\mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}}{\to}{\mathbf{P}}_{\mathrm{\U0001d7d2}}.$ |

This is a lifted random walk between two entities ${\mathrm{P}}_{\mathrm{1}}\mathrm{\to}{\mathrm{P}}_{\mathrm{4}}$ in the target predicate, $\mathrm{W}\mathit{}\mathrm{o}\mathit{}\mathrm{r}\mathit{}\mathrm{k}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{U}\mathit{}\mathrm{n}\mathit{}\mathrm{d}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{(}{\mathrm{P}}_{\mathrm{1}}\mathrm{,}{\mathrm{P}}_{\mathrm{4}}\mathrm{)}$. It is semantically sound as it is possible to chain the second argument of a predicate to the first argument of the succeeding predicate. This walk also contains an inverse predicate $\mathrm{A}\mathit{}\mathrm{c}\mathit{}\mathrm{t}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{I}\mathit{}{\mathrm{n}}^{\mathrm{-}\mathrm{1}}$, which is distinct from $\mathrm{A}\mathit{}\mathrm{c}\mathit{}\mathrm{t}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{I}\mathit{}\mathrm{n}$ (since the argument types are reversed).

We use path-constrained random walks [22] approach to generate $M$ lifted random walks ${\mathbf{R}}_{\mathbf{j}}$, $j=1,\mathrm{\dots},M$. These random walks form the backbone of the lifted neural network, as they are templates for various feature combinations in the domain. They can also be interpreted as domain rules as they impart localized structure to the domain model, that is, they provide a qualitative description of the domain. When these rules, or lifted random walks have weights associated with them, we are then able to endow the rules with a quantitative influence on the target predicate. We now describe a novel approach to network instantiation using these random-walk-based relational features. A key component of the proposed instantiation is rule-based parameter tying, which reduces the number of network parameters to be learned significantly, while still effectively maintaining the quantitative influences as described by the relational random walks.

### 3.2 Network Instantiation

The relational random walks (${\mathbf{R}}_{\mathbf{j}}$) generated in the previous subsection are the relational features of the lifted relational neural network, $\mathcal{N}$. Our goal is to unroll and ground the network with several intermediate layers that capture the relationships expressed by the random walks. A key difference in network construction between our proposed work and recent approaches such as that of Šourek et al., [42] is that we do not perform an exhaustive grounding to generate all possible instances before constructing the network. Instead, we only ground as needed leading to a much more compact network. We unroll the network in the following manner (cf. Figure 1).

Output Layer: For the $\mathbf{T}\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{e}\mathbf{t}$, which is also the head $\mathbf{h}$ in all the rules ${\mathbf{R}}_{\mathbf{j}}$, introduce an output neuron called the target neuron, $Ah$. With one-hot encoding of the target labels, this architecture can handle multi-class problems. The target neuron uses the softmax activation function. Without loss of generality, we describe the rest of the network unrolling assuming a single output neuron.

Combining Rules Layer: The target neuron is connected to $M$ lifted rule neurons, each corresponding to one of the lifted relational random walks, $({\mathbf{R}}_{\mathbf{j}},{\mathbf{w}}_{\mathbf{j}})$. Each rule ${\mathbf{R}}_{\mathbf{j}}$ is a conjunction of predicates defined by random walks:

$${\mathbf{Q}}_{\mathrm{\U0001d7cf}}^{\mathbf{j}}(\mathbf{X},\cdot )\wedge \mathrm{\dots}\wedge {\mathbf{Q}}_{\mathbf{L}}^{\mathbf{j}}(\cdot ,\mathbf{Z})\Rightarrow \mathbf{T}\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{e}\mathbf{t}(\mathbf{X},\mathbf{Z}),\mathbf{j}=\mathrm{\U0001d7cf},\mathrm{\dots},\mathbf{M},$$ | (1) |

and corresponds to the lifted rule neuron ${A}_{j}$. This layer of neurons is fully connected to the output layer to ensure that all the lifted random walks (that capture the domain structure) influence the output. The extent of their influence is determined by learnable weights, ${u}_{j}$ between ${A}_{j}$ and the output neuron ${A}_{h}$.

In Fig. 1, we see that the rule neuron ${A}_{j}$ is connected to the neurons ${A}_{ji}$; these neurons correspond to ${N}_{j}$ instantiations of the random-walk ${\mathbf{R}}_{\mathbf{j}}$. The lifted rule neuron ${A}_{j}$ aims to combine the influence of the groundings/instantiations of the random-walk feature ${\mathbf{R}}_{\mathbf{j}}$ that are true in the evidence. Thus, each lifted rule neuron can also be viewed as a rule combination neuron. The activation function of a rule combination neuron can be any aggregator or combining rule [30]. This can include value aggregators such as weighted mean, max0 or distribution aggregators (if inputs to the this layer are probabilities) such as Noisy-Or. Many such aggregators can be incorporated into the combining rules layer with appropriate weights (${v}_{ji}$) and activation functions of the rule neurons. For instance, combining rule instantiations $\text{\U0001d5c8\U0001d5ce\U0001d5cd}({A}_{ji})$ with a weighted mean will require learning ${v}_{ji}$, with the nodes using unit functions for activation. The formulation of this layer is much more general and subsumes the approach of Šourek et al [42], which uses a max combination layer.

Grounding Layer: For each instantiated (ground) random walk ${\mathbf{R}}_{\mathbf{j}}{\bm{\theta}}_{\mathbf{i}},\mathbf{i}=\mathrm{\U0001d7cf},\mathrm{\dots},{\mathbf{N}}_{\mathbf{j}}$, we introduce a ground rule neuron, ${A}_{ji}$. This ground rule neuron represents the $i$-th instantiation (grounding) of the body of the $j$-th rule, ${\mathbf{R}}_{\mathbf{j}}{\bm{\theta}}_{\mathbf{i}}$: ${\mathbf{Q}}_{\mathrm{\U0001d7cf}}^{\mathbf{j}}{\bm{\theta}}_{\mathbf{i}}\wedge \mathrm{\dots}\wedge {\mathbf{Q}}_{\mathrm{\ell}}^{\mathbf{j}}{\bm{\theta}}_{\mathbf{i}}$ (cf. eqn 1). The activation function of a ground rule neuron is a logical AND ($\wedge $); it is only activated when all its constituent inputs are true (that is, only when the entire instantiation is true in the evidence).

This requires all the constituent facts ${\mathbf{Q}}_{\mathrm{\U0001d7cf}}^{\mathbf{j}}{\bm{\theta}}_{\mathbf{i}},\mathrm{\dots},{\mathbf{Q}}_{\mathrm{\ell}}^{\mathbf{j}}{\bm{\theta}}_{\mathbf{i}}$ to be in the evidence. Thus, the $(j,i)$-th ground rule neuron is connected to all the fact neurons that appear in its corresponding instantiated rule body. A key novelty of our approach is regarding relational parameter tying: the weights of connections between the fact and grounding layers are tied by the rule these facts appear in together. This is described in detail further below.

Input Layer: Each instantiated (grounded) predicate that appears as a part of an instantiated rule body is a fact, that is ${\mathbf{Q}}_{\mathbf{k}}^{\mathbf{j}}{\bm{\theta}}_{\mathbf{i}}\in \mathcal{F}$. For each such instantiated fact, we create a fact neuron ${A}_{f}$, ensuring that each unique fact in evidence has only one single neuron associated with it. Every example is a collection of facts, that is, example ${\mathbf{x}}_{i}\equiv {\mathcal{F}}_{i}\subset \mathcal{F}$. Thus, an example is input into the system by simply activating its constituent facts in the input layer.

Relational Parameter Tying: The most important thing to note about this construction is that we employ rule-based parameter tying for the weights between the grounding layer and the input/facts layer. Parameter tying ensures that instances corresponding to an example all share the same weight ${w}_{j}$ if they occur in the same lifted rule ${\mathbf{R}}_{\mathbf{j}}$. The shared weights ${w}_{j}$ are propagated through the network in a bottom-up fashion, ensuring that weights in the succeeding hidden layers are influenced by them.

Our approach to parameter tying is in sharp contrast to that of Šourek et al., [42], who learn the weights of the network edges between the output layer and the combining rules layer. Furthermore, they also use fuzzy facts (weighted instances), whereas in our case, the facts/instances are Boolean, though their edge weights are tied. Our approach also differs from that of Kaur et al., [17] who also use relational random walks. From a parametric standpoint, Kaur et al., used relational random walks as features for a restricted Boltzmann machine, where the instance neurons and the rule neurons form a bipartite graph. Thus, the relational RBM formulation has significantly more edges, and commensurately many more parameters to optimize during learning.

###### Example (continued, see Fig. 2)

Consider two lifted random walks $\mathrm{(}{\mathrm{R}}_{\mathrm{1}}\mathrm{,}{\mathrm{w}}_{\mathrm{1}}\mathrm{)}$ and $\mathrm{(}{\mathrm{R}}_{\mathrm{2}}\mathrm{,}{\mathrm{w}}_{\mathrm{2}}\mathrm{)}$ for the target predicate $\mathrm{W}\mathit{}\mathrm{o}\mathit{}\mathrm{r}\mathit{}\mathrm{k}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{U}\mathit{}\mathrm{n}\mathit{}\mathrm{d}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{(}{\mathrm{P}}_{\mathrm{1}}\mathrm{,}{\mathrm{P}}_{\mathrm{2}}\mathrm{)}$

$\mathbf{W}\mathbf{o}\mathbf{r}\mathbf{k}\mathbf{e}\mathbf{d}\mathbf{U}\mathbf{n}\mathbf{d}\mathbf{e}\mathbf{r}({\mathbf{P}}_{\mathrm{\U0001d7cf}},{\mathbf{P}}_{\mathrm{\U0001d7d0}})\Leftarrow $ | $\mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}\mathbf{n}({\mathbf{P}}_{\mathrm{\U0001d7cf}},\mathbf{M})\wedge \mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{e}{\mathbf{d}}^{-\mathrm{\U0001d7cf}}(\mathbf{M},{\mathbf{P}}_{\mathrm{\U0001d7d0}}),$ | ||

$\mathbf{W}\mathbf{o}\mathbf{r}\mathbf{k}\mathbf{e}\mathbf{d}\mathbf{U}\mathbf{n}\mathbf{d}\mathbf{e}\mathbf{r}({\mathbf{P}}_{\mathrm{\U0001d7cf}},{\mathbf{P}}_{\mathrm{\U0001d7d0}})\Leftarrow $ | $\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{P}\mathbf{e}\mathbf{r}\mathbf{s}\mathbf{o}\mathbf{n}({\mathbf{P}}_{\mathrm{\U0001d7cf}},{\mathbf{P}}_{\mathrm{\U0001d7d1}})\wedge \mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}\mathbf{n}({\mathbf{P}}_{\mathrm{\U0001d7d1}},\mathbf{M})\wedge \mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{e}{\mathbf{d}}^{-\mathrm{\U0001d7cf}}(\mathbf{M},{\mathbf{P}}_{\mathrm{\U0001d7d0}}).$ |

Note that while the inverse predicate $\mathrm{D}\mathit{}\mathrm{i}\mathit{}\mathrm{r}\mathit{}\mathrm{e}\mathit{}\mathrm{c}\mathit{}\mathrm{t}\mathit{}\mathrm{e}\mathit{}{\mathrm{d}}^{\mathrm{-}\mathrm{1}}\mathit{}\mathrm{(}\mathrm{M}\mathrm{,}\mathrm{P}\mathrm{)}$ is syntactically different from $\mathrm{D}\mathit{}\mathrm{i}\mathit{}\mathrm{r}\mathit{}\mathrm{e}\mathit{}\mathrm{c}\mathit{}\mathrm{t}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{(}\mathrm{P}\mathrm{,}\mathrm{M}\mathrm{)}$ (argument order is reversed), they are both semantically same. The output layer consists of a single neuron ${A}_{h}$ corresponding to the binary target $\mathrm{W}\mathit{}\mathrm{o}\mathit{}\mathrm{r}\mathit{}\mathrm{k}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{U}\mathit{}\mathrm{n}\mathit{}\mathrm{d}\mathit{}\mathrm{e}\mathit{}\mathrm{r}$. The lifted rule layer (also known as combining rules layer) has two lifted rule nodes ${A}_{\mathrm{1}}$ corresponding to rule ${\mathrm{R}}_{\mathrm{1}}$ and ${A}_{\mathrm{2}}$ corresponding to rule ${\mathrm{R}}_{\mathrm{2}}$. These rule nodes combine inputs corresponding to instantiations that are true in the evidence. The network is unrolled based on the specific training example, for instance: $\mathrm{W}\mathit{}\mathrm{o}\mathit{}\mathrm{r}\mathit{}\mathrm{k}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{U}\mathit{}\mathrm{n}\mathit{}\mathrm{d}\mathit{}\mathrm{e}\mathit{}\mathrm{r}\mathit{}\mathrm{(}\mathrm{L}\mathit{}\mathrm{e}\mathit{}\mathrm{o}\mathrm{,}\mathrm{M}\mathit{}\mathrm{a}\mathit{}\mathrm{r}\mathit{}\mathrm{t}\mathit{}\mathrm{y}\mathrm{)}$. For this example, the rule ${\mathrm{R}}_{\mathrm{1}}$ has two instantiations that are true in the evidence. Then, we introduce a ground rule node for each such instantiation:

${A}_{11}:$ | $\mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}\mathbf{n}(\mathbf{L}\mathbf{e}\mathbf{o},\mathrm{"}\mathbf{T}\mathbf{h}\mathbf{e}\mathbf{D}\mathbf{e}\mathbf{p}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{e}\mathbf{d}\mathrm{"})\wedge \mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{e}{\mathbf{d}}^{-\mathrm{\U0001d7cf}}(\mathrm{"}\mathbf{T}\mathbf{h}\mathbf{e}\mathbf{D}\mathbf{e}\mathbf{p}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{e}\mathbf{d}\mathrm{"},\mathbf{M}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{y}),$ | ||

${A}_{12}:$ | $\mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}\mathbf{n}(\mathbf{L}\mathbf{e}\mathbf{o},\mathrm{"}\mathbf{T}\mathbf{h}\mathbf{e}\mathbf{A}\mathbf{v}\mathbf{i}\mathbf{a}\mathbf{t}\mathbf{o}\mathbf{r}\mathrm{"})\wedge \mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{e}{\mathbf{d}}^{-\mathrm{\U0001d7cf}}(\mathrm{"}\mathbf{T}\mathbf{h}\mathbf{e}\mathbf{A}\mathbf{v}\mathbf{i}\mathbf{a}\mathbf{t}\mathbf{o}\mathbf{r}\mathrm{"},\mathbf{M}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{y}).$ |

The rule ${\mathrm{R}}_{\mathrm{2}}$ has only one instantiation, and consequently only one node:

${A}_{21}:$ | $\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{P}\mathbf{e}\mathbf{r}\mathbf{s}\mathbf{o}\mathbf{n}(\mathbf{L}\mathbf{e}\mathbf{o},\mathbf{L}\mathbf{e}\mathbf{o}\mathbf{n}\mathbf{a}\mathbf{r}\mathbf{d}\mathbf{o})\wedge \mathbf{A}\mathbf{c}\mathbf{t}\mathbf{e}\mathbf{d}\mathbf{I}\mathbf{n}(\mathbf{L}\mathbf{e}\mathbf{o},\mathrm{"}\mathbf{T}\mathbf{h}\mathbf{e}\mathbf{D}\mathbf{e}\mathbf{p}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{e}\mathbf{d}\mathrm{"})$ | ||

$\wedge \mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{e}{\mathbf{d}}^{-\mathrm{\U0001d7cf}}(\mathrm{"}\mathbf{T}\mathbf{h}\mathbf{e}\mathbf{D}\mathbf{e}\mathbf{p}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{e}\mathbf{d}\mathrm{"},\mathbf{M}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{y}).$ |

The grounding layer consists of ground rule nodes corresponding to instantiations of rules that are true in the evidence. The edges ${A}_{j\mathit{}i}\mathrm{\to}{A}_{j}$ have weights ${v}_{j\mathit{}i}$ that depend on the combining rule implemented in ${A}_{j}$. In this example, the combining rule is average, so we have ${v}_{\mathrm{11}}\mathrm{=}{v}_{\mathrm{12}}\mathrm{=}\frac{\mathrm{1}}{\mathrm{2}}$ and ${v}_{\mathrm{21}}\mathrm{=}\mathrm{1}$. The input layer consists of atomics fact in evidence: $f\mathrm{\in}\mathrm{F}$. The fact nodes $\mathrm{A}\mathit{}\mathrm{c}\mathit{}\mathrm{t}\mathit{}\mathrm{e}\mathit{}\mathrm{d}\mathit{}\mathrm{I}\mathit{}\mathrm{n}\mathit{}\mathrm{(}\mathrm{L}\mathit{}\mathrm{e}\mathit{}\mathrm{o}\mathrm{,}\mathrm{"}\mathit{}\mathrm{T}\mathit{}\mathrm{h}\mathit{}\mathrm{e}\mathit{}\mathrm{A}\mathit{}\mathrm{v}\mathit{}\mathrm{i}\mathit{}\mathrm{a}\mathit{}\mathrm{t}\mathit{}\mathrm{o}\mathit{}\mathrm{r}\mathit{}\mathrm{"}\mathrm{)}$ and $\mathrm{D}\mathrm{i}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{t}\mathrm{e}{\mathrm{d}}^{\mathrm{-}\mathrm{1}}\mathrm{(}\mathrm{"}\mathrm{T}\mathrm{h}\mathrm{e}\mathrm{A}\mathrm{v}\mathrm{i}\mathrm{a}\mathrm{t}\mathrm{o}\mathrm{r}\mathrm{"}\mathrm{,}$ $\mathrm{M}\mathrm{a}\mathrm{r}\mathrm{t}\mathrm{y}\mathrm{)}$ appear in the grounding ${\mathrm{R}}_{\mathrm{1}}\mathit{}{\mathbf{\theta}}_{\mathrm{2}}$ and are connected to the corresponding ground rule neuron ${A}_{\mathrm{12}}$. Finally, parameters are tied on the edges between the facts layer and the grounding layer. This ensures that all facts that ultimately contribute to a rule are pooled together, which increases the influence of the rule during weight learning. This, in turn, ensures that a rule that holds strongly in the evidence gets a higher weight.

Once the network $\mathcal{N}\bm{\theta}$ is instantiated, the weights ${w}_{j}$ and ${u}_{j}$ can be learned using standard techniques such as backpropagation. We denote our approach Neural Networks with Relational Parameter Tying (NNRPT). The tied parameters incorporate the structure captured by the relational features (lifted random walks), leading to a network with significantly fewer weights, while also endowing the it with semantic interpretability regarding the discriminative power of the relational features. We now demonstrate the importance of parameter tying and the use of relational random walks as compared to previous frameworks.

## 4 Experiments

Our empirical evaluation aims to answer the following questions explicitly^{1}^{1}
1
https://github.com/navdeepkjohal/NNRPT:
Q1:] How does $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ compare to the state-of-the-art SRL models i.e., what the value of learning a neural net over standard models? Q2: How does $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ compare to propositionalization models i.e., what is the need for parameterization of standard neural networks? Q3: How does $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ compare to other relational neural networks in literature?

#### Data Sets:

We use five standard data sets to evaluate our algorithm (see Table 1): Uw-Cse. [38] is a standard data set that consists of predicates and relations such as $\mathbf{P}\mathbf{r}\mathbf{o}\mathbf{f}\mathbf{e}\mathbf{s}\mathbf{s}\mathbf{o}\mathbf{r}$, $\mathbf{S}\mathbf{t}\mathbf{u}\mathbf{d}\mathbf{e}\mathbf{n}\mathbf{t}$, $\mathbf{P}\mathbf{u}\mathbf{b}\mathbf{l}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}$, $\mathbf{H}\mathbf{a}\mathbf{s}\mathbf{P}\mathbf{o}\mathbf{s}\mathbf{i}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}$ and $\mathbf{T}\mathbf{a}\mathbf{u}\mathbf{g}\mathbf{h}\mathbf{t}\mathbf{B}\mathbf{y}$ etc. The data set contains information from $5$ different areas of computer science about professors, students and courses, and the task is to predict the $\mathbf{A}\mathbf{d}\mathbf{v}\mathbf{i}\mathbf{s}\mathbf{e}\mathbf{d}\mathbf{B}\mathbf{y}$ relationship between a professor and a student. Imdb was first created by Mihalkova and Mooney [27] and contains nine predicates such as $\mathbf{G}\mathbf{e}\mathbf{n}\mathbf{d}\mathbf{e}\mathbf{r}$, $\mathbf{G}\mathbf{e}\mathbf{n}\mathbf{r}\mathbf{e}$, $\mathbf{M}\mathbf{o}\mathbf{v}\mathbf{i}\mathbf{e}$, and $\mathbf{D}\mathbf{i}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{t}\mathbf{o}\mathbf{r}$. We predict whether an actor has $\mathbf{W}\mathbf{o}\mathbf{r}\mathbf{k}\mathbf{e}\mathbf{d}\mathbf{U}\mathbf{n}\mathbf{d}\mathbf{e}\mathbf{r}$ a director. Cora is a citation matching data set modified by Poon and Domingos [36]. It contains predicates $\mathbf{A}\mathbf{u}\mathbf{t}\mathbf{h}\mathbf{o}\mathbf{r}$, $\mathbf{T}\mathbf{i}\mathbf{t}\mathbf{l}\mathbf{e}$, $\mathbf{V}\mathbf{e}\mathbf{n}\mathbf{u}\mathbf{e}$, $\mathbf{H}\mathbf{a}\mathbf{s}\mathbf{W}\mathbf{o}\mathbf{r}\mathbf{d}\mathbf{A}\mathbf{u}\mathbf{t}\mathbf{h}\mathbf{o}\mathbf{r}$, $\mathbf{H}\mathbf{a}\mathbf{s}\mathbf{W}\mathbf{o}\mathbf{r}\mathbf{d}\mathbf{T}\mathbf{i}\mathbf{t}\mathbf{l}\mathbf{e}$, $\mathbf{H}\mathbf{a}\mathbf{s}\mathbf{W}\mathbf{o}\mathbf{r}\mathbf{d}\mathbf{V}\mathbf{e}\mathbf{n}\mathbf{u}\mathbf{e}$, $\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{A}\mathbf{u}\mathbf{t}\mathbf{h}\mathbf{o}\mathbf{r}$, and $\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{T}\mathbf{i}\mathbf{t}\mathbf{l}\mathbf{e}$. The task is to predict if one venue is $\mathbf{S}\mathbf{a}\mathbf{m}\mathbf{e}\mathbf{V}\mathbf{e}\mathbf{n}\mathbf{u}\mathbf{e}$ as another.

Mutagenesis [25] was originally used to predict whether a compound is mutagenetic or not. It consists of properties of compounds, their constituent atoms and the type of bond that exists between atoms. We performed relation prediction of whether an atom is a constituent of a given molecule or not ($\mathbf{M}\mathbf{o}\mathbf{l}\mathbf{e}\mathbf{A}\mathbf{t}\mathbf{m}(\mathbf{A}\mathbf{t}\mathbf{o}\mathbf{m}\mathbf{I}\mathbf{D},\mathbf{M}\mathbf{o}\mathbf{l}\mathbf{I}\mathbf{D})$). Sports consists of facts from the sports domain crawled by the Never-Ending Language Learner (NELL, [5]) including details of players, sports, individual plays, league information etc. The goal is to predict which sport a particular team plays.

Domain | Target | #Facts | #Pos | #Neg | #RW | #Samp/RW |

Uw-Cse | $\mathrm{\U0001d68a\U0001d68d\U0001d69f\U0001d692\U0001d69c\U0001d68e\U0001d68d\U0001d671\U0001d6a2}$ | 2817 | 90 | 180 | 2500 | 1000 |

Mutagenesis | $\mathrm{\U0001d67c\U0001d698\U0001d695\U0001d68e\U0001d670\U0001d69d\U0001d696}$ | 29986 | 1000 | 2000 | 100 | 100 |

Cora | $\mathrm{\U0001d682\U0001d68a\U0001d696\U0001d68e\U0001d685\U0001d68e\U0001d697\U0001d69e\U0001d68e}$ | 31086 | 2331 | 4662 | 100 | 100 |

Imdb | $\mathrm{\U0001d686\U0001d698\U0001d69b\U0001d694\U0001d68e\U0001d68d\U0001d684\U0001d697\U0001d68d\U0001d68e\U0001d69b}$ | 914 | 305 | 710 | 80 | - |

Sports | $\mathrm{\U0001d683\U0001d68e\U0001d68a\U0001d696\U0001d67f\U0001d695\U0001d68a\U0001d6a2\U0001d69c\U0001d682\U0001d699\U0001d698\U0001d69b\U0001d69d}$ | 7824 | 200 | 400 | 200 | 100 |

#### Baselines and Experimental Details:

To answer Q1, we compare $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ with the more recent and state-of-the-art relational gradient-boosting methods, $\mathrm{\U0001d681\U0001d673\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$[29], $\mathrm{\U0001d67c\U0001d67b\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$ [20], and relational restricted Boltzmann machines $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d674$, $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d672$ [17]. As the random walks chain binary predicates in our model, we convert unary and ternary predicates into binary predicates for all data sets. Further, to maintain consistency in experimentation, we use the same resulting predicates across all our baselines as well. We run $\mathrm{\U0001d681\U0001d673\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$ and $\mathrm{\U0001d67c\U0001d67b\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$ with their default settings and learn $20$ trees for each model. Also, we train $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d674$ and $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d672$ according to the settings recommended in [17].

For $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$, we generate random walks by considering each predicate and its inverse to be two distinct predicates. Also, we avoid loops in the random walks by enforcing sanity constraints on the random walk generation. We consider $100$ random walks for Mutagenesis, Cora, $80$ random walks for Imdb, $200$ random walks for Sports and $2500$ random walks for Uw-Cse as suggested by Kaur et al [17] (see Table 1). Since we use a large number of random walks, exhaustive grounding becomes prohibitively expensive. To overcome this, we sample groundings for each random walk for large data sets. Specifically, we sample $100$ groundings per random walk per example for Cora, Sports, Mutagenesis, and $1000$ groundings per random walk per example for Uw-Cse (see Table 1).

For all experiments, we set the positive to negative example ratio to be $1:2$ for training, set combination function to be average and perform $5$-fold cross validation. For $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$, we set the learning rate to be $0.05$, batch size to $1$, and number of epochs to $1$. We train our model with ${L}_{1}$-regularized AdaGrad [9]. Since these are relational data sets where the data is skewed, AUC-PR and AUC-ROC are better measures than likelihood and accuracy.

To answer Q2, we generated flat feature vectors by Bottom Clause Propositionalization (BCP, [11]), according to which one bottom clause is generated for each example. BCP considers each predicate in the body of the bottom clause as a unique feature when it propositionalizes bottom clauses to flat feature vector. We use Progol [28] to generate these bottom clauses. After propositionalization, we train two connectionist models: a propositionalized restricted Boltzmann machine ($\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d681\U0001d671\U0001d67c}$) and a propositionalized neural network ($\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d67d\U0001d67d}$). The NN has two hidden layers in our experiments, which makes $\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d67d\U0001d67d}$ model a modified version of CILP++ [11] that had one hidden layer. The hyper-parameters of both the models were optimized by line search on validation set.

To answer Q3, we compare our model with Lifted Relational Neural Networks ($\mathrm{\U0001d67b\U0001d681\U0001d67d\U0001d67d}$, [42]). To ensure fairness, we perform structure learning by using PROGOL [28] and input the same clauses to both $\mathrm{\U0001d67b\U0001d681\U0001d67d\U0001d67d}$ and $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$. PROGOL learned $4$ clauses for Cora, $8$ clauses for Imdb, $3$ clauses for Sports, $10$ clauses for Uw-Cse and $11$ clauses for Mutagenesis in our experiment.

\Xhline3 \Xhline3 Data Set | Measure | $\mathrm{\U0001d681\U0001d673\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$ | $\mathrm{\U0001d67c\U0001d67b\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$ | $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d674$ | $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d672$ | $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ |

\Xhline3 \Xhline3 Uw-Cse | AUC-ROC | 0.973$\pm $0.014 | 0.968$\pm $0.014 | 0.975$\pm $0.013 | 0.968$\pm $0.011 | 0.959$\pm $0.024 |

AUC-PR | 0.931$\pm $0.036 | 0.916$\pm $0.035 | 0.923$\pm $0.056 | 0.924$\pm $0.040 | 0.896$\pm $0.063 | |

\Xhline3 Imdb | AUC-ROC | 0.955$\pm $0.046 | 0.944$\pm $0.070 | 1.000$\pm $0.000 | 0.997$\pm $0.006 | 0.984$\pm $0.025 |

AUC-PR | 0.863$\pm $0.112 | 0.839$\pm $0.169 | 1.000$\pm $0.000 | 0.992$\pm $0.017 | 0.951$\pm $0.082 | |

\Xhline3 Cora | AUC-ROC | 0.895$\pm $0.183 | 0.835$\pm $0.035 | 0.984$\pm $0.009 | 0.867$\pm $0.041 | 0.952$\pm $0.043 |

AUC-PR | 0.833$\pm $0.259 | 0.799$\pm $0.034 | 0.948$\pm $0.042 | 0.825$\pm $0.050 | 0.899$\pm $0.070 | |

\Xhline3 Mutag. | AUC-ROC | 0.999$\pm $0.000 | 0.999$\pm $0.000 | 0.999$\pm $0.000 | 0.998$\pm $0.001 | 0.981$\pm $0.024 |

AUC-PR | 0.999$\pm $0.000 | 0.999$\pm $0.000 | 0.999$\pm $0.000 | 0.997$\pm $0.002 | 0.970$\pm $0.039 | |

\Xhline3 Sports | AUC-ROC | 0.801$\pm $0.026 | 0.806$\pm $0.016 | 0.760$\pm $0.016 | 0.656$\pm $0.071 | 0.780$\pm $0.026 |

AUC-PR | 0.670$\pm $0.028 | 0.652$\pm $0.032 | 0.634$\pm $0.020 | 0.648$\pm $0.085 | 0.668$\pm $0.070 | |

\Xhline3 \Xhline3 |

\Xhline3 \Xhline3 Data Set | Measure | $\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d681\U0001d671\U0001d67c}$ | $\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d67d\U0001d67d}$ | $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ |

\Xhline3 \Xhline3 Uw-Cse | AUC-ROC | 0.951$\pm $0.041 | 0.868$\pm $0.053 | 0.959$\pm $0.024 |

AUC-PR | 0.860$\pm $0.114 | 0.869$\pm $0.033 | 0.896$\pm $0.063 | |

\Xhline3 Imdb | AUC-ROC | 0.780$\pm $0.164 | 0.540$\pm $0.152 | 0.984$\pm $0.025 |

AUC-PR | 0.367$\pm $0.139 | 0.536$\pm $0.231 | 0.951$\pm $0.082 | |

\Xhline3 Cora | AUC-ROC | 0.801$\pm $0.017 | 0.670$\pm $0.064 | 0.952$\pm $0.043 |

AUC-PR | 0.647$\pm $0.050 | 0.658$\pm $0.064 | 0.899$\pm $0.070 | |

\Xhline3 Mutag. | AUC-ROC | 0.991$\pm $0.003 | 0.945$\pm $0.019 | 0.981$\pm $0.024 |

AUC-PR | 0.995$\pm $0.001 | 0.973$\pm $0.012 | 0.970$\pm $0.039 | |

\Xhline3 Sports |
AUC-ROC | 0.664$\pm $0.021 | 0.543$\pm $0.037 | 0.780$\pm $0.026 |

AUC-PR | 0.532$\pm $0.041 | 0.499$\pm $0.065 | 0.668$\pm $0.070 | |

\Xhline3 \Xhline3 |

#### Results:

Table 2 compares our $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ to $\mathrm{\U0001d67c\U0001d67b\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$, $\mathrm{\U0001d681\U0001d673\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$, $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d674$ and $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d672$ to answer Q1. As we see, $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ is significantly better than $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d672$ for Cora and Sports on both AUC-ROC and AUC-PR, and performs comparably to the other data sets. It also performs better than $\mathrm{\U0001d67c\U0001d67b\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$, $\mathrm{\U0001d681\U0001d673\U0001d67d}$-$\mathrm{\U0001d671\U0001d698\U0001d698\U0001d69c\U0001d69d}$ on Imdb and Cora data sets, and comparably on other data sets. Similarly, it performs better than $\mathrm{\U0001d681\U0001d681\U0001d671\U0001d67c}$-$\U0001d674$ on Sports, both on AUC-ROC and AUC-PR and comparably on other data sets. Broadly, Q1 can be answered affirmatively in that $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ performs comparably to or better than state-of-the-art SRL models.

Table 3 shows the comparison of $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ with two propositionalization models: $\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d681\U0001d671\U0001d67c}$ and $\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d67d\U0001d67d}$ in order to answer Q2. $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ performs better than $\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d681\U0001d671\U0001d67c}$ on all the data sets except Mutagenesis, where the two models have similar performance. $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ also performs better than $\mathrm{\U0001d671\U0001d672\U0001d67f}$-$\mathrm{\U0001d67d\U0001d67d}$ on all data sets. It should be noted that BCP feature generation sometimes introduces a large positive-to-negative example skew (for example, in the Imdb data set), which can sometimes gravely affect the performance of the propositional model, as we observe in Table 3. This emphasizes the need for designing models that can handle relational data directly and without propositionalization; our proposed model as an effort in this direction. Q2 can now be answered affirmatively: that $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ performs better than propositionalization models.

Table 4 compares the performance of $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ and $\mathrm{\U0001d67b\U0001d681\U0001d67d\U0001d67d}$ when both use clauses learned by PROGOL [28]. $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ performs better on Uw-Cse, Sports evaluated using AUC-PR. This result is especially significant because these data sets are considerably skewed. $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ also outperforms $\mathrm{\U0001d67b\U0001d681\U0001d67d\U0001d67d}$ on Cora and Mutagenesis. Lastly, $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ has comparable performance on Imdb on both AUC-ROC and AUC-PR. The reason for this big performance gap between the two models on Cora is likely because $\mathrm{\U0001d67b\U0001d681\U0001d67d\U0001d67d}$ could not build effective models with the fewer number of clauses (i.e. four) typically learned by PROGOL. In contrast, even with very few clauses, $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ is able to outperform $\mathrm{\U0001d67b\U0001d681\U0001d67d\U0001d67d}$. This helps us answer Q3, affirmatively, that: $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ offers many advantages over state-of-the-art relational neural networks.

In summary, our experiments clearly show the benefits of parameter tying as well as the expressivity of relational random walks in tightly integrating with a neural network model across a wide variety of domains and settings. The key strengths of $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ are that it can (1) efficiently incorporate a large number of relational features, (2) capture local qualitative structure through relational random walk features, (3) tie feature weights (parameter-tying) in a manner that captures the global quantitative influences.

\Xhline3 \Xhline3 Model | Measure | Uw-Cse | Imdb | Cora | Mutagen. | Sports |

\Xhline3 \Xhline3 $\mathrm{\U0001d67b\U0001d681\U0001d67d\U0001d67d}$ | AUC-ROC | 0.923$\pm $0.027 | 0.995$\pm $0.004 | 0.503$\pm $0.003 | 0.500$\pm $0.000 | 0.741$\pm $0.016 |

AUC-PR | 0.826$\pm $0.056 | 0.985$\pm $0.013 | 0.356$\pm $0.006 | 0.335$\pm $0.000 | 0.527$\pm $0.036 | |

\Xhline3 $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ |
AUC-ROC | 0.700$\pm $0.186 | 0.997$\pm $0.007 | 0.968$\pm $0.022 | 0.532$\pm $0.019 | 0.657$\pm $0.014 |

AUC-PR | 0.910$\pm $0.072 | 0.992$\pm $0.017 | 0.943$\pm $0.032 | 0.412$\pm $0.032 | 0.658$\pm $0.056 | |

\Xhline3 \Xhline3 |

#### Discussion:

A typical convolutional neural network (CNN) is composed of three layers: convolution, max-pooling and (fully-connected) output layers. $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$ can be considered a special instance of a convolutional network in relational domains, where the fact-grounding layer edges are the equivalent of convolution, combining rules layer represents pooling, and softmax layer is the fully-connected layer. If we perform a full and exhaustive grounding of the neural network in $\mathrm{\U0001d67d\U0001d67d\U0001d681\U0001d67f\U0001d683}$, $M$ is the number of lifted random walks (template rules), $N$ is the number of grounded random walks (instances of a template rule) and $|\mathcal{F}|$ is the number of all facts (atomic instances). The data can be represented as a three-dimensional tensor $B$ of size $M\times N\times |\mathcal{F}|$, whose elements are precisely ${B}_{ijk}={\mathbf{Q}}_{k}^{j}{\bm{\theta}}_{i}$ (see the discussion of the Input Layer in Section 3.2). In addition, if we consider the rule layer as tensor $T$ $=$ $M\times 1\times |\mathcal{F}|$, where parameters are tied across $|\mathcal{F}|$, then ${[{w}_{m1f}]}_{m=1}^{M}$ constitutes the convolving filter that is repeatedly applied to each of $|\mathcal{F}|$ ground instances. The resulting tensor $G=M\times N\times 1$ obtained by composing $G=D\circ T$ representing the output of grounded layer passes through a pooling layer (which is the rule-combination layer, here) to downsample the data produce a new tensor $C=M\times 1\times 1$. The tensor $C$, when composed with the fully-connected non-linear layer $F=M\times |\mathcal{O}|$ of our model produces tensor of size $1\times |\mathcal{O}|$ that represents the probability of each class in the output: $\mathcal{O}$.

## 5 Conclusion and Future Work

We considered the problem of learning neural networks from relational data. Our proposed architecture was able to exploit parameter tying i.e., different instances of the same rule shared the same parameters inside the same training example. In addition, we explored the use of relational random walks to create relational features for training these neural nets. Further experiments on larger data sets could yield insights into the scalability of this approach. Integration with an approximate-counting method could potentially reduce the training time. Given the relation to CNNs, stacking could allow for our method to be deeper. Finally, understanding the use of such random-walk-based neural network as a function approximator can allow for efficient and interpretable learning in relational domains with minimal feature engineering.

Acknowledgements: SN, GK & NK gratefully acknowledge AFOSR award FA9550-18-1-0462. The authors acknowledge the support of Amazon faculty award. KK acknowledges the support of the RMU project DeCoDeML. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the AFOSR, Amazon, DeCoDeML or the US government.

## References

- [1] (2017) Hinge-loss Markov random fields and probabilistic soft logic. JMLR. Cited by: §1.
- [2] (2004) Using neural networks for relational learning. In ICML Workshop, Cited by: §2.
- [3] (2012) Joint learning of words and meaning representations for open-text semantic parsing. In AISTATS, Cited by: §2.
- [4] (2013) Translating embeddings for modeling multi-relational data. In NeurIPS, Cited by: §2.
- [5] (2010) Toward an architecture for never-ending language learning. In AAAI, Cited by: §4.
- [6] (2017) Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL, Cited by: §2.
- [7] (2016) Statistical Relational Artificial Intelligence: Logic, Probability, and Computation. Morgan & Claypool. Cited by: §1.
- [8] (2004) Learning an approximation to inductive logic programming clause evaluation. In ILP, Cited by: §2.
- [9] (2011) Adaptive subgradient methods for online learning and stochastic optimization. JMLR. Cited by: §4.
- [10] (2018) Can neural networks understand logical entailment?. ICLR. Cited by: §2.
- [11] (2014) Fast relational learning using bottom clause propositionalization with artificial neural networks. MLJ. Cited by: §2, §4.
- [12] (2002) Neural-symbolic learning system: foundations and applications. Springer-Verlag. Cited by: §2.
- [13] (2001) Learning probabilistic relational models. RDM. Cited by: §1.
- [14] (2007) Introduction to statistical relational learning. MIT Press. Cited by: §1.
- [15] (2016) Harnessing deep neural networks with logic rules. In ACL, Cited by: §2.
- [16] (2007) Parameter learning for relational bayesian networks. In ICML, Cited by: §1.
- [17] (2017) Relational restricted boltzmann machines: a probabilistic logic learning approach. In ILP, Cited by: §2, §3.1, §3.2, §4, §4.
- [18] (2014) Relational logistic regression. In KR, Cited by: §2.
- [19] (2018) RelNN: A deep neural model for relational learning. In AAAI, Cited by: §1, §2.
- [20] (2011) Learning Markov logic networks via functional gradient boosting. In ICDM, Cited by: §4.
- [21] (2007) First-order deduction in neural networks. In LATA, Cited by: §2.
- [22] (2010) Relational retrieval using a combination of path-constrained random walks. JMLR. Cited by: §1, §2, §3.1.
- [23] (2008) Classification using discriminative restricted boltzmann machines. In ICML, Cited by: §2.
- [24] (1993) Inductive logic programming: techniques and applications. Prentice Hall. Cited by: §1.
- [25] (2005) Is mutagenesis still challenging ?. In ILP, Cited by: §4.
- [26] (2013) Deep relational machines. In ICONIP, Cited by: §2.
- [27] (2007) Bottom-up learning of Markov logic network structure. In ICML, Cited by: §4.
- [28] (1995) Inverse entailment and Progol. New Generation Computing. Cited by: §4, §4, §4, Table 4.
- [29] (2012) Gradient-based boosting for statistical relational learning: relational dependency network case. MLJ. Cited by: §4.
- [30] (2008) Learning first-order probabilistic models with combining rules. ANN MATH ARTIF INTEL. Cited by: §1, §3.2.
- [31] (2011) A three-way model for collective learning on multirelational data. In ICML, Cited by: §2.
- [32] (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §2.
- [33] (2018) Recurrent relational networks for complex relational reasoning. In ICLR, Cited by: §2.
- [34] (2014) DeepWalk: online learning of social representations. In KDD, Cited by: §2.
- [35] (2016) Column networks for collective classification. In AAAI, Cited by: §2.
- [36] (2007) Joint inference in information extraction. In AAAI, Cited by: §4.
- [37] (2000) Multi instance neural network. In ICML Workshop, Cited by: §2.
- [38] (2006) Markov logic networks. MLJ. Cited by: §1, §4.
- [39] (2009) The graph neural network model. IEEE Transactions on Neural Networks. Cited by: §2.
- [40] (2018) Modeling relational data with graph convolutional networks. In ESWC, Cited by: §2.
- [41] (2013) Reasoning with neural tensor networks for knowledge base completion. In NeurIPS, Cited by: §2.
- [42] (2015) Lifted relational neural networks. In NeurIPS Workshop, Cited by: §3.2, §3.2, §3.2, §4.
- [43] (2016) Learning predictive categories using lifted relational neural networks. In ILP, Cited by: §1, §2.
- [44] (2017) Stacked structure learning for lifted relational neural networks. In ILP, Cited by: §2.
- [45] (1990) Refinement of approximate domain theories by knowledge-based neural networks. In AAAI, Cited by: §2.
- [46] (2015) Relational stacked denoising autoencoder for tag recommendation. In AAAI, Cited by: §2.
- [47] (2014) Knowledge graph embedding by translating on hyperplanes. In AAAI, Cited by: §2.
- [48] (2015) Embedding entitities and relations for learning and inference in knowledge bases. In ICLR, Cited by: §2.
- [49] (2014) Relation classification via convolutional deep neural network. In COLING, Cited by: §2.