Abstract
We focus on a class of realworld domains, where gathering hierarchicalknowledge is required to accomplish a task. Many problems can be represented inthis manner, such as network penetration testing, targeted advertising ormedical diagnosis. In our formalization, the task is to sequentially requestpieces of information about a sample to build the knowledge hierarchy andterminate when suitable. Any of the learned pieces of information can befurther analyzed, resulting in a complex and variable action space. We presenta combination of techniques in which the knowledge hierarchy is explicitlyrepresented and given to a deep reinforcement learning algorithm as its input.To process the hierarchical input, we employ Hierarchical MultipleInstanceLearning and to cope with the complex action space, we factor it withhierarchical softmax. Our endtoend differentiable model is trained with A2C,a standard deep reinforcement learning algorithm. We demonstrate the method ina set of seven classification domains, where the task is to achieve the bestaccuracy with a set budget on the amount of information retrieved. Compared tobaseline algorithms, our method achieves not only better results, but alsobetter generalization.
Quick Read (beta)
Deep Reinforcement Learning with Explicitly Represented Knowledge and Variable State and Action Spaces
Abstract.
We focus on a class of realworld domains, where gathering hierarchical knowledge is required to accomplish a task. Many problems can be represented in this manner, such as network penetration testing, targeted advertising or medical diagnosis. In our formalization, the task is to sequentially request pieces of information about a sample to build the knowledge hierarchy and terminate when suitable. Any of the learned pieces of information can be further analyzed, resulting in a complex and variable action space. We present a combination of techniques in which the knowledge hierarchy is explicitly represented and given to a deep reinforcement learning algorithm as its input. To process the hierarchical input, we employ Hierarchical MultipleInstance Learning and to cope with the complex action space, we factor it with hierarchical softmax. Our endtoend differentiable model is trained with A2C, a standard deep reinforcement learning algorithm. We demonstrate the method in a set of seven classification domains, where the task is to achieve the best accuracy with a set budget on the amount of information retrieved. Compared to baseline algorithms, our method achieves not only better results, but also better generalization.
Artificial Intelligence Center, Department of Computer Science
Faculty of Electrical Engineering, Czech Technical University in Prague, Artificial Intelligence Center, Department of Computer Science
Faculty of Electrical Engineering, Czech Technical University in Prague
Additional Key Words and Phrases: deep reinforcement learning; relational reinforcement learning; graph classification; explicit knowledge; variable state space; variable action space; hierarchical multipleinstance learning; hierarchical softmax; a2c; classification with costly features
In many domains, information retrieval is the the core of the problem. Here, the task is to gather a structured knowledge about the current sample and decide what to do with it. Let’s look at an example from the domain of penetration testing. In an attack onto a computer network, the agent may start with a horizontal scan of the network to identify active hosts. Afterward, the agent may further scan any of these hosts (note that the set of hosts is not known at the beginning). The result of this scan is a set of ports corresponding to the services running on the host. For any of the services, the agent may fingerprint the service to learn the exact software package implementing it. Next, it may query vulnerability databases to find known exploits for the software package and choose to execute an exploit, with the goal of penetrating the network. If it fails, it may decide to continue with a different exploit, service or host.
As another example, in advertising on social networks an agent may decide whether to show a specific user an ad for football shoes or ballroom dancing lessons. In order to do that, it looks at user’s likes. If none of them seems relevant for either, it may acquire the list of friends and query their likes. Alternatively, it may query user’s groups and the likes of the people in those groups. Each new query comes with new objects from the world and new actions that can be executed. After gathering enough information, the agent decides what ad to show.
All the examples share their structure and in all of them, the task is to sequentially collect structured information and then perform a final, terminal action (i.e., to terminate the attack or show one of the ads). Naively, it is possible to design an agent that sees the problem as its game world, which it interacts with through primitive actions (such as moveleft, moveright and select). Instead, we offer another direction, exploiting the structure of the task. In our formalization, all the knowledge gathered so far is explicitly represented as a tree given to the agent through its input. At each step, the agent directly selects an unexpanded leaf for examination (i.e., to probe a port or query the user’s friends) or terminates the episode. The objective can be taskrelated (to maximize the success of the network penetration), classification (targeted advertising) or classification with a budget, where each action has a defined cost. We show the hierarchical knowledge in another example in Figure 1.
The introduction of variable state and action spaces raises several issues to overcome. First, we approach the variable treelike input with Hierarchical MultipleInstance Learning (Pevný and Somol, 2016). The result of the process is a fixedlength embedding of the whole tree. Second, the number of actions varies at each step – new actions are created when a node is expanded and performed actions cease to exist. To solve this issue, we employ a hierarchical policy decomposition inspired by methods used in natural language processing (Morin and Bengio, 2005). To train our agent, we use an A2C algorithm (Mnih et al., 2016), a standard policy gradient method used in deep reinforcement learning (deepRL). We demonstrate the method in several scenarios similar to those mentioned. Particularly, we choose classification domain with a budget on the amount of retrieved information, the Classification with Costly Features framework (DulacArnold et al., 2011; Janisch et al., 2019b; Shim et al., 2018) with relational data.
There are several contributions in our work that are related to several different domains. The first one is about knowledge representation. Even in modelfree deep reinforcement learning, the knowledge about the world in usually represented in some form, mainly because the solved domains are scarcely markovian and a state aliasing occurs. The knowledge representation techniques include: Providing the agent with a history of recent states (Mnih et al., 2015), recurrent neural architectures that learn to internally model the environment (Lin and Mitchell, 1992; Hausknecht and Stone, 2015) or forms of external memory (Graves et al., 2014; Graves et al., 2016), that can be used to store knowledge. In our work, we decide to explicitly represent all the gathered knowledge in a form of an exact tree, that is given to the agent as an input. There are pros and cons to this approach. The main advantage of this approach is that the information is presented in its exact form and the agent can precisely choose actions without forgetting. On the other side, recurrent networks given the information sequentially can learn more concise representation.
Our work is also related to a general field of Relational Reinforcement Learning (RRL) (Džeroski et al., 2001; Tadepalli et al., 2004; Van Otterlo, 2005), where the world is described in a form of firstorder logic – of objects, their relations and actions on those objects. We limit ourselves to domains with a specific structure, in which the objects relations form a hierarchy and the focus of actions is mainly on a retrieval of new knowledge. Zambaldi et al. (2018) also focus on RRL with deep learning, but trains their models with image inputs. In our work, we focus on domains with explicitly representable knowledge.
Second contribution is related to the complex action space. In our work, an action is to analyze any unexpanded leaf in the current knowledge tree. For this we use a hierarchical softmax (Morin and Bengio, 2005; Goodman, 2001). There are several other existing techniques to factorize complex action spaces in deepRL, such as in in mutlidimensional action spaces (Tang and Agrawal, 2019), or when multiple simultaneous actions are required (Chen et al., 2019). Metz et al. (2017) decompose a high dimensional action space into separate actions for each dimension that are sequentially performed in a transformed MDP with the same solution.
Third, the method can be seen as an extension of the Classification with Costly Features (CwCF) approach (DulacArnold et al., 2011; Janisch et al., 2019b; Shim et al., 2018) to work with relational data. Formerly, the algorithms were able to work only with fixedlength vectors. Ability to process hierarchical data provides a possibility to extend CwCF into many new domains.
Fourth, the problem is related to inference on graphs (Zhou et al., 2018). The classification task can be reformulated in the graph terminology as a classification of heterogeneous graphs with edge information, but in our case, we classify only the root node. Several methods were developed for graph classification, such as (Hamilton et al., 2017; Perozzi et al., 2014; Kipf and Welling, 2016). However, these methods are not as general, cannot optimize any objective and cannot work with a set budget.
In this work, we use several existing techniques described in this section. To create an embedding of the hierarchical input, we use Hierarchical MultipleInstance Learning (HMIL) (Pevný and Somol, 2016). To select the performed actions, we use hierarchical softmax (Morin and Bengio, 2005; Goodman, 2001). To train our agent, we use Advantage Actor Critic (A2C) (Mnih et al., 2016), a reinforcement learning algorithm from the policy gradient family. In experimental section, we use the Classification with Costly Features (CwCF) (Janisch et al., 2019b) framework to evaluate our algorithm.
MultipleInstance Learning (MIL) (Pevný and Somol, 2017) introduces a neural network architecture to create an embedding of an unordered set $\mathcal{B}$, composed of $m$ items ${x}_{\{1..m\}}\in {\mathbf{R}}^{n}$ (see Figure 2). The items are simultaneously processed with a parametrized function ${f}_{{\theta}_{\mathcal{B}}}$, where ${\theta}_{\mathcal{B}}$ are shared parameters for the set $\mathcal{B}$. This creates their embedding ${z}_{i}=\sigma ({f}_{{\theta}_{\mathcal{B}}}({x}_{i}))$, where $\sigma $ is a nonlinearity function (i.e., ReLu). All embeddings are finally processed with an aggregation function $g$, commonly defined as a mean or max operator. The whole process creates an embedding ${z}_{\mathcal{B}}$ of the set, and is differentiable.
Hierarchical MIL (Pevný and Somol, 2016) works with hierarchies of sets. In this case, the nested sets (where different types of sets have different parameters ${\theta}_{\mathcal{B}}$) are recursively evaluated as in MIL, and their embeddings are used as feature values. The soundness of the hierarchical approach is theoretically studied by Pevný and Kovařík (2019). The whole algorithm is easily parallelizable across samples.
This approach decomposes a probability $p(yx)$ into a tree, where each joint represents a decision point which branch to follow, itself being a proper probability distribution (Morin and Bengio, 2005; Goodman, 2001). The sampling from $p$ can be then regarded as a sequence of stochastic decisions at each joint, starting from the root and continuing down the tree. If we label the probabilities encountered on the path from a root node to $y$ with ${p}_{1},\mathrm{\dots},{p}_{n}$, then $p(yx)={\prod}_{i=1}^{n}{p}_{i}$. The complexity of sampling from $p(yx)$ can be brought down to $O(\mathrm{log}Y)$, as only the partial probabilities on the sampled path has to be computed.
Advantage Actor Critic algorithm (A2C), a synchronous version of A3C (Mnih et al., 2016), is an onpolicy, policygradient algorithm. It iteratively optimizes a policy ${\pi}_{\theta}$ and a value estimate ${V}_{\theta}$ with model parameters $\theta $ to achieve the best cumulative reward in a Markov Decision Process (MDP) $(\mathcal{S},\mathcal{A},r,t)$, where $\mathcal{S}$, $\mathcal{A}$ represent the state and action spaces, and $r$, $t$ are reward and transition functions. Let’s define a stateaction value function $Q(s,a)={\mathbb{E}}_{{s}^{\prime}\sim t(s,a)}[r(s,a,{s}^{\prime})+\gamma {V}_{\theta}({s}^{\prime})]$ and advantage function $A(s,a)=Q(s,a){V}_{\theta}(s)$. Then, the policy gradient is defined as:
$${\nabla}_{\theta}J=\underset{s,a\sim {\pi}_{\theta},t}{\mathbb{E}}\left[A(s,a)\cdot {\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}(as)\right]$$ 
The value function gradient is:
$${\nabla}_{\theta}V=\underset{s,a\sim {\pi}_{\theta},t}{\mathbb{E}}{\nabla}_{\theta}{[{V}_{\theta}(s)Q(s,a)]}^{2}$$ 
with $Q(s,a)$ regarded as a constant. To prevent premature convergence, a regularization term in form of the average policy entropy is used:
$${\nabla}_{\theta}H({\pi}_{\theta})=\underset{s\sim {\pi}_{\theta},t}{\mathbb{E}}{\nabla}_{\theta}\left[\underset{a\sim {\pi}_{\theta}}{\mathbb{E}}\mathrm{log}{\pi}_{\theta}(as)\right]$$ 
The total gradient is then computed as $G={\nabla}_{\theta}(J+{\alpha}_{v}V{\alpha}_{h}H)$, with ${\alpha}_{v},{\alpha}_{h}$ learning coefficients.
The algorithm iteratively gathers sample runs according to a current policy ${\pi}_{\theta}$, and the traces are used as samples for the above expectations. Then, an arbitrary gradient descent method is used with the gradient $G$. Commonly, multiple environments are run in parallel to create a better gradient estimate. Mnih et al. (2016) use asynchronous gradient updates; in A2C the updates are made synchronously.
When the number of actions is high and the computation of their probabilities is expensive, the estimation of the policy entropy gradient can become a bottleneck. In these cases, it is beneficial to use an alternative entropy estimator (Zhang et al., 2018):
$${\nabla}_{\theta}H({\pi}_{\theta},s)=\underset{a\sim {\pi}_{\theta}(s)}{\mathbb{E}}\left[\mathrm{log}{\pi}_{\theta}(as)\cdot {\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}(as)\right]$$ 
In this case, only the actually performed action can be used to sample the expectation. This works well in combination with the hierarchical softmax, because only that action’s probability is easily available.
Classification with Costly Features (CwCF) (Janisch et al., 2019b) is a problem of optimizing accuracy, along with a cost of features used in the process. Let $({y}_{\theta},{f}_{\theta})$ be a model where ${y}_{\theta}$ returns the label and ${f}_{\theta}$ the features used. Then the objective is:
$$\underset{\theta}{\mathrm{min}}\underset{(x,y)\in \mathcal{D}}{\mathbb{E}}\left[\mathrm{\ell}({y}_{\theta}(x),y)+\lambda c({f}_{\theta}(x))\right]$$  (1) 
In here, $(x,y)$ is a data point with its label, $\mathrm{\ell}$ is the classification loss function, $c$ is the cost function returning the cost of the given features and the parameter $\lambda $ is a tradeoff factor between the accuracy and the cost. In the original framework, each data point $x\in {\mathbf{R}}^{n}$ is a vector of $n$ features, each of which has has a defined cost.
Eq. (1) allows a construction of an MDP with an equivalent solution to the original goal. In this MDP, an agent classifies one data point $(x,y)$ per episode. The state $s$ represents the currently observed features. At each step, it can either choose to reveal a single feature $f$ (and receiving a reward of $\lambda c(f)$) or to classify with a label $\widehat{y}$, in which case the episode terminates and the agent receives a reward of $\mathrm{\ell}(\widehat{y},y)$. A common choice for the loss function $\mathrm{\ell}$ is a binary loss (1 in case of mismatch, 0 otherwise).
It has been shown that solving this MDP is equivalent to the original objective and so the standard reinforcement learning (RL) techniques can be used. In a following work, Janisch et al. (2019a) shows that the $\lambda $ parameter can be interpreted as a setting for an average budget. They also adapt the method to a setting with a strict hard budget and also to a directly specified average budget, removing the need for the parameter $\lambda $. In this work, we focus only on the average budget with a set $\lambda $, but the modifications are available, if needed.
By setting $\lambda =0$, the algorithm solves a simple classification problem without any budget. The authors have shown that even in this case the algorithm learns to use only the features contributing to the objective. Hence, this setting can be useful if the costs are unavailable.
The input to the algorithm is a distribution of samples (i.e., distinct computer networks), where each sample comes in a form of a tree of information, and a description of the samples’ structure. There exist two distinct tasks that can be solved. In the first task, the goal for each sample is to sequentially create a partial subtree of the given sample, which contains a defined quality (i.e., a successfully penetrated computer network) and the overall goal of the algorithm is to optimize the taskrelated objective in expectation over all samples (i.e, maximize the network attack success rate). In the second task, labels for the samples are provided and the overall objective changes into classification. In both cases, the nodes can be valuated with a cost (defined in the dataset description) and a budget can be defined. In this case, the objective is constrained so that the total cost of all nodes in the created subtree does not exceed the budget, either on average or strictly for any sample. Moreover, we do not just want to solve the objective for the given training data – we want to create a model that generalizes well to unseen samples.
The tree structure of a sample is defined as follows. There are three types of nodes: items (a host in a network), their properties (an IP address) and sets of other items. Each item can possess several properties and sets. The properties have their values, either numerical, categorical or strings; sets are collections of sametyped items (all items in a set share their structure). The structure of a particular dataset is fixed and known (i.e., hosts have ports), but the number of items in sets can differ across samples (a set also can be empty).
With a single sample, the algorithm starts only with a knowledge about the root node and its children, with their values hidden. Sequentially, it can select a previously unexplored leaf in its currently observed subtree. Selecting a property reveals its value, selecting a set reveals its list of items. Discovered items are automatically expanded, so that the knowledge about the existence of their properties and sets is available. To enable a broader set of problems to be represented, some nodes can only be selected after some precondition is met (defined in the description; i.e., a port can be attacked only after a corresponding exploit is discovered). To satisfy a precondition, an item with a defined property has to be present in the current subtree (i.e., an exploit applicable to port 80). After a sequence of selections, the algorithm can decide to terminate, and the provided subtree is evaluated. In the classification case, the algorithm also provides a label for the current sample.
The problem can be naturally represented as an Markov Decision Process (MDP) that process one sample in a single episode and in which the current subtree is a part of the state, the set of actions is composed of one terminal action and one action per an unselected leaf in the subtree. Let $\mathrm{\ell}$ be the loss function to minimize (negative of the objective) with $\tau $ being either the final subtree or the returned label, depending on the task. Then the reward for the terminal action is $\mathrm{\ell}(\tau )$. Let $c$ be a cost function returning a cost of a selected node. Then the reward for action selecting a node $n$ is $\lambda c(n)$, where $\lambda $ is a tradeoff parameter between the main objective and cost. This exactly follows the formalism of CwCF (see Section Deep Reinforcement Learning with Explicitly Represented Knowledge and Variable State and Action Spaces), with the defined $\mathrm{\ell},c,\lambda $ and the main objective of the algorithm is to optimize the eq. (1). If $\lambda =0$, the problem degrades into objective optimization without a budget. However, the algorithm can still learn to prune parts of the tree that do not contribute to the objective.
In our work, the model is a differentiable function (a neural network) taking a partially observed sample as an input and outputting a probability distribution over all actions (unexplored properties and sets and the terminal action). The state consist of the sample itself, augmented with information about the missing parts. A single float value, called a mask, is added to each property in the currently observed tree, with value $1$ if the property is revealed and $0$ if not. Similarly, a single float value is added to each set in the tree, signalizing what portion of the corresponding branch is revealed, with a value between $0$ and $1$. Currently, we limit ourselves to samples with a limited depth, so the set mask can be recursively computed as an average of masks of properties and sets of all items the set. If a property of an item is not revealed, it’s replaced with zeros.
The data types present in the samples have to be transformed into floats prior their processing by the model. For strings, we observed a good performance with a character trigram histograms (Damashek, 1995). This hashing mechanism is simple, fast and conserves similarities between strings. Categorical properties are translated into onehot encoded array.
The prepared input is processed with HMIL (Section Deep Reinforcement Learning with Explicitly Represented Knowledge and Variable State and Action Spaces), resulting into embeddings of all present items and the embedding of the whole subtree $z$ (see Figure 3). Separate linear functions with their own parameters are used to output:

(1)
class probability distribution $p(z)$

(2)
the value function $V(z)$

(3)
presoftmax energy of the terminal action ${a}_{t}(z)$
The class probability output is used only if the task is classification. When available, the $p$ output can be used to initialize the first part of the network by sampling incomplete subtrees from the dataset, and training only the classification part.
The actions are to terminate or expand a single available unexpanded leaf. To obtain a probability distribution over all possible actions, we employ the hierarchical softmax (see Section Deep Reinforcement Learning with Explicitly Represented Knowledge and Variable State and Action Spaces). Starting at the root of the current subtree, a stochastic decision is made at each node to continue down the tree. An exception is with sets, where all items’ properties and sets are considered at once (instead of choosing an item first and then continuing down; see Figure 4). Each type of set $\mathcal{B}$ share parameters ${\varphi}_{\mathcal{B}}$ (different from HMIL set parameters ${\theta}_{\mathcal{B}}$). The rootlevel embedding $z$ and previously computed embedding of each item ${z}_{i}=\sigma ({f}_{{\theta}_{\mathcal{B}}}({x}_{i}))$ is used to compute presoftmax energies for each property or set in the item, with ${f}_{{\varphi}_{\mathcal{B}}}(z,{z}_{i})$. All action energies across all items in the set are then fed through the softmax function to obtain the final probability matrix. Already revealed properties, or completely explored branches (easily checked for their mask to be $1$) are excluded from the softmax. For convenient and efficient implementation, their presoftmax energies can simply be replaced with $\mathrm{\infty}$. At the root level, the terminal action ${a}_{t}$ is added to the softmax.
The actual performed action is sampled with the hierarchical softmax – through a sequence of stochastic decisions at each level. By taking a product of the partial probabilities of the path from the root to a leaf, the process retrieves a differentiable probability $\pi (as)$, a piece of the policy $\pi $, for the sampled action $a$ in a state $s$. The model is trained with the A2C algorithm (Section Deep Reinforcement Learning with Explicitly Represented Knowledge and Variable State and Action Spaces), by using random samples from a training distribution.
To demonstrate our method, we make experiments on a set of seven datasets from multiple domains (see Table 1). The demonstrated task is the budgeted classification, because we obtained only classification datasets. All datasets are taken from publicly available sources and most of them are originally relational databases. For each dataset, a description of its structure is available to the algorithm. The carcinogenesis, hepatitis, mutagenesis, sap and stats are relational datasets retrieved from (Motl and Schulte, 2015). For our purposes, we transformed them into trees by fixing the root and unfolding into the defined depth by following the links. Numerical values were normalized. The ingredients dataset comes from Kaggle^{1}^{1} 1 https://kaggle.com/alisapugacheva/recipesdata. The web dataset was retrieved by querying ThreatCrowd^{2}^{2} 2 https://threatcrowd.org service with a list of benign and malicious domains, recursively into a defined depth. The costs of various properties in the datasets were manually defined, based on our intuition. The preconditions of nodes, as defined in Section Deep Reinforcement Learning with Explicitly Represented Knowledge and Variable State and Action Spaces, are not used. Strings were processed with the trigram histogram method (Damashek, 1995), with simple modulo 13 index hashing. The datasets were split into training, validation and testing sets. For each dataset, two hyperparameters exist: eplen, controlling an epoch length and some other parameters and bsize, the batch size. All other hyperparameters are shared.
We refer to the method described in this paper as rl. We choose two baseline algorithms to compare to. First, we use HMIL trained with complete data (referred to as hmil), that should give the upper bound on the accuracy for the given dataset (as discussed later, this is often not the case, as the algorithm is often surpassed by the RL method). Second, we train HMIL with randomly constructed partial subtrees, that have some defined cost (referred to as rw, as for randomwalk). With different costs, we train different models. This approach serves as the lower bound, because it does not make informed decisions what features to include. This method is inspired by Hamilton et al. (2017), a method to create embeddings of general graphs, but constrained to always start in the root node and to a set budget.
Dataset  samples  classes  depth  eplen  bsize 

carcinogenesis  329  2  2  100  128 
Predicting the carcinogenity of molecules.  
hepatitis  500  2  2  100  128 
Predicting a diabetes type of a patient.  
mutagenesis  188  2  3  100  128 
Predicting a mutagenicity of molecules.  
ingredients  40k  20  2  1000  1024 
Predicting a cuisine type from ingredients.  
sap  35k  2  2  1000  1024 
Predicting a customer conversion.  
stats  8k  3  3  100  256 
Predicting an user’s age in a social network.  
web  1171  2  3  100  256 
Identifying malicious domains. 
The model’s parameters are initialized according to the provided dataset description. Parameters ${\theta}_{\mathcal{B}},{\varphi}_{\mathcal{B}}$ are created for each set $\mathcal{B}$. We use a fixed embedding size of 64 (the output of ${f}_{{\theta}_{\mathcal{B}}}$), across all sets and datasets. ReLu is used as the $\sigma $ activation function. The learning weight of the A2C algorithms were initialized as ${\alpha}_{v}=0.5,{\alpha}_{h}=0.05$, where the policy entropy controlling weight (${\alpha}_{h}$) exponentially decays by a factor $0.5$ every eplen steps. We use Adam optimizer (Kingma and Ba, 2015), with L2 regularization ${10}^{4}$. The gradients are clipped to a norm of $0.1$. The network is initialized by pretraining the classifier with randomly generated subtrees from the dataset, with crossentropy loss, learning rate $3\times {10}^{3}$ and early stopping. The learning rate of the main training exponentially decays from $3\times {10}^{3}$ by a factor of $0.5$ every $10\times eplen$ steps. During the main training, the classifier is trained only in states where the agent decides to terminate. For each dataset, we run the algorithm for $100\times eplen$ steps, and select the best performing iteration based on its validation performance (reward). The source code will be released on github.
For each dataset, we run the algorithm with three different seeds for each of six different tradeoff values $\lambda $. We make comparable number of runs with the baseline methods. We evaluate the resulting models in the testing sets and plot them in the costaccuracy plane, as seen in Figure 5. To better visualize the best performance of the algorithms, we also plot the pareto frontier of the results. Note that in the costaccuracy plane, the whole range of costs has to be taken into account when comparing the algorithms.
Based on the clustering of the result points and the pareto frontier, it can be said that the rl method achieves a better performance than rw on all but ingredients dataset. This is expected, as the rw method is uninformed and the rl method uses it for its initialization (in the classifier pretraining phase). Surprisingly, the baseline hmil method, which should give the top performance, does not work as well as expected. Although it receives the complete sample, it performs much worse than rl and often even than rw with only a fraction of the information in all but ingredients and sap datasets. This indicates that the full information is often not only unnecessary, but also harmful in the training. We analyzed the performance of the algorithms and found that the hmil model often overfits the training data, leading to a poor generalization. The rl method achieved worse performance during training, but the gap between training and testing data was lower. Similar results were observed by Janisch et al. (2019a) with fixedlength vectors, indicating that the RL model generalizes better because it solves a much harder task (with missing data).
In the terms of the computation needed by the rl method, the datasets with shorter eplen parameter (carcinogenesis, hepatitis, mutagenesis, stats and web) need about 1 hour to train. The other datasets (ingredients and sap) take about 24 hours to train. The hmil method usually finished in terms of minutes and rw method in tens of minutes.
We described that many realworld domains can be represented as a hierarchical information retrieval problem, possibly budgeted. Combining several techniques, we created a deep reinforcement learning method that takes an explicit knowledge tree as its input and sequentially decides which part of this tree to expand, what more information it needs to accomplish a defined task. The task can take a form of a general objective optimization or classification; each setting can incorporate a budget on the amount of acquired information. We demonstrated the method in seven datasets from realworld budgeted classification domains. According to our expectations, the method outperformed a simple baseline that is learned with randomly selected information. However, the method’s performance also surpassed an algorithm trained with complete data, without the budget cap. The results also indicate strong generalization to unseen instances.
References
 (1)
 Chen et al. (2019) YangEn Chen, KaiFu Tang, YuShao Peng, and Edward Y Chang. 2019. Effective Medical Test Suggestions Using Deep Reinforcement Learning. arXiv preprint arXiv:1905.12916 (2019).
 Damashek (1995) Marc Damashek. 1995. Gauging similarity with ngrams: Languageindependent categorization of text. Science 267, 5199 (1995), 843–848.
 DulacArnold et al. (2011) Gabriel DulacArnold, Ludovic Denoyer, Philippe Preux, and Patrick Gallinari. 2011. Datumwise classification: a sequential approach to sparsity. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 375–390.
 Džeroski et al. (2001) Sašo Džeroski, Luc De Raedt, and Kurt Driessens. 2001. Relational reinforcement learning. Machine learning 43, 12 (2001), 7–52.
 Goodman (2001) Joshua Goodman. 2001. Classes for fast maximum entropy training. arXiv preprint cs/0108006 (2001).
 Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 (2014).
 Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626 (2016), 471.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
 Hausknecht and Stone (2015) Matthew Hausknecht and Peter Stone. 2015. Deep recurrent qlearning for partially observable mdps. In 2015 AAAI Fall Symposium Series.
 Janisch et al. (2019a) Jaromír Janisch, Tomáš Pevnỳ, and Viliam Lisỳ. 2019a. Classification with Costly Features as a Sequential DecisionMaking Problem. arXiv preprint arXiv:1909.02564 (2019).
 Janisch et al. (2019b) Jaromír Janisch, Tomáš Pevný, and Viliam Lisý. 2019b. Classification with Costly Features using Deep Reinforcement Learning. In Proceedings of 33^{rd} AAAI Conference on Artificial Intelligence.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
 Lin and Mitchell (1992) LongJi Lin and Tom M Mitchell. 1992. Memory approaches to reinforcement learning in nonMarkovian domains. Citeseer.
 Metz et al. (2017) Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. 2017. Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035 (2017).
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. 1928–1937.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
 Morin and Bengio (2005) Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model.. In Aistats, Vol. 5. Citeseer, 246–252.
 Motl and Schulte (2015) Jan Motl and Oliver Schulte. 2015. The CTU prague relational learning repository. arXiv preprint arXiv:1511.03086 (2015). https://relational.fit.cvut.cz/
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Pevný and Kovařík (2019) Tomáš Pevný and Vojtěch Kovařík. 2019. Approximation capability of neural networks on spaces of probability measures and treestructured domains. arXiv preprint arXiv:1906.00764 (2019).
 Pevný and Somol (2016) Tomáš Pevný and Petr Somol. 2016. Discriminative models for multiinstance problems with tree structure. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. ACM, 83–91.
 Pevný and Somol (2017) Tomáš Pevný and Petr Somol. 2017. Using neural network formalism to solve multipleinstance problems. In International Symposium on Neural Networks. Springer, 135–142.
 Shim et al. (2018) Hajin Shim, Sung Ju Hwang, and Eunho Yang. 2018. Joint Active Feature Acquisition and Classification with VariableSize Set Encoding. In Advances in Neural Information Processing Systems. 1375–1385.
 Tadepalli et al. (2004) Prasad Tadepalli, Robert Givan, and Kurt Driessens. 2004. Relational reinforcement learning: An overview. In Proceedings of the ICML2004 workshop on relational reinforcement learning. 1–9.
 Tang and Agrawal (2019) Yunhao Tang and Shipra Agrawal. 2019. Discretizing continuous action space for onpolicy optimization. arXiv preprint arXiv:1901.10500 (2019).
 Van Otterlo (2005) Martijn Van Otterlo. 2005. A survey of reinforcement learning in relational domains. Centre for Telematics and Information Technology (CTIT) University of Twente, Tech. Rep (2005).
 Zambaldi et al. (2018) Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, et al. 2018. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830 (2018).
 Zhang et al. (2018) Yiming Zhang, Quan Ho Vuong, Kenny Song, XiaoYue Gong, and Keith W Ross. 2018. Efficient Entropy for Policy Gradient with Multidimensional Action Space. arXiv preprint arXiv:1806.00589 (2018).
 Zhou et al. (2018) Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2018. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434 (2018).