ProLoNets: Neural-encoding Human Experts' Domain Knowledge to Warm Start Reinforcement Learning

  • 2019-07-02 14:23:30
  • Andrew Silva, Matthew Gombolay
  • 0


Deep reinforcement learning has seen great success across a breadth of tasks,such as in game playing and robotic manipulation. However, the modern practiceof attempting to learn tabula rasa disregards the logical structure of manydomains and the wealth of readily available domain experts' knowledge thatcould help "warm start" the learning process. Further, learning fromdemonstration techniques are not yet efficient enough to infer this knowledgethrough sampling-based mechanisms in large state and action spaces. We presenta new reinforcement learning architecture that can encode expert knowledge, inthe form of propositional logic, directly into a neural, tree-like structure offuzzy propositions amenable to gradient descent and show that our novelarchitecture is able to outperform reinforcement and imitation learningtechniques across an array of reinforcement learning challenges.


Quick Read (beta)

ProLoNets: Neural-encoding Human Experts’ Domain Knowledge to Warm Start Reinforcement Learning

Andrew Silva
Georgia Institute of Technology
Atlanta, GA 30308
[email protected]
\AndMatthew Gombolay
Georgia Institute of Technology
Atlanta, GA 30308
[email protected]

Deep reinforcement learning has seen great success across a breadth of tasks, such as in game playing and robotic manipulation. However, the modern practice of attempting to learn tabula rasa disregards the logical structure of many domains and the wealth of readily available domain experts’ knowledge that could help “warm start” the learning process. Further, learning from demonstration techniques are not yet efficient enough to infer this knowledge through sampling-based mechanisms in large state and action spaces. We present a new reinforcement learning architecture that can encode expert knowledge, in the form of propositional logic, directly into a neural, tree-like structure of fuzzy propositions amenable to gradient descent and show that our novel architecture is able to outperform reinforcement and imitation learning techniques across an array of reinforcement learning challenges.


ProLoNets: Neural-encoding Human Experts’ Domain Knowledge to Warm Start Reinforcement Learning

  Andrew Silva Georgia Institute of Technology Atlanta, GA 30308 [email protected] Matthew Gombolay Georgia Institute of Technology Atlanta, GA 30308 [email protected]


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Reinforcement learning (RL) has seen impressive success in many domainsMnih et al. (2013); Espeholt et al. (2018); Andrychowicz et al. (2018, 2016); Sun et al. (2018); Edwards et al. (2016); Rajeswaran et al. (2017); Hahn et al. (2019); however, as research examines increasingly complex domains, such as real-time strategy games Synnaeve & Bessiere (2016); Sun et al. (2018) or robotic manipulationRajeswaran et al. (2017); Andrychowicz et al. (2018), current state-of-the-art approaches require incredible amounts of computation, data, and time not readily available to most researchers and practitioners across academia and industry. Further, reinforcement and imitation learning approaches Mnih et al. (2013); Cheng et al. (2018) fail to efficiently capture the wealth of domain knowledge that does exists for many of these domains.

We aim to reduce the need for voluminous data and computational power by better blending humans and machines. Human domain experts develop strategies or heuristics that allow them to outcompete the most advanced computational approaches Young (2019). Advantageously, human factors practices have been developed to solicit this domain knowledge in the form of a condensed set of heuristics or rules. Missing in current deep RL is a mechanism to translate this domain expertise into a cohesive and expressive learning framework, combining the best of both humans and machines. Apart from using humans as oracles to label training or replay data, which Amershi et al. (2014) argue is unreasonable, no such combined mechanism exists. Furthermore, many real-world domains do not have large, crowd-sourced datasets to leverage and potentially only have a handful of human experts.

Figure 1: A traditional decision tree is on the left, and the ProLoNet version is on the right.

To achieve this blending of human domain knowledge and the strengths of machine computation, we propose a new approach, which we call Propositional Logic Nets (ProLoNets), that directly encodes domain expert knowledge as a set of propositional rules into a neural network. By incorporating a set of logical propositions for initialization of neural network weights, an RL agent can immediately begin learning helpful actions or strategies rather than expending hundreds or thousands of CPU hours in a complex domain without the ability to explore it in a meaningful way. This approach leverages readily available domain knowledge–obtained via standard knowledge solicitation techniques in human factors Crandall et al. (2006); Newell et al. (1983)–while still retaining the ability to learn and improve over time, eventually outperforming the expertise with which it was initialized. By exploiting structure and logical rules that are inherent to many tasks RL is applied to, we can bypass early random exploration and expedite an agent’s learning in a new domain. In addition to demonstrating that our approach can outperform standard deep RL techniques, we also show that our framework is superior to state-of-the-art imitation-learning-based RL Cheng et al. (2018) despite observation of that same domain expert knowledge.

We make three primary contributions in this work. First, we propose a novel approach to initializing deep RL agents using a new architecture that we call ProLoNets. This architecture allows for domain expertise to be translated into a deep RL agent’s policy before the agent begins exploring a new domain. Second, we demonstrate the efficiency of the ProLoNet as an architecture for learning, with and without intelligent initialization, which proves more robust to the need for hyperparameter tuning than a fully-connected baseline. Third, we offer a new approach to dynamic network growth that enables our ProLoNet architecture to expand its capacity dynamically and outgrow its initialization. We also include our implementation and code for experiments11 1

2 Approach

We provide a visual overview of the ProLoNet architecture in Figure 1. To intelligently initialize a ProLoNet, an expert first provides a policy in the form of some hierarchical set of decisions. These policies may be solicited through well-established human factors techniques, such as cognitive and hierarchical task analyses Klein (2000); Schraagen et al. (2000); Stanton (2006), which yield a set of rules or propositions to follow in the form of first-order logic. The propositional description of the expert’s decision-making process is then translated into a set of weights wW and comparator values cC representing each rule. Each weight wi determines which input features to consider and how to weight them. The comparator ci is used as a threshold for the weighted features. We note that c could also be a vector c where each element maps directly to a weighted input feature. For the remainder of this paper, however, we assume that c is a scalar compared against the sum of the weighted input. Each decision node Dn throughout the network is represented as

Dn=σ[α(wnT*X-cn)] (1)

where X is the input data, σ is the sigmoid function, and α controls how smooth the decisions are, where the smoother a decision is, the less deterministic it is. A smoother tree allows for more uncertainty in decision making Yuan & Shaw (1995), and therefore, allows for more exploration, including from an expert initialization. High values of α will emphasize the difference between the comparator and the weighted input, therefore pushing the tree to be more boolean. Lower values of α will encourage a smoother tree, with α=0 producing random decisions. We allow α to be a learned parameter.

After all decision nodes are processed, the values of Dn from each node represent the likelihood of that condition being met, while (1-Dn) represents the likelihood of the condition not being met. With these likelihoods, the network then multiplies out the probabilities for different paths to all of the leaf nodes. Every leaf lL contains a path zZ, a set of decision nodes which should be met or not met in order to reach l, as well as a prior set of weights for each output action aa. For example, in Figure 1, z1=D1*D2, and z3=(1-D1)*D3. The likelihood of each action a in leaf li is then determined by multiplying the probability of reaching leaf li by the prior weight of the outputs within leaf li

P(a|li)=zi*li,a (2)

While the ProLoNet paths always yield valid probability distributions, the leaf weights are trainable parameters, and therefore, do not necessarily maintain output values constrained to valid probability distributions. After calculating the outputs for every leaf, the leaves are summed and passed through a softmax function to provide the final probability distribution over all output actions.

To illustrate this more practically, we consider the simplest case of a cart pole (Section 3.2.1) ProLoNet with a single decision node. Assume we have solicited the following from a domain expert: “If the cart’s x position is right of center, move left; otherwise, move right,” and that they indicate x_position is the first input feature and that the center is at 0. We therefore initialize our primary node D0 with w0=[1,0,0,0] and c0=0. We then specify l0 to be a new leaf with a prior of [1,0] over the two available actions and l1 to be a new leaf with the prior [0,1]. Finally, we set the path to l0 to be D0 and the path to l1 to be (1-D0). Consequently for each state, the probability distribution over the agent’s two actions is a softmax over (D0*l0+(1-D0)*l1).

Critic Initialization

The ProLoNet architecture allows for intelligent initialization of a policy, and we use a duplicate of that policy to initialize an agent’s value network. Our critic network is instantiated with the same parameters for weights, comparators, and leaves as the actor, with the only difference being that we do not run the final critic distribution through a softmax function, as a sample from a probability distribution for value predictions is not needed.


While our initialized ProLoNets are able to follow expert strategies immediately, they may lack expressive capacity to learn more optimal policies once they are deployed into a domain. For instance, if an expert policy only involves a small number of decisions, the network will only have a small number of weight vectors and comparators to use for its entire existence. To enable the ProLoNet architecture to continue to grow beyond its initial definition, we introduce a dynamic deepening procedure. The process is outlined in Algorithm 1, where H is an entropy function and ϵ enforces a minimum confidence. In our experiments, we set ϵ=0.1. A visual example of this process can be seen in Figure 1(a).

Algorithm 1 Dynamic Deepening
1:   Input: ProLoNet Pd, Deeper ProLoNet Pd+1
2:   for liLPd do
3:      Calculate H(li)
4:      Calculate H(ld1), H(ld2) for leaves under li in Pd+1
5:      if H(ls)>(H(ld1)+H(ld2)+ϵ) then
6:         Deepen Pd at li using ld1 and ld2
7:         Deepen Pd+1 at ld1 and ld2 randomly
8:      end if
9:   end for

Upon initialization, a ProLoNet agent maintains two copies of its actor: the shallower, unaltered initialized version and a deeper version, in which each leaf is transformed into a randomly initialized node with two new randomly initialized leaves (line 1 of Algorithm 1). As the agent interacts with its environment, it relies on the shallower networks to generate actions and value predictions and to gather experience. After each episode, our off-policy update is run over the shallower and deeper networks. Finally, after the off-policy updates, the agent compares the entropy of the shallower actor’s leaves (line 3) to the entropy of the deeper actor’s leaves (line 4) and selectively deepens when the leaves of the deeper actor are less uniform than those of the shallower actor (lines 5-7). We find that this dynamic deepening improves stability and ameliorates policy degradation.

3 Experiments and Results

In our experiments, we evaluate ProLoNets against RL and imitation-learning-based techniques. We also conduct an ablation study that examines the value of the various components of the ProLoNet architecture, including intelligent initialization, the architecture itself, and our deepening procedure. We assess our algorithm in the StarCraft II Learning Environment (SC2LE) domain Vinyals et al. (2017) for both the macro and micro battles and the standard lunar lander and cart pole Barto et al. (1983) environments from the OpenAI Gym Brockman et al. (2016) for more detailed analysis.

Our actors are updated with a normal proximal policy optimization (PPO) loss function Schulman et al. (2017), though for more complex domains, we found that augmenting the PPO update with a Kullback-Leibler (KL) divergence term yielded superior performance. Additional details are provided in the supplementary material. The critic’s loss function is the mean-squared error between the output of the critic and the reward from the state-action pair. All approaches are trained with the RMSProp Tieleman & Hinton (2012) optimizer. We set our reward discount factor to 0.99, and learning rates and batch sizes are given in the supplementary material.

3.1 Agents

We compare different agents across our experimental domains, including baseline linear and long short-term memory (LSTM) Hochreiter & Schmidhuber (1997) models, an imitation learning model, and a ProLoNet.

  • ProLoNet: a ProLoNet agent as described above

  • Randomly Initialized ProLoNet: a ProLoNet with random initialization

  • FC: a fully-connected layer agent with ReLU Nair & Hinton (2010) activations

  • LSTM: an LSTM agent with ReLU activations

  • LOKI: a fully-connected layer agent as above but which is trained via imitation learning for the first N episodes Cheng et al. (2018), where N is a tuned hyperparameter. We use the same heuristic to train this agent that we use to initialize our ProLoNet agents.

For the FC and LSTM agents, we experiment with the number and size of linear layers and report results with the best performing agent in each experiment. Architectures are provided in the supplementary material.

3.2 Environments

We consider four environments to evaluate ProLoNets: cart pole, lunar lander, the FindAndDefeatZerglings minigame from the SC2LE Vinyals et al. (2017), and a full game of StarCraft II (SC2) against the in-game artificial intelligence (AI). We provide a brief overview of each domain and present results below. More details, such as hyperparameter settings or descriptions of state-action spaces, are given in supplementary material. All experiments are run on an i7-7700K 4.2 GHz Quad-Core Processor.

To evaluate the impact of deepening and intelligent initialization, our ablation results from all experiments are shown in Table 1. For each n-mistake agent, weights, comparators, and leaves are randomly negated according to n, up to a maximum of 2n for each category. For example, “N = 0.05” has a maximum of 10% of its weights, comparators, and leaves negated, each with probability 0.05.

Figure 2: A visualization of the deepening process where the deeper ProLoNet is shown in paler colors with dashed lines (left) and a demonstration of its stabilizing effects in cart pole (right)

3.2.1 Cart Pole

We find that ProLoNets always solve cart pole, regardless of how they are initialized. Running reward in this domain is averaged across five runs Henderson et al. (2018) for 1,000 episodes each and is depicted in Figure 2(a). As this figure shows, ProLoNets are able to outperform standard reinforcement and imitation-based reinforcement Cheng et al. (2018) learning architectures.

In our ablation study, we find that the ProLoNet is also robust to errant initialization, as shown by the success of the architecture across a variety of sizes with random as well as human-given initialization, notedas shown in Table 1. We test between two- and 32-leaf architectures with random initialization and find that all sizes succeed in solving the problem. FC and LSTM agents however are highly sensitive to such hyperparameter tuning, requiring very specific architectures and learning rates. Even without intelligent initialization, this result demonstrates that the ProLoNet architecture is easier to deploy for a simple RL challenge, and it is robust to a variety of hyperparameter choices.

To evaluate the effect of deepening in a lifelong learning scenario, we compare a ProLoNet to a non-deepening ProLoNet using the cart pole problem for three runs of 10,000 episodes each. We select the cart pole problem for its simplicity and low computational overhead, allowing us to evaluate how well the agent handle continued updates to an already-optimal policy. We present a comparison between a ProLoNet and non-deepening ProLoNet in Figure 1(b). While the non-deepening ProLoNet is able to perform well and is relatively robust to policy degradation, we can see that deepening improves the agent’s stability and long-term reward. Our agents use a dynamically growing batch size for this domain, a technique shown to ameliorate overfitting Smith et al. (2017), and even with this technique, the agent suffers from instability without dynamic deepening.

(a) Cart Pole
(b) Lunar Lander
Figure 3: A comparison of architectures on cart pole (left) and lunar lander (right)

3.2.2 Lunar Lander

For our lunar lander experiment, ProLoNets are successful at solving this domain, using anywhere from 10 to 14 decision nodes and 10 to 15 leaves. Just as with the cart pole problem, FC and LSTM agents are much more sensitive to hyperparameter tuning. Running reward in this domain is averaged across five runs of 1,500 episodes each and is depicted in Figure 2(b). Again, we see that the ProLoNet outperforms baseline architectures and an imitation learning baseline and remains robust to policy degradation. Furthermore, Table 1 shows the architecture’s ability to optimize long-term expected reward even with sub-optimal or random initialization.

3.2.3 StarCraft II: FindAndDefeatZerglings

FindAndDefeatZerglings is a minigame from the SC2LE Vinyals et al. (2017) designed to challenge RL agents to learn how to effectively micromanage their individual attacking units in SC2. The agent controls three attacking units on a small, partially observable map and must explore the map while killing enemy units. We leverage the SC2 API 22 2 to manufacture a 32D state and a 10D action space which are specified by a domain expert. For this problem, we assign an agent to each individual allied unit, which generates actions for only that unit.

Figure 4: A comparison of architectures on FindAndDefeatZerglings

ProLoNet agents for this domain are more sensitive to their initial policies, though we again find success with a variety of initial intelligent policies and architectures. We report results for the best-performing policy, which is a simple policy with five decision nodes and seven leaves. Running reward is averaged across five runs for 1,000 episodes and is depicted in Figure 4. Intelligent initialization is crucial in this more complex domain, in which random initialization fails to find much success, despite having the same architecture as an intelligently initialized ProLoNet. The n-mistake agents–shown in Table 1–obtain significantly lower cumulative reward than the intelligently initialized agent, even with only 5-10% of their parameters negated. Even so, we note that a sub-optimal initialization and a random initialization both significantly outperform baseline RL approaches. In this domain, the imitation learning baseline performs on par with the heuristic it follows, but is unable to generalize well beyond that initial success.

Table 1: ProLoNets ablation study of average cumulative reward. Units are in thousands.
Random. Shallow
Domain ProLoNet ProLoNet ProLoNet N = 0.05 N = 0.1 N = 0.15
Cart Pole 449±15 401±26 415± 27 426± 30 369± 28 424± 29
Lunar 86 ± 33 55±19 49± 20 50± 22 45± 22 45± 22
Zerglings 8.9±1.5 -1.3±0.6 8.8±1.5 5.1±1.1 5.9±1.2 4.1±1.1

3.2.4 StarCraft II: Full Game

The complexity and state space of SC2 is exceptionally large, includes various levels of detail, and is partially observable. We again use the API to extract exact unit types and counts and game state information, reducing the state to a 193D vector of information, such as unit and resource counts. As in TStarBot Sun et al. (2018), we construct a set of heuristics to simplify the specification of our expert policy, though we do not use the macro-actions created in prior work. Our action space is ultimately 44D and includes build, attack, scout, harvest, defend, and "do nothing" commands. Full details are given in the supplementary material.

In such a difficult and complex domain, making the right parameter update is a significant challenge for RL agents. As such, we validate each agent’s network updates by freezing the actor and critic parameters and playing out several games with the new policy. If the agent’s chance of a victory is at least as good as it was before the update, then the parameters are unfrozen and the agent continues its learning. However, if the agent’s probability of success is lower after the update, then the parameters are reverted, and the agent gathers experience for a new update. Probability of success is simply the number of victories under a new policy divided by the number of games under a new policy, and we determine when the result is significant enough on which to base a decision by performing a Bernoulli test after every five games.

Table 2: Win rates for against SC2 in-game AI
ProLoNet ProLoNet at Random Init.
AI Difficulty (Ours) Initialization FC LSTM LOKI Cheng et al. (2018) ProLoNet
VeryEasy 100% 14.1 % 0 % 0 % 0 % 0 %
Easy 100% 10.9 % 0 % 0 % 0 % 0 %
Medium 82.2% 11.3 % 0 % 0 % 0 % 0 %
Hard 26% 10.7 % 0 % 0 % 0 % 0 %

As shown in Table 2, after 5,000 episodes, neither the FC, LSTM, nor the randomly initialized ProLoNet agents are able to win a single game against the in-game AI at the easiest setting. Even the LOKI agent, which has access to the same heuristics used by the ProLoNet, is unable to win one game. The intelligently initialized ProLoNet, on the other hand, is able to progress to the “ hard” in-game AI, achieving 100% win rates against easier opponents as it progressed. Even against the “ hard” in-game AI, the ProLoNet agent is able to double its win rate from initialization. This result demonstrates the importance of an intelligent initialization in complex domains, where only a very narrow and specific set of actions yield successful results. Agents that take random actions will be unproductive for thousands of episodes as the network continues to output actions that are not possible to perform (i.e., “Train Zealot” before a Gateway has been constructed).

4 Discussion

Our results indicate that it is possible to encode heuristic policies into deep networks for warm starting RL agents and that such warm starting can help agents begin to immediately explore and learn superior policies without wasting time taking random actions. As we can see in Figures 2(a), 2(b), and 4, ProLoNets outperform both baseline architectures and imitation learning baselines, and intelligent-initialization can make a significant difference in an agent’s long-term success in complex domains. We also observe that the ProLoNet architecture is robust to sub-optimal or even random initialization, improving from such initialization to solve simple domains and consistently outperforming RL baselines.

For our two simpler domains–cart pole and lunar lander–we observe that our approach is as fast or faster than baseline methods to learn an optimal policy and that it is more stable and robust to policy degradation than baseline methods. We also see the impact of dynamic network growth in Figure 1(b). Finally, we note that our architecture is able to solve these domains across a variety of initial sizes, while baseline methods such as FC or LSTM require extensive hyperparameter tuning.

In our more complex domains, we identify the importance of an intelligent initialization. While the imitation learning baseline is able to perform well in the FindAndDefeatZerglings minigame, it is unable to improve on the imitated policy. Similarly, noisily initialized ProLoNets are unable to quickly find competitive solutions to the problem, even when given the exact same architecture as their successful counterparts.

5 Related Work

Our work is related to several active areas of research, including warm starting, efficient exploration for RL agents, imitation learning, and dynamic network growth. We summarize each area briefly below.

Warm Starting and Neural Trees –

There has been an increase in researchers investigating ways to improve the initialization of deep neural networks, particularly in the RL domain in which agents can spend a tremendous amount of time without learning anything meaningful before starting to accrue useful reward signals. Warm starts have been used for RL in healthcare Zhu & Liao (2017), as well as in supervised learning for natural language processing Wang et al. (2017) or other classification tasks Hu et al. (2016). While these works have provided interesting insight into the efficacy of warm starts in various domains, they either involve large labeled datasets, or they require RL agents to solve the same domain repeatedly. In domains where an RL agent will struggle to ever find a solution without a warm start, this is not a practical assumption nor is it always possible to acquire a large labeled dataset for new domains.

Researchers have also sought to bridge the gap between decision trees and deep neural networks Laptev & Buhmann (2014); Kontschieder et al. (2015). This work has been focused on either partitioning a subspace of the data for more efficient inference Tanno et al. (2018) or for more explicit interpretability through visualization of a network’s classification policy Frosst & Hinton (2017). Nonetheless, our work is the first to leverage propositional logic in the form of a decision tree to intelligently initialize a deep network.

Finally, researchers have proposed deep jointly-informed neural networks (DJINN) Humbird et al. (2018), using a decision tree to initialize a deep network for classification while preserving the decision tree rules for immediately accurate prediction from the network. However, DJINN applies a decision tree trained on a target dataset for network initialization, relying on a large set of labeled examples. Our approach instead seeks to translate an expert policy into a network without the need for a supervised training set to construct a decision tree for initialization. Alternatively, we convert a set of propositional logical rules directly into a set of neural network weights.

Imitation Learning –

Imitation learning–the process of learning a policy by directly mimicking an expert–often requires large datasets on which to train. Methods such as ILPO Edwards et al. (2018) require large labeled datasets, and approaches that combine imitating and exploring, such as DAgger Ross et al. (2011) and LOKI Cheng et al. (2018), require a pre-trained policy or heuristic to act as an oracle. Even with a pre-trained policy, LOKI still requires extensive domain experience before beginning the reinforcement learning stage. A human can also act as an oracle for imitation learning, but it is not reasonable to expect a human to patiently label replay data for the entirety of an imitation-learning agent’s life Amershi et al. (2014). While there are many methods for extracting policies or general “rules of thumb” from humans Newell et al. (1983); Crandall et al. (2006); Annett (2003), these heuristics or rules must be translated into oracles which can be used to provide labels for imitation learning systems, and then these oracles must be applied across large amounts of data. Our approach can leverage the same human factors research for extracting policies from humans, though we translate them directly into an RL agent’s policy and begin RL immediately, sidestepping the imitation learning phase.

Dynamic Network Growth –

Adaptive neural network architectures and dynamic network growth have been areas of interest within the lifelong learning and continual learning communities. Recent work in lifelong learning for multiple tasks Lee et al. (2017) uses a dynamically expanding network to learn new tasks without catastrophically forgetting Kirkpatrick et al. (2017) old tasks, adding neurons throughout the network if old task performance ever suffers while learning new tasks or features. Xiao et al. (2014) used a hierarchy of trained models arranged in a tree-like structure for incremental learning of new classes over small batches of data. While most related work uses some classification loss or reconstruction loss Zhou et al. (2012) to determine when a network should be grown, Susan & Dwivedi (2014) uses a measure of entropy on new data to determine whether or not input data would be considered anomalous by the network and therefore, require an increase in network capacity. Our work differs from these prior approaches in that our method for deepening relies on entropy within the model weights themselves to estimate when a model is unable to decide on a single action.

6 Conclusion and Future Work

We present a new architecture for deep RL agents, ProLoNets, which permits intelligent initialization of agents and grant agents the ability to grow their network capacity as necessary. We show that ProLoNets are robust to the number of parameters in the network, to policy degradation, and even to sub-optimal expert initialization in simpler domains. We demonstrate that our approach is superior to imitation and reinforcement learning on traditional architectures and that intelligent initialization allows deep RL agents to explore and learn in environments that are too complex for randomly initialized agents.

We believe that there are many opportunities for future work in this area. First, we have not outlined ways in which we can incorporate unstructured information, such as images or audio, into a ProLoNet framework. Second, we have not investigated ProLoNets with continuous output, which would allow for deployment into a greater variety of domains and challenges. Finally, our architecture does not currently employ any kind of recurrence or memory, presenting a key area for further development in domains with long time horizons or in lifelong learning scenarios.


  • Amershi et al. (2014) Amershi, S., Cakmak, M., Knox, W. B., and Kulesza, T. Power to the people: The role of humans in interactive machine learning. AI Magazine, 35(4):105–120, 2014.
  • Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
  • Andrychowicz et al. (2018) Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.
  • Annett (2003) Annett, J. Hierarchical task analysis. In Handbook of cognitive task design, pp. 41–60. CRC Press, 2003.
  • Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, SMC-13(5):834–846, 1983.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
  • Cheng et al. (2018) Cheng, C.-A., Yan, X., Wagener, N., and Boots, B. Fast policy learning through imitation and reinforcement. arXiv preprint arXiv:1805.10413, 2018.
  • Crandall et al. (2006) Crandall, B., Klein, G., Klein, G. A., and Hoffman, R. R. Working minds: A practitioner’s guide to cognitive task analysis. Mit Press, 2006.
  • Edwards et al. (2016) Edwards, A., Isbell, C., and Takanishi, A. Perceptual reward functions. arXiv preprint arXiv:1608.03824, 2016.
  • Edwards et al. (2018) Edwards, A. D., Sahni, H., Schroeker, Y., and Isbell, C. L. Imitating latent policies from observation. arXiv preprint arXiv:1805.07914, 2018.
  • Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
  • Frosst & Hinton (2017) Frosst, N. and Hinton, G. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784, 2017.
  • Hahn et al. (2019) Hahn, M., Kadav, A., Rehg, J. M., and Graf, H. P. Tripping through time: Efficient localization of activities in videos. arXiv preprint arXiv:1904.09936, 2019.
  • Henderson et al. (2018) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hu et al. (2016) Hu, Z., Ma, X., Liu, Z., Hovy, E., and Xing, E. Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318, 2016.
  • Humbird et al. (2018) Humbird, K. D., Peterson, J. L., and McClarren, R. G. Deep neural network initialization with decision trees. IEEE transactions on neural networks and learning systems, 2018.
  • Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, pp. 201611835, 2017.
  • Klein (2000) Klein, G. Cognitive task analysis of teams. Cognitive task analysis, 11:417–29, 2000.
  • Kontschieder et al. (2015) Kontschieder, P., Fiterau, M., Criminisi, A., and Rota Bulo, S. Deep neural decision forests. In Proceedings of the IEEE international conference on computer vision, pp. 1467–1475, 2015.
  • Laptev & Buhmann (2014) Laptev, D. and Buhmann, J. M. Convolutional decision trees for feature learning and segmentation. In German Conference on Pattern Recognition, pp. 95–106. Springer, 2014.
  • Lee et al. (2017) Lee, J., Yun, J., Hwang, S., and Yang, E. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
  • Newell et al. (1983) Newell, A., Moran, T., and Card, S. The psychology of human computer interaction. L. Erlbaum Associates, 1983.
  • Rajeswaran et al. (2017) Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., and Levine, S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
  • Ross et al. (2011) Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
  • Schraagen et al. (2000) Schraagen, J. M., Chipman, S. F., and Shalin, V. L. Cognitive task analysis. Psychology Press, 2000.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Smith et al. (2017) Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  • Stanton (2006) Stanton, N. A. Hierarchical task analysis: Developments, applications, and extensions. Applied ergonomics, 37(1):55–79, 2006.
  • Sun et al. (2018) Sun, P., Sun, X., Han, L., Xiong, J., Wang, Q., Li, B., Zheng, Y., Liu, J., Liu, Y., Liu, H., et al. Tstarbots: Defeating the cheating level builtin ai in starcraft ii in the full game. arXiv preprint arXiv:1809.07193, 2018.
  • Susan & Dwivedi (2014) Susan, S. and Dwivedi, M. Dynamic growth of hidden-layer neurons using the non-extensive entropy. In Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, pp. 491–495. IEEE, 2014.
  • Synnaeve & Bessiere (2016) Synnaeve, G. and Bessiere, P. Multiscale bayesian modeling for rts games: An application to starcraft ai. IEEE Transactions on Computational intelligence and AI in Games, 8(4):338–350, 2016.
  • Tanno et al. (2018) Tanno, R., Arulkumaran, K., Alexander, D. C., Criminisi, A., and Nori, A. V. Adaptive neural trees. arXiv preprint arXiv:1807.06699, 2018.
  • Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  • Vinyals et al. (2017) Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
  • Wang et al. (2017) Wang, J., Wang, Z., Zhang, D., and Yan, J. Combining knowledge with deep convolutional neural networks for short text classification. In Proceedings of IJCAI, volume 350, 2017.
  • Xiao et al. (2014) Xiao, T., Zhang, J., Yang, K., Peng, Y., and Zhang, Z. Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 177–186. ACM, 2014.
  • Young (2019) Young, S. Human ingenuity beats ai’s machine precision., 2019. Accessed: 2019-05-12.
  • Yuan & Shaw (1995) Yuan, Y. and Shaw, M. J. Induction of fuzzy decision trees. Fuzzy Sets and systems, 69(2):125–139, 1995.
  • Zhou et al. (2012) Zhou, G., Sohn, K., and Lee, H. Online incremental feature learning with denoising autoencoders. In Artificial intelligence and statistics, pp. 1453–1461, 2012.
  • Zhu & Liao (2017) Zhu, F. and Liao, P. Effective warm start for the online actor-critic reinforcement learning based mhealth intervention. arXiv preprint arXiv:1704.04866, 2017.

Appendix A Experimental Domain Details

A.1 Cart Pole

Cart pole is an RL domain Barto et al. (1983) where the object is to balance an inverted pendulum on a cart that moves left or right. The state space is a 4D vector representing {cart position, cart velocity, pole angle, pole velocity}, and the action space is is {left, right}. We use the cart pole domain from the OpenAI Gym Brockman et al. (2016).

For the cart pole domain, we set all agent’s learning rates to 0.01, the batch size is set to dynamically grow as there is more replay experience available, we initialized α=1, and each agent trains on all data gathered after each episode, then empties its replay buffer. All agents train on 2 simulations concurrently, pooling replay experience after each episode, and updating their policy parameters. For the LOKI agent, we set N=200. All agents are updated according to the standard PPO loss function. We selected all parameters empirically to produce the best results for each method.

A.2 Lunar Lander

Lunar lander is the second domain we use from the OpenAI Gym Brockman et al. (2016), and is based on the classic Atari game of the same name. Lunar lander is a game where the player attempts to land a small ship (the lander) safely on the ground, keeping the lander upright and touching down slowly. The 8D state consists of the lander’s {x, y} position and velocity, the lander’s angle and angular velocity, and two binary flags which are true when the left or right legs have touched down.

We use the discrete lunar lander domain, and so the 4D action space contains {do nothing, left engine, main engine, right engine}. For the lunar lander domain, we set most hyperparameters to the same values as in the cart pole domain. The two exceptions are the number of concurrent processes, which we set to 4, and the LOKI agent’s N, which is set to 300. All agents use the standard PPO loss function.

A.3 FindAndDefeatZerglings

FindAndDefeatZerglings is a minigame from the SC2LE designed to challenge RL agents to learn how to effectively micromanage their individual attacking units in SC2. The agent controls three attacking units on a small, partially-observable map, and must explore the map while killing enemy units. The agent receives +1 reward for each enemy unit that is killed, and -1 for each allied unit that is killed. Enemy units respawn in random locations, and so the best agents are ones that continuously explore and kill enemy units until the three minute timer has elapsed.

We leverage the SC2 API 33 3 to manufacture a 32D state which contains {x_position, y_position, health, weapon_cooldown} for three allied units, and the five nearest visible enemy units. Missing information is filled with -1. Our action space is 10D, containing move commands for north, east, south, west, attack commands for each of the five nearest visible enemies, and a “do nothing” command. For this problem, we assign an agent to each individual allied unit, which generates actions for only that unit. Experience from each agent stops accumulating when the unit dies. All experience is pooled for policy updates after each episode, and parameters are shared between agents.

For the SC2LE minigame, we set all agents’ learning rates to 0.001, we again initialized α=1, and the batch size to 4. Each agent trains on replay data for 50 update iterations per episode, and pools experience from 2 concurrent processes. The LOKI agent’s N, is set to 500. The agents in this domain update according to the loss function in Equation 3.

L(a,s,πnew,πold)=(A*log(a|πnew))KL(P(a|πnew,s),P(a|πold,s)) (3)

Where A is the advantage gained by taking action a in state s, πnew is the current set of model parameters, and πold is the set of model parameters used during the episode which generated this state-action pair. a is the probability distribution over all actions that a policy π yields given state s. As in prior work, the advantage A is calculated by subtracting the reward (obtained by taking action a in state s) from the value prediction for taking action a in state s, given by a critic network.

A.4 SC2 Full Game

Our simplified StarCraft 2 state contains:

  • Allied Unit Counts: A 36x1 vector in which each index corresponds to a type of allied unit, and the value corresponds to how many of those units exist.

  • Pending Unit Counts: As above, but for units that are currently in production and do not exist yet.

  • Enemy Unit Counts: A 112x1 vector in which each index corresponds to a type of unit, and the value corresponds to how many of those types are visible.

  • Player State: A 9x1 vector of specific player state information, including minerals, vespene gas, supply, etc.

The disparity between allied unit counts and enemy unit counts is due to the fact that we only play as the Protoss race, but we can play against any of the three races.

The number of actions in SC2 can be well into the thousands if one considers every individual unit’s abilities. As we seek to encode a high-level strategy, rather than rules for moving every individual unit, we restrict the action space for our agent. Rather than using exact mouse and camera commands for individual units, we abstract actions out to simply: “Build Pylon.” As such, our agents have 44 available actions, including 35 building and unit production commands, 4 research commands, and 5 commands for attack, defend, harvest resources, scout, and do nothing.

For the full SC2 game, we set all agents’ learning rates to 0.0001, we again initialized α=1, set the batch size to 4, and updates per episode to 8. We run 4 episodes between updates, and set the LOKI N=1000. Agents train for as long as necessary to achieve a >80% win-rate against the easiest AI, then move up to successive levels of difficulty as they achieve >80% win-rates. The agents in this domain update according to the loss function in Equation 3.

Appendix B Initialization Policies

B.1 Cart Pole Heuristics

We use a simple set of heuristics for the cart pole problem, visualized in Figure 5. If the cart is close enough to the center, we move in the direction opposite to the lean of the pole, as long as that motion will not push us too far from the center. If the cart is close to an edge, the agent attempts to account for the cart’s velocity and recenter the cart, though this is often an unrecoverable situation for the heuristic. The longest run we saw for a ProLoNet with no training was about 80 timesteps.

Figure 5: Visualization of the heuristics used to initialize the cart pole ProLoNet, and to train the LOKI agent

B.2 Lunar Lander Heuristics

For the lunar lander problem, the heuristic rules are split into two primary phases. The first phase is engaged at the beginning of an episode while the lander is still high above the surface. In this phase, the lander focuses on keeping the lander’s angle as close to 0 as possible. Phase two occurs when the lander gets closer to the surface, and the agent then focuses on keeping the y_velocity lower than 0.2. As is depicted in Figure 6, there are many checks for both lander legs being down. We found that both LOKI and ProLoNets were prone to landing successfully, but continuing to fire their left or right boosters. In an attempt to ameliorate this problem, we added the extra “legs down” checks.

Figure 6: Visualization of the heuristics used to initialize the lunar lander ProLoNet, and to train the LOKI agent

B.3 FindAndDefeatZerglings Heuristics

For the SC2LE minigame, the overall strategy of our heuristic is to stay grouped up and fight or explore as a group. As such, the first four checks are all in place to ensure that the marines are all close to each other. After they pass the proximity checks, they attack whatever is nearest. If nothing is nearby, they will move in a counter-clockwise sweep around the periphery of the map, searching for more zerglings. Our heuristic is shown in Figure 7.

Figure 7: Visualization of the heuristics used to initialize the FindAndDefeatZerglings ProLoNet, and to train the LOKI agent

B.4 SC2 Full Game Heuristics

The SC2 full game heuristic first checks for important actions that should always be high priority, such as attacking, defending, harvesting resources, and scouting. Once initial checks for these are all passed, the heuristic descends into the build order, where it simply uses building or unit count checks to determine when certain units should be built or trained. After enough attacking units are trained, the heuristic indicates that it is time to attack. The SC2 full game heuristic is depicted in Figure. 8.

Figure 8: Visualization of the heuristics used to initialize the SC2 full game ProLoNet, and to train the LOKI agent

Appendix C Agent Architectures Per Experiment

In this section we briefly overview the FC and LSTM layer information. The LOKI agent maintained the same architecture as the FC agent.

C.1 Cart Pole

The cart pole FC network is a 3-layer network following the sequence: 4x4 – 4x4 – 4x2.

We experimented with sizes ranging from 4-64 and numbers of hidden layers from 1 to 10, and found that the small network performed the best.

The LSTM network for cart pole is the same as the FC network, though with an LSTM unit inserted between the first and second layers. The LSTM unit’s hidden size is 4, so the final sequence is: 4x4 – LSTM(4x4) – 4x4 – 4x2.

We experimented with hidden-sizes for the LSTM unit from 4 to 64, though none were overwhelmingly successful, and we varied the number of layers after the LSTM unit from 1-10.

The ProLoNet agent for this task used 9 decision nodes and 11 leaves. For the deepening experiment, we tested an agent with only a single node and 2 leaves, and found that it still solved the task very quickly. We tested randomly initialized architectures from 1 to 9 nodes and from 2 to 11 leaves, and we found that all combinations successfully solved the task.

C.2 Lunar Lander

The lunar lander FC network is a 4-layer network, following the sequence: 8x8 – 8x64 – 64x8 – 8x4.

We again experimented with sizes from 8-64 and number of hidden layers from 2 to 11.

The LSTM network for lunar lander mimics the architecture from cart pole. The LSTM unit’s hidden size is 8, so the final sequence is: 8x8 – LSTM(8x8) – 8x8 – 8x4.

We experimented with hidden-sizes for the LSTM unit from 8 to 64, and again we varied the number of layers succeeding the LSTM unit from 1 to 10.

The ProLoNet agent for this task featured 14 decision nodes and 15 leaves. We experimented with intelligent initialization architectures ranging from 10 nodes to 14 and from 10 to 15 leaves, and found little difference between their performances. The additional nodes were an attempt to encourage the agent to “do nothing” once successfully landing, though this was only moderately successful.

C.3 FindAndDefeatZerglings

We failed to find a FC architecture that succeeded in this task, and so we choose one that compromised between the depth of the ProLoNet and the simplicity that FC agents seemed to prefer for toy domains. The final network is a 7-layer network with the following sequence: 32x32 – 32x32 – 32x32 – 32x32 – 32x32 – 32x32 – 32x10.

We choose to keep the size to 32 after testing 32 and 64 as sizes, and deciding that trying to get as close to the ProLoNet architecture was the best bet.

The LSTM network for FindAndDefeatZerglings features more hidden layers than the LSTM for lunar lander and cart pole. The hidden size is set to 32, and the LSTM unit is followed by 5 layers. The final sequence is: 32x32 – LSTM(32x32) – 32x32 – 32x32 – 32x32 – 32x32 – 32x10.

We experimented with hidden-sizes for the LSTM unit from 32 to 64 and varied the number of successive layers from 4-10.

The ProLoNet agent for FindAndDefeatZerglings featured 10 nodes and 11 leaves. We tested architectures from 6 to 15 nodes and from 7 to 13 leaves, and found that the initialized policy and architecture had more of an immediate impact for this task. The 7 node policy allowed agents to spread out too much, and they died quickly, whereas the 15 node policy had agents moving more than shooting, and they would walk around while being overrun.

C.4 SC2 Full Game

We again failed to find a FC architecture that succeeded in this task, and so used a similar architecture to that of the FindAndDefeatZerglings task. The 7-layer network is of the sequence: 194x194 – 194x194 – 194x194 – 194x194 – 194x194 – 194x194 – 194x44.

We again experimented with a variety of shapes and number of layers, though none succeeded.

Again, the LSTM network shadows the FC network for this task. As in the FindAndDefeatZerglings task, we experimented with a variety of LSTM hidden unit sizes, hidden layer sizes, and hidden layer numbers. The final architecture reflects the FindAndDefeatZerglings sequence: 194x194 – LSTM(194x194) – 194x194 – 194x194 – 194x194 – 194x194 – 194x44.

The ProLoNet agent for the SC2 full game featured 10 nodes and 11 leaves. We tested architectures from 10 to 16 nodes and from 1 to 17 leaves, and found that the initialized policy and architecture was not as important for this task as it was for the FindAndDefeatZerglings task. As long as we included a basic build order and the “attack” command, the agent would manage to defeat the VeryEasy in-game AI at least 10% of the time. We found that constraining the policy to fewer nodes and leaves provided less noise as updates progressed, and kept the policy close to initialization while also providing improvements. An initialization with too many parameters often seemed to degrade quickly, presumably due to small changes over many parameters having a larger impact than small changes over few parameters.