Construction of Macro Actions for Deep Reinforcement Learning

  • 2019-08-05 05:59:40
  • Yi-Hsiang Chang, Kuan-Yu Chang, Henry Kuo, Chun-Yi Lee
  • 6

Abstract

Conventional deep reinforcement learning typically determines an appropriateprimitive action at each timestep, which requires enormous amount of time andeffort for learning an effective policy, especially in large and complexenvironments. To deal with the issue fundamentally, we incorporate macroactions, defined as sequences of primitive actions, into the primitive actionspace to form an augmented action space. The problem lies in how to find anappropriate macro action to augment the primitive action space. The agent usinga proper augmented action space is able to jump to a farther state and thusspeed up the exploration process as well as facilitate the learning procedure.In previous researches, macro actions are developed by mining the mostfrequently used action sequences or repeating previous actions. However, themost frequently used action sequences are extracted from a past policy, whichmay only reinforce the original behavior of that policy. On the other hand,repeating actions may limit the diversity of behaviors of the agent. Instead,we propose to construct macro actions by a genetic algorithm, which eliminatesthe dependency of the macro action derivation procedure from the past policiesof the agent. Our approach appends a macro action to the primitive action spaceonce at a time and evaluates whether the augmented action space leads topromising performance or not. We perform extensive experiments and show thatthe constructed macro actions are able to speed up the learning process for avariety of deep reinforcement learning methods. Our experimental results alsodemonstrate that the macro actions suggested by our approach are transferableamong deep reinforcement learning methods and similar environments. We furtherprovide a comprehensive set of ablation analysis to validate the proposedmethodology.

 

Quick Read (beta)

Construction of Macro Actions for
Deep Reinforcement Learning

Yi-Hsiang Chang
National Tsing Hua University
[email protected]
&Kuan-Yu Chang
National Tsing Hua University
[email protected]
&Henry Kuo
National Hsinchu Senior High School
[email protected]
&Chun-Yi Lee
National Tsing Hua University
[email protected]
Abstract

Conventional deep reinforcement learning typically determines an appropriate primitive action at each timestep, which requires enormous amount of time and effort for learning an effective policy, especially in large and complex environments. To deal with the issue fundamentally, we incorporate macro actions, defined as sequences of primitive actions, into the primitive action space to form an augmented action space. The problem lies in how to find an appropriate macro action to augment the primitive action space. The agent using a proper augmented action space is able to jump to a farther state and thus speed up the exploration process as well as facilitate the learning procedure. In previous researches, macro actions are developed by mining the most frequently used action sequences or repeating previous actions. However, the most frequently used action sequences are extracted from a past policy, which may only reinforce the original behavior of that policy. On the other hand, repeating actions may limit the diversity of behaviors of the agent. Instead, we propose to construct macro actions by a genetic algorithm, which eliminates the dependency of the macro action derivation procedure from the past policies of the agent. Our approach appends a macro action to the primitive action space once at a time and evaluates whether the augmented action space leads to promising performance or not. We perform extensive experiments and show that the constructed macro actions are able to speed up the learning process for a variety of deep reinforcement learning methods. Our experimental results also demonstrate that the macro actions suggested by our approach are transferable among deep reinforcement learning methods and similar environments. We further provide a comprehensive set of ablation analysis to validate the proposed methodology.

\algnewcommand\IfThen

[2]\State\algorithmicif #1 \algorithmicthen #2

 

Construction of Macro Actions for
Deep Reinforcement Learning


  Yi-Hsiang Chang National Tsing Hua University [email protected] Kuan-Yu Chang National Tsing Hua University [email protected] Henry Kuo National Hsinchu Senior High School [email protected] Chun-Yi Lee National Tsing Hua University [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Conventional deep reinforcement learning (DRL) has been shown to demonstrate super-human performance on a variety of environments and tasks [1, 2, 3, 4, 5, 6, 7]\fxwarning[more citations, including GA with RL]. However, in conventional methods, agents are restricted to make decisions at each timestep, which differs much from the temporally-extended framework of decision-making in human beings. As a consequence, traditional methods [1, 3, 8] \fxwarningcite require enormous amounts of sampling data in environments where goals are hard to reach or rewards are sparse. In complex environments where goals can only be achieved by executing a long sequence of primitive actions, it is difficult to perform exploration efficiently. As most real-world environments are large, complex, and usually provide sparse rewards, finding an optimal policy for an agent is still hard and challenging. It becomes crucial to explore new mechanisms for DRL to deal with these types of environments more efficiently and effectively.

Researchers in the past few years have attempted various techniques to expand the realm of DRL to temporally-extended frameworks [9, 10, 11, 12, 13, 14, 15]. In such frameworks, a high-level controller interacts with the environment by selecting temporal-extended policies usually named as "options". Once an option is selected, it interacts with the environment in a certain timesteps and perform primitive actions until a termination condition is met. The controller then moves on to select another option. However, developing effective options directly either require significant domain knowledge [16] \fxwarningcite, or only works in low-dimensional and/or relatively simple environments [12, 17, 11].

Instead of developing options, another branch of research directions focus on constructing macro actions [18, 19, 20, 21, 22, 23, 24]. A macro action (or simply "a macro") is an open-loop [25] policy composed of a finite sequence of primitive actions. Once a macro is chosen, the actions will be taken by the agent without any further decision making process. Some researches in DRL attempt to construct macros from the experience of an agent [26, 27, 28, 29]. A key benefit of these approaches is its ease for them to construct a desired macro without supervision [26]. However, these approaches may lead to biased macros. For example, the most frequently used sequence of actions may not correspond to a macro that can lead the agent to outperform its past policies. Furthermore, as agents in general perform exploration extensively in the early stages of training, the inconsistency in the early experience may perturb the construction of macros. A few researchers proposed to employ a reduced form of macro action called action repeat [30, 31]. In this formulation, primitive actions are repeated several times in a macro before the agent makes another decision. However, this formulation may limit the diversity of macro actions. By relaxing the agent to perform macros consisting of diversified actions, the agent is granted more chances to achieve higher performance. In addition, there are a handful of researches that requires human supervision to derive macros for improving training efficiency. McGovern et al. [32] show that handcrafted macros can speed up training in certain tasks but hinder performance in others. Heecheol et al. [33] generate macros from expert demonstrations via a variational auto-encoder. However, the process of obtaining expert demonstrations is expensive. It would thus be favorable if there exists a method for finding a macro action without human intervention. Unfortunately, little attention has been paid to the construction of such macros.

Our goal is to develop a methodology for constructing a macro action from possible candidates. As possible macros are allowed to have different lengths and arbitrary compositions of primitive actions, such diversified macro actions essentially form an enormous space. We define this space as the macro action space (or simply "macro space"). Repeated action sequences are simply a small subset of the macro space. For a specific task in an environment, we hypothesize that there are good macros and bad macros in the macro space. Different macro actions have different performance impacts to an agent. Good macro actions enable the agent to jump over multiple states and reach a target state quicker and easier. On the other hand, bad macro actions may lead the agent to undesirable states. We argue that whether a macro is good or bad can only be determined by direct evaluation. In this study, we propose an evaluation method to test whether a macro is satisfactory for an agent to perform a specific task in an environment. Our method first relaxes the conventional action space [34] with a macro to form an augmented action space. We then equip the agent with the augmented action space, and utilize the performance results as the basis for our evaluation. In order to find a good macro in the vast macro space, a systematic method is critically important and necessary. The method entails two prerequisites: a macro construction mechanism and a macro evaluation method. Although the second one is addressed above, there is still a lack of an appropriate approach to construct macros.

To satisfy the above requirement, we embrace an genetic algorithm (or simply "GA") for macro construction. GA offers two promising properties. First, it eliminates the dependency of the macro action derivation procedure from the past policies of an agent and/or human supervision. Second, it produces diversified macros by mutation. In order to combine GA with our evaluation method, our approach comprises of three phases: (1) macro construction by GA; (2) action space augmentation; and (3) evaluation of the augmented action space. Our augmented action space contains not only the original action space defined by DRL, but also the macro(s) constructed by GA. To validate the proposed approach, we perform our experiments on Atari 2600 [35] and ViZDoom [36], and compare them to two representative DRL baseline methods. We demonstrate that our proposed method is complementary to existing DRL methods, and perform favorably against the baselines. Moreover, we show that the choice of the macro have a crucial impact on the performance of an agent. Furthermore, our results reveal the existence of transferability of a few macros over similar environments or DRL methods. We additionally provide a comprehensive set of ablation analysis to justify various aspects of our proposed approach. The primary contributions of this paper are summarized as the following:

  • We define the proposed approach as a framework.

  • We provide a definition of macro action space.

  • We introduce an augmentation method for action spaces.

  • We propose an evaluation method to determine whether a macro is good or not.

  • We establish a macro action construction method using GA for DRL.

  • We investigate and reveal the transferability of macro actions.

The rest of this paper is organized as follows. Section 2 walks through our proposed framework. Section 3 describes our implementation details of GA. Section 4 discusses the experimental results. Section 5 concludes the paper.

2 Framework Formulation

Table 1: Essential notations.
Symbol Name Symbol Name
|| sequence length or set size environment
{} probability DRL method
𝔼{} expected value 𝔐 macro action space
t timestep augmented action space
𝒮 state space m primitive action or macro action
s regular state mt m taken at t
st s perceived by agent at t rsm (discounted) expected reward
𝒜 primitive action space ν(s,m) policy over
a primitive action pssm environment dynamic
rt one-step reward Vν(s) value function
γ discount factor V*(s) optimal value function
DRL method collection evaluation method collection
𝒜 augmentation method collection 𝒞 macro construction method collection

In this section, we first provide the definition of macro actions. Then, we provide a model of the environment with macros, which is a special case of Semi-Markov Decision Processes (SMDP) [9]. Next, we provide definitions of functions in DRL. Finally, we formulate a framework for constructing good macros. The essential notations used in this paper can be referenced in Table 1.

Macro action.  A macro action is defined as a sequence of primitive actions m=(a1,a2,,ak), for some natural number k. Macros can be selected atomically as one of the actions by an agent. The set of macros form a macro action space 𝔐, which can be represented as 𝔐=𝒜+𝒜, where ‘+’ stands for Kleene plus [37]. Please notice that 𝔐 does not contain 𝒜.

Environment modeled.  The environment we concerned can be modeled as a special case of SMDP, which can be represented as a 4-tuple (𝒮,,pssm,rsm), where 𝒮 is a finite set of states, the finite set containing macros and all primitive actions provided by the environment, pssm the transition probability from s to s when executing m, and rsm the reward received by the agent after executing m. The expressions of , pssm, and rsm can be represented as Eqs. 12, and 3, respectively.

Deep reinforcement learning.  An agent interacts with an environment under a policy ν, where ν is a mapping, ν:𝒮×[0,1]. The expected cumulative rewards it receives from each state s under ν can be denoted as Vν(s). The objective of DRL is to train an agent to learn an optimal policy such that it is able to maximize its expected return. The maximal expected return from each state s under the optimal policy can be denoted as V*(s). The expressions of Vν and V* can be represented as Eqs. 4 and 5, respectively, where γ[0,1]. We only use γ between macros but not between the primitive actions within a macro. This encourages the agent to prefer the provided macros over a series of primitive actions.

=𝒜M, where M𝔐 (1)
pssm ={st+|m|=s|st=s,mt=m} (2)
rsm =𝔼{τ=0|m|-1rt+τ|st=s,mt=m} (3)
Vν(s) =mν(s,m)[rsm+γs𝒮pssmVν(s)] (4)
V* =maxm[rsm+γs𝒮pssmV*(s)] (5)

Framework.  We define our framework as a 4-tuple (,𝒜,,𝒞), where is the collection of all DRL methods, 𝒜 the collection of all action space augmentation methods, the collection of all evaluation methods, and 𝒞 the collection of all macro action construction methods. Following this framework, in this study, we select Proximal Policy Optimization (PPO) [5] and Advantage Actor Critic (A2C) [3] as our DRL methods. Our implementations of the latter three components of the 4-tuple are formulated as pseudocodes and are presented in Algorithms 33, and  3, respectively.

3 Generic Algorithm for Constructing Macro Actions

{algorithm}

[H] Action space augmentation method

{algorithmic}

[1] \Stateinput: Primitive action space 𝒜 \Stateinput: Macro action m \Stateoutput: Augmented action space \FunctionAugment𝒜, m \Statereturn =𝒜{m} \EndFunction

{algorithm}

[H] Macro evaluation method

{algorithmic}

[1] \Stateinput: Environment ; DRL method ; Augmented action space ; Evaluation epochs N \Stateoutput: Fitness score of \FunctionEvaluate, , , N \Stateinitialize: E=[e1,,eN] eE,e=0 \Fori in 1 to N \CommentN epoch \StateLearn a policy over in using \Stateei = the average of all "last 100 episode mean rewards" \EndFor\Statereturn Average of E \EndFunction

{algorithm}

[H] Macro construction algorithm based on GA

{algorithmic}

[1] \Stateinput: Environment ; DRL algorithm ; Total number of evaluated macros k \Stateinput: The sizes of sets Q,Q+,Q* = q,q+,q* respectively \Stateoutput: List of the top q highest-performing macros evaluated Q \FunctionConstruction, , k, q, q+, q* \Stateinitialize: 𝒜 = the primitive action space of \Stateinitialize: M=[m1,,mq], a list of q randomly generated macros such that mM,|m|=2 \Stateinitialize: F=[f1,,fq],fF,f=0, the fitness scores for all the macros in M \Stateinitialize: Q=,Q+=,Q*=, i=0 \Whilei<k \CommentEach iteration is one generation \Forj in 1 to q \State=\CallAugment𝒜,mj \Statefj= \CallEvaluate, , \Statei=i+1 \IfThenikbreak \EndFor\StateM¯=[m1,,mq] the top q scoring macros in M according to F \StateQ = the top q scoring macros in QM¯ \Form in List of q+ randomly selected macros in Q \StateQ+ = Q+{\CallAppend𝒜,m} \EndFor\Form in List of q* randomly selected macros in Q \StateQ* = Q*{\CallAlter𝒜,m} \EndFor\StateM=Q+Q* \EndWhile\Statereturn Q \EndFunction

In this section, we present our implementation details of GA for constructing macro actions. We formulate our macro construction algorithm based on GA as Algorithm 3, which is established atop (1) the action space augmentation method (presented as Algorithm 3 and accessed at line 11 of Algorithm 3), (2) the macro evaluation method (presented as Algorithm 3 and accessed at line 12 of Algorithm 3), as well as (3) the append operator (accessed at line 19) and (4) the alteration operator (accessed at line 22) presented in Fig. 1 (a) and the supplementary material for mutating the macros.

We walk through Algorithm 3 and highlight the four phases of GA as follows. Lines 1-3 declare the input and output of the algorithm. Lines 5-8 correspond to the “initialization phase”, which initializes the essential parameters. The population of the macros M are initialized at line 6. Lines 10-15 correspond to the “fitness phase”, which augments the action space and performs fitness evaluation. The fitness phase can be executed in parallel by multiple threads. Lines 16-17 corresponds to the “selection phase”, which selects the top performers from the population. Lastly, lines 18-23 correspond to the “mutation phase”, which alters the selected macros and updates the population.

4 Experimental Results

In this section, we present our experimental results of Algorithm 3 and discuss their implications. We arrange the organization of our presentation as follows in order to validate the methodology we proposed in the previous sections. First, we walk through the experimental setups. Second, we compare the constructed macro actions of different generations to examine the effectiveness of Algorithm 3. Third, we investigate the compatibility of the constructed macros with two off-the-shelf DRL methods. Then, we explore the transferability of the constructed macros between the two DRL methods. We next demonstrate that the constructed macros can be transferred to harder environments. Finally, we present a set of ablation analysis to inspect the proposed methodology from different perspectives. Please note that we only present the most representative results in this section. We strongly suggest the interested readers to refer to our supplementary material for more details.

4.1 Experimental Setup

We first present the environments used in our experiments, followed by a brief description of the baselines adopted for comparison purposes. All of the macros presented throughout this section are constructed by Algorithm 3, if not specifically specified. We tabularize the environmental configurations, the hyper-parameters of Algorithm 3, and the hyper-parameters used during the training phase in our supplementary material. Except for Section 4.2, all of the curves presented in this section are generated based on five random seeds and drawn with 95% confidence interval (displayed as the shaded areas).

\thesubsubfigure The genetic operators.
\thesubsubfigure The map for the Dense, Sparse, and Very Sparse tasks.
\thesubsubfigure The map for the Super Sparse task.
Figure 1: (a) The generic operators for the append operation and the alteration operation in Section 3. (b)(c) Map layouts of the ViZDoom environment. The blue points denote the random spawn points for the Dense task, while the orange, red, and purple arrows denote the fixed spawn points of the agents and their initial orientations for the Sparse, Very Sparse, and Super Sparse tasks, respectively.

Environments.  We employ two environments with discrete primitive action spaces in our experiments: Atari 2600 [38] and ViZDoom [39]. For Atari 2600, we select seven representative games to evaluate our methodology: BeamRider, Breakout, Pong, Q*bert, Seaquest, SpaceInvaders, and Enduro. Due to the limited space, we use only the former six games for presenting our comparisons and analyses in Section 4, while leaving the remainder of our results in the supplementary material. For ViZDoom, we evaluate our methodology on the default task my_way_home (denoted as “Dense”) for comparing the macros constructed among generations. We further use the “Sparse”, “Very Sparse” and, “Super Sparse” (developed by us) tasks for analyzing the properties of compatibility and transferability of the constructed macros mentioned above. The Super Sparse task comes with extended rooms and corridors in which the distance between the spawn point of the agent and the goal is farther than the other three tasks. The map layouts for these four tasks are depicted in Figs. 1 and 1.

Baselines.  We select PPO [5] and A2C [3] as our DRL baselines for training the agents. For ViZDoom, we further incorporate the intrinsic curiosity module (ICM) [40] in the Sparse, Very Sparse, and Super Sparse tasks to generate intrinsic rewards for motivating exploration.

ViZDoom setups for transferability evaluation.   We use ViZDoom to validate our hypothesis that macros constructed from an easy environment can be transferred to complex environments. In order to perform this validation study, we use Algorithm 3 to construct macros from the Dense task and evaluate their performance after 5M training timesteps. The highest performing macro is then selected and evaluated in the Sparse, Very Sparse, and Super Sparse tasks after 10M training timesteps.

4.2 Comparison of Macros among Generations

\thesubsubfigure BeamRider.
\thesubsubfigure SpaceInvaders.
\thesubsubfigure Q*bert.
\thesubsubfigure Dense.
Figure 2: Learning curves of the A2C agents trained with the early, middle, and late generations of the constructed macros. Please note that “Gen” denotes “generation”.

Fig. 2 compares the learning curves of the A2C agents trained with different generations of the constructed macros in BeamRider, SpaceInvaders, Q*bert, and Dense. Each curve along with the corresponding confidence interval stand for the average performance of the entire generation (i.e., M in Algorithm 3). It is observed from the trends that the mean episode rewards obtained by the agents improve with generations, revealing that later generations of the constructed macros may outperform earlier generations. The improving trends also suggest that the evaluation method presented in Algorithm 3 is effective and reliable to be used by Algorithm 3.

4.3 Analysis of Compatibility of the Constructed Macros with DRL Methods

In order to examine whether the best macros constructed by our methodology is complementary to the baseline DRL methods, we first execute Algorithm 3 with A2C and PPO, and determine the best macros m^A2C and m^PPO for the two DRL baselines respectively. We then form the augmented action spaces A2C=A{m^A2C} and PPO=A{m^PPO}. We next train A2C using A2C and PPO using PPO for 10M timesteps respectively. The results evaluated on BeamRider are shown in Fig. 3. The learning curves reveal that the baseline DRL methods are both significantly benefited from the augmented action spaces. More supporting evidences are presented in Figs. 3 and 4.

4.4 Analysis of Transferability of the Constructed Macros among DRL Methods

\thesubsubfigure BeamRider.
\thesubsubfigure Q*bert.
\thesubsubfigure Pong.
\thesubsubfigure Breakout.
Figure 3: (a) Learning curves of the baseline DRL methods with/without the constructed macros. (b)(c)(d) Learning curves of the baseline DRL methods with/without m^A2C.

In order to inspect whether the macro constructed for one DRL baseline can be transferred to and utilized by another DRL baseline, we train PPO with A2C for 10M timesteps and plot the results on Q*bert, Pong, and Breakout in Figs. 3,  3, and  3, respectively. The results show that PPO is able to use m^A2C to enhance both the performance and the learning speed of the agents. More results are provided in our supplementary material.

4.5 Analysis of Transferability of the Constructed Macros to Harder Environments

\thesubsubfigure Dense.
\thesubsubfigure Sparse.
\thesubsubfigure Very Sparse.
\thesubsubfigure Super Sparse.
Figure 4: Learning curves of Curiosity with/without m^D for four different ViZDoom tasks.

In order to investigate the transferability of the constructed macros among similar environments, we first execute Algorithm 3 with A2C and ICM (together denoted as “Curiosity”) on Dense to construct a macro m^D. We choose to train on Dense because the agent can learn from both the dense and sparse reward scenarios at different spawn points, enabling it to adapt to different reward sparsity settings easier. We then form the augmented action space =A{m^D} and train the agent to use with Curiosity on similar but harder environments with sparser rewards. The results are plotted in Fig. 4. Figs. 44 demonstrate that the agents with learn significantly faster than the agents without it. These results thus validate the transferability of the constructed macros to harder environments. Fig. 4 also reveals that the gap between ‘Curiosity+m^D’ and ‘Curiosity’ grows as the sparsity of the reward signal increases. For the Super Sparse task, ‘Curiosity+m^D’ converges at around 13M timestep, while ‘Curiosity’ just begins to learn at about 25M timestep. We further compare the mean timesteps required to reach the goal w/ and w/o m^D over 100 episodes in our supplementary material.

The macro m^D constructed by Algorithm 3 is (2,2,1), which corresponds to two forward moves followed by one right turn. We speculate that this macro is transferable because the map layouts favor forward moves and right turns. Moving forward is more essential than turning right for the four ViZDoom tasks, as the optimal trajectories comprise of several straight moves. In contrast, left turn is rarely performed by the agent, as it is only used when the agent is deviated from its optimal routes.

4.6 Ablation Analysis

\thesubsubfigure Seaquest.
\thesubsubfigure Very Sparse.
\thesubsubfigure Seaquest.
\thesubsubfigure Dense.
\thesubsubfigure BeamRider.
Figure 5: (a)(b) The best macros constructed by our methodology versus random macros and action repeat. (c)(d) Distribution of the constructed macros over different ranges of the evaluated mean scores. (e) Comparison of the macros constructed from random and handcrafted initial populations.

In order to justify our decision of selecting GA as our macro construction method, we additionally compare GA with three different macro construction methods to demonstrate that GA is superior to them in most cases and possess several promising properties required by our methodology.

Comparison with random macros.  We first compare the performance of the macros constructed by Algorithm 3 against randomly constructed macros. We present the learning curves of ‘A2C with m^A2C’ and ‘A2C with m^D’ against ‘A2C with the best random macro m^Random’ in Figs. 5 and 5, respectively. For a fair comparison, we randomly select 50 out of all possible macros with lengths from 2 to 5, and evaluate each random macro for 5M timesteps for 2 times similar to how GA evaluates. We limit the lengths of the macros from 2 to 5 because it is the range GA has attempted. Following this procedure, the macro with highest evaluation score is selected to be m^Random. The agents are then trained for 10M timesteps, and the results are plotted in Figs. 5 and 5 (denoted as “Random”). It is observed that the two curves for ‘A2C with m^Random’ are not obviously superior to those of ‘A2C with m^A2C’ and ‘A2C with m^D’. We further compare the 50 randomly constructed macros with the 50 macros constructed in the course of Algorithm 3, and summarize their impacts on performance as distributions over mean scores in Figs. 5 and 5. It is observed that a larger portion of the macros constructed by our approach tend to result in higher mean scores than those constructed randomly. A detailed breakdown and analysis are offered in our supplementary material.

Comparison with action repeat.  We next compare the performance of ‘A2C with m^A2C’ and ‘A2C with m^D’ against ‘A2C with the best action repeat macro m^Repeat’, and plot the learning curves in Figs. 5 and 5. In order to construct m^Repeat for the two figures, we evaluate all possible action repeat macros with the same length to m^A2C and m^D for five random seeds and for 10M timesteps. The macro with highest evaluation score is then selected as m^Repeat. It is observed that the curves of ‘A2C with m^Repeat’ are significantly worse than those of ‘A2C with m^A2C’, ‘A2C with m^D’, and ‘A2C with m^Random’ in the figures. The low performance is probably due to the lack of diversity in m^Repeat.

Comparison with handcrafted macros.  We further investigate whether human intervention affects the outcome of Algorithm 3. As the initial population of the macros used by Algorithm 3 are randomly constructed, in this ablation study we replace the initial population by handcrafted macros (1,1,1),(1,1,1,1),(1,1,1,1,1), which correspond to consecutive ‘fire’ action in BeamRider. These handcrafted macros are advantageous to the agent from the perspective of human players, as waves of enemies continue to emerge throughout the game. The macro constructed in this manner is denoted as m^Human. Fig. 5 shows a comparison of the learning curves of ‘A2C with m^A2C’ and ‘A2C with m^Human’. It is observed that the two curves are almost aligned. We therefore conclude that our approach does not need human intervention and is unaffected by the initial population of the macros.

5 Conclusions

We have formally presented a methodology to construct macro actions that may potentially improve both the performance and learning efficiency of the existing DRL methods. The methodology falls within the scope of a broader framework that permits other possible combinations of the DRL method, the action space augmentation method, the evaluation method, as well as the macro action construction method. We formulated the proposed methodology as a set of algorithms, and used them as the basis for investigating the interesting properties of macro actions. Our results revealed that the macro actions constructed by our methodology are complementary to two representative DRL methods, and may demonstrate transferability among different DRL methods and similar environments. We additionally compared our methodology against three other macro construction methods to justify our design decisions. Our work paves a way for future research on macro actions and their applications.

References

  • Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. arXiv:1312.5602, Dec. 2013.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb. 2015.
  • Mnih et al. [2016] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML), pages 1928–1937, Jun. 2016.
  • Salimans et al. [2017] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864, Sep. 2017.
  • Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, Aug. 2017.
  • Moriarty et al. [1999] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette. Evolutionary algorithms for reinforcement learning. J. Artificial Intelligence Research (JAIR), 11:241–276, 1999.
  • Such et al. [2018] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv:1712.06567, Apr. 2018.
  • Houthooft et al. [2016] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. VIME: Variational information maximizing exploration. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 1109–1117, Dec. 2016.
  • Sutton et al. [1999] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, Aug. 1999.
  • Vezhnevets et al. [2016] A. Vezhnevets, V. Mnih, S. Osindero, A. Graves, O. Vinyals, J. Agapiou, et al. Strategic attentive writer for learning macro-actions. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 3486–3494, Dec. 2016.
  • Kulkarni et al. [2016] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 3675–3683, Dec. 2016.
  • Bacon et al. [2017] P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Proc. the Thirty-First AAAI Conf. Artificial Intelligence (AAAI-17), pages 1726–1734, Feb. 2017.
  • Frans et al. [2017] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. In Proc. Int. Conf. Learning Represenations (ICLR), Apr.-May 2017.
  • Daniel et al. [2016] C. Daniel, H. Van Hoof, J. Peters, and G. Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104(2-3):337–357, Sep. 2016.
  • Florensa et al. [2017] C. Florensa, Y. Duan, and P. Abbeel. Stochastic neural networks for hierarchical reinforcement learning. Proc. Int. Conf. Learning Represenations (ICLR), Apr. 2017.
  • Girgin et al. [2010] S. Girgin, F. Polat, and R. Alhajj. Improving reinforcement learning by using sequence trees. Machine Learning, 81(3):283–331, Dec. 2010.
  • Heess et al. [2016] N. Heess, G. Wayne, Y. Tassa, T. Lillicrap, M. Riedmiller, and D. Silver. Learning and transfer of modulated locomotor controllers. arXiv:1610.05182, Oct. 2016.
  • Fikes and Nilsson [1971] R. E. Fikes and N. J. Nilsson. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2(3-4):189–208, Sep. 1971.
  • Siklossy and Dowson [1977] L. Siklossy and C. Dowson. The role of preprocessing in problem solving systems. In Proc. Int. Joint Conf. Artificial Intelligence (IJCAI), pages 465–471, Aug. 1977.
  • Minton [1985] S. Minton. Selectively generalizing plans for problem-solving. In Proc. Int. Joint Conf. Artificial Intelligence (IJCAI), pages 596–599, Aug. 1985.
  • Pickett and Barto [2002] M. Pickett and A. G. Barto. PolicyBlocks: An algorithm for creating useful macro-actions in reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML), pages 506–513, Jul. 2002.
  • Botea et al. [2005] A. Botea, M. Enzenberger, M. Müller, and J. Schaeffer. Macro-FF: Improving AI planning with automatically learned macro-operators. J. Artificial Intelligence Research (JAIR), 24:581–621, Oct. 2005.
  • Newton et al. [2005] M. Newton, J. Levine, and M. Fox. Genetically evolved macro-actions in AI planning problems. Proc. the UK Planning and Scheduling Special Interest Group (PlanSIG) Wksp., pages 163–172, 2005.
  • Newton et al. [2007] M. A. H. Newton, J. Levine, M. Fox, and D. Long. Learning macro-actions for arbitrary planners and domains. In Proc. Int. Conf. Automated Planning and Scheduling (ICAPS), pages 256–263, Sep. 2007.
  • DiStefano III et al. [1967] J. J. DiStefano III, A. R. Stubberud, and I. J. Williams. Feedback and control systems. McGraw-Hill, 1967.
  • Durugkar et al. [2016] I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan. Deep reinforcement learning with macro-actions. arXiv:1606.04615, Jun. 2016.
  • Randlov [1999] J. Randlov. Learning macro-actions in reinforcement learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 1045–1051, Dec. 1999.
  • Yoshikawa and Kurihara [2006] T. Yoshikawa and M. Kurihara. An acquiring method of macro-actions in reinforcement learning. In Proc. IEEE Int. Conf. Systems, Man and Cybernetics, pages 4813–4817, Nov. 2006.
  • Onda and Ozawa [2009] H. Onda and S. Ozawa. A reinforcement learning model using macro-actions in multi-task grid-world problems. In Proc. IEEE Int. Conf. Systems, Man and Cybernetics, pages 3088–3093, Oct. 2009.
  • Lakshminarayanan et al. [2017] A. S. Lakshminarayanan, S. Sharma, and B. Ravindran. Dynamic action repetition for deep reinforcement learning. In Proc. the Thirty-First AAAI Conf. Artificial Intelligence (AAAI-17), Feb. 2017.
  • Sharma et al. [2017] S. Sharma, A. S. Lakshminarayanan, and B. Ravindran. Learning to repeat: Fine grained action repetition for deep reinforcement learning. In Proc. Int. Conf. Learning Represenations (ICLR), Apr.-May 2017.
  • McGovern et al. [1997] A. McGovern, R. S. Sutton, and A. H. Fagg. Roles of macro-actions in accelerating reinforcement learning. In Proc. Grace Hopper celebration of women in computing, volume 1317, 1997.
  • Heecheol et al. [2019] K. Heecheol, M. Yamada, K. Miyoshi, and H. Yamakawa. Macro action reinforcement learning with sequence disentanglement using variational autoencoder. arXiv:1903.09366, May 2019.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018.
  • Brockman et al. [2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. arXiv:1606.01540, Jun. 2016.
  • Kempka et al. [2016] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowskim. ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In IEEE Conf. Computational Intelligence and Games (CIG), pages 1–8, Sep. 2016.
  • Hopcroft and Ullman [1979] J. E. Hopcroft and J. D. Ullman. Introduction to automata theory, languages, and computation. Addison-Wesley, 1979.
  • Bellemare et al. [2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artificial Intelligence Research (JAIR), 47:253–279, Jun. 2013.
  • Wydmuch et al. [2018] M. Wydmuch, M. Kempka, and W. Jaśkowski. ViZDoom competitions: Playing Doom from pixels. IEEE Trans. Games, 2018.
  • Pathak et al. [2017] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In Proc. Int. Conf. Machine Learning (ICML), Aug. 2017.