Abstract
Real-world tasks are often highly structured. Hierarchical reinforcementlearning (HRL) has attracted research interest as an approach for leveragingthe hierarchical structure of a given task in reinforcement learning (RL).However, identifying the hierarchical policy structure that enhances theperformance of RL is not a trivial task. In this paper, we propose an HRLmethod that learns a latent variable of a hierarchical policy using mutualinformation maximization. Our approach can be interpreted as a way to learn adiscrete and latent representation of the state-action space. To learn optionpolicies that correspond to modes of the advantage function, we introduceadvantage-weighted importance sampling. In our HRL method, the gating policylearns to select option policies based on an option-value function, and theseoption policies are optimized based on the deterministic policy gradientmethod. This framework is derived by leveraging the analogy between amonolithic policy in standard RL and a hierarchical policy in HRL by using adeterministic option policy. Experimental results indicate that our HRLapproach can learn a diversity of options and that it can enhance theperformance of RL in continuous control tasks.