Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs

Abstract

Aiming to produce reinforcement learning (RL) policies that arehuman-interpretable and can generalize better to novel scenarios, Trivedi etal. (2021) present a method (LEAPS) that first learns a program embedding spaceto continuously parameterize diverse programs from a pre-generated programdataset, and then searches for a task-solving program in the learned programembedding space when given a task. Despite the encouraging results, the programpolicies that LEAPS can produce are limited by the distribution of the programdataset. Furthermore, during searching, LEAPS evaluates each candidate programsolely based on its return, failing to precisely reward correct parts ofprograms and penalize incorrect parts. To address these issues, we propose tolearn a meta-policy that composes a series of programs sampled from the learnedprogram embedding space. By learning to compose programs, our proposedhierarchical programmatic reinforcement learning (HPRL) framework can produceprogram policies that describe out-of-distributionally complex behaviors anddirectly assign credits to programs that induce desired behaviors. Theexperimental results in the Karel domain show that our proposed frameworkoutperforms baselines. The ablation studies confirm the limitations of LEAPSand justify our design choices.

Quick Read (beta)

loading the full paper ...