Abstract
Solutions to most complex tasks can be decomposed into simpler, intermediateskills, reusable across wider ranges of problems. We follow this concept andintroduce Hindsight Off-policy Options (HO2), a new algorithm for efficient androbust option learning. The algorithm relies on critic-weighted maximumlikelihood estimation and an efficient dynamic programming inference procedureover off-policy trajectories. We can backpropagate through the inferenceprocedure through time and the policy components for every time-step, making itpossible to train all component's parameters off-policy, independently of thedata-generating behavior policy. Experimentally, we demonstrate that HO2outperforms competitive baselines and solves demanding robot stacking andball-in-cup tasks from raw pixel inputs in simulation. We further compareautoregressive option policies with simple mixture policies, providing insightsinto the relative impact of two types of abstractions common in the optionsframework: action abstraction and temporal abstraction. Finally, we illustratechallenges caused by stale data in off-policy options learning and provideeffective solutions.