Abstract
Transfer learning across different reinforcement learning (RL) tasks isbecoming an increasingly valuable area of research. We consider a goalbasedmultitask RL framework and mechanisms by which previously solved tasks canreduce sample complexity and regret when the agent is faced with a new task.Specifically, we introduce two metrics on the state space that encode notionsof traversibility of the state space for an agent. Using these metrics atopological covering is constructed by way of a set of landmark states in afully selfsupervised manner. We show that these landmark coverings confertheoretical advantages for transfer learning within the goalbased multitaskRL setting. Specifically, we demonstrate three mechanisms by which landmarkcoverings can be used for successful transfer learning. First, we extend theLandmark Options Via Reflection (LOVR) framework to this new topologicalcovering; second, we use the landmarkcentric value functions themselves asfeatures and define a greedy zombie policy that achieves near oracleperformance on a sequence of zeroshot transfer tasks; finally, motivated bythe second transfer mechanism, we introduce a learned reward function thatprovides a more dense reward signal for goalbased RL. Our novel topologicallandmark covering confers beneficial theoretical results, bounding the Q valuesat each stateaction pair. In doing so, we introduce a mechanism that performsactionpruning at infeasible actions which cannot possibly be part of anoptimal policy for the current goal.
Quick Read (beta)
On Mechanisms for Transfer using Landmark Value Functions in MultiTask Lifelong Reinforcement Learning
Abstract
Transfer learning across different reinforcement learning (RL) tasks is becoming an increasingly valuable area of research. We consider a goalbased multitask RL framework and mechanisms by which previously solved tasks can reduce sample complexity and regret when the agent is faced with a new task. Specifically, we introduce two metrics on the state space that encode notions of traversibility of the state space for an agent. Using these metrics a topological covering is constructed by way of a set of landmark states in a fully selfsupervised manner. We show that these landmark coverings confer theoretical advantages for transfer learning within the goalbased multitask RL setting. Specifically, we demonstrate three mechanisms by which landmark coverings can be used for successful transfer learning. First, we extend the Landmark Options Via Reflection (LOVR) framework to this new topological covering; second, we use the landmarkcentric value functions themselves as features and define a greedy zombie policy that achieves near oracle performance on a sequence of zeroshot transfer tasks; finally, motivated by the second transfer mechanism, we introduce a learned reward function that provides a more dense reward signal for goalbased RL. Our novel topological landmark covering confers beneficial theoretical results, bounding the $Q$ values at each stateaction pair. In doing so, we introduce a mechanism that performs actionpruning at infeasible actions which cannot possibly be part of an optimal policy for the current goal. In effect, this actionpruning mechanism can be interpreted as a sense of safety, as it prevents the agent from taking any actions that are sufficiently detrimental towards accomplishing its goal. Empirical results on the cliffwalker domain support these theoretical results. Finally, since each of the transfer mechanisms each have their own benefits and deficits in terms of finitetime analysis, we demonstrate empirically that a hierarchical agent that uses a multiarmed bandit controller at the start of each episode to select from these transfer mechanisms learns to use the transfer mechanisms when they confer the most advantage, and discontinue their use when no longer advantageous.
ย
On Mechanisms for Transfer using Landmark Value Functions in MultiTask Lifelong Reinforcement Learning
ย Nick Denis Department of Mathematics and Statistics University of Ottawa Ottawa, ON K1N 6N5 [email protected]
noticebox[b]Preprint. Work in progress.\[email protected]
1 Introduction
The strength of reinforcement learning (RL) in solving sequential decision problems is apparent in the broad and increasing range of problemtypes considered by RL researchers. One example of these are lifelong learning settings [1]. They assume an RL agent is faced with a sequence of MDPs that share certain properties and are thus viewed as coming from a common environment. As in transfer learning[2], rather than beginning fresh with each task, an efficient lifelong agent should be able to leverage past knowledge of its experience within the environment in order to solve each new task as quickly as possible. Hierarchical RL and the use of temporally extended actions are essential in overcoming the curse of dimensionality in many settings [3]. How to structure control and learning within an RL hierarchy are active current research questions [47]. More generally  hierarchical methods, representation learning and semisupervised learning (SSL) are all wellestablished areas of research in their own right (see resp. [8,9] for surveys), with fruitful interplay between them, e.g., [10,11] for two recent examples.
We address how previously solved tasks can be used to reduce sample complexity and/or regret on future tasks sampled from the same environment. Specifically, we ask: how can access to optimal Q value functions with respect to a set of landmark states help in solving a new goalbased task from the same environment? To this end, we extend the Landmark Options Via Reflection (LOVR) framework by replacing the notion of an $\mathrm{\u0e2e\u0e17}$reachability covering of landmarks, $\mathrm{\u0e50\x9d\x95\x83}$, with a true topological covering under metrics on $\mathrm{\u0e50\x9d\x92\u0e0e}$ that carry meaningful semantics for the goalbased multitask setting. This landmark covering is first learned in a purely selfsupervised setting, after which we introduce different mechanisms by which ${\{{Q}^{\mathrm{\u0e42\x84\x93}}\}}_{\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}}$ can be used to solve a new task. The first mechanism simply extends the LOVR framework [12]. The second mechanism involves representing the current state $s\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e0e}$ as ${\{{V}^{\mathrm{\u0e42\x84\x93}}\u0e42\x81\u0e02(s)\}}_{\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x89\u0e01{V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)\u0e42\x89\u0e01({V}^{{\mathrm{\u0e42\x84\x93}}_{1}}\u0e42\x81\u0e02(s),{V}^{{\mathrm{\u0e42\x84\x93}}_{2}}\u0e42\x81\u0e02(s),\mathrm{\u0e42\x80\u0e06},{V}^{{\mathrm{\u0e42\x84\x93}}_{\mathrm{\u0e50\x9d\x95\x83}}}\u0e42\x81\u0e02(s))$. This approach finds inspiration from the philosophy of phenomenology, specifically that of Martin Heidegger, whereby our representation of objects in the world are not objective in any sense, but rather come to us by how the object provides us subjective utility, being readytohand. Hence, this staterepresentation, ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)$, represents the current state of the environment in terms of value or utility from a vector of contexts. Depending on the environment, if $\mathrm{\u0e50\x9d\x92\u0e0e}={\mathrm{\u0e42\x84\x9d}}^{d}$, and $$, then ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)$ has the benefit of inducing a dimensionality reduction on the state representation. In this study, this state representation is coupled to a greedy and deterministically defined zombie policy, which astonishingly empirically achieves nearoracle levels of regret on future tasks despite no further learning updates taking place. This represents a truly effective zeroshot transfer learning approach. Finally, the third mechanism for transfer introduced is motivated by the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ state representation to define a novel reward function which is more dense and varied than the actionpenalty reward structure normally considered. We also show that a hierarchical agent comprising of a banditcontroller which at the start of each episode selects amongst baseline learning and the transfer mechanisms achieves state of the art performance.
From a theoretical standpoint under the topological landmark covering we show that for any new goal task $g$, lower and upper bounds on the optimal stateaction value function can be made. In doing so, actionpruning can be used to prevent actions at particular states that cannot possibly realize the optimal stateaction value function for the given task. In effect, this actionpruning mechanism can be interpreted as a sense of safety, as it prevents the agent from taking any actions that may be detrimental towards accomplishing its goal. Empirical results on the cliffwalker domain support these theoretical results.
Our contributions include

โข
We introduce two metrics on $\mathrm{\u0e50\x9d\x92\u0e0e}$ that are semantically meaningful for the goalbased multitask RL setting.

โข
We extend the LOVR framework using a landmark covering, $\mathrm{\u0e50\x9d\x95\x83}$, which is a true topological covering of $\mathrm{\u0e50\x9d\x92\u0e0e}$.

โข
For any new goal task, $g$, we show that $\mathrm{\u0e50\x9d\x95\x83}$ can be used to obtain theoretical bounds on ${V}^{{g}^{*}}\u0e42\x81\u0e02(s)$, $\u0e42\x88\x80s\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e0e}$.

โข
The aforementioned theoretical bounds are used to motivate an actionconstrained exploration strategy whereby actions that cannot possibly be taken by an optimal policy are not considered during exploration. This actionconstrained exploration approach can be interpreted as a sense of safety in exploration, which is demonstrated empirically in the cliffhanger domain.

โข
We introduce a novel state representation approach by way of ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x89\u0e01{\{{V}^{\mathrm{\u0e42\x84\x93}}\u0e42\x81\u0e02(s)\}}_{\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}}$.

โข
A greedy and deterministic zombie policy is introduced, which, when coupled with the the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ state representation, achieves near oracle regret on future tasks, despite no further learning updates being made.

โข
A novel form of transfer is introduced through a reward function that utilizes ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$, and empirically achieves lower finitetime regret compared to baseline.

โข
A hierarchical agent utilizing a bandit controller to select among the transfer mechanisms and the baseline learning algorithm makes use of each approach at distinct periods of the learning process leading to strong empirical performance.
For the remainder of the paper we review relevant background and related research, including the LOVR framework. Next, we introduce the metrics on $\mathrm{\u0e50\x9d\x92\u0e0e}$ used in this study, discuss the topological covering $\mathrm{\u0e50\x9d\x95\x83}$ and the theoretical results stemming from its use in the goalbased multitask RL setting, including the novel actionconstrained exploration strategy. We demonstrate empirical results for the actionconstrained exploration strategy on the cliffwalk domain, before introducing the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ state representation, zombie policy, ${\mathrm{\u0e2f\x80}}^{\mathrm{\u0e50\x9d\x95\x83}}$ and transfer through reward function ${r}^{\mathrm{\u0e50\x9d\x95\x83}}$. Empirical results on the MNIST world domain are stated.
2 Background
Goalbased multitask reinforcement learning
We consider a multitask RL setting where the agent is confined to a stationary environment, and throughout its lifetime is assigned a sequence of tasks. A finite sequence of tasks are represented as episodic MDPs ${\mathrm{\u0e42\x84\u0e13}}_{i}=\u0e42\x9f\u0e08\mathrm{\u0e50\x9d\x92\u0e0e},\mathrm{\u0e50\x9d\x92\x9c},P,\mathrm{\u0e42\x84\x9b},{s}_{{g}_{i}},\mathrm{\u0e2e\u0e1d}\u0e42\x9f\u0e09$, $i\u0e42\x88\x88[T]$, with respective terminal states ${s}_{{g}_{i}}$. In summary: the state space $\mathrm{\u0e50\x9d\x92\u0e0e}$, set of actions $\mathrm{\u0e50\x9d\x92\x9c}$, transition probability kernel $P$, reward function $\mathrm{\u0e42\x84\x9b}$ and initial state distribution $\mathrm{\u0e2e\u0e1d}$ all remain fixed across tasks and only the goal state varies. The goal state ${s}_{{g}_{i}}$ thus encodes the $i$โth task, though when speaking of a single task denoted as simply ${s}_{g}$. In many settings it may be more applicable to consider goal sets ${\mathrm{\u0e50\x9d\x92\u0e02}}_{i}\u0e42\x8a\x82\mathrm{\u0e50\x9d\x92\u0e0e}$, rather than a single goal state. Our usage of terminal state follows the common formulation [13]for episodic tasks where one wants the agent, such as a robot, to perform a single specific function, and upon completion to become inactive and remain so until a new episode or task is begun. In general one could let ${K}_{i}$ be the number of episodes that ${\mathrm{\u0e42\x84\u0e13}}_{i}$ is run before the next task is assigned, but we will take ${K}_{i}\u0e42\x89\u0e01K$ to be independent of $i$ in the present paper. We consider the actionpenalty reward structure [14] of $1$ for all transitions, which reinforces the agent towards a policy of arriving at the current goal in as few steps as possible. The goal of the lifelong learning agent is to solve for a sequence of policies ${\{{\mathrm{\u0e2f\x80}}_{i}^{*}\}}_{i=1}^{T}$ which minimize cumulative regret. It is clear that under this reward structure that ${V}^{g}\u0e42\x81\u0e02(s)$ is the negative expected number of steps from state $s$ to goal $g$, and since ${\mathrm{\u0e2f\x80}}^{*}$ is optimal, ${V}^{{g}^{*}}\u0e42\x81\u0e02(s)$ encodes the โfastest" path to the goal.
Options Framework
The options framework is a mathematically principled approach for temporally extended actions [15]. An option $o\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e0a}$ is a triple, $o=\u0e42\x9f\u0e08{I}_{o},{\mathrm{\u0e2f\x80}}_{o},{\mathrm{\u0e2e\u0e12}}_{o}\u0e42\x9f\u0e09$, where ${\mathrm{\u0e42\x84\x90}}_{o}\u0e42\x8a\x86\mathrm{\u0e50\x9d\x92\u0e0e}$ is the initiation set, ${\mathrm{\u0e2f\x80}}_{o}$ is the option policy that maps states to primitive actions, and ${\mathrm{\u0e2e\u0e12}}_{o}$ is the termination function which controls when to terminate ${\mathrm{\u0e2f\x80}}_{o}$ and return control back to $\mathrm{\u0e2f\x80}$. The inclusion of options in a MDP results in a SemiMDP (SMDP), where standard RL algorithms apply in the SMDP setting [15]. Landmark options were first introduced in [16,15] as policies leading towards โlandmark" states. We retain this aspect but the way they are employed in our framework is different [12].
3 Introduced Traversibility Metrics and Landmark Coverings
We begin by introducing two metrics on the state space of MDPs. These metrics consider first hitting times between states, which carry with them a meaningful and useful interpretation under goal based RL. Before introducing the two metrics, we recall the notion of a topological covering.
Definition 1.
(Covering) Let $(\mathrm{\u0e50\x9d\x92\u0e13},d)$ be a metric space. Let $\mathrm{\u0e2e\u0e17}>0$. We say $\mathrm{\u0e50\x9d\x92\x9e}\u0e42\x8a\x82\mathrm{\u0e50\x9d\x92\u0e13}$ is an $\mathrm{\u0e2e\u0e17}$cover of $\mathrm{\u0e50\x9d\x92\u0e13}$ if $\u0e42\x88\x80x\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e13}$, $\u0e42\x88\x83c\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\x9e}$ such that $d\u0e42\x81\u0e02(x,c)\u0e42\x89\u0e04\mathrm{\u0e2e\u0e17}$.
Recall that for all tasks, $g$, we use the actionpenalty reward function $r\u0e42\x89\u0e011$. It is easy to show that under this reward function, $\u0e42\x88\x80g,s\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e0e}$, ${V}^{{g}^{*}}\u0e42\x81\u0e02(s)={\mathrm{\u0e50\x9d\x94\u0e1c}}_{{\mathrm{\u0e2f\x80}}^{*}}\u0e42\x81\u0e02[\text{min. number of steps from s to g}]$. That is, ${V}^{{g}^{*}}\u0e42\x81\u0e02(s)$ is the expected first hitting time of state $g$ from state $s$ under some optimal shortest path policy. With this, we introduce two metrics.
Definition 2.
(Round trip metric) For ${d}_{r\u0e42\x81\u0e02t}:\mathrm{\u0e50\x9d\x92\u0e0e}\u0e23\x97\mathrm{\u0e50\x9d\x92\u0e0e}\u0e42\x86\x92[0,2\u0e42\x81\u0e02D]$, $D$ the diameter of $\mathrm{\u0e50\x9d\x92\u0e0e}$, we define the round trip function ${d}_{r\u0e42\x81\u0e02t}$:
${d}_{r\u0e42\x81\u0e02t}\u0e42\x81\u0e02(x,y)={V}^{{x}^{*}}\u0e42\x81\u0e02(y)+{V}^{{y}^{*}}\u0e42\x81\u0e02(x)$ 
Definition 3.
(Max oneway metric) For ${d}_{\mathrm{\u0e42\x88\x9e}}:\mathrm{\u0e50\x9d\x92\u0e0e}\u0e23\x97\mathrm{\u0e50\x9d\x92\u0e0e}\u0e42\x86\x92[0,D]$, $D$ the diameter of $\mathrm{\u0e50\x9d\x92\u0e0e}$, we define the max oneway function ${d}_{\mathrm{\u0e42\x88\x9e}}$:
${d}_{\mathrm{\u0e42\x88\x9e}}\u0e42\x81\u0e02(x,y)=m\u0e42\x81\u0e02a\u0e42\x81\u0e02x\u0e42\x81\u0e02\{{V}^{{x}^{*}}\u0e42\x81\u0e02(y),{V}^{{y}^{*}}\u0e42\x81\u0e02(x)\}$ 
It is straight forward to prove that both the roundtrip and max oneway functions are in fact metrics on $\mathrm{\u0e50\x9d\x92\u0e0e}$. These metrics are helpful for goalbased RL as they carry meaningful semantics no matter the state representation, as opposed to other metrics such as ${\mathrm{\u0e42\x84\x92}}_{p}$ norms, which donโt carry any meaning in state representation spaces (e.g. images). Using these metrics we look to construct $\mathrm{\u0e2e\u0e17}$coverings of $\mathrm{\u0e50\x9d\x92\u0e0e}$ and see how they can confer theoretical benefits in the goalbased multitask RL setting. First, we show that an agent with access to such a covering, $\mathrm{\u0e50\x9d\x95\x83}$, one is able to immediately bound the value of any state $s\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e0e}$ for any novel task $g$.
Proposition 4.
Let $\mathrm{L}$ be an $\mathrm{\u0e2e\u0e17}$cover of $\mathrm{S}$ under the roundtrip metric ${d}_{r\mathit{\u0e42\x81\u0e02}t}$. Let $g\mathrm{\u0e42\x88\x88}\mathrm{S}$ be a task. Then $\mathrm{\u0e42\x88\x83}\mathrm{\u0e42\x84\x93}\mathrm{\u0e42\x88\x88}\mathrm{L}$ such that $\mathrm{}{V}^{{g}^{\mathrm{*}}}\mathit{\u0e42\x81\u0e02}\mathrm{(}\mathrm{\u0e42\x84\x93}\mathrm{)}\mathrm{}\mathrm{\u0e42\x89\u0e04}\mathrm{\u0e2e\u0e17}\mathrm{}\mathrm{}{V}^{{l}^{\mathrm{*}}}\mathit{\u0e42\x81\u0e02}\mathrm{(}g\mathrm{)}\mathrm{}$, and $\mathrm{\u0e42\x88\x80}s\mathrm{\u0e42\x88\x88}\mathrm{S}$ $\mathrm{\u0e42\x88\x83}\mathrm{\u0e42\x84\x93}\mathrm{\u0e42\x88\x88}\mathrm{L}\mathrm{,}$:
${V}^{{g}^{*}}\u0e42\x81\u0e02(s)\u0e42\x88\x88[\text{\u0e50\x9d\x91\x9a\u0e50\x9d\x91\x8e\u0e50\x9d\x91\u0e05}\u0e42\x81\u0e02\{0,{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s){V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(g)\},{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s){V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(g)+\mathrm{\u0e2e\u0e17}]$ 
Moreover these bounds are tight, in the sense that $\mathrm{\u0e42\x88\x83}$ an MDP $\mathrm{M}$ such that these bounds are realized.
Proposition 5.
Let $\mathrm{L}$ be an $\mathrm{\u0e2e\u0e17}$cover of $\mathrm{S}$ under the max oneway metric ${d}_{\mathrm{\u0e42\x88\x9e}}$. Let $g\mathrm{\u0e42\x88\x88}\mathrm{S}$ be a task. Then $\mathrm{\u0e42\x88\x83}\mathrm{\u0e42\x84\x93}\mathrm{\u0e42\x88\x88}\mathrm{L}$ such that $\mathrm{}{V}^{{g}^{\mathrm{*}}}\mathit{\u0e42\x81\u0e02}\mathrm{(}\mathrm{\u0e42\x84\x93}\mathrm{)}\mathrm{}\mathrm{\u0e42\x89\u0e04}\mathrm{\u0e2e\u0e17}$, and $\mathrm{\u0e42\x88\x80}s\mathrm{\u0e42\x88\x88}\mathrm{S}$ $\mathrm{\u0e42\x88\x83}\mathrm{\u0e42\x84\x93}\mathrm{\u0e42\x88\x88}\mathrm{L}\mathrm{,}$:
${V}^{{g}^{*}}\u0e42\x81\u0e02(s)\u0e42\x88\x88[\text{\u0e50\x9d\x91\x9a\u0e50\x9d\x91\x8e\u0e50\x9d\x91\u0e05}\u0e42\x81\u0e02\{0,{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s){V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(g)\},{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s)+\mathrm{\u0e2e\u0e17}]$ 
Moreover these bounds are tight, in the sense that $\mathrm{\u0e42\x88\x83}$ an MDP $\mathrm{M}$ such that these bounds are realized.
Note that for both Proposition 4 and 5, the $\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}$ referenced in the existence statements is the landmark that witnesses the goal, that is $d\u0e42\x81\u0e02(g,\mathrm{\u0e42\x84\x93})\u0e42\x89\u0e04\mathrm{\u0e2e\u0e17}$.
4 Proposed Transfer Mechanisms
4.1 Transfer through Safety via Action Pruning
The first transfer mechanism we introduce relates to a notion of safety during exploration. Having bounds on the optimal value function, and thus the expected first hitting time along some optimal policy, of any novel goal task $g$ is quite beneficial. By having both upper and lower bounds on the expected first hitting time for some optimal policy, inductive biases can be encoded into exploration policies that can remove from consideration actions that cannot possibly be selected by an optimal policy. We believe that this encoding can be interpreted as a sense of safety during exploration. Our aim is to introduce an actionpruning mechanism based on these theoretical bounds. We begin by motivating the intuition behind this action pruning mechanism.
Definition 6.
(Oracle function) Let $o:\mathrm{\u0e50\x9d\x92\u0e0e}\u0e23\x97\mathrm{\u0e50\x9d\x92\u0e0e}\u0e42\x86\x92[0,D]$ be the oracle function, where $o\u0e42\x81\u0e02(s,g)\u0e42\x89\u0e01{o}_{g}\u0e42\x81\u0e02(s)\u0e42\x89\u0e01{V}^{{g}^{*}}\u0e42\x81\u0e02(s).$
Hence, the oracle function $o$ takes as arguments an arbtriary stategoal pair, and returns the expected first hitting time from the state to the goal under some optimal policy.
Definition 7.
(Oracle feasible actions) The oracle feasible actions, ${A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83})$, are those actions that are feasible or plausibly possible with the optimal landmark $Q$ values and the oracle. Symbolically,
${A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83}):=\{a\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\x9c}:{Q}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s,a)\u0e42\x89\u0e04{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(g)+o\u0e42\x81\u0e02(s),\u0e42\x88\x80\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}\}$ 
${A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83})$ containts all the feasible actions in the sense that if for some given goal $g$, and state $s$ if ${\mathrm{\u0e2f\x80}}_{g}^{*}\u0e42\x81\u0e02(s)=a$ then $a\u0e42\x88\x88{A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83})$, and equivalently if $a\u0e42\x88\x89{A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83})$ then ${\mathrm{\u0e2f\x80}}_{g}^{*}\u0e42\x81\u0e02(s)\u0e42\x89a$. The first claim is trivial to show since ${\mathrm{\u0e2f\x80}}_{g}^{*}\u0e42\x81\u0e02(s)$ is the action that maximizes ${Q}^{{g}^{*}}\u0e42\x81\u0e02(s,a)$ which also realizes $o\u0e42\x81\u0e02(s)$. To show that if a given action $a$, $a\u0e42\x88\x89{A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83})$ then ${\mathrm{\u0e2f\x80}}_{g}^{*}\u0e42\x81\u0e02(s)\u0e42\x89a$, suppose otherwise. Since $a\u0e42\x88\x89{A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83})$ then $\u0e42\x88\x83\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}$ such that ${Q}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s,a)>{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(g)+{V}^{{g}^{*}}\u0e42\x81\u0e02(s)$. However, this implies that the expected number of steps to arrive at $\mathrm{\u0e42\x84\x93}$ from $s$ by taking action $a$ is strictly greater than taking the optimal policy to $g$ then taking the optimal policy from $g$ to $\mathrm{\u0e42\x84\x93}$. But this is a contradiction since $a$ is the optimal action under ${\mathrm{\u0e2f\x80}}_{g}^{*}\u0e42\x81\u0e02(s)$. We make no assumptions to having access to an oracle, $o$, therefore we do not have access to ${A}_{o}\u0e42\x81\u0e02(s,g,\mathrm{\u0e50\x9d\x95\x83})$. However, we will construct sets of feasible actions based on each metric, which are supersets of ${A}_{o}$, and hence again any $a$ not in the feasible sets then cannot be part of any optimal policy. We do so by using our upperbound estimates on $o\u0e42\x81\u0e02(s)$ as given by Propositions 4 and 5.
Definition 8.
(Feasible sets) Given $\mathrm{\u0e50\x9d\x95\x83}$ an $\mathrm{\u0e2e\u0e17}$cover of $\mathrm{\u0e50\x9d\x92\u0e0e}$ under ${d}_{r\u0e42\x81\u0e02t}$ or ${d}_{\mathrm{\u0e42\x88\x9e}}$, the feasible sets for each covering are defined, respectively:
${A}_{r\u0e42\x81\u0e02t}$  $=\{a\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\x9c}:{Q}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s,a)\u0e42\x89\u0e04{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s)+\mathrm{\u0e2e\u0e17},\u0e42\x88\x80\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\{{\mathrm{\u0e42\x84\x93}}^{\u0e42\x80\u0e12}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}{d}_{r\u0e42\x81\u0e02t}\u0e42\x81\u0e02(g,{\mathrm{\u0e42\x84\x93}}^{\u0e42\x80\u0e12})\u0e42\x89\u0e04\mathrm{\u0e2e\u0e17}\}\}$  
${A}_{\mathrm{\u0e42\x88\x9e}}$  $=\{a\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\x9c}:{Q}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(s,a)\u0e42\x89\u0e04{V}^{{\mathrm{\u0e42\x84\x93}}^{*}}\u0e42\x81\u0e02(g)+{V}^{\mathrm{\u0e42\x84\x93}}\u0e42\x81\u0e02(s)+\mathrm{\u0e2e\u0e17},\u0e42\x88\x80\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\{{\mathrm{\u0e42\x84\x93}}^{\u0e42\x80\u0e12}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}{d}_{\mathrm{\u0e42\x88\x9e}}\u0e42\x81\u0e02(g,{\mathrm{\u0e42\x84\x93}}^{\u0e42\x80\u0e12})\u0e42\x89\u0e04\mathrm{\u0e2e\u0e17}\}\}$ 
We see that the covering $\mathrm{\u0e50\x9d\x95\x83}$ can now be used to make exploration more efficient by restricting action selection to the feasible actions. In many states it is likely that the feasible actions are simply the entire set of actions $\mathrm{\u0e50\x9d\x92\x9c}$. However, for those actions that are not amongst the feasible actions, then these actions will, in expectation, delay the agent from arriving at the goal state significantly and hence should not be taken. Note that the feasible sets are defined purely with respect to (state) value functions associated to the landmark states $\mathrm{\u0e50\x9d\x95\x83}$, which can be solved in a selfsupervised learning manner in an initial phase, before the agent is assigned the sequence of tasks. We first demonstrate how restricting exploration to only those actions within the feasible sets can confer notions of safety and thereby reduce sample complexity by performing experiments on the cliff walker domain (Results Section).
4.2 Transfer through Landmark Options
The second transfer mechanism considered is simply an extension to the previously introduced Landmark Options Via Reflection (LOVR) framework [12]. A more detailed explanation can be found in [12]. Briefly, by properly defining the initiation set and termination conditions of the set of options, each landmark state is associated to a landmarkcentric value function, ${\{{Q}^{\mathrm{\u0e42\x84\x93}}\}}_{\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}}$, which induces a landmark option policy ${o}_{\mathrm{\u0e42\x84\x93}}$. Specifically, at the start of each episode, the landmark option associated to the landmark state that is closest to the current goal $g$ under the $\mathrm{\u0e2e\u0e17}$covering is selected. The option is terminated when the current state is within an $\mathrm{\u0e2e\u0e17}$ball of the landmark closest to the current goal. Due to the metrics considered in these coverings, this also implies that the current goal $g$ is also within an $\mathrm{\u0e2e\u0e17}$ball around this landmark. Upon terminating the landmark option, the initiation set of all landmark options are all empty, and hence the agent can only select primitive actions. In essence, this encoding of LOVR drives the agent to an $\mathrm{\u0e2e\u0e17}$ball around the landmark state that is closest to the current goal, and while within this $\mathrm{\u0e2e\u0e17}$ball, explores and solves for an optimal policy towards the goal from this $\mathrm{\u0e2e\u0e17}$ball. Whenever the agent leaves the $\mathrm{\u0e2e\u0e17}$ball, it selects the landmark option that drives the agent back to the ball. In effect, LOVR reduces the statespace of the current task to an $\mathrm{\u0e2e\u0e17}$ball, where $\mathrm{\u0e2e\u0e17}$ is a hyperparameter set by the experimenter. Rather than exploring over the entire state space $\mathrm{\u0e50\x9d\x92\u0e0e}$, the agent performs exploration only within an $e\u0e42\x81\u0e02t\u0e42\x81\u0e02a$ ball, which can be a dramatic reduction in size. By controlling the size of $e\u0e42\x81\u0e02t\u0e42\x81\u0e02a$, the experimenter trades off an initial โoverhead" for learning optimal policies for the covering $\mathrm{\u0e50\x9d\x95\x83}$ with a reduced effective state space size on all future tasks. Prior work introducing LOVR showed a dramatic improvement in finite sample complexity over baseline methods for a lifelong multitask RL agent [12].
4.3 Transfer Through Intermediate State Representation: ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$
Due to the semantic nature of the value functions considered in this setting, the evaluation of a single state with respect to the value functions of multiple previously solved tasks can be used to represent a state as a finite length vector of expected first arrival times to various landmark states that cover the environment. This state representation was inspired by the works of Martin Heideggerโs Being and Time. All to briefly, Heidegger introduces the notion of objects being readytohand. Our experience of an object that is readytohand is rather that we do not experience the object at all. We make no notice of a pen when we use it to write, rather it is simply an extension of ourselves. It is only when there is a malfunction (e.g. pen out of ink) that we become aware of the pen, as it has lost its โpenness", by way of a loss of functionality. Objects that are readytohand are extensions of ourselves, and we come to experience and view the world through the utility of these objects, just as we experience and view the world through the utility of our own bodies. In this way, we make use of Heideggerโs phenomenology by considering a set of previously learned options as extensions of self. We encode this by representing the state not as it is given to the agent, but rather, as a vectorized representation of the values of the current state with respect to the previously learned options. In this way, the options can be interpreted as acting as an intermediate representation. This representation asks the question โWhat value is this current state with respect to all of my skills and knowledge of the world?", and the answer being ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)$.
Given $\mathrm{\u0e50\x9d\x95\x83}$, we define ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)=({V}^{{\mathrm{\u0e42\x84\x93}}_{1}}\u0e42\x81\u0e02(s),{V}^{{\mathrm{\u0e42\x84\x93}}_{2}}\u0e42\x81\u0e02(s),\mathrm{\u0e42\x80\u0e06},{V}^{{\mathrm{\u0e42\x84\x93}}_{\mathrm{\u0e50\x9d\x95\x83}}}\u0e42\x81\u0e02(s))$, for some fixed indexing of $\mathrm{\u0e50\x9d\x95\x83}$. Hence, ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ is an $\mathrm{\u0e50\x9d\x95\x83}$ dimensional state representation that encodes semantics about the traversibility of the state space. ${Q}^{\mathrm{\u0e50\x9d\x95\x83}}$ is defined similarly. If $\mathrm{\u0e50\x9d\x95\x83}$ = 1, then ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)$ representations may actually be harmful, as they may alias the state space since many different states can have (approximately) the same expected first hitting time to $\mathrm{\u0e42\x84\x93}$, thus converting the problem to a partially observable MDP. However, if $\mathrm{\u0e50\x9d\x95\x83}$ is large enough, it is fair to assume that the map $\mathrm{\u0e2f\x95}\u0e42\x81\u0e02(s)={V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)$ becomes injective, and hence the aliasing affect will no longer be an issue. Moreover, in settings where the goal state $g$ is provided to the agent, then the agent can represent this state as ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(g)$. Such a representation can be favorable, as it can result in a dramatic dimensionality reduction of the state representation, especially in visual (image) environments. This can allow for smaller function approximation architectures to be used to speed up learning and also provide stability. Moreover, this state representation carries with it a natural and useful semantics, that of a localization of $s$ with respect to reference states ${\mathrm{\u0e42\x84\x93}}_{1},\mathrm{\u0e42\x80\u0e06},{\mathrm{\u0e42\x84\x93}}_{\mathrm{\u0e50\x9d\x95\x83}}$ in terms of expected first hitting times. We expect this state representation to be easy to exploit, both immediately by following deterministic greedy policies that minimize the distance between $s$ and $g$ in this representation space, and by use of function approximations such as neural networks which we believe may more quickly learn useful feature maps from this already semantically rich state representation.
4.3.1 Zombie Policy
Using the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ state representation, we consider a greedy heuristic policy termed the zombie policy. As discussed previously, if this ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ is an injective map on $\mathrm{\u0e50\x9d\x92\u0e0e}$, then state aliasing is of no concern, and this state representation will still follow the Markov property. Moreover, this state representation carries with it semantic content relevant to the goalbased RL setting studied here. Noting that,
${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s){V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(g)=(0,0,\mathrm{\u0e42\x80\u0e06},0)=\text{\u0e50\x9d\x9f\x8e}\u0e42\x87\x94{V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)={V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(g)\u0e42\x87\x94s=g,$ 
we reason that a potentially simple and useful heuristic is to take actions that move the agent greedily towards ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(g)$ in this representation space. We define the zombie policy, ${\mathrm{\u0e2f\x80}}_{\mathrm{\u0e50\x9d\x92\u0e15}}^{\mathrm{\u0e50\x9d\x95\x83}}$,
${\mathrm{\u0e2f\x80}}_{\mathrm{\u0e50\x9d\x92\u0e15}}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s):=\underset{a\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\x9c}}{\mathrm{argmin}}\mathit{\u0e42\x80\x83}{{Q}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s,a){V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(g)}_{1}.$ 
This policy selects the action that is expcted to bring the agent towards a state that has similar expected first hitting times to each of the landmarks as the current goal does. Proper theoretical analysis is required, however our initial intuition is that with a sufficiently spread out set of landmarks over $\mathrm{\u0e50\x9d\x92\u0e0e}$, environments with low connectivity (e.g. each state has a small number of possible successor states) and injectivity of this representation, we believe this representation should confer strong benefits in terms of sample complexity and regret bounds.
We highlight that following this zombie policy no longer utilizes any learning updates, is completely deterministic, and hence is a purely zeroshot transfer mechanism, as no further learning occurs on the target task.
4.4 Transfer Through Reward Function
The final mechanism for transfer in goalbased multitask RL considered here involves the agent implementing a highly dense reward function used to solve each subsequent task. This highly dense reward function is built from ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$, and follows a similar line of reasoning as that used for the zombie policy ${\mathrm{\u0e2f\x80}}_{\mathrm{\u0e50\x9d\x92\u0e15}}^{\mathrm{\u0e50\x9d\x95\x83}}$. Often in RL environments such as robotics tasks, the reward function is defined as the (scaled) negative distance between the state and the target goal state. Though this reward function is highly dense, especially as compared to the action penalty reward function, in real world settings oracle knowledge of distances is not provided. Moreover, metrics on state spaces tend not to produce any meaningful semantics, as distance between two states in pixel space, for example, do not carry any semantics. However, under the framework presented here, the agent has learned a useful state representation, and can be used to define the distance ${d}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s,g):={{V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s){V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(g)}_{1}$, which measures, relative to the goal state $g$, what is the total difference in expected steps $s$ is to each $\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}$ relative to $g$. Hence, we use the transfer reward function ${r}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s,a,{s}^{\u0e42\x80\u0e12})=\mathrm{\u0e2e\u0e12}\u0e42\x81\u0e02{d}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02({s}^{\u0e42\x80\u0e12},g)$ in our experiments, for some $\mathrm{\u0e2e\u0e12}>0$. We note: ${d}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s,g)=0\u0e42\x87\x94s=g$; unlike ${\mathrm{\u0e2f\x80}}_{\mathrm{\u0e50\x9d\x92\u0e15}}^{\mathrm{\u0e50\x9d\x95\x83}}$ which assumes injectivity of ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$, and is not capable of any further learning, ${r}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s,a,{s}^{\u0e42\x80\u0e12})$ does not require injectivity of ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$, rather, requires only the much weaker condition that ${V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s)={V}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(g)\u0e42\x87\x94s=g$. Under this assumption, ${r}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s,a,{s}^{\u0e42\x80\u0e12})$ provides a dense reward at every step, and in minimizing the undiscounted sum of rewards the agent is not likely to solve for a policy that receives small, but negative rewards indefinitely, but will favor arriving at the goal state $g$, and do so โquickly". For this reason, we believe that ${r}^{\mathrm{\u0e50\x9d\x95\x83}}\u0e42\x81\u0e02(s,a,{s}^{\u0e42\x80\u0e12})$ should work quite well, especially with the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ state representation, over baseline methods and should asymptotically outperform ${\mathrm{\u0e2f\x80}}_{\mathrm{\u0e50\x9d\x92\u0e15}}^{\mathrm{\u0e50\x9d\x95\x83}}$. For our experiments we used $\mathrm{\u0e2e\u0e12}=0.1$.
5 Results
5.1 Safety through ActionPruning
To demonstrate how the theoretical results above can be translated into notions of safety, we use the feasible action set as defined above in the cliffwalker domain. Cliffwalker is a stochastic gridworld domain that simulates walking along a windy cliff. As seen in Figure 1, we use a $4\u0e23\x9712$ gridworld, where the cliff runs along the bottom row between the first and last columns. The agent is equipped with four actions, each associated to movement in a cardinal direction. The agent always begins each episode at the bottom left state. We simulate a windy cliffwalker environment, where for all noncliff states, the agent moves successfully according to the action selected with probability 0.8, but with probability 0.2 moves down, if possible. Moreover, if the agent selects the down action, with equal probability the agent moves down one or two spaces (when possible). In this setup, the cliff acts as a sticky state, where with probability 0.95, no matter what the action, the agent remains at that state, and with proability 0.05 the action taken is successful. We use the ${d}_{\mathrm{\u0e42\x88\x9e}}$ metric to cover $\mathrm{\u0e50\x9d\x92\u0e0e}$ with $\mathrm{\u0e2e\u0e17}=8$. For this environment, this results in $\mathrm{\u0e50\x9d\x95\x83}$ comprising of three noncliff states, and all the cliffstates (see Figure 1). The initial state is always the bottom left state, and we consider experiments where the goal task $g$ is arriving at the bottom right state. The optimal policy involves first moving upward, then moving as far right as possible. Moving downwards in any of the cliff columns is considered dangerous, especially in the bottom three rows. We run experiments with a baseline Qlearning agent (no transfer baseline), a Qlearning agent with actionpruning, a Qlearning LOVR agent and a Qlearning LOVR agent with actionpruning. Exploration is performed using $\mathrm{\u0e2f\u0e15}$greedy exploration with $\mathrm{\u0e2f\u0e15}=0.1$. Each implementation is run for $n=100$ experiments, each experiment lasting 1000 episodes, each episode lasting a maximum of 1000 time steps (or until the agent reaches the goal state). For each experiment we count the number of time steps the agent is on the cliff and use this as a measure for how safe the agent is during exploration.
The baseline Qlearning agent is on the cliff an average of 22754.31 time steps per experiment. For all other agents we report the relative percent reduction in time steps found at the cliff over the baseline. Table 1 shows that the actionpruning exploration approach which only selects actions from the feasible set of actions, achieves a $45.2\%$ reduction in time steps spent on the cliff compared to the baseline. The LOVR agent that explores within an $\mathrm{\u0e2e\u0e17}$ball around the landmark nearest to the goal state results in a $94.03\%$ reduction in time steps spent on the cliff, while the agent that combines the LOVR implementation with actionpruning resulted in a $98.95\%$ reduction. To the best of our knowledge, these results are the first to report a transfer mechanism that confers notions of safety by solely using the value functions of previously solved tasks. Moreover, this safety mechanism makes no particular assumptions on the environment itself, only that a landmark covering has previously been learned using the actionpenalty reward structure, which can be implemented independantly of any endogenous reward function provided by an environment.
ActionPruning  LOVR  ActionPruning $+$ LOVR 

$45.19$  94.03  98.95 
5.2 MNISTworld experiments
The next set of experiments evaluate the other proposed transfer mechanisms in the multitask RL setting. For these experiments we use a MNISTworld domain, where at each time step the agent receives three concatenated MNIST digit images, two representing the index of the xcoordinate and the third representing the index of the ycoordinate, of a $100\u0e23\x9710$ grid world. At each time step the appropriate MNIST images are sampled from the training set of digits from 09 in order to build the representation of the current state in MNISTworld. The agent can move in each cardinal direction, where each action moves it in the appropriate cardinal direction stochastically with probability $p=0.85$, and in the other directions with equal probability. There is also a sticky absorbing state at state $(55,5)$ where the probability that an action taken is successfull is $0.05$, and with probability $0.95$ the agent remains stuck in that state. For all experiments a landmark is constructed using the ${d}_{r\u0e42\x81\u0e02t}$ metric, with $\mathrm{\u0e2e\u0e17}=10,20$. The landmark covering was built by initiating $\mathrm{\u0e50\x9d\x95\x83}=$, then stochastically adding a single state to $\mathrm{\u0e50\x9d\x95\x83}$ until the state space was covered. Ultimately 27 landmark states made up the covering. The landmark centric value functions were represented using Deep QNetworks (DQN) for each $\mathrm{\u0e42\x84\x93}\u0e42\x88\x88\mathrm{\u0e50\x9d\x95\x83}$, then using the landmark covering to perform the appropriate transfer mechanism. Transfer mechanism experiments involves a sequence of 20 novel goal tasks, each run for 1000 episodes, each episode lasting a maximum of 500 time steps. Each experiment was repeated a total of 5 times and we report the mean over the five experiments, with figures plotting $+\u0e42\x81\u0e03/\u0e42\x81\u0e03$ 1 standard deviation. For the DQN architectures: For computational efficiency, first an autoencoder was used trained to reconstruct the MNIST dataset, where the bottleneck layer was of dimension 20. These 20dimensional representations of the MNIST images were used as inputs to the DQN, hence each state comprises of 3 concatenated 20dimensional vectors. The DQN itself was a simple multilayer perceptron with two hidden layers of 200 units each, with a linear ouput layer of size $\mathrm{\u0e50\x9d\x92\x9c}]vert=4$. An experience replay buffer of size 25000 was used, sgd was used to train the network with minibathces of size 512, with ADAM optimizer with initial learning rate of ${10}^{4}$.
Figure 2 shows the mean epsiode regret of each agent implementation, where regret is computed with respect to the optimal (oracle) value functions for each goal, given the randomly sampled initial state. As seen in Figure 2, for both $\mathrm{\u0e2e\u0e17}=10$ and $\mathrm{\u0e2e\u0e17}=20$, quite similar trends are observed. First, we note that the baseline DQN agent is slower to convergence than the other agent implementations, however asymptotically outperforms all other implementations except the bandit agent. As well, the baseline DQN agent experiences the highest variance implementation. LOVR agent (green) performs quite well, converging to a good, but not optimal policy within roughly 100 episodes. The lack of optimality is due to the inductive bias encoded in the LOVR framework which forces the agent to drive towards the nearest landmark first, before taking actions towards the current goal state. This asymptotic inefficiency is discussed in [12]. Empirically, we find that the LOVR agent solves for roughly a $2\u0e42\x81\u0e02\mathrm{\u0e2e\u0e17}$optimal policy quite quickly, which is in line with the theoretical results previously presented [12].
The zombie policy ${\mathrm{\u0e2f\x80}}_{\mathrm{\u0e50\x9d\x92\u0e15}}^{\mathrm{\u0e50\x9d\x95\x83}}$ performs exceptionally well. It is worth repeating that no learning updates whatsoever occur during these experiments, and hence is a purely zeroshot transfer approach. The zombie policy agent uses the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ state representation and deterministically takes actions as described above. The zombie agent achieves near oracle levels of regret in this zeroshot transfer setting, achieving an average per episode regret of 15.4 time steps as compared to an oracle policy. It can be seen that it takes the baseline DQN agent roughly 700 episodes of learning to finally overtake the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent in terms of mean perepisode regret. An interesting second set of experiments would have been to continue to run both against to see how many episodes it would take in order for the baseline DQN agent to overtake the zombie agent in terms of cumulative life long regret, and not just per episode average regret. However, assuming that after episode 1000 the baseline DQN agent achieved zero regret per episode and the zombie agent continued with an average perepisode regret of 15.4, it would take over 5385 episodes for the baseline DQN agent to overtake the zombie agent with respect to cumulative life long regret under these experimental conditions.
A closer look at the average perepisode regret of the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent is visualized in Figure 3. Upon closer inspection, we found that of the 20 task, 10 achieve better mean perepisode regret than a fully converged baseline DQN agent with 1000 episodes of learning, despite the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent being a purely zeroshot transfer implementation, and experiencing as little as a mean episode regret of 1.5 time steps for some tasks. For 3 of the 25 tasks, the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent achieved high per episode regret in the 2028 time step range. Further analysis determined this to be attributed to the fact that for these three tasks, there were some initial states that the agent could not solve the goal from, and hence always achieved a maximum level of regret for those episodes. Despite being unable to learn from these mistakes, the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent still has an extremely strong performance, on average, across all 25 tasks. These results demonstrate that the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent may be an attractive finite time zeroshot transfer agent, when the number of episodes the agent is assigned a task is small, thus limiting the ability to learn an optimal policy quickly. Finally we see that the ${r}^{\mathrm{\u0e50\x9d\x95\x83}}$ reward transfer agent learned slightly faster than the baseline DQN agent, however it converged to a policy of poorer quality than the other agents.
Each of the transfer mechanisms tested, including the notransfer baseline DQN agent, each have their own strengths and perform well under different conditions and at different points during throughout learning. For example, the LOVR agent begins performing quite well quickly, but is outperformed by the baseline DQN agent asymptotically. With this in mind, we implemented a bandit agent, which uses a simple multiarmed bandit controller, which selects which agent implementation to use at the start of each episode. The experience of each agent selected is used to train offpolicy all the agents considered. In this way, we expect the bandit controller agent to reap the benefits of each transfer mechanism, and learn when to use each. To test this hypothesis we redo the previous experiments with a simple bandit controller as just described. The bandit controller follows the UCB algorithm with hyperparameter $C=200$, where the reward of an armpull is the negative number of steps taken by the arm. Here, each transfer mechanism, including the baseline DQN agent, is an arm. The results can be seen in Figure 2 (black). In both $\mathrm{\u0e2e\u0e17}=10,20$ the bandit agent solves for the optimal policy of each task quite fast, quickly overtaking the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent and outperforming the baseline DQN asymptotic policy in as few as 300 episodes (versus 1000 for the baseline). To investigate when the bandit controller selects each of the arms, we plot the relative proportion each arm is pulled during the course of learning (Figure 4) for experiments with $\mathrm{\u0e2e\u0e17}=10$. The plot is noisy in for the first four episodes, since the nature of the bandit algorithm is to deterministically select each of the arms in sequence until they have all been selected once, before begging the UCB algorithm. Without considering this initial arm selection phase, it can be seen that on average across the 20 tasks, the bandit controller tends to select the LOVR agent (red) and the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent (green) early on in learning during approximately the first 150 episodes. The ${r}^{\mathrm{\u0e50\x9d\x95\x83}}$ agent (blue) is consistently selected the least, even less frequently than the baseline DQN agent (black). These results are consistent with the regret curves found in Figure 2. Beginning at around 150 episodes, the baseline DQN consistently becomes consistently selected more frequently by the bandit controller, along with a gradual decrease in the selection of the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent, whereas the LOVR agent selection frequency remains steady until roughly 250 episodes, afterwhich it gradually decreases. We see that during episodes around 600800, the bandit controller is essentially selecting the baseline DQN agent and the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agents equally as often, and that this corresponds to when the baseline DQN tends to overtake all other transfer agents mean perepisode regret as seen in Figure 2. Finally, the last (approximately) 200 episodes see a slight increase in selection of the baseline DQN agent over the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ zombie agent by the bandit controller. The bandit controller arm pull proportions are consistent with the previous experiments and as hypothesized. The bandit controller leveraged the โfast to poor well" transfer agent implementations by favoring them early in learning, but gradually increased the selection of the asymptotically optimal baseline DQN agent. These results support the notion that each landmark convering transfer mechanism presented confers benefits at different stages of the learning process, and each with their own inductive biases favor learning at different stages.
6 Discussion
We presented two novel metrics on the state space of a Markov Decision Processes, and studied various benefits, both theoretical and empirical, a topological covering under such metrics can confer to an agent in the multitask goal based RL setting. Such a covering was introduced as a set of landmarks, $\mathrm{\u0e50\x9d\x95\x83}$, and was an immediate extension to the Landmark Options Via Reflection (LOVR) framework. Developing such a covering can be quite general in nature, which can be learned in a selfsupervised manner, in an online fashion (adding landmark states to $\mathrm{\u0e50\x9d\x95\x83}$ until the state space is covered) or selected by a researcher. We show how the landmark value functions from such a topological covering can be used to confer benefits to transfer learning under the multitask goal based RL setting. First, we explored the notion of safety and showed that the landmark value functions can be used to prune actions that may be dangerous, or infeasible with respect to the optimal policy for the current task. Empirical results on the cliffwalker domain supported the potential safety benefits landmark value functions can confer a multitask goal based RL agent. Second, an extension to the LOVR framework that uses landmark value functions as landmark options was introduced, which had strong empirical performance. Third, motivated from the phenomenology of Heidegger, we introduced a novel state representation which uses the landmark value functions themselves as a state representation, and coupled this representation to a greedy heuristic policy called the zombie agent. This agent was a purely zeroshot transfer mechanism and performed incredibly well on 25 novel tasks unseen by the agent. Finally, a transfer mechanism using a learned reward function based on the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ state representation was introduced.
6.1 Preprint, work in progress
Landmark coverings and landmark value functions clearly carry strong empirical and theoretical [12] benefits to transfer learning in the multitask goal based RL setting. This document is acting as a placeholder, and represents work in progress. The theoretical aspects of the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ representation and the zombie agent require much further exploration. Initial experiments on randomly generated MDPs that are highly connected show that the zombie agent performs quite poorly, and is not meant to be touted as a general mechanism that should or will work for any given environment. However, some initial work and conjectures are that for state spaces with low connectivity, such as grid world environments or video games, the zombie agent should confer benefits. We also leave a more thorough discussion of relevant studies and recent research for a later draft of this paper, as well as a comparison to other baselines such as Successor Representations (SRs) which we believe also has similarities to the ${V}^{\mathrm{\u0e50\x9d\x95\x83}}$ representation.
7 References
[1] D. Silver, Q. Yang, and L. Li. Lifelong machine learning systems: beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, pages 49โ55, 2013.
[2] J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149โ198, 2000.
[3] A.G. Barto and S. Mahedevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:341โ379, 2003.
[4] E. Brunskill and L. Li. Sample complexity of multitask reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
[5] K. Frans, J. Ho, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. Technical report, 2017. https://arxiv.org/pdf/1710.09767[cs.LG].
[6] R. Laroche, M. Fatemi, J. Romoff, and H. van Seijen. Multiadvisor reinforcement learning. Technical report, 2017. https://arxiv.org/pdf/1704.00756[cs.LG].
[7] H. van Seijen, M. Fatemi, J. Romoff, and R. Laroche. Separation of concerns in reinforcement learning. Technical report, 2017. https://arxiv.org/pdf/1612.05159[cs.LG].
[8] Y. Bengio, A. Courville, and P. Vincent. Representation learning: a review and new perspectives. Technical report, 2014. https://arxiv.org/pdf/1206.5538[cs.LG].
[9] X. Zhu. Semisupervised learning literature survey. Technical report, 2005. Technical Report 1530.
[10] M. Fraser. Multistep learning and underlying structure in statistical models. In NIPS, pages 4815โ4823, 2016.
[11] J. Godwin, P. Stenetorp, and S. Riedel. Deep semisupervised learning with linguistically motivated se quence labelling task hierarchies. Technical report, 2016. https://arxiv.org/pdf/1612.09113[cs.CL].
[12] N. Denis and M. Fraser. Options in multitask reinforcement learning. In 32nd Canadian Conference on Artificial Intelligence, pages 225โ237, 2019.
[13] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2016.
[14] S. Koenig and R.G. Simmons. Complexity analysis of realtime reinforcement learning. AAAI, pages 99โ105, 1993.
[15] S.R. Sutton, D. Precup, and S. Singh. Beteween mdps and semimdps: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181โ211, 1999.
[16] T.A. Mann, S. Mannor, and D. Precup. Approximate value iteration with temporally extended actions. Journal of Artificial Intelligence Research, 53:375โ438, 2015.