Abstract
Meta reinforcement learning sets a distribution over a set of tasks on whichthe agent can train at will, then is asked to learn an optimal policy for anytest task efficiently. In this paper, we consider a finite set of tasks modeledthrough Markov decision processes with various dynamics. We assume to haveendured a long training phase, from which the set of tasks is perfectlyrecovered, and we focus on regret minimization against the optimal policy inthe unknown test task. Under a separation condition that states the existenceof a state-action pair revealing a task against another, Chen et al. (2022)show that $O(M^2 \log(H))$ regret can be achieved, where $M, H$ are the numberof tasks in the set and test episodes, respectively. In our first contribution,we demonstrate that the latter rate is nearly optimal by developing a novellower bound for test-time regret minimization under separation, showing that alinear dependence with $M$ is unavoidable. Then, we present a family ofstronger yet reasonable assumptions beyond separation, which we call strongidentifiability, enabling algorithms achieving fast rates $\log (H)$ andsublinear dependence with $M$ simultaneously. Our paper provides a newunderstanding of the statistical barriers of test-time regret minimization andwhen fast rates can be achieved.