Abstract
Learning new task-specific skills from a few trials is a fundamentalchallenge for artificial intelligence. Meta reinforcement learning (meta-RL)tackles this problem by learning transferable policies that support few-shotadaptation to unseen tasks. Despite recent advances in meta-RL, most existingmethods require the access to the environmental reward function of new tasks toinfer the task objective, which is not realistic in many practicalapplications. To bridge this gap, we study the problem of few-shot adaptationin the context of human-in-the-loop reinforcement learning. We develop ameta-RL algorithm that enables fast policy adaptation with preference-basedfeedback. The agent can adapt to new tasks by querying human's preferencebetween behavior trajectories instead of using per-step numeric rewards. Byextending techniques from information theory, our approach can design querysequences to maximize the information gain from human interactions whiletolerating the inherent error of non-expert human oracle. In experiments, weextensively evaluate our method, Adaptation with Noisy OracLE (ANOLE), on avariety of meta-RL benchmark tasks and demonstrate substantial improvement overbaseline algorithms in terms of both feedback efficiency and error tolerance.