Provably Learning from Language Feedback

Abstract

Interactively learning from observation and language feedback is anincreasingly studied area driven by the emergence of large language model (LLM)agents. While impressive empirical demonstrations have been shown, so far aprincipled framing of these decision problems remains lacking. In this paper,we formalize the Learning from Language Feedback (LLF) problem, assertsufficient assumptions to enable learning despite latent rewards, and introduce$\textit{transfer eluder dimension}$ as a complexity measure to characterizethe hardness of LLF problems. We show that transfer eluder dimension capturesthe intuition that information in the feedback changes the learning complexityof the LLF problem. We demonstrate cases where learning from rich languagefeedback can be exponentially faster than learning from reward. We develop ano-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problemsthrough sequential interactions, with performance guarantees that scale withthe transfer eluder dimension of the problem. Across several empirical domains,we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMsdoes not work reliably. Our contributions mark a first step towards designingprincipled interactive learning algorithms from generic language feedback.

Quick Read (beta)

loading the full paper ...