Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

Abstract

We theoretically explore the relationship between sample-efficiency andadaptivity in reinforcement learning. An algorithm is sample-efficient if ituses a number of queries $n$ to the environment that is polynomial in thedimension $d$ of the problem. Adaptivity refers to the frequency at whichqueries are sent and feedback is processed to update the querying strategy. Toinvestigate this interplay, we employ a learning framework that allows sendingqueries in $K$ batches, with feedback being processed and queries updated aftereach batch. This model encompasses the whole adaptivity spectrum, ranging fromnon-adaptive 'offline' ($K=1$) to fully adaptive ($K=n$) scenarios, and regimesin between. For the problems of policy evaluation and best-policyidentification under $d$-dimensional linear function approximation, weestablish $\Omega(\log \log d)$ lower bounds on the number of batches $K$required for sample-efficient algorithms with $n = O(poly(d))$ queries. Ourresults show that just having adaptivity ($K>1$) does not necessarily guaranteesample-efficiency. Notably, the adaptivity-boundary for sample-efficiency isnot between offline reinforcement learning ($K=1$), where sample-efficiency wasknown to not be possible, and adaptive settings. Instead, the boundary liesbetween different regimes of adaptivity and depends on the problem dimension.

Quick Read (beta)

loading the full paper ...