Abstract
We theoretically explore the relationship between sample-efficiency andadaptivity in reinforcement learning. An algorithm is sample-efficient if ituses a number of queries $n$ to the environment that is polynomial in thedimension $d$ of the problem. Adaptivity refers to the frequency at whichqueries are sent and feedback is processed to update the querying strategy. Toinvestigate this interplay, we employ a learning framework that allows sendingqueries in $K$ batches, with feedback being processed and queries updated aftereach batch. This model encompasses the whole adaptivity spectrum, ranging fromnon-adaptive 'offline' ($K=1$) to fully adaptive ($K=n$) scenarios, and regimesin between. For the problems of policy evaluation and best-policyidentification under $d$-dimensional linear function approximation, weestablish $\Omega(\log \log d)$ lower bounds on the number of batches $K$required for sample-efficient algorithms with $n = O(poly(d))$ queries. Ourresults show that just having adaptivity ($K>1$) does not necessarily guaranteesample-efficiency. Notably, the adaptivity-boundary for sample-efficiency isnot between offline reinforcement learning ($K=1$), where sample-efficiency wasknown to not be possible, and adaptive settings. Instead, the boundary liesbetween different regimes of adaptivity and depends on the problem dimension.