Uniform Last-Iterate Guarantee for Bandits and Reinforcement Learning

Abstract

Existing metrics for reinforcement learning (RL) such as regret, PAC bounds,or uniform-PAC (Dann et al., 2017), typically evaluate the cumulativeperformance, while allowing the agent to play an arbitrarily bad policy at anyfinite time t. Such a behavior can be highly detrimental in high-stakesapplications. This paper introduces a stronger metric, uniform last-iterate(ULI) guarantee, capturing both cumulative and instantaneous performance of RLalgorithms. Specifically, ULI characterizes the instantaneous performance byensuring that the per-round suboptimality of the played policy is bounded by afunction, monotonically decreasing w.r.t. round t, preventing revisiting badpolicies when sufficient samples are available. We demonstrate that anear-optimal ULI guarantee directly implies near-optimal cumulative performanceacross aforementioned metrics, but not the other way around. To examine theachievability of ULI, we first provide two positive results for bandit problemswith finite arms, showing that elimination-based algorithms andhigh-probability adversarial algorithms with stronger analysis or additionaldesigns, can attain near-optimal ULI guarantees. We also provide a negativeresult, indicating that optimistic algorithms cannot achieve near-optimal ULIguarantee. Furthermore, we propose an efficient algorithm for linear banditswith infinitely many arms, which achieves the ULI guarantee, given access to anoptimization oracle. Finally, we propose an algorithm that achievesnear-optimal ULI guarantee for the online reinforcement learning setting.

Quick Read (beta)

loading the full paper ...