Abstract
The Exploration-Exploitation tradeoff arises in Reinforcement Learning whenone cannot tell if a policy is optimal. Then, there is a constant need toexplore new actions instead of exploiting past experience. In practice, it iscommon to resolve the tradeoff by using a fixed exploration mechanism, such as$\epsilon$-greedy exploration or by adding Gaussian noise, while still tryingto learn an optimal policy. In this work, we take a different approach andstudy exploration-conscious criteria, that result in optimal policies withrespect to the exploration mechanism. Solving these criteria, as we establish,amounts to solving a surrogate Markov Decision Process. We continue and analyzeproperties of exploration-conscious optimal policies and characterize twogeneral approaches to solve such criteria. Building on the approaches, we applysimple changes in existing tabular and deep Reinforcement Learning algorithmsand empirically demonstrate superior performance relatively to theirnon-exploration-conscious counterparts, both for discrete and continuous actionspaces.