### Abstract

Learning a near optimal policy in a partially observable system remains anelusive challenge in contemporary reinforcement learning. In this work, weconsider episodic reinforcement learning in a reward-mixing Markov decisionprocess (MDP). There, a reward function is drawn from one of multiple possiblereward models at the beginning of every episode, but the identity of the chosenreward model is not revealed to the agent. Hence, the latent state space, forwhich the dynamics are Markovian, is not given to the agent. We study theproblem of learning a near optimal policy for two reward-mixing MDPs. Unlikeexisting approaches that rely on strong assumptions on the dynamics, we make noassumptions and study the problem in full generality. Indeed, with no furtherassumptions, even for two switching reward-models, the problem requires severalnew ideas beyond existing algorithmic and analysis techniques for efficientexploration. We provide the first polynomial-time algorithm that finds an$\epsilon$-optimal policy after exploring $\tilde{O}(poly(H,\epsilon^{-1})\cdot S^2 A^2)$ episodes, where $H$ is time-horizon and $S, A$ are the numberof states and actions respectively. This is the first efficient algorithm thatdoes not require any assumptions in partially observed environments where theobservation space is smaller than the latent state space.