Humans are capable of attributing latent mental contents such as beliefs orintentions to others. The social skill is critical in daily life for reasoningabout the potential consequences of others' behaviors so as to plan ahead. Itis known that humans use such reasoning ability recursively by considering whatothers believe about their own beliefs. In this paper, we start from level-$1$recursion and introduce a probabilistic recursive reasoning (PR2) framework formulti-agent reinforcement learning. Our hypothesis is that it is beneficial foreach agent to account for how the opponents would react to its futurebehaviors. Under the PR2 framework, we adopt variational Bayes methods toapproximate the opponents' conditional policies, to which each agent finds thebest response and then improve their own policies. We developdecentralized-training-decentralized-execution algorithms, namely PR2-Q andPR2-Actor-Critic, that are proved to converge in the self-play scenarios whenthere exists one Nash equilibrium. Our methods are tested on both the matrixgame and the differential game, which have a non-trivial equilibrium wherecommon gradient-based methods fail to converge. Our experiments show that it iscritical to reason about how the opponents believe about what the agentbelieves. We expect our work to contribute a new idea of modeling the opponentsto the multi-agent reinforcement learning community.