When observing the actions of others, humans carry out inferences about whythe others acted as they did, and what this implies about their view of theworld. Humans also use the fact that their actions will be interpreted in thismanner when observed by others, allowing them to act informatively and therebycommunicate efficiently with others. Although learning algorithms have recentlyachieved superhuman performance in a number of two-player, zero-sum games,scalable multi-agent reinforcement learning algorithms that can discovereffective strategies and conventions in complex, partially observable settingshave proven elusive. We present the Bayesian action decoder (BAD), a newmulti-agent learning method that uses an approximate Bayesian update to obtaina public belief that conditions on the actions taken by all agents in theenvironment. Together with the public belief, this Bayesian update effectivelydefines a new Markov decision process, the public belief MDP, in which theaction space consists of deterministic partial policies, parameterised by deepneural networks, that can be sampled for a given public state. It exploits thefact that an agent acting only on this public belief state can still learn touse its private information if the action space is augmented to be over partialpolicies mapping private information into environment actions. The Bayesianupdate is also closely related to the theory of mind reasoning that humanscarry out when observing others' actions. We first validate BAD on aproof-of-principle two-step matrix game, where it outperforms traditionalpolicy gradient methods. We then evaluate BAD on the challenging, cooperativepartial-information card game Hanabi, where in the two-player setting themethod surpasses all previously published learning and hand-coded approaches.