### Abstract

State-of-the-art deep learning algorithms mostly rely on gradientbackpropagation to train a deep artificial neural network, which is generallyregarded to be biologically implausible. For a network of stochastic unitstrained on a reinforcement learning task or a supervised learning task, onebiologically plausible way of learning is to train each unit by REINFORCE. Inthis case, only a global reward signal has to be broadcast to all units, andthe learning rule given is local, which can be interpreted as reward-modulatedspike-timing-dependent plasticity (R-STDP) that is observed biologically.Although this learning rule follows the gradient of return in expectation, itsuffers from high variance and cannot be used to train a deep network inpractice. In this paper, we propose an algorithm called MAP propagation thatcan reduce this variance significantly while retaining the local property oflearning rule. Different from prior works on local learning rules (e.g.Contrastive Divergence) which mostly applies to undirected models inunsupervised learning tasks, our proposed algorithm applies to directed modelsin reinforcement learning tasks. We show that the newly proposed algorithm cansolve common reinforcement learning tasks at a speed similar to that ofbackpropagation when applied to an actor-critic network.