Model-Free Mean-Field Reinforcement Learning: Mean-Field MDP and Mean-Field Q-Learning

  • 2019-10-28 16:56:46
  • René Carmona, Mathieu Laurière, Zongjun Tan
  • 0


We develop a general reinforcement learning framework for mean field control(MFC) problems. Such problems arise for instance as the limit of collaborativemulti-agent control problems when the number of agents is very large. Theasymptotic problem can be phrased as the optimal control of a non-lineardynamics. This can also be viewed as a Markov decision process (MDP) but thekey difference with the usual RL setup is that the dynamics and the reward nowdepend on the state's probability distribution itself. Alternatively, it can berecast as a MDP on the Wasserstein space of measures. In this work, weintroduce generic model-free algorithms based on the state-action valuefunction at the mean field level and we prove convergence for a prototypicalQ-learning method. We then implement an actor-critic method and reportnumerical results on two archetypal problems: a finite space model motivated bya cyber security application and a continuous space model motivated by anapplication to swarm motion.


Quick Read (beta)


We develop a general reinforcement learning framework for mean field control (MFC) problems. Such problems arise for instance as the limit of collaborative multi-agent control problems when the number of agents is very large. The asymptotic problem can be phrased as the optimal control of a non-linear dynamics. This can also be viewed as a Markov decision process (MDP) but the key difference with the usual RL setup is that the dynamics and the reward now depend on the state’s probability distribution itself. Alternatively, it can be recast as a MDP on the Wasserstein space of measures. In this work, we introduce generic model-free algorithms based on the state-action value function at the mean field level and we prove convergence for a prototypical Q-learning method. We then implement an actor-critic method and report numerical results on two archetypal problems: a finite space model motivated by a cyber security application and a continuous space model motivated by an application to swarm motion.


Model-Free Mean-Field Reinforcement Learning:
Mean-Field MDP and Mean-Field Q-Learning


René Carmona                        Mathieu Laurière                        Zongjun Tan

1 Introduction

Typical reinforcement learning (RL) applications involve the search for a procedure to learn by trial and error the optimal behavior so as to maximize a reward. While similar in spirit to optimal control applications, a key difference is that in the latter, the model is assumed to be known to the controller. This is in contrast with RL for which the environment has to be explored, and the reward cannot be predicted with certainty. Still, the RL paradigm has generated numerous theoretical developments and found plenty practical applications. As a matter of fact, bidirectional links with the optimal control literature have been unveiled as common tools lie at the heart of many studies. Mean field control (MFC), also called optimal control of McKean-Vlasov (MKV) dynamics, is an extension of stochastic control which has recently attracted a surge of interest (see e.g. [5, 11, 12]). From a theoretical standpoint, the main peculiarity of this type of problems is that the transition and reward functions not only involve the state and the action of the controller, but also the distribution of the state (and possibly of the control). Practically speaking, these problems appear as the asymptotic limits for the control of a large number of collaborative agents. They can also be introduced as single agent problems whose evolution and costs depend upon the distribution of her state (and potentially of her control). Such problems have found a wide range of applications in distributed robotics, energy, drone fleet management, risk management, finance, etc. Although they are bona fide control problems for which a dynamic programming principle can be formulated, they generally lead to Bellman equations on the infinite dimensional space of measures, which are extremely difficult to solve ([6, 32, 35]). Figure 1 contains a schematic diagram of the relationships between optimal control (i.e., planning with a model) and how the paradigms of MFC and RL are combined in mean-field reinforcement learning (MFRL).

Figure 1: Relationships between optimal control (OC), mean-field control (MFC), reinforcement learning (RL) and mean-field reinforcement learning (MFRL). Horizontal arrows: extension in the direction of model-free learning. Vertical arrows: generalization by letting the number of agents grow to infinity or by controlling a MKV dynamics.

Main contributions. Our first contribution is conceptual and consists in introducing a general framework of MFC problems in discrete time, with infinite horizon, discount and common noise. We argue that this setup, which has not been covered in the classical literature on MFC problems, is particularly relevant to develop a theory of reinforcement learning for mean field problems. We then rephrase the problem as a Markov decision process (MDP) on the space of measures. This point of view leads to the introduction of a state value function and a state-action value function as well as their associated Bellman equations on the Wasserstein space of measures. Our second contribution is to propose two model-free methods to learn an approximation of the state-action value function by trial and error. The first method relies on a discretization of the simplex and a tabular version of Q-learning, for which we prove a convergence result. The second method is based on an actor-critic method (Deep Deterministic Policy Gradient). Last, our third contribution is to implement the latter method and assess its convergence numerically. Numerical tests are conducted on two prototypical examples drawn from the mean-field literature: a finite state model motivated by a cyber security application and a continuous state and action model motivated by an application to swarm motion.

2 Mean Field Control

In this section, we keep the discussion at an informal level in order to encompass both finite and continuous state spaces. Specific methods and examples for each setting are presented in Sections 3 and 4.

Definition of the problem.

We denote by 𝒮 and 𝒜 respectively the state space and the action space. Typically, we have in mind the finite space case where both are finite sets, say 𝒮={1,,|𝒮|} and 𝒜={1,,|𝒜|}, or the continuous case, where 𝒮=d and 𝒜=k. A generic discrete time, infinite horizon, discounted mean field control (MFC) problem (or control of McKean-Vlasov dynamics) takes the following form: Maximize over the control process (or policy) π the reward functional

J(μ0,π)=𝔼[t=0+γtf(xtπ,μ0,μtπ,μ0,πt)] (1)

where the state process has initial distribution μ0 and dynamics

(xt+1π,μ0|xtπ,μ0,μtπ,μ0,πt,ϵt0)=pxtπ,μ0,μtπ,μ0,πt,ϵt0(). (2)

Here p:𝒮×𝒫(𝒮)×𝒜×𝒮×0 encodes the transition. The mean-field nature of the model stems from the presence of μtπ,μ0=(xtπ,μ0|ϵ0)𝒫(𝒮), which is the law of xtπ,μ0 conditioned on the realization of ϵ0 up to time t-1, where (ϵt0)t0 is a stochastic process taking values in a set 0. For simplicity we assume that the ϵt0 are i.i.d. They play the role of a so-called common noise affecting the state transitions. Although the presence of this noise is not necessary for the model to be meaningful and we postpone the rigorous mathematical framework to future work, we believe that its presence is important for applications. Motivations and examples for this type of noise are provided in the sequel. Randomness in the rewards as well as interactions through the control’s distribution could also be handled, but for the sake of simplicity of the presentation we limit ourselves to a reward f which is a deterministic function of (x,μ,a)𝒮×𝒫(𝒮)×𝒜 and to the interaction through the conditional distribution of the state only.

Remark 1.

Note that J depends on μ0. Although this dependence is usually omitted in the MFC literature, it is important to remember that the solution of the MFC problem changes if we let this initial distribution vary.

In the finite case, the above dynamics take the following form, where we identify 𝒫(𝒮) with the |𝒮|-simplex denoted by SS: for every (x,μ,a,s,e0)𝒮×SS×𝒜×𝒮×0, px,μ,a,e0(x) corresponds to


Such an evolution can be interpreted in terms of a transition rate matrix, and the common noise can for instance affect the coefficients of this matrix. In the continuous case, (2) can come from a continuous time model, for example a stochastic differential equation of the McKean Vlasov type (MKV SDE) via an Euler scheme [8], in which case:

xt+1π,μ0=xtπ,μ0+b(xtπ,μ0,μtπ,μ0,πt)+ϵt+1+ϵt+10, (3)

where the random variables ϵt,ϵt0,t1 are independent (e.g. with Gaussian distributions) and are interpreted as sources of noise. This type of setting has been studied in [16] with a linear dynamics and a quadratic cost.

When there is no ambiguity from the context, we will drop the superscripts π and μ0. Intuitively,  (1) is the limiting problem, as N grows to infinity, of the following control problem for N agents: Maximize over (π1,,πN) the social average reward


where μ¯tN=j=1Nδxtj/N is the empirical distribution, x0i are i.i.d. with distribution μ0 and the dynamics are (xt+1i|xti,μ¯tN,πti,ϵt0)=pxti,μ¯tN,πti,ϵt0(). Note that the same ϵt0 appears in the transitions of all (xti)i=1,,N. In other words, the system is affected by two sources of noise: one perturbs each state xi independently, and one affects all the agents. The first noise is idiosyncratic to each agent whereas the second one is common to the whole population. This latter type of noise is important in many applications, see e.g. [13, 2] for models of systemic risk or energy management in the context of mean field games. Since, in this N-agent problem, the goal is to maximize JN, the problem can be interpreted e.g. as a large collaborative game (i.e. a Pareto optimum, rather than a Nash equilibrium as in mean field games), or as the problem of a central planner trying to find the best way to control a large group of robots.

Reformulation as an MDP on the Wasserstein space of measures.

We reformulate the MFC problem (1) as the optimal control of a Markov decision process in which the state space is the space of measures, in the spirit of e.g. [23]. We restrict our attention to controls which are stationary feedback functions of ((xt),xt), namely processes π for which there exists a (deterministic) function a:𝒫(𝒮)×𝒮𝒜 such that for all t


In this situation, a typical agent takes her decision at each time step based on only two pieces of information: her current state and the distribution of the population’s states. For such a and π, we will write J(a) instead of J(π). We denote by 𝔸 the set of such functions a and it will sometimes be convenient to view them as functions from 𝒫(𝒮) to the set 𝔸~={a~:𝒮𝒜}. The initial MFC problem (1) can be recast as an optimal control problem on the distribution flow, namely,

J(μ0,a)=𝔼(ϵt0)t0t=0+γtf~(μtμ0,a,a(μtμ0,a)) (4)

under the constraint: μtμ0,a=(xtμ0,a|ϵ0), where xμ0,a solves (2). Here f~:𝒫(𝒮)×𝔸~ is defined by


and the evolution of the distribution is given by

μt+1μ0,a=Φa(μtμ0,a,),ϵt0(μtμ0,a) (5)

where Φ:𝔸~×0×𝒫(𝒮)𝒫(𝒮),(a~,e0,μ)Φa~,e0(μ), formalizing the transition (2) in our new set of notations. Note that Φ depends (possibly in a non-linear way) on the distribution at time t, in accordance with the idea of MKV dynamics. If Φ is constant with respect to the common noise, then the evolution of the distribution is deterministic and the expectation symbol in (4) is superfluous. To alleviate the presentation, we sometimes omit the dependence on ϵ0 (since it is now the only source of randomness) and see Φ as a stochastic map from 𝔸~×𝒫(𝒮) to 𝒫(𝒮).

We are facing an MDP over the space 𝒫(𝒮) of probability measures on the underlying state 𝒮. Note that 𝒫(𝒮) is always continuous and infinite dimensional unless 𝒮 is finite. If 𝒮 is finite, the distribution μt can be viewed as a vector in |𝒮| whose dynamics can be written as

μt+1μ0,a=Pμtμ0,a,a(μtμ0,a),ϵt0μtμ0,a, (6)

where Pμtμ0,a,a(μtμ0,a),ϵt0|𝒮|×|𝒮| is the transition matrix of the distribution, which can be interpreted in terms of the transition rate matrix of a typical player.

Dynamic programming.

Let V be the value function associated to the above problem (4), defined by: for μ𝒫(𝒮),

V(μ)=supa𝔸𝔼t=0+γtf~(μtμ,a,a(μtμ,a)) (7)

under the constraint: (μtμ,a)t0 solves (5) with initial condition μ. One can check, at least formally, that a dynamic programming principle holds in the sense that V solves the following Bellman equation:

V(μ)=supa𝔸{f~(μ,a(μ))+γ𝔼V(Φa(μ)(μ))}=supa~𝔸~{f~(μ,a~)+γ𝔼V(Φa~(μ))}, (8)

where the expectation is over the randomness of Φ, which stems from the common noise.

Moreover, under suitable conditions, we expect a verification theorem to hold, namely, any function satisfying the above Bellman equation (8) coincides with the value function of the MFC problem: V(μ0)=supa𝔸J(μ0,a) for any initial distribution μ0, and the optimal control is given by the maximizer in (8). The value function and its connection to the so-called Master equation and to the mean-field PDE system has been a very active research direction in recent years, see e.g. [5, 11, 12].

3 Mean-Field Q-Learning

We now turn our attention to the question of learning the solution of the MFC problem in a model-free setting, i.e., assuming the model is unknown but one has access to a simulator which can provide samples of trajectories and rewards.

State-action value function.

From now on, we see the distribution as the state in 𝒫(𝒮) and the action taken by the central planner given this distribution as an element of 𝔸~. We then introduce the following state-action value function Q:𝒫(𝒮)×𝔸~ defined, for every (μ,a~)𝒫(𝒮)×𝔸~ by

Q(μ,a~)=f~(μ,a~)+γ𝔼maxa~𝔸~Q(Φa~(μ),a~), (9)

where we recall that we dropped ϵ0 from the notation in the expectation and in Φ. Under suitable conditions, V and Q are well defined and V(μ)=supa~Q(μ,a~). Note that finding one maximizer a~μ𝔸~ for each μ𝒫(𝒮) amounts to find a maximizer in 𝔸 for (7).

In the rest of this section, we propose two model-free algorithms relying on this state-action value function. We first focus on the case where 𝒮 and 𝒜 are finite and we explain later on how to adapt these techniques to the case of continuous spaces.

First method: Tabular MFQ-learning with projection.

Note that even when 𝒮 is finite, μ𝒫(𝒮) is an element of the |𝒮|-simplex SS, which is a continuous space, and hence a tabular version of Q-learning can not be applied directly. One possible workaround is to first replace SS by a finite subset SSˇ and then project μt on SSˇ at each time step t, thus letting the mean-field term take only a finite number of values. We can then approximate the function Q by a function Qˇ:SSˇ×𝔸~, which can be represented by a matrix (or a “table”) in |SSˇ|×|𝒜||𝒮| (since 𝔸~ is the set of functions from 𝒮 to 𝒜). We introduce the following “projected MFC problem”: Maximize over aˇAˇ={aˇ:SSˇ×SA}

Jˇ(μ0,aˇ)=𝔼t=0+γtf~(μˇtμ0,aˇ,aˇ(μˇtμ0,aˇ)) (10)

under the constraint: μˇt+1μ0,aˇ=Φˇaˇ(μˇtμ0,aˇ),ϵt0(μˇtμ0,aˇ), where Φˇ:A~××E0SSˇSSˇ, (a~,e0,μˇ)ProjSSˇ(Φa~,e0(μˇ)). This problem is still an optimal control problem of some sequence of distributions, except that at each time step the distribution is pushed forward by Φˇa~=ProjSSˇΦa~ (instead of Φa~) when using control a~. In this case, a straightforward adaptation of the tabular Q-learning algorithm leads to Algorithm 1. Note that, even in the absence of common noise, this algorithm is possibly stochastic since at each episode, the order in which the state-action pairs are picked is potentially random. In practice, the order could be fixed in advance or stem from a sampled trajectory.

\DontPrintSemicolon\KwDataA number of episodes Nepi; a sequence of learning rates (αt)t=0,,Nepi-1. \KwResultAn approximation of Q on SSˇ×𝔸~. \Begin Initialize table Qˇ00|SSˇ|×|𝒜||𝒮| \[email protected]
\For episode t=0,1,Nepi-1 Set Qˇt+1Qˇt\[email protected]
\For(μˇ,a~)SSˇ×𝔸~ Execute action a~, observe μˇ=Φˇa~(μˇ) and reward f~=f~(μˇ,a~)\[email protected]
Replace Qˇt+1(μˇ,a~) by
\algorithmcfname 1 Mean-Field Q-learning (MFQ)

For this elementary algorithm, as a proof of concept we provide a convergence result for the approximation of the Q-function of the original MFC problem by the table returned at the end of Algorithm 1. To this end, in order to keep the paper at a reasonable length, we will make the following simplifying assumptions.

We endow the simplex SS seen as a subset of |𝒮| with the Euclidean distance denoted by dSS.

  1. 1.

    Regularity of the data: f~ is bounded and Lipschitz continuous with respect to (μ,a~) with constant Lf~ and Φ is Lipschitz continuous with respect to μ with constant LΦ, uniformly in a~ in expectation over the randomness of the common noise, namely: for every a~𝔸~,μ1,μ2SS,

  2. 2.

    Regularity of the value function: V is Lipschitz continuous wrt μ with constant LV.

  3. 3.

    Simplex discretization: There exists ϵS>0 s.t.: μSS,μˇSSˇ s.t. dSS(μ,μˇ)ϵS.

  4. 4.

    Covering time: There exists a finite Tcov such that with probability 1/2 (over the randomness of the common noise and of Algorithm 1) the following holds: For every starting point in SSˇ×𝔸~, every element of SSˇ×𝔸~ has been visited before time Tcov during the execution of Algorithm 1.

  5. 5.

    Learning rates: There exists κ(1/2,1) such that for every (μˇ,a~)SSˇ×𝔸~, αt(μˇ,a~)=1/(1+n(μˇ,a~,t))κ for each t0, where n(μˇ,a~,t) is the number of times up to t that the pair (μˇ,a~) has been visited in Algorithm 1.

The regularity of V in 2 can typically be ensured through suitable conditions on the data of the problem, as e.g. in [17, 9, 12], while to obtain 3, one can consider an ϵS-net as in [24]. Assumption 4 holds for instance either by using exploring starts (if the learner can query an oracle which simulates transitions from any (μ,a~)), or by following a long enough sampled trajectory (provided some form of irreducibility or ergodicity of the dynamics, ensuring full exploration). Note that the boundedness of the running reward f~ from Assumption 1 together with the fact that γ(0,1) ensures the existence of a finite upper bound Vˇmax for the value function of the projected MFC problem. We denote by β=(1-γ)/2 the horizon of the MDP corresponding to the projected MFC problem, and for δ(0,1), we let Tcov(δ)=Tcovlog2(1/(2δ)).

Theorem 2.

Let δ(0,1) and ϵ>0. Under Assumptions 15, if the number of episodes Nepi is of order

Ω(((Tcov(δ))1+3κVˇmax2ln(|SSˇ||𝒜||𝒮|Vˇmax/(2δβϵ))β2ϵ2)1κ+((Tcov(δ))βln(Vˇmaxϵ))11-κ), (11)

then with probability 1-δ, for all (μ,a~)SS×A~,


where ϵ=ϵ+[γ(2-γ)1-γLV(1+LΦ)+Lf~]ϵS.

Note that ϵ can be chosen as small as desired provided Nepi is large enough. The second term in the error ϵ is proportional to ϵS, which is somehow unavoidable in general due to the projection on SSˇ. However, this error vanishes as ϵS0, i.e., as SSˇ is better and better an approximation of SS.

The proof is deferred to Section C of the appendix. It relies on the following three steps: (1) Since Nepi is large enough, then QˇNepiQˇ on SSˇ×𝔸~; (2) QˇQ on SSˇ×𝔸~; (3) For every (μ,a~)SS×𝔸~, Q(ProjSSˇ(μ),a~)Q(μ,a~). The first step relies on standard Q-learning convergence results [19], while the two other steps stem from the regularity assumptions and the approximation of SS by SSˇ.

Let us now derive a consequence in terms of the control function. We will use the following additional assumption on the gap between the values of the best and second-best actions, which is rather standard in approximation algorithms based on tabular Q-functions [20, 4].

  1. 1.

    Action gap: There exists KA>0 such that: For every μˇSSˇ and a~𝔸~\argmaxQ(μˇ,), max𝔸~Q(μˇ,)-Q(μˇ,a~)KA.

For τ>0 and xn, we use the notation



Corollary 3.

In the setting of Theorem 2, if in addition Assumption 1 holds, then for every μˇSSˇ,


where QˇNepi is the table returned by Algorithm 1 and ϵ is the error term appearing in Theorem 2.

The proof is deferred to Section C of the appendix. The argmaxe in the second term is here in case there are several optimal controls. The softmax regularizes the best action predicted by the estimation QˇNepi of the function Q.

Second method: DDPG for MFC.

Although the elementary structure of the above method allows us to analyze it, one drawback is that it requires the computation of a projection on a discretization of the simplex at each time step, which can be quite cumbersome and computationally costly when the number of states in 𝒮 is large. For this reason, we propose a different RL method based on an approximation of the Q-function by neural networks which can deal with inputs and outputs in continuous spaces. Still in the case of finite 𝒮 and 𝒜, since the state of distributions 𝒫(𝒮) and the set of actions 𝒜𝒮 are finite dimensional, we can try to approximate the Q-function by a neural network taking inputs in |𝒮|×|𝒜||𝒮| and outputting a real number. For the learning procedure, we employ the Deep Deterministic Policy Gradient (DDPG) proposed in [33]. It relies on two neural networks, one for the Q-function (the critic) and one for the policy (the actor). The heart of the algorithm consists in updating alternatively the critic by minimizing an empirical square error and the actor by making one step of gradient descent. To improve exploration, a Gaussian noise is added to the action prescribed by the actor, and for more stability, target networks are also added. The algorithm is summarized in the appendix (see Algorithm 2).

Adaptation to continuous spaces.

When the underlying state space 𝒮 is continuous, the distribution μt is an element of the infinite-dimensional space 𝒫(𝒮). In order to develop a reinforcement learning method, we thus need some kind of finite-dimensional approximation. Motivated by applications to physical models such as swarm of robots or drones in which the underlying space 𝒮 is in low dimension (typically 𝒮d with d=1,2 or 3), we propose to represent μt by a histogram of its values at a finite number of points. In other words, we consider a finite number Np of points in 𝒮 and let MtNp be a vector which approximates μt. Similarly, an action a𝔸~ can be approximated by its values on the Np points chosen in 𝒮 and can hence be represented by a vector in Np. The problem then reduces to the finite-state case discussed above.

4 Numerical Examples

General setup.

We present numerical results obtained using the second method introduced above. We assumed that the central planner has access to an oracle which can simulate the evolution of Mt as discussed at the end of the previous section. In the numerical implementation, we provided the learning algorithm with a black-box which computes transitions of Mt. This oracle has been implemented using a finite-difference scheme for Kolmorov-Fokker-Planck equations, in line with recent research on numerical methods for mean field games [1]. The actor and critic networks have been implemented using a feedforward fully connected architecture with 2 hidden layers of width less than 300 neurons. We used random initial states at each episode, and the noise used on the action is a Gaussian noise with mean 0 and variance 0.02. We used Adam optimizer with initial learning rate 0.0001 and minibatches of size 16. For the sake of clarity and in order to benchmark the aforementioned method, we provide illustrations on examples without common noise but analogous results can also be obtained in the presence of a common noise.

Example 1: Cyber security model.

For a first testbed, we start with a finite state problem. We revisit the cyber security example introduced in [28], but here from the point of view of a central planner (such as a large company or a state) trying to protect its computers against the attacks of a hacker. The situation can hence be phrased as a mean field control problem.

In this model, the population consists of a large group of computers which can be either defended (D) or undefended (U) and either infected (I) or susceptible (S) of infection. The set 𝒮 has hence four elements corresponding to the four possible combinations: DI, DS, UI, US. The action set is 𝒜={0,1}, where 0 is interpreted as the fact that the central planner is satisfied with the current level of protection (D or U) of the computer under consideration, whereas 1 means that she wants to change this level of protection. In the latter case, the update occurs at a (fixed) rate λ>0. At each of the four states, all the computers are indistinguishable and hence the central planner only chooses one action per state and applies it to all the computers at that state. When infected, each computer may recover at rate qrecD or qrecU depending on whether it is defended or not. On the other hand, a computer may be infected either directly by a hacker, at rate vHqinfD (resp. vHqinfU) if it is defended (resp. undefended), or by undefended infected computers, at rate βUUμ({UI}) (resp. βUDμ({UI})) if it is undefended (resp. defended), or by defended infected computers, at rate βDUμ({DI}) (resp. βDDμ({DI})) if it is undefended (resp. defended).

The transition matrix from (6) is given by




and all the instances of should be replaced by the negative of the sum of the entries of the row in which appears on the diagonal. At each time step, the central planner pays a protection cost kD>0 for each defended computer, and a penalty kI>0 for each infected computer. The total reward at time t is hence f~t=-[kDμt({DI,DS})+kIμt({DI,UI})].

Although the underlying state space 𝒮 is finite, instead of using MFQ (see Algorithm 1), we use the second method introduced above based on DDPG. In the present case, this approach has the advantage to avoid discretizing 𝒫(𝒮) since we instead deal directly with the distribution as a vector in dimension 4. The DDPG method learns a control, that we then apply to the three reference cases studied in [11, Section 7.2.3] (for the sake of brevity, we do not reproduce here the values of the parameters). The resulting distribution’s evolution are shown in Figure 2. We can see that we recover the same evolution for the three initial distributions considered, namely (0.25,0.25,0.25,0.25),(1,0,0,0) and (0,0,0,1). In particular, at T=10, we obtain in all three simulations the distribution μ10=(0.0,0.0,0.4376,0.5624), which is close to the values found in [11, Section 7.2] for the stationary distribution, namely (0.0,0.0,0.44,0.56).



(a) Test 1


(b) Test 2


(c) Test 3
Figure 2: Cyber security example: Evolution of the distribution when applying the control learnt by DDPG.

Example 2: Swarm motion.

We then turn our attention to a model in continuous space. More precisely, let us consider a model of swarm motion with aversion to crowded regions introduced in [3] (in the context of mean field games). Since here we simply want to provide a proof of principle for our method, we take (as in the aforementioned work) the interval [0,1] with periodic boundary condition (i.e. the unit torus) as the state space 𝒮. The dynamics of a typical agent is driven by (3) with b(x,μ,a)=a. In other words, the central planner chooses the velocity of each agent. The instantaneous reward (appearing in (1)) of a typical agent at location x and using action a while the population’s state is μ, is defined as f(x,μ,a)=-12|a|2+φ(x)-ln(μ(x)). Here, the first term penalizes a large velocity (it can be interpreted as a kind of cost proportional to the kinetic energy of the agent), φ encodes a preference for certain positions in space, and the last term models crowd aversion since it penalizes the fact of being at a location where the density of agents is high. By choosing


we obtain a model for which, when ϵt have Gaussian distribution and ϵ00, the MFC admits an explicit ergodic solution that we can use as a benchmark. Indeed, in this case the optimal ergodic control is given by a~(x)=2πcos(2πx) and the ergodic distribution of the corresponding MKV dynamics has density μ(x)=e2sin(2πx)/e2sin(2πx)𝑑x.

To implement the second RL method described above, we discretize [0,1] with a mesh of Np points and use a finite difference scheme to simulate the evolution of the dynamics. The DDPG method uses this as a black-box and, for a given action a~Np, can only access the resulting new distribution and the associated reward. Figure 3 presents results obtained using this method after 3000 episodes. The system has been trained on initial distributions which are Gaussian with random mean and random variance. As illustrated in the figures, the system has learnt how to drive this type of initial distributions towards the analytical stationary distribution and then how to use an approximation of the stationary optimal control in order to keep the system in the stationary regime.



(a) μt at several t


(b) μt at several t


(c) Evolution of μt


(d) Evolution of μt


(e) Control learnt


(f) Control learnt
Figure 3: Swarm motion: Evolution of the distribution and control learnt for two different initial distributions.

5 Conclusion and future research

In this work, we explored the central role played by MKV dynamics in multi-agent RL. We developed a framework and model-free methods to learn mean field optimal control. As a proof of principle, we established a rate of convergence for a Q-learning method and our numerical tests assess empirical convergence of an actor-critic method on examples from the literature. An important feature of our model is the presence of common noise, whose impact had to be controlled.

Our results can be extended in several directions. The analysis and the numerical implementations could be applied to other mean-field problems such as mean field games or mean field control problems with several populations, with important applications to multi-agent sweeping and tracking. The proof of convergence of the actor-critic method is also postponed for future work. Last, it would be interesting to investigate other types of simulators, such as Monte-Carlo simulators based on samples of a finite population.

Related work. Our work is at the intersection of RL and MFC. The latter has recently attracted a lot of attention, particularly since the introduction of mean field games (MFG) by Lasry and Lions [2006a, 2006b, 2007] and by Caines, Huang and Malhamé [2006, 2007]. MFGs correspond to the asymptotic limit of Nash equilibria for games with mean field interactions. They are defined through a fixed point procedure and hence, differ both conceptually and numerically, from MFC problems which correspond to social optima, see e.g. [11] for details. Most works on learning in the presence of mean field interactions have focused on MFGs, see e.g. [40, 10] for “learning” (or rather solving) MFGs based on the full knowledge of the model, and [27, 38, 39, 24, 34, 18, 37, 21] for RL based methods. In contrast, our work focuses on MFC problems. Along these lines, we have studied policy gradient methods for MFC in [16]. However, this work was restricted to linear-quadratic models. While completing the present work, we became aware of the very recent work [36], which studies MFC with policy gradient methods too. However, their work is restricted to finite state and action spaces whereas we also consider continuous spaces. Furthermore, we provide a rate of convergence (see Theorem 2). We also stress that although some tools are common, our work differs significantly from [24, 18] because the latter works deal with a mean field game. Their learning procedure is embedded in a fixed point on the distribution and, for this reason, the Q-learning step is only needed to solve a classical control problem and not a mean field one. Here, the key novelty is that our learning methods are designed for MDPs on the space of measures. Last, we would also like to mention that the first two authors have recently proposed machine learning methods for solving mean field control problems and mean field games, see [14, 15]. These methods are based on the knowledge of the model, since one relies on it to compute gradients of the cost functional and implement a stochastic gradient descent. The present work can be viewed as an extension of these methods, where one tries to be free from the model and learn the solution by trial and error.


M.L. is grateful to Matthieu Geist and Julien Pérolat for helpful discussions on the DDPG algorithm.


  • Achdou and Capuzzo-Dolcetta [2010] Achdou, Y. and Capuzzo-Dolcetta, I. (2010). Mean field games: numerical methods. SIAM J. Numer. Anal., 48(3):1136–1162.
  • Alasseur et al. [2017] Alasseur, C., Ben Tahar, I., and Matoussi, A. (2017). An extended mean field game for storage in smart grids. arXiv preprint arXiv:1710.08991.
  • Almulla et al. [2017] Almulla, N., Ferreira, R., and Gomes, D. (2017). Two numerical approaches to stationary mean-field games. Dyn. Games Appl., 7(4):657–682.
  • Bellemare et al. [2016] Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., and Munos, R. (2016). Increasing the action gap: New operators for reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence.
  • Bensoussan et al. [2013] Bensoussan, A., Frehse, J., and Yam, S. C. P. (2013). Mean field games and mean field type control theory. Springer Briefs in Mathematics. Springer, New York.
  • Bensoussan et al. [2015] Bensoussan, A., Frehse, J., and Yam, S. C. P. (2015). The master equation in mean field theory. J. Math. Pures Appl. (9), 103(6):1441–1474.
  • Bertsekas [2012] Bertsekas, D. P. (2012). Dynamic programming and optimal control. Vol. II. Approximate dynamic programming. Athena Scientific, Belmont, MA, fourth edition.
  • Bossy and Talay [1997] Bossy, M. and Talay, D. (1997). A stochastic particle method for the McKean-Vlasov and the Burgers equation. Math. Comp., 66(217):157–192.
  • Cardaliaguet et al. [2019] Cardaliaguet, P., Delarue, F., Lasry, J.-M., and Lions, P.-L. (2019). The master equation and the convergence problem in mean field games, volume 201 of Annals of Mathematics Studies. Princeton University Press, Princeton, NJ.
  • Cardaliaguet and Hadikhanloo [2017] Cardaliaguet, P. and Hadikhanloo, S. (2017). Learning in mean field games: the fictitious play. ESAIM: Control, Optimisation and Calculus of Variations, 23(2):569–591.
  • Carmona and Delarue [2018a] Carmona, R. and Delarue, F. (2018a). Probabilistic theory of mean field games with applications. I, volume 83 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field FBSDEs, control, and games.
  • Carmona and Delarue [2018b] Carmona, R. and Delarue, F. (2018b). Probabilistic theory of mean field games with applications. II, volume 84 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field games with common noise and master equations.
  • Carmona et al. [2015] Carmona, R., Fouque, J.-P., and Sun, L.-H. (2015). Mean field games and systemic risk. Commun. Math. Sci., 13(4):911–933.
  • Carmona and Laurière [2019a] Carmona, R. and Laurière, M. (2019a). Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: I - the ergodic case. arXiv preprint arXiv:1907.05980.
  • Carmona and Laurière [2019b] Carmona, R. and Laurière, M. (2019b). Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: II - the finite horizon case. arXiv preprint arXiv:1908.01613.
  • Carmona et al. [2019] Carmona, R., Laurière, M., and Tan, Z. (2019). Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. arXiv preprint arXiv:1910.04295.
  • Chassagneux et al. [2014] Chassagneux, J.-F., Crisan, D., and Delarue, F. (2014). A probabilistic approach to classical solutions of the master equation for large population equilibria. arXiv:1411.3009.
  • Elie et al. [2019] Elie, R., Pérolat, J., Laurière, M., Geist, M., and Pietquin, O. (2019). Approximate fictitious play for mean field games. arXiv preprint arXiv:1907.02633.
  • Even-Dar and Mansour [2003] Even-Dar, E. and Mansour, Y. (2003). Learning rates for Q-learning. J. Mach. Learn. Res., 5:1–25.
  • Farahmand [2011] Farahmand, A.-m. (2011). Action-gap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems, pages 172–180.
  • Fu et al. [2019] Fu, Z., Yang, Z., Chen, Y., and Wang, Z. (2019). Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498.
  • Gao and Pavel [2017] Gao, B. and Pavel, L. (2017). On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805.
  • Gast et al. [2012] Gast, N., Gaujal, B., and Le Boudec, J.-Y. (2012). Mean field for markov decision processes: from discrete to continuous optimization. IEEE Transactions on Automatic Control, 57(9):2266–2280.
  • Guo et al. [2019] Guo, X., Hu, A., Xu, R., and Zhang, J. (2019). Learning mean-field games. arXiv preprint arXiv:1901.09585.
  • Huang et al. [2007] Huang, M., Caines, P. E., and Malhamé, R. P. (2007). Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized ϵ-Nash equilibria. IEEE Trans. Automat. Control, 52(9):1560–1571.
  • Huang et al. [2006] Huang, M., Malhamé, R. P., and Caines, P. E. (2006). Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst., 6(3):221–251.
  • Iyer et al. [2014] Iyer, K., Johari, R., and Sundararajan, M. (2014). Mean field equilibria of dynamic auctions with learning. Management Science, 60(12):2949–2970.
  • Kolokoltsov and Bensoussan [2016] Kolokoltsov, V. N. and Bensoussan, A. (2016). Mean-field-game model for botnet defense in cyber-security. Appl. Math. Optim., 74(3):669–692.
  • Lasry and Lions [2006a] Lasry, J.-M. and Lions, P.-L. (2006a). Jeux à champ moyen. I. Le cas stationnaire. C. R. Math. Acad. Sci. Paris, 343(9):619–625.
  • Lasry and Lions [2006b] Lasry, J.-M. and Lions, P.-L. (2006b). Jeux à champ moyen. II. Horizon fini et contrôle optimal. C. R. Math. Acad. Sci. Paris, 343(10):679–684.
  • Lasry and Lions [2007] Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Jpn. J. Math., 2(1):229–260.
  • Laurière and Pironneau [2016] Laurière, M. and Pironneau, O. (2016). Dynamic programming for mean-field type control. J. Optim. Theory Appl., 169(3):902–924.
  • Lillicrap et al. [2016] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR 2016).
  • Mguni et al. [2018] Mguni, D., Jennings, J., and de Cote, E. M. (2018). Decentralised learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Pham and Wei [2017] Pham, H. and Wei, X. (2017). Dynamic programming for optimal control of stochastic McKean-Vlasov dynamics. SIAM J. Control Optim., 55(2):1069–1101.
  • Subramanian and Mahajan [2019] Subramanian, J. and Mahajan, A. (2019). Reinforcement learning in stationary mean-field games. In Proceedings. 18th International Conference on Autonomous Agents and Multiagent Systems.
  • Tiwari et al. [2019] Tiwari, N., Ghosh, A., and Aggarwal, V. (2019). Reinforcement learning for mean field game. arXiv preprint arXiv:1905.13357.
  • Yang et al. [2018a] Yang, J., Ye, X., Trivedi, R., Xu, H., and Zha, H. (2018a). Deep mean field games for learning optimal behavior policy of large populations. In International Conference on Learning Representations.
  • Yang et al. [2018b] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang, J. (2018b). Mean field multi-agent reinforcement learning. In International Conference on Machine Learning, pages 5567–5576.
  • Yin et al. [2013] Yin, H., Mehta, P. G., Meyn, S. P., and Shanbhag, U. V. (2013). Learning in mean-field games. IEEE Transactions on Automatic Control, 59(3):629–644.

Appendix A Derivation of Bellman equation

Although the derivation of the Bellman equation (7) is rather standard, see e.g. [7], we include it for the sake of completeness.

To alleviate the presentation, we assume that the running reward f is bounded by a constant Cf and we focus on stationary controls. To avoid measurability issues, we assume that the set 0 is countable. More general settings could be considered at the expense of technicalities which are beyond the scope of the present work. We rewrite the value function as


which leads us to introduce the following dynamic programming operators defined, for a bounded function v and a𝔸, by


The above expression for V then rewrites


where (𝒯n0) represents the result of 𝒯 composed n times and applied to the function which is identically 0 on 𝒫(𝒮). Thanks to the bound on f, one can check that for every μ𝒫(𝒮) and every integer n,


Applying 𝒯 to the above relation yields


By taking n+, we deduce


which is the desired relation. One can also check that the solution to (7) is unique. Similarly, denoting J~(a)(μ)=J(μ,a), we have for any a𝔸 and μ𝒫(𝒮), J~(a) is the only fixed point of 𝒯a.

In addition, let us consider an stationary control a* which is optimal in the sense that, for any μ0𝒫(𝒮), it achieves the supremum over a𝔸 in (4), i.e., J~(a*)=V. Then, using the Bellman equations for V and J~(a*), we have

(𝒯V)(μ0) =V(μ0)

Conversely, assume a is such that for every μ0,


Then, since (𝒯V)(μ0)=V(μ0), we obtain that V(μ0)=(𝒯aV)(μ0). By uniqueness of the fixed point of 𝒯a, we have that V=J~(a) and hence a is optimal.

Appendix B Link with MDPs arising in Mean Field Games

Here, we clarify the difference between the Mean Field MDPs studied in the present work and the ones arising in the context of finite state Mean Field Games [38, 24]. Although both problems involve the term “mean field”, we argue that in a MFG, the MDP is a rather standard one.

In [38], the authors make the following point (at the beginning of their Section 4): given the evolution of the population (i.e., the mean-field term), an infinitesimal player solves a (standard) optimal control problem, to which corresponds a (standard) MDP. In other words, the population’s distribution appears as a given parameter in the MDP and not as the state over which the optimization is performed, as in our case. To emphasize, in our notation, the difference between their setting and ours, let us consider the evolution (6) for a finite state space 𝒮={1,2,,|𝒮|}. As in [38], let us assume that the players control directly their transition probabilities. In other words, the action space is the set of probability distributions on 𝒮, and all the players in a state x𝒮 control the probability with which they will go to each other state x𝒮. In this case, each element of 𝒜 can be identified with a vector of length |𝒮| representing a probability distribution on 𝒮, and each element of a~𝔸~ can be identified with a matrix Pa~ such that Px,xa~=a~(x)(x). In [38], the initial distribution μ0 is fixed (hence we omit to denote explicitly the dependence on μ0), and there is no common noise. In this case, one can look for feedback controls which are functions of time and x only. Then the MFC problem (4) rewrites: Find 𝐚~=(a~t)t0,atA maximizing


under the constraint: μ0𝐚~=μ0 and for t0,

μt+1𝒂~=Pa~tμt𝒂~, (12)

where P is as defined above. The corresponding MFG, analogous to the one studied in [38], can be formulated as follows: Find a sequence of distributions 𝐦=(mt)t0 and a sequence of controls 𝐚~=(a~t),a~tA such that: (1) 𝐚~ maximizes


under the constraint: μ0𝐚~,𝐦=μ0 and for t0,

μt+1𝒂~,𝒎=Pa~tμt𝒂~,𝒎, (13)

and (2) for every t0, mt=μ0𝐚~,𝐦. Step (1) above corresponds to the problem faced by a typical agent, when the evolution of the population is given by 𝒎. The dynamics (13) can be viewed as an MDP on the space of measures, but the evolution is purely linear, in the sense that at time t, the state of the MDP, namely μt𝒂~,𝒎, is not involved in the transition matrix, namely Pa~t. In contrast, the dynamics (6) is in general non-linear.

In [24], the authors build the MDP upon the same insight in a different way. To stress the difference with our notion of MDP, let us go back to the MFC problem (1) and consider again a finite state space 𝒮 and no common noise. The corresponding MFG would be: Find a flow of distributions 𝐦=(mt)t0 and a policy π such that: (1) π maximizes


where the state process x has initial distribution μ0 and dynamics


and (2) For every t0, mt is the distribution of xtπ,𝐦. As mentioned above, in the optimization problem of a typical player, the flow of distributions 𝒎 appears as a given parameter. The corresponding MDP is hence parameterized by this 𝒎 but evolves in the finite state space 𝒮. Furthermore, in [24], although they consider interactions through the control’s distributions (omitted here to alleviate the notations), the authors focus on a stationary Nash equilibrium. In other words, they look for a solution of the following type of problems: Find mP(S) and π=(πt)t0 such that, letting 𝐦=(m,m,), we have: (1) π maximizes


where the state process has initial distribution m and dynamics


and (2) For every t0, m is the distribution of xtπ,𝐦. In this case, the rewards and dynamics of an infinitesimal player is parameterized by a single distribution m and the same remark holds for the corresponding MDP.

Appendix C Proof of the convergence results

Proof of Theorem 2.

Let us denote by Vˇ and Qˇ respectively the state value function and the state-action value function of the projected MFC problem. We split the proof into several steps.

Step 1. If Nepi is large enough, then for every (μˇ,a~)SSˇ×𝔸~,


This comes from standard convergence results on Q-learning for finite state-action spaces. More precisely, under Assumptions 1, 4 and 5, we can apply Theorem 4 and Corollary 34 in [19] for asynchronous Q-learning and polynomial learning rates, and we obtain that, with probability at least 1-δ, QˇNepi-Qˇϵ, given that Nepi is of order (11).

Step 2. For every (μˇ,a~)SSˇ×𝔸~,


This amounts to say that the projection on SSˇ realized at each step does not perturb too much the value function. Let us start by noting that, for every μˇSSˇ and a~𝔸~,


where the last inequality holds by Lipschitz continuity of V, see Assumption 2, and because for every μˇSSˇ,


To conclude, we use Assumptions 3 and 1, and we obtain:


Combining the above bounds yields


Step 3. For every (μ,a~)SS×𝔸~,


Indeed, for every μSSˇ and a~𝔸~, letting μˇ=ProjSSˇ(μ) to alleviate the notation, we have


where we used the Lipschitz continuity of f~,V,Φ and the assumption on SSˇ, see Assumptions 1, 2 and 3. ∎

Proof of Corollary 3.

We use Proposition 4 of [22], which states that softmaxτ is τ-Lipschitz and Lemma 7 [24], which states that for (xi)i=1,,n,


where δ=max(x)-maxxi<max(x)xi, and δ= if all xi are equal. We can apply this latter result to Q(μˇ,) thanks to our Assumption 1 on the action gap, with n=|𝔸~| and δ=KA. Combining this with the result of Theorem 2, we have, for every μˇ,


Appendix D Deep Deterministic Policy Gradient Algorithm

In this section, we recall for the sake of completeness the Deep Deterministic Policy Gradient (DDPG) proposed in [33], adapted to our mean field control problem. See also [18] for how the same algorithm can be used for MFG. In the latter case, it is used to compute the (approximate) best response of a single infinitesimal player instead of the optimal control for the whole population as in MFC problems.

\DontPrintSemicolon\KwDataA number of episodes Nepi; a length T for each episode; a minibatch size Nbatch; a learning rate τ. \KwResultA control a𝔸. \Begin Randomly initialize critic network QθQ and actor network πθπ with parameters θQ and θπ respectively\[email protected]
Randomly initialize target networks QθQ and network πθπ with θQθQ and θπθπ \[email protected]
\For episode k=0,1,Nepi-1 Initialize M0 \[email protected]
Initialize replay buffer R \[email protected]
\Fort=0,1,T-1 Select an action at=πθπ(Mt)+𝒩tNp\[email protected]
Execute at, observe reward f~t and Mt+1\[email protected]
Store transition (Mt,at,f~t,Mt+1) in R \[email protected]
Sample a random minibatch of Nbatch transitions (Mi,ai,f~i,Mi+1) from R \[email protected]
Set yi=f~i+γQθQ(Mi+1,πθπ(Mi+1)), for i=1,Nbatch \[email protected]
Update the critic by minimizing the loss: L(θQ)=1Nbatchi(yi-QθQ(xi,ai))2 \[email protected]
Update the actor policy using the sampled policy gradient θπJ of
Update target networks: θQτθQ+(1-τ)θQ and θπτθπ+(1-τ)θπ\[email protected]
\algorithmcfname 2 DDPG for MFC