Abstract
We develop a general reinforcement learning framework for mean field control(MFC) problems. Such problems arise for instance as the limit of collaborativemultiagent control problems when the number of agents is very large. Theasymptotic problem can be phrased as the optimal control of a nonlineardynamics. This can also be viewed as a Markov decision process (MDP) but thekey difference with the usual RL setup is that the dynamics and the reward nowdepend on the state's probability distribution itself. Alternatively, it can berecast as a MDP on the Wasserstein space of measures. In this work, weintroduce generic modelfree algorithms based on the stateaction valuefunction at the mean field level and we prove convergence for a prototypicalQlearning method. We then implement an actorcritic method and reportnumerical results on two archetypal problems: a finite space model motivated bya cyber security application and a continuous space model motivated by anapplication to swarm motion.
Quick Read (beta)
Abstract
We develop a general reinforcement learning framework for mean field control (MFC) problems. Such problems arise for instance as the limit of collaborative multiagent control problems when the number of agents is very large. The asymptotic problem can be phrased as the optimal control of a nonlinear dynamics. This can also be viewed as a Markov decision process (MDP) but the key difference with the usual RL setup is that the dynamics and the reward now depend on the state’s probability distribution itself. Alternatively, it can be recast as a MDP on the Wasserstein space of measures. In this work, we introduce generic modelfree algorithms based on the stateaction value function at the mean field level and we prove convergence for a prototypical Qlearning method. We then implement an actorcritic method and report numerical results on two archetypal problems: a finite space model motivated by a cyber security application and a continuous space model motivated by an application to swarm motion.
ModelFree MeanField Reinforcement Learning:
MeanField MDP and MeanField QLearning
René Carmona${}^{\mathrm{\u2020}}$ Mathieu Laurière${}^{\mathrm{\u2020}}$ Zongjun Tan${}^{\mathrm{\u2020}}$
1 Introduction
Typical reinforcement learning (RL) applications involve the search for a procedure to learn by trial and error the optimal behavior so as to maximize a reward. While similar in spirit to optimal control applications, a key difference is that in the latter, the model is assumed to be known to the controller. This is in contrast with RL for which the environment has to be explored, and the reward cannot be predicted with certainty. Still, the RL paradigm has generated numerous theoretical developments and found plenty practical applications. As a matter of fact, bidirectional links with the optimal control literature have been unveiled as common tools lie at the heart of many studies. Mean field control (MFC), also called optimal control of McKeanVlasov (MKV) dynamics, is an extension of stochastic control which has recently attracted a surge of interest (see e.g. [5, 11, 12]). From a theoretical standpoint, the main peculiarity of this type of problems is that the transition and reward functions not only involve the state and the action of the controller, but also the distribution of the state (and possibly of the control). Practically speaking, these problems appear as the asymptotic limits for the control of a large number of collaborative agents. They can also be introduced as single agent problems whose evolution and costs depend upon the distribution of her state (and potentially of her control). Such problems have found a wide range of applications in distributed robotics, energy, drone fleet management, risk management, finance, etc. Although they are bona fide control problems for which a dynamic programming principle can be formulated, they generally lead to Bellman equations on the infinite dimensional space of measures, which are extremely difficult to solve ([6, 32, 35]). Figure 1 contains a schematic diagram of the relationships between optimal control (i.e., planning with a model) and how the paradigms of MFC and RL are combined in meanfield reinforcement learning (MFRL).
$$\text{xymatrix}\text{txt}OC\text{ar}{[d]}_{N\to \mathrm{\infty}/MKV}\text{ar}{[rrr]}^{\text{txt}\text{\mathit{l}\mathit{e}\mathit{a}\mathit{r}\mathit{n}\mathit{i}\mathit{n}\mathit{g}}}\mathrm{\&}\mathrm{\&}\mathrm{\&}\text{txt}RL\text{ar}{[d]}^{N\to \mathrm{\infty}/MKV}\text{txt}MFC\text{ar}{[rrr]}_{\text{txt}\text{\mathit{l}\mathit{e}\mathit{a}\mathit{r}\mathit{n}\mathit{i}\mathit{n}\mathit{g}}}\mathrm{\&}\mathrm{\&}\mathrm{\&}\text{txt}MFRL$$ 
Main contributions. Our first contribution is conceptual and consists in introducing a general framework of MFC problems in discrete time, with infinite horizon, discount and common noise. We argue that this setup, which has not been covered in the classical literature on MFC problems, is particularly relevant to develop a theory of reinforcement learning for mean field problems. We then rephrase the problem as a Markov decision process (MDP) on the space of measures. This point of view leads to the introduction of a state value function and a stateaction value function as well as their associated Bellman equations on the Wasserstein space of measures. Our second contribution is to propose two modelfree methods to learn an approximation of the stateaction value function by trial and error. The first method relies on a discretization of the simplex and a tabular version of $Q$learning, for which we prove a convergence result. The second method is based on an actorcritic method (Deep Deterministic Policy Gradient). Last, our third contribution is to implement the latter method and assess its convergence numerically. Numerical tests are conducted on two prototypical examples drawn from the meanfield literature: a finite state model motivated by a cyber security application and a continuous state and action model motivated by an application to swarm motion.
2 Mean Field Control
In this section, we keep the discussion at an informal level in order to encompass both finite and continuous state spaces. Specific methods and examples for each setting are presented in Sections 3 and 4.
Definition of the problem.
We denote by $\mathcal{S}$ and $\mathcal{A}$ respectively the state space and the action space. Typically, we have in mind the finite space case where both are finite sets, say $\mathcal{S}=\{1,\mathrm{\dots},\mathcal{S}\}$ and $\mathcal{A}=\{1,\mathrm{\dots},\mathcal{A}\}$, or the continuous case, where $\mathcal{S}={\mathbb{R}}^{d}$ and $\mathcal{A}={\mathbb{R}}^{k}$. A generic discrete time, infinite horizon, discounted mean field control (MFC) problem (or control of McKeanVlasov dynamics) takes the following form: Maximize over the control process (or policy) $\pi $ the reward functional
$$J({\mu}_{0},\pi )=\mathbb{E}\left[\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}f({x}_{t}^{\pi ,{\mu}_{0}},{\mu}_{t}^{\pi ,{\mu}_{0}},{\pi}_{t})\right]$$  (1) 
where the state process has initial distribution ${\mu}_{\mathrm{0}}$ and dynamics
$$\mathbb{P}({x}_{t+1}^{\pi ,{\mu}_{0}}\in \cdot {x}_{t}^{\pi ,{\mu}_{0}},{\mu}_{t}^{\pi ,{\mu}_{0}},{\pi}_{t},{\u03f5}_{t}^{0})={p}_{{x}_{t}^{\pi ,{\mu}_{0}},{\mu}_{t}^{\pi ,{\mu}_{0}},{\pi}_{t},{\u03f5}_{t}^{0}}(\cdot ).$$  (2) 
Here $p:\mathcal{S}\times \mathcal{P}(\mathcal{S})\times \mathcal{A}\times \mathcal{S}\times {\mathcal{E}}^{0}\to \mathbb{R}$ encodes the transition. The meanfield nature of the model stems from the presence of ${\mu}_{t}^{\pi ,{\mu}_{0}}=\mathcal{L}({x}_{t}^{\pi ,{\mu}_{0}}{\u03f5}^{0})\in \mathcal{P}(\mathcal{S})$, which is the law of ${x}_{t}^{\pi ,{\mu}_{0}}$ conditioned on the realization of ${\u03f5}^{0}$ up to time $t1$, where ${({\u03f5}_{t}^{0})}_{t\ge 0}$ is a stochastic process taking values in a set ${\mathcal{E}}^{0}$. For simplicity we assume that the ${\u03f5}_{t}^{0}$ are i.i.d. They play the role of a socalled common noise affecting the state transitions. Although the presence of this noise is not necessary for the model to be meaningful and we postpone the rigorous mathematical framework to future work, we believe that its presence is important for applications. Motivations and examples for this type of noise are provided in the sequel. Randomness in the rewards as well as interactions through the control’s distribution could also be handled, but for the sake of simplicity of the presentation we limit ourselves to a reward $f$ which is a deterministic function of $(x,\mu ,a)\in \mathcal{S}\times \mathcal{P}(\mathcal{S})\times \mathcal{A}$ and to the interaction through the conditional distribution of the state only.
Remark 1.
Note that $J$ depends on ${\mu}_{\mathrm{0}}$. Although this dependence is usually omitted in the MFC literature, it is important to remember that the solution of the MFC problem changes if we let this initial distribution vary.
In the finite case, the above dynamics take the following form, where we identify $\mathcal{P}(\mathcal{S})$ with the $\mathcal{S}$simplex denoted by $\mathrm{SS}$: for every $(x,\mu ,a,s,{e}^{0})\in \mathcal{S}\times \mathrm{SS}\times \mathcal{A}\times \mathcal{S}\times {\mathcal{E}}^{0}$, ${p}_{x,\mu ,a,{e}^{0}}({x}^{\prime})$ corresponds to
$$\mathbb{P}({x}_{t+1}^{\pi ,{\mu}_{0}}={x}^{\prime}({x}_{t}^{\pi ,{\mu}_{0}},{\mu}_{t}^{\pi ,{\mu}_{0}},{\pi}_{t},{\u03f5}_{t}^{0})=(x,\mu ,a,{e}^{0})).$$ 
Such an evolution can be interpreted in terms of a transition rate matrix, and the common noise can for instance affect the coefficients of this matrix. In the continuous case, (2) can come from a continuous time model, for example a stochastic differential equation of the McKean Vlasov type (MKV SDE) via an Euler scheme [8], in which case:
$${x}_{t+1}^{\pi ,{\mu}_{0}}={x}_{t}^{\pi ,{\mu}_{0}}+b({x}_{t}^{\pi ,{\mu}_{0}},{\mu}_{t}^{\pi ,{\mu}_{0}},{\pi}_{t})+{\u03f5}_{t+1}+{\u03f5}_{t+1}^{0},$$  (3) 
where the random variables ${\u03f5}_{t},{\u03f5}_{t}^{0},t\ge 1$ are independent (e.g. with Gaussian distributions) and are interpreted as sources of noise. This type of setting has been studied in [16] with a linear dynamics and a quadratic cost.
When there is no ambiguity from the context, we will drop the superscripts $\pi $ and ${\mu}_{0}$. Intuitively, (1) is the limiting problem, as $N$ grows to infinity, of the following control problem for $N$ agents: Maximize over $\mathrm{(}{\pi}^{\mathrm{1}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\pi}^{N}\mathrm{)}$ the social average reward
$${J}^{N}({\mu}_{0},{\pi}^{1},\mathrm{\dots},{\pi}^{N})=\frac{1}{N}\sum _{i=1}^{N}\mathbb{E}\left[\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}f({x}_{t}^{i},{\overline{\mu}}_{t}^{N},{\pi}_{t}^{i})\right],$$ 
where ${\overline{\mu}}_{t}^{N}\mathrm{=}{\mathrm{\sum}}_{j\mathrm{=}\mathrm{1}}^{N}{\delta}_{{x}_{t}^{j}}\mathrm{/}N$ is the empirical distribution, ${x}_{\mathrm{0}}^{i}$ are i.i.d. with distribution ${\mu}_{\mathrm{0}}$ and the dynamics are $\mathbb{P}({x}_{t+1}^{i}\in \cdot {x}_{t}^{i},{\overline{\mu}}_{t}^{N},{\pi}_{t}^{i},{\u03f5}_{t}^{0})={p}_{{x}_{t}^{i},{\overline{\mu}}_{t}^{N},{\pi}_{t}^{i},{\u03f5}_{t}^{0}}(\cdot ).$ Note that the same ${\u03f5}_{t}^{0}$ appears in the transitions of all ${({x}_{t}^{i})}_{i=1,\mathrm{\dots},N}$. In other words, the system is affected by two sources of noise: one perturbs each state ${x}^{i}$ independently, and one affects all the agents. The first noise is idiosyncratic to each agent whereas the second one is common to the whole population. This latter type of noise is important in many applications, see e.g. [13, 2] for models of systemic risk or energy management in the context of mean field games. Since, in this $N$agent problem, the goal is to maximize ${J}^{N}$, the problem can be interpreted e.g. as a large collaborative game (i.e. a Pareto optimum, rather than a Nash equilibrium as in mean field games), or as the problem of a central planner trying to find the best way to control a large group of robots.
Reformulation as an MDP on the Wasserstein space of measures.
We reformulate the MFC problem (1) as the optimal control of a Markov decision process in which the state space is the space of measures, in the spirit of e.g. [23]. We restrict our attention to controls which are stationary feedback functions of $(\mathcal{L}({x}_{t}),{x}_{t})$, namely processes $\pi $ for which there exists a (deterministic) function $a:\mathcal{P}(\mathcal{S})\times \mathcal{S}\to \mathcal{A}$ such that for all $t$
$${\pi}_{t}=a(\mathcal{L}({x}_{t}^{\pi ,{\mu}_{0}}),{x}_{t}^{\pi ,{\mu}_{0}}).$$ 
In this situation, a typical agent takes her decision at each time step based on only two pieces of information: her current state and the distribution of the population’s states. For such $a$ and $\pi ,$ we will write $J(a)$ instead of $J(\pi )$. We denote by $\mathbb{A}$ the set of such functions $a$ and it will sometimes be convenient to view them as functions from $\mathcal{P}(\mathcal{S})$ to the set $\stackrel{~}{\mathbb{A}}=\{\stackrel{~}{a}:\mathcal{S}\to \mathcal{A}\}$. The initial MFC problem (1) can be recast as an optimal control problem on the distribution flow, namely,
$$J({\mu}_{0},a)={\mathbb{E}}_{{({\u03f5}_{t}^{0})}_{t\ge 0}}\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}\stackrel{~}{f}({\mu}_{t}^{{\mu}_{0},a},a({\mu}_{t}^{{\mu}_{0},a}))$$  (4) 
under the constraint: ${\mu}_{t}^{{\mu}_{0},a}=\mathcal{L}({x}_{t}^{{\mu}_{0},a}{\u03f5}^{0})$, where ${x}^{{\mu}_{0},a}$ solves (2). Here $\stackrel{~}{f}:\mathcal{P}(\mathcal{S})\times \stackrel{~}{\mathbb{A}}\to \mathbb{R}$ is defined by
$$\stackrel{~}{f}(\mu ,\stackrel{~}{a})={\mathbb{E}}_{x\sim \mu}[f(x,\mu ,\stackrel{~}{a}(x))],$$ 
and the evolution of the distribution is given by
$${\mu}_{t+1}^{{\mu}_{0},a}={\mathrm{\Phi}}^{a({\mu}_{t}^{{\mu}_{0},a},\cdot ),{\u03f5}_{t}^{0}}({\mu}_{t}^{{\mu}_{0},a})$$  (5) 
where $\mathrm{\Phi}:\stackrel{~}{\mathbb{A}}\times {\mathcal{E}}^{0}\times \mathcal{P}(\mathcal{S})\to \mathcal{P}(\mathcal{S}),(\stackrel{~}{a},{e}^{0},\mu )\mapsto {\mathrm{\Phi}}^{\stackrel{~}{a},{e}^{0}}({\mu}^{\prime})$, formalizing the transition (2) in our new set of notations. Note that $\mathrm{\Phi}$ depends (possibly in a nonlinear way) on the distribution at time $t$, in accordance with the idea of MKV dynamics. If $\mathrm{\Phi}$ is constant with respect to the common noise, then the evolution of the distribution is deterministic and the expectation symbol in (4) is superfluous. To alleviate the presentation, we sometimes omit the dependence on ${\u03f5}^{0}$ (since it is now the only source of randomness) and see $\mathrm{\Phi}$ as a stochastic map from $\stackrel{~}{\mathbb{A}}\times \mathcal{P}(\mathcal{S})$ to $\mathcal{P}(\mathcal{S})$.
We are facing an MDP over the space $\mathcal{P}(\mathcal{S})$ of probability measures on the underlying state $\mathcal{S}$. Note that $\mathcal{P}(\mathcal{S})$ is always continuous and infinite dimensional unless $\mathcal{S}$ is finite. If $\mathcal{S}$ is finite, the distribution ${\mu}_{t}$ can be viewed as a vector in ${\mathbb{R}}^{\mathcal{S}}$ whose dynamics can be written as
$${\mu}_{t+1}^{{\mu}_{0},a}={P}^{{\mu}_{t}^{{\mu}_{0},a},a({\mu}_{t}^{{\mu}_{0},a}),{\u03f5}_{t}^{0}}{\mu}_{t}^{{\mu}_{0},a},$$  (6) 
where ${P}^{{\mu}_{t}^{{\mu}_{0},a},a({\mu}_{t}^{{\mu}_{0},a}),{\u03f5}_{t}^{0}}\in {\mathbb{R}}^{\mathcal{S}\times \mathcal{S}}$ is the transition matrix of the distribution, which can be interpreted in terms of the transition rate matrix of a typical player.
Dynamic programming.
Let $V$ be the value function associated to the above problem (4), defined by: for $\mu \in \mathcal{P}(\mathcal{S})$,
$$V(\mu )=\underset{a\in \mathbb{A}}{sup}\mathbb{E}\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}\stackrel{~}{f}({\mu}_{t}^{\mu ,a},a({\mu}_{t}^{\mu ,a}))$$  (7) 
under the constraint: ${({\mu}_{t}^{\mu ,a})}_{t\ge 0}$ solves (5) with initial condition $\mu $. One can check, at least formally, that a dynamic programming principle holds in the sense that $V$ solves the following Bellman equation:
$$\begin{array}{cc}\hfill V(\mu )& =\underset{a\in \mathbb{A}}{sup}\left\{\stackrel{~}{f}(\mu ,a(\mu ))+\gamma \mathbb{E}V\left({\mathrm{\Phi}}^{a(\mu )}(\mu )\right)\right\}\hfill \\ & =\underset{\stackrel{~}{a}\in \stackrel{~}{\mathbb{A}}}{sup}\left\{\stackrel{~}{f}(\mu ,\stackrel{~}{a})+\gamma \mathbb{E}V\left({\mathrm{\Phi}}^{\stackrel{~}{a}}(\mu )\right)\right\},\hfill \end{array}$$  (8) 
where the expectation is over the randomness of $\mathrm{\Phi}$, which stems from the common noise.
Moreover, under suitable conditions, we expect a verification theorem to hold, namely, any function satisfying the above Bellman equation (8) coincides with the value function of the MFC problem: $V({\mu}_{0})={sup}_{a\in \mathbb{A}}J({\mu}_{0},a)$ for any initial distribution ${\mu}_{0}$, and the optimal control is given by the maximizer in (8). The value function and its connection to the socalled Master equation and to the meanfield PDE system has been a very active research direction in recent years, see e.g. [5, 11, 12].
3 MeanField QLearning
We now turn our attention to the question of learning the solution of the MFC problem in a modelfree setting, i.e., assuming the model is unknown but one has access to a simulator which can provide samples of trajectories and rewards.
Stateaction value function.
From now on, we see the distribution as the state in $\mathcal{P}(\mathcal{S})$ and the action taken by the central planner given this distribution as an element of $\stackrel{~}{\mathbb{A}}$. We then introduce the following stateaction value function $Q:\mathcal{P}(\mathcal{S})\times \stackrel{~}{\mathbb{A}}\to \mathbb{R}$ defined, for every $(\mu ,\stackrel{~}{a})\in \mathcal{P}(\mathcal{S})\times \stackrel{~}{\mathbb{A}}$ by
$$Q(\mu ,\stackrel{~}{a})=\stackrel{~}{f}(\mu ,\stackrel{~}{a})+\gamma \mathbb{E}\underset{{\stackrel{~}{a}}^{\prime}\in \stackrel{~}{\mathbb{A}}}{\mathrm{max}}Q({\mathrm{\Phi}}^{\stackrel{~}{a}}(\mu ),{\stackrel{~}{a}}^{\prime}),$$  (9) 
where we recall that we dropped ${\u03f5}^{0}$ from the notation in the expectation and in $\mathrm{\Phi}$. Under suitable conditions, $V$ and $Q$ are well defined and $V(\mu )={sup}_{\stackrel{~}{a}}Q(\mu ,\stackrel{~}{a})$. Note that finding one maximizer ${\stackrel{~}{a}}_{\mu}\in \stackrel{~}{\mathbb{A}}$ for each $\mu \in \mathcal{P}(\mathcal{S})$ amounts to find a maximizer in $\mathbb{A}$ for (7).
In the rest of this section, we propose two modelfree algorithms relying on this stateaction value function. We first focus on the case where $\mathcal{S}$ and $\mathcal{A}$ are finite and we explain later on how to adapt these techniques to the case of continuous spaces.
First method: Tabular MFQlearning with projection.
Note that even when $\mathcal{S}$ is finite, $\mu \in \mathcal{P}(\mathcal{S})$ is an element of the $\mathcal{S}$simplex $\mathrm{SS}$, which is a continuous space, and hence a tabular version of Qlearning can not be applied directly. One possible workaround is to first replace $\mathrm{SS}$ by a finite subset $\stackrel{\u02c7}{\mathrm{SS}}$ and then project ${\mu}_{t}$ on $\stackrel{\u02c7}{\mathrm{SS}}$ at each time step $t$, thus letting the meanfield term take only a finite number of values. We can then approximate the function $Q$ by a function $\stackrel{\u02c7}{Q}:\stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}\to \mathbb{R}$, which can be represented by a matrix (or a “table”) in ${\mathbb{R}}^{\stackrel{\u02c7}{\mathrm{SS}}\times {\mathcal{A}}^{\mathcal{S}}}$ (since $\stackrel{~}{\mathbb{A}}$ is the set of functions from $\mathcal{S}$ to $\mathcal{A}$). We introduce the following “projected MFC problem”: Maximize over $\stackrel{\mathrm{\u02c7}}{a}\mathrm{\in}\stackrel{\mathrm{\u02c7}}{\mathrm{A}}\mathrm{=}\mathrm{\{}\stackrel{\mathrm{\u02c7}}{a}\mathrm{:}\stackrel{\mathrm{\u02c7}}{\mathrm{SS}}\mathrm{\times}\mathrm{S}\mathrm{\to}\mathrm{A}\mathrm{\}}$
$$\stackrel{\u02c7}{J}({\mu}_{0},\stackrel{\u02c7}{a})=\mathbb{E}\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}\stackrel{~}{f}({\stackrel{\u02c7}{\mu}}_{t}^{{\mu}_{0},\stackrel{\u02c7}{a}},\stackrel{\u02c7}{a}({\stackrel{\u02c7}{\mu}}_{t}^{{\mu}_{0},\stackrel{\u02c7}{a}}))$$  (10) 
under the constraint: ${\stackrel{\mathrm{\u02c7}}{\mu}}_{t\mathrm{+}\mathrm{1}}^{{\mu}_{\mathrm{0}}\mathrm{,}\stackrel{\mathrm{\u02c7}}{a}}\mathrm{=}{\stackrel{\mathrm{\u02c7}}{\mathrm{\Phi}}}^{\stackrel{\mathrm{\u02c7}}{a}\mathit{}\mathrm{(}{\stackrel{\mathrm{\u02c7}}{\mu}}_{t}^{{\mu}_{\mathrm{0}}\mathrm{,}\stackrel{\mathrm{\u02c7}}{a}}\mathrm{)}\mathrm{,}{\u03f5}_{t}^{\mathrm{0}}}\mathit{}\mathrm{(}{\stackrel{\mathrm{\u02c7}}{\mu}}_{t}^{{\mu}_{\mathrm{0}}\mathrm{,}\stackrel{\mathrm{\u02c7}}{a}}\mathrm{)}\mathrm{,}$ where $\stackrel{\mathrm{\u02c7}}{\mathrm{\Phi}}\mathrm{:}\stackrel{\mathrm{~}}{\mathrm{A}}\mathrm{\times}\mathrm{\times}{\mathrm{E}}^{\mathrm{0}}\stackrel{\mathrm{\u02c7}}{\mathrm{SS}}\mathrm{\to}\stackrel{\mathrm{\u02c7}}{\mathrm{SS}}\mathrm{,}$ $\mathrm{(}\stackrel{\mathrm{~}}{a}\mathrm{,}{e}^{\mathrm{0}}\mathrm{,}\stackrel{\mathrm{\u02c7}}{\mu}\mathrm{)}\mathrm{\mapsto}{\mathrm{Proj}}_{\stackrel{\mathrm{\u02c7}}{\mathrm{SS}}}\mathit{}\mathrm{\left(}{\mathrm{\Phi}}^{\stackrel{\mathrm{~}}{a}\mathrm{,}{e}^{\mathrm{0}}}\mathit{}\mathrm{(}\stackrel{\mathrm{\u02c7}}{\mu}\mathrm{)}\mathrm{\right)}$. This problem is still an optimal control problem of some sequence of distributions, except that at each time step the distribution is pushed forward by ${\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a}}={\mathrm{Proj}}_{\stackrel{\u02c7}{\mathrm{SS}}}\circ {\mathrm{\Phi}}^{\stackrel{~}{a}}$ (instead of ${\mathrm{\Phi}}^{\stackrel{~}{a}}$) when using control $\stackrel{~}{a}$. In this case, a straightforward adaptation of the tabular Qlearning algorithm leads to Algorithm 1. Note that, even in the absence of common noise, this algorithm is possibly stochastic since at each episode, the order in which the stateaction pairs are picked is potentially random. In practice, the order could be fixed in advance or stem from a sampled trajectory.
$(1{\alpha}_{t}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})){\stackrel{\u02c7}{Q}}_{t}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})$  
$\mathrm{\hspace{1em}\hspace{1em}}+{\alpha}_{t}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\left(\stackrel{~}{f}+\gamma \underset{{\stackrel{~}{a}}^{\prime}\in \stackrel{~}{\mathbb{A}}}{\mathrm{max}}{\stackrel{\u02c7}{Q}}_{t}({\stackrel{\u02c7}{\mu}}^{\prime},{\stackrel{~}{a}}^{\prime})\right)$ 
For this elementary algorithm, as a proof of concept we provide a convergence result for the approximation of the Qfunction of the original MFC problem by the table returned at the end of Algorithm 1. To this end, in order to keep the paper at a reasonable length, we will make the following simplifying assumptions.
We endow the simplex $\mathrm{SS}$ seen as a subset of ${\mathbb{R}}^{\mathcal{S}}$ with the Euclidean distance denoted by ${d}_{\mathrm{SS}}$.

1.
Regularity of the data: $\stackrel{~}{f}$ is bounded and Lipschitz continuous with respect to $(\mu ,\stackrel{~}{a})$ with constant ${L}_{\stackrel{~}{f}}$ and $\mathrm{\Phi}$ is Lipschitz continuous with respect to $\mu $ with constant ${L}_{\mathrm{\Phi}}$, uniformly in $\stackrel{~}{a}$ in expectation over the randomness of the common noise, namely: for every $\stackrel{~}{a}\in \stackrel{~}{\mathbb{A}},{\mu}_{1},{\mu}_{2}\in \mathrm{SS}$,
${\mathbb{E}}_{{e}^{0}}\left[{d}_{\mathrm{SS}}({\mathrm{\Phi}}^{\stackrel{~}{a},{e}^{0}}({\mu}_{1}),{\mathrm{\Phi}}^{\stackrel{~}{a},{e}^{0}}({\mu}_{2}))\right]\le {L}_{\mathrm{\Phi}}{d}_{\mathrm{SS}}({\mu}_{1},{\mu}_{2}).$ 
2.
Regularity of the value function: $V$ is Lipschitz continuous wrt $\mu $ with constant ${L}_{V}$.

3.
Simplex discretization: There exists ${\u03f5}_{S}>0$ s.t.: $\forall \mu \in \mathrm{SS},\exists \stackrel{\u02c7}{\mu}\in \stackrel{\u02c7}{\mathrm{SS}}$ s.t. ${d}_{\mathrm{SS}}(\mu ,\stackrel{\u02c7}{\mu})\le {\u03f5}_{S}$.

4.
Covering time: There exists a finite ${T}_{cov}$ such that with probability $1/2$ (over the randomness of the common noise and of Algorithm 1) the following holds: For every starting point in $\stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}$, every element of $\stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}$ has been visited before time ${T}_{cov}$ during the execution of Algorithm 1.

5.
Learning rates: There exists $\kappa \in (1/2,1)$ such that for every $(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\in \stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}$, ${\alpha}_{t}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})=1/{\left(1+n(\stackrel{\u02c7}{\mu},\stackrel{~}{a},t)\right)}^{\kappa}$ for each $t\ge 0$, where $n(\stackrel{\u02c7}{\mu},\stackrel{~}{a},t)$ is the number of times up to $t$ that the pair $(\stackrel{\u02c7}{\mu},\stackrel{~}{a})$ has been visited in Algorithm 1.
The regularity of $V$ in 2 can typically be ensured through suitable conditions on the data of the problem, as e.g. in [17, 9, 12], while to obtain 3, one can consider an ${\u03f5}_{S}$net as in [24]. Assumption 4 holds for instance either by using exploring starts (if the learner can query an oracle which simulates transitions from any $(\mu ,\stackrel{~}{a})$), or by following a long enough sampled trajectory (provided some form of irreducibility or ergodicity of the dynamics, ensuring full exploration). Note that the boundedness of the running reward $\stackrel{~}{f}$ from Assumption 1 together with the fact that $\gamma \in (0,1)$ ensures the existence of a finite upper bound ${\stackrel{\u02c7}{V}}_{max}$ for the value function of the projected MFC problem. We denote by $\beta =(1\gamma )/2$ the horizon of the MDP corresponding to the projected MFC problem, and for $\delta \in (0,1)$, we let ${T}_{cov}(\delta )=\lceil {T}_{cov}{\mathrm{log}}_{2}(1/(2\delta ))\rceil $.
Theorem 2.
Let $\delta \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$ and $\u03f5\mathrm{>}\mathrm{0}$. Under Assumptions 1–5, if the number of episodes ${N}_{e\mathit{}p\mathit{}i}$ is of order
$$\begin{array}{cc}& \mathrm{\Omega}({\left(\frac{{({T}_{cov}(\delta ))}^{1+3\kappa}{\stackrel{\u02c7}{V}}_{max}^{2}\mathrm{ln}\left(\stackrel{\u02c7}{\mathrm{SS}}{\mathcal{A}}^{\mathcal{S}}{\stackrel{\u02c7}{V}}_{max}/(2\delta \beta \u03f5)\right)}{{\beta}^{2}{\u03f5}^{2}}\right)}^{\frac{1}{\kappa}}\hfill \\ & +{\left(\frac{({T}_{cov}(\delta ))}{\beta}\mathrm{ln}\left(\frac{{\stackrel{\u02c7}{V}}_{max}}{\u03f5}\right)\right)}^{\frac{1}{1\kappa}}),\hfill \end{array}$$  (11) 
then with probability $\mathrm{1}\mathrm{}\delta $, for all $\mathrm{(}\mu \mathrm{,}\stackrel{\mathrm{~}}{a}\mathrm{)}\mathrm{\in}\mathrm{SS}\mathrm{\times}\stackrel{\mathrm{~}}{\mathrm{A}}$,
$${\stackrel{\u02c7}{Q}}_{{N}_{epi}}({\mathrm{Proj}}_{\stackrel{\u02c7}{\mathrm{SS}}}(\mu ),\stackrel{~}{a})Q(\mu ,\stackrel{~}{a})\le {\u03f5}^{\prime},$$ 
where ${\u03f5}^{\mathrm{\prime}}\mathrm{=}\u03f5\mathrm{+}\mathrm{\left[}\frac{\gamma \mathit{}\mathrm{(}\mathrm{2}\mathrm{}\gamma \mathrm{)}}{\mathrm{1}\mathrm{}\gamma}\mathit{}{L}_{V}\mathit{}\mathrm{(}\mathrm{1}\mathrm{+}{L}_{\mathrm{\Phi}}\mathrm{)}\mathrm{+}{L}_{\stackrel{\mathrm{~}}{f}}\mathrm{\right]}\mathit{}{\u03f5}_{S}\mathrm{.}$
Note that $\u03f5$ can be chosen as small as desired provided ${N}_{epi}$ is large enough. The second term in the error ${\u03f5}^{\prime}$ is proportional to ${\u03f5}_{S}$, which is somehow unavoidable in general due to the projection on $\stackrel{\u02c7}{\mathrm{SS}}$. However, this error vanishes as ${\u03f5}_{S}\to 0$, i.e., as $\stackrel{\u02c7}{\mathrm{SS}}$ is better and better an approximation of $\mathrm{SS}$.
The proof is deferred to Section C of the appendix. It relies on the following three steps: (1) Since ${N}_{epi}$ is large enough, then ${\stackrel{\u02c7}{Q}}_{{N}_{epi}}\approx \stackrel{\u02c7}{Q}$ on $\stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}$; (2) $\stackrel{\u02c7}{Q}\approx Q$ on $\stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}$; (3) For every $(\mu ,\stackrel{~}{a})\in \mathrm{SS}\times \stackrel{~}{\mathbb{A}}$, $Q({\mathrm{Proj}}_{\stackrel{\u02c7}{\mathrm{SS}}}(\mu ),\stackrel{~}{a})\approx Q(\mu ,\stackrel{~}{a}).$ The first step relies on standard $Q$learning convergence results [19], while the two other steps stem from the regularity assumptions and the approximation of $\mathrm{SS}$ by $\stackrel{\u02c7}{\mathrm{SS}}$.
Let us now derive a consequence in terms of the control function. We will use the following additional assumption on the gap between the values of the best and secondbest actions, which is rather standard in approximation algorithms based on tabular $Q$functions [20, 4].

1.
Action gap: There exists ${K}_{A}>0$ such that: For every $\stackrel{\u02c7}{\mu}\in \stackrel{\u02c7}{\mathrm{SS}}$ and $\stackrel{~}{a}\in \stackrel{~}{\mathbb{A}}\backslash \mathrm{arg}\mathrm{max}Q(\stackrel{\u02c7}{\mu},\cdot )$, ${\mathrm{max}}_{\stackrel{~}{\mathbb{A}}}Q(\stackrel{\u02c7}{\mu},\cdot )Q(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\ge {K}_{A}.$
For $\tau >0$ and $x\in {\mathbb{R}}^{n}$, we use the notation
$${\mathrm{softmax}}_{\tau}(x)=({e}^{\tau {x}_{1}},\mathrm{\dots},{e}^{\tau {x}_{n}})/\sum _{j}{e}^{\tau {x}_{j}},$$ 
and
$$\mathrm{argmaxe}(x)={({\mathrm{\U0001d7cf}}_{i\in \mathrm{arg}\mathrm{max}(x)})}_{i=1,\mathrm{\dots},n}/\mathrm{arg}\mathrm{max}(x).$$ 
Corollary 3.
In the setting of Theorem 2, if in addition Assumption 1 holds, then for every $\stackrel{\mathrm{\u02c7}}{\mu}\mathrm{\in}\stackrel{\mathrm{\u02c7}}{\mathrm{SS}}$,
${\parallel {\mathrm{softmax}}_{\tau}\left({\stackrel{\u02c7}{Q}}_{{N}_{epi}}(\stackrel{\u02c7}{\mu},\cdot )\right)\mathrm{argmaxe}\left(Q(\stackrel{\u02c7}{\mu},\cdot )\right)\parallel}_{2}$  
$\le \tau {\u03f5}^{\prime}+2\stackrel{~}{\mathbb{A}}{e}^{\tau {K}_{A}},$ 
where ${\stackrel{\mathrm{\u02c7}}{Q}}_{{N}_{e\mathit{}p\mathit{}i}}$ is the table returned by Algorithm 1 and ${\u03f5}^{\mathrm{\prime}}$ is the error term appearing in Theorem 2.
The proof is deferred to Section C of the appendix. The $\mathrm{argmaxe}$ in the second term is here in case there are several optimal controls. The $\mathrm{softmax}$ regularizes the best action predicted by the estimation ${\stackrel{\u02c7}{Q}}_{{N}_{epi}}$ of the function $Q$.
Second method: DDPG for MFC.
Although the elementary structure of the above method allows us to analyze it, one drawback is that it requires the computation of a projection on a discretization of the simplex at each time step, which can be quite cumbersome and computationally costly when the number of states in $\mathcal{S}$ is large. For this reason, we propose a different RL method based on an approximation of the Qfunction by neural networks which can deal with inputs and outputs in continuous spaces. Still in the case of finite $\mathcal{S}$ and $\mathcal{A}$, since the state of distributions $\mathcal{P}(\mathcal{S})$ and the set of actions ${\mathcal{A}}^{\mathcal{S}}$ are finite dimensional, we can try to approximate the Qfunction by a neural network taking inputs in ${\mathbb{R}}^{\mathcal{S}\times {\mathcal{A}}^{\mathcal{S}}}$ and outputting a real number. For the learning procedure, we employ the Deep Deterministic Policy Gradient (DDPG) proposed in [33]. It relies on two neural networks, one for the Qfunction (the critic) and one for the policy (the actor). The heart of the algorithm consists in updating alternatively the critic by minimizing an empirical square error and the actor by making one step of gradient descent. To improve exploration, a Gaussian noise is added to the action prescribed by the actor, and for more stability, target networks are also added. The algorithm is summarized in the appendix (see Algorithm 2).
Adaptation to continuous spaces.
When the underlying state space $\mathcal{S}$ is continuous, the distribution ${\mu}_{t}$ is an element of the infinitedimensional space $\mathcal{P}(\mathcal{S})$. In order to develop a reinforcement learning method, we thus need some kind of finitedimensional approximation. Motivated by applications to physical models such as swarm of robots or drones in which the underlying space $\mathcal{S}$ is in low dimension (typically $\mathcal{S}\subset {\mathbb{R}}^{d}$ with $d=1,2$ or $3$), we propose to represent ${\mu}_{t}$ by a histogram of its values at a finite number of points. In other words, we consider a finite number ${N}_{p}$ of points in $\mathcal{S}$ and let ${M}_{t}\in {\mathbb{R}}^{{N}_{p}}$ be a vector which approximates ${\mu}_{t}$. Similarly, an action $a\in \stackrel{~}{\mathbb{A}}$ can be approximated by its values on the ${N}_{p}$ points chosen in $\mathcal{S}$ and can hence be represented by a vector in ${\mathbb{R}}^{{N}_{p}}$. The problem then reduces to the finitestate case discussed above.
4 Numerical Examples
General setup.
We present numerical results obtained using the second method introduced above. We assumed that the central planner has access to an oracle which can simulate the evolution of ${M}_{t}$ as discussed at the end of the previous section. In the numerical implementation, we provided the learning algorithm with a blackbox which computes transitions of ${M}_{t}$. This oracle has been implemented using a finitedifference scheme for KolmorovFokkerPlanck equations, in line with recent research on numerical methods for mean field games [1]. The actor and critic networks have been implemented using a feedforward fully connected architecture with $2$ hidden layers of width less than $300$ neurons. We used random initial states at each episode, and the noise used on the action is a Gaussian noise with mean $0$ and variance $0.02$. We used Adam optimizer with initial learning rate $0.0001$ and minibatches of size $16$. For the sake of clarity and in order to benchmark the aforementioned method, we provide illustrations on examples without common noise but analogous results can also be obtained in the presence of a common noise.
Example 1: Cyber security model.
For a first testbed, we start with a finite state problem. We revisit the cyber security example introduced in [28], but here from the point of view of a central planner (such as a large company or a state) trying to protect its computers against the attacks of a hacker. The situation can hence be phrased as a mean field control problem.
In this model, the population consists of a large group of computers which can be either defended (D) or undefended (U) and either infected (I) or susceptible (S) of infection. The set $\mathcal{S}$ has hence four elements corresponding to the four possible combinations: DI, DS, UI, US. The action set is $\mathcal{A}=\{0,1\}$, where $0$ is interpreted as the fact that the central planner is satisfied with the current level of protection (D or U) of the computer under consideration, whereas $1$ means that she wants to change this level of protection. In the latter case, the update occurs at a (fixed) rate $\lambda >0$. At each of the four states, all the computers are indistinguishable and hence the central planner only chooses one action per state and applies it to all the computers at that state. When infected, each computer may recover at rate ${q}_{rec}^{D}$ or ${q}_{rec}^{U}$ depending on whether it is defended or not. On the other hand, a computer may be infected either directly by a hacker, at rate ${v}_{H}{q}_{inf}^{D}$ (resp. ${v}_{H}{q}_{inf}^{U}$) if it is defended (resp. undefended), or by undefended infected computers, at rate ${\beta}_{UU}\mu (\{UI\})$ (resp. ${\beta}_{UD}\mu (\{UI\})$) if it is undefended (resp. defended), or by defended infected computers, at rate ${\beta}_{DU}\mu (\{DI\})$ (resp. ${\beta}_{DD}\mu (\{DI\})$) if it is undefended (resp. defended).
The transition matrix from (6) is given by
$${P}^{\mu ,a}=\left(\begin{array}{cccc}\hfill \mathrm{\dots}\hfill & \hfill {P}_{DS\to DI}^{\mu ,a}\hfill & \hfill \lambda a\hfill & \hfill 0\hfill \\ \hfill {q}_{rec}^{D}\hfill & \hfill \mathrm{\dots}\hfill & \hfill 0\hfill & \hfill \lambda a\hfill \\ \hfill \lambda a\hfill & \hfill 0\hfill & \hfill \mathrm{\dots}\hfill & \hfill {P}_{US\to UI}^{\mu ,a}\hfill \\ \hfill 0\hfill & \hfill \lambda a\hfill & \hfill {q}_{rec}^{U}\hfill & \hfill \mathrm{\dots}\hfill \end{array}\right)$$ 
where
${P}_{DS\to DI}^{\mu ,a}={v}_{H}{q}_{inf}^{D}+{\beta}_{DD}\mu (\{DI\})+{\beta}_{UD}\mu (\{UI\}),$  
${P}_{US\to UI}^{\mu ,a}={v}_{H}{q}_{inf}^{U}+{\beta}_{UU}\mu (\{UI\})+{\beta}_{DU}\mu (\{DI\}),$ 
and all the instances of $\mathrm{\dots}$ should be replaced by the negative of the sum of the entries of the row in which $\mathrm{\dots}$ appears on the diagonal. At each time step, the central planner pays a protection cost ${k}_{D}>0$ for each defended computer, and a penalty ${k}_{I}>0$ for each infected computer. The total reward at time $t$ is hence ${\stackrel{~}{f}}_{t}=\left[{k}_{D}{\mu}_{t}(\{DI,DS\})+{k}_{I}{\mu}_{t}(\{DI,UI\})\right]$.
Although the underlying state space $\mathcal{S}$ is finite, instead of using MFQ (see Algorithm 1), we use the second method introduced above based on DDPG. In the present case, this approach has the advantage to avoid discretizing $\mathcal{P}(\mathcal{S})$ since we instead deal directly with the distribution as a vector in dimension $4$. The DDPG method learns a control, that we then apply to the three reference cases studied in [11, Section 7.2.3] (for the sake of brevity, we do not reproduce here the values of the parameters). The resulting distribution’s evolution are shown in Figure 2. We can see that we recover the same evolution for the three initial distributions considered, namely $(0.25,0.25,0.25,0.25),(1,0,0,0)$ and $(0,0,0,1)$. In particular, at $T=10$, we obtain in all three simulations the distribution ${\mu}_{10}=(0.0,0.0,0.4376,0.5624)$, which is close to the values found in [11, Section 7.2] for the stationary distribution, namely $(0.0,0.0,0.44,0.56)$.
Example 2: Swarm motion.
We then turn our attention to a model in continuous space. More precisely, let us consider a model of swarm motion with aversion to crowded regions introduced in [3] (in the context of mean field games). Since here we simply want to provide a proof of principle for our method, we take (as in the aforementioned work) the interval $[0,1]$ with periodic boundary condition (i.e. the unit torus) as the state space $\mathcal{S}$. The dynamics of a typical agent is driven by (3) with $b(x,\mu ,a)=a$. In other words, the central planner chooses the velocity of each agent. The instantaneous reward (appearing in (1)) of a typical agent at location $x$ and using action $a$ while the population’s state is $\mu $, is defined as $f(x,\mu ,a)=\frac{1}{2}{a}^{2}+\phi (x)\mathrm{ln}(\mu (x)).$ Here, the first term penalizes a large velocity (it can be interpreted as a kind of cost proportional to the kinetic energy of the agent), $\phi $ encodes a preference for certain positions in space, and the last term models crowd aversion since it penalizes the fact of being at a location where the density of agents is high. By choosing
$$\phi (x)=2{\pi}^{2}\left[\mathrm{sin}(2\pi x)+{\mathrm{cos}(2\pi x)}^{2}\right]+2\mathrm{sin}(2\pi x),$$ 
we obtain a model for which, when ${\u03f5}_{t}$ have Gaussian distribution and ${\u03f5}^{0}\equiv 0$, the MFC admits an explicit ergodic solution that we can use as a benchmark. Indeed, in this case the optimal ergodic control is given by $\stackrel{~}{a}(x)=2\pi \mathrm{cos}(2\pi x)$ and the ergodic distribution of the corresponding MKV dynamics has density $\mu (x)={e}^{2\mathrm{sin}(2\pi x)}/\int {e}^{2\mathrm{sin}(2\pi {x}^{\prime})}\mathit{d}{x}^{\prime}$.
To implement the second RL method described above, we discretize $[0,1]$ with a mesh of ${N}_{p}$ points and use a finite difference scheme to simulate the evolution of the dynamics. The DDPG method uses this as a blackbox and, for a given action $\stackrel{~}{a}\in {\mathbb{R}}^{{N}_{p}}$, can only access the resulting new distribution and the associated reward. Figure 3 presents results obtained using this method after $3000$ episodes. The system has been trained on initial distributions which are Gaussian with random mean and random variance. As illustrated in the figures, the system has learnt how to drive this type of initial distributions towards the analytical stationary distribution and then how to use an approximation of the stationary optimal control in order to keep the system in the stationary regime.
5 Conclusion and future research
In this work, we explored the central role played by MKV dynamics in multiagent RL. We developed a framework and modelfree methods to learn mean field optimal control. As a proof of principle, we established a rate of convergence for a $Q$learning method and our numerical tests assess empirical convergence of an actorcritic method on examples from the literature. An important feature of our model is the presence of common noise, whose impact had to be controlled.
Our results can be extended in several directions. The analysis and the numerical implementations could be applied to other meanfield problems such as mean field games or mean field control problems with several populations, with important applications to multiagent sweeping and tracking. The proof of convergence of the actorcritic method is also postponed for future work. Last, it would be interesting to investigate other types of simulators, such as MonteCarlo simulators based on samples of a finite population.
Related work. Our work is at the intersection of RL and MFC. The latter has recently attracted a lot of attention, particularly since the introduction of mean field games (MFG) by Lasry and Lions [2006a, 2006b, 2007] and by Caines, Huang and Malhamé [2006, 2007]. MFGs correspond to the asymptotic limit of Nash equilibria for games with mean field interactions. They are defined through a fixed point procedure and hence, differ both conceptually and numerically, from MFC problems which correspond to social optima, see e.g. [11] for details. Most works on learning in the presence of mean field interactions have focused on MFGs, see e.g. [40, 10] for “learning” (or rather solving) MFGs based on the full knowledge of the model, and [27, 38, 39, 24, 34, 18, 37, 21] for RL based methods. In contrast, our work focuses on MFC problems. Along these lines, we have studied policy gradient methods for MFC in [16]. However, this work was restricted to linearquadratic models. While completing the present work, we became aware of the very recent work [36], which studies MFC with policy gradient methods too. However, their work is restricted to finite state and action spaces whereas we also consider continuous spaces. Furthermore, we provide a rate of convergence (see Theorem 2). We also stress that although some tools are common, our work differs significantly from [24, 18] because the latter works deal with a mean field game. Their learning procedure is embedded in a fixed point on the distribution and, for this reason, the $Q$learning step is only needed to solve a classical control problem and not a mean field one. Here, the key novelty is that our learning methods are designed for MDPs on the space of measures. Last, we would also like to mention that the first two authors have recently proposed machine learning methods for solving mean field control problems and mean field games, see [14, 15]. These methods are based on the knowledge of the model, since one relies on it to compute gradients of the cost functional and implement a stochastic gradient descent. The present work can be viewed as an extension of these methods, where one tries to be free from the model and learn the solution by trial and error.
Acknowledgements
M.L. is grateful to Matthieu Geist and Julien Pérolat for helpful discussions on the DDPG algorithm.
References
 Achdou and CapuzzoDolcetta [2010] Achdou, Y. and CapuzzoDolcetta, I. (2010). Mean field games: numerical methods. SIAM J. Numer. Anal., 48(3):1136–1162.
 Alasseur et al. [2017] Alasseur, C., Ben Tahar, I., and Matoussi, A. (2017). An extended mean field game for storage in smart grids. arXiv preprint arXiv:1710.08991.
 Almulla et al. [2017] Almulla, N., Ferreira, R., and Gomes, D. (2017). Two numerical approaches to stationary meanfield games. Dyn. Games Appl., 7(4):657–682.
 Bellemare et al. [2016] Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., and Munos, R. (2016). Increasing the action gap: New operators for reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence.
 Bensoussan et al. [2013] Bensoussan, A., Frehse, J., and Yam, S. C. P. (2013). Mean field games and mean field type control theory. Springer Briefs in Mathematics. Springer, New York.
 Bensoussan et al. [2015] Bensoussan, A., Frehse, J., and Yam, S. C. P. (2015). The master equation in mean field theory. J. Math. Pures Appl. (9), 103(6):1441–1474.
 Bertsekas [2012] Bertsekas, D. P. (2012). Dynamic programming and optimal control. Vol. II. Approximate dynamic programming. Athena Scientific, Belmont, MA, fourth edition.
 Bossy and Talay [1997] Bossy, M. and Talay, D. (1997). A stochastic particle method for the McKeanVlasov and the Burgers equation. Math. Comp., 66(217):157–192.
 Cardaliaguet et al. [2019] Cardaliaguet, P., Delarue, F., Lasry, J.M., and Lions, P.L. (2019). The master equation and the convergence problem in mean field games, volume 201 of Annals of Mathematics Studies. Princeton University Press, Princeton, NJ.
 Cardaliaguet and Hadikhanloo [2017] Cardaliaguet, P. and Hadikhanloo, S. (2017). Learning in mean field games: the fictitious play. ESAIM: Control, Optimisation and Calculus of Variations, 23(2):569–591.
 Carmona and Delarue [2018a] Carmona, R. and Delarue, F. (2018a). Probabilistic theory of mean field games with applications. I, volume 83 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field FBSDEs, control, and games.
 Carmona and Delarue [2018b] Carmona, R. and Delarue, F. (2018b). Probabilistic theory of mean field games with applications. II, volume 84 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field games with common noise and master equations.
 Carmona et al. [2015] Carmona, R., Fouque, J.P., and Sun, L.H. (2015). Mean field games and systemic risk. Commun. Math. Sci., 13(4):911–933.
 Carmona and Laurière [2019a] Carmona, R. and Laurière, M. (2019a). Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: I  the ergodic case. arXiv preprint arXiv:1907.05980.
 Carmona and Laurière [2019b] Carmona, R. and Laurière, M. (2019b). Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: II  the finite horizon case. arXiv preprint arXiv:1908.01613.
 Carmona et al. [2019] Carmona, R., Laurière, M., and Tan, Z. (2019). Linearquadratic meanfield reinforcement learning: Convergence of policy gradient methods. arXiv preprint arXiv:1910.04295.
 Chassagneux et al. [2014] Chassagneux, J.F., Crisan, D., and Delarue, F. (2014). A probabilistic approach to classical solutions of the master equation for large population equilibria. arXiv:1411.3009.
 Elie et al. [2019] Elie, R., Pérolat, J., Laurière, M., Geist, M., and Pietquin, O. (2019). Approximate fictitious play for mean field games. arXiv preprint arXiv:1907.02633.
 EvenDar and Mansour [2003] EvenDar, E. and Mansour, Y. (2003). Learning rates for Qlearning. J. Mach. Learn. Res., 5:1–25.
 Farahmand [2011] Farahmand, A.m. (2011). Actiongap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems, pages 172–180.
 Fu et al. [2019] Fu, Z., Yang, Z., Chen, Y., and Wang, Z. (2019). Actorcritic provably finds nash equilibria of linearquadratic meanfield games. arXiv preprint arXiv:1910.07498.
 Gao and Pavel [2017] Gao, B. and Pavel, L. (2017). On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805.
 Gast et al. [2012] Gast, N., Gaujal, B., and Le Boudec, J.Y. (2012). Mean field for markov decision processes: from discrete to continuous optimization. IEEE Transactions on Automatic Control, 57(9):2266–2280.
 Guo et al. [2019] Guo, X., Hu, A., Xu, R., and Zhang, J. (2019). Learning meanfield games. arXiv preprint arXiv:1901.09585.
 Huang et al. [2007] Huang, M., Caines, P. E., and Malhamé, R. P. (2007). Largepopulation costcoupled LQG problems with nonuniform agents: individualmass behavior and decentralized $\u03f5$Nash equilibria. IEEE Trans. Automat. Control, 52(9):1560–1571.
 Huang et al. [2006] Huang, M., Malhamé, R. P., and Caines, P. E. (2006). Large population stochastic dynamic games: closedloop McKeanVlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst., 6(3):221–251.
 Iyer et al. [2014] Iyer, K., Johari, R., and Sundararajan, M. (2014). Mean field equilibria of dynamic auctions with learning. Management Science, 60(12):2949–2970.
 Kolokoltsov and Bensoussan [2016] Kolokoltsov, V. N. and Bensoussan, A. (2016). Meanfieldgame model for botnet defense in cybersecurity. Appl. Math. Optim., 74(3):669–692.
 Lasry and Lions [2006a] Lasry, J.M. and Lions, P.L. (2006a). Jeux à champ moyen. I. Le cas stationnaire. C. R. Math. Acad. Sci. Paris, 343(9):619–625.
 Lasry and Lions [2006b] Lasry, J.M. and Lions, P.L. (2006b). Jeux à champ moyen. II. Horizon fini et contrôle optimal. C. R. Math. Acad. Sci. Paris, 343(10):679–684.
 Lasry and Lions [2007] Lasry, J.M. and Lions, P.L. (2007). Mean field games. Jpn. J. Math., 2(1):229–260.
 Laurière and Pironneau [2016] Laurière, M. and Pironneau, O. (2016). Dynamic programming for meanfield type control. J. Optim. Theory Appl., 169(3):902–924.
 Lillicrap et al. [2016] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR 2016).
 Mguni et al. [2018] Mguni, D., Jennings, J., and de Cote, E. M. (2018). Decentralised learning in systems with many, many strategic agents. In ThirtySecond AAAI Conference on Artificial Intelligence.
 Pham and Wei [2017] Pham, H. and Wei, X. (2017). Dynamic programming for optimal control of stochastic McKeanVlasov dynamics. SIAM J. Control Optim., 55(2):1069–1101.
 Subramanian and Mahajan [2019] Subramanian, J. and Mahajan, A. (2019). Reinforcement learning in stationary meanfield games. In Proceedings. 18th International Conference on Autonomous Agents and Multiagent Systems.
 Tiwari et al. [2019] Tiwari, N., Ghosh, A., and Aggarwal, V. (2019). Reinforcement learning for mean field game. arXiv preprint arXiv:1905.13357.
 Yang et al. [2018a] Yang, J., Ye, X., Trivedi, R., Xu, H., and Zha, H. (2018a). Deep mean field games for learning optimal behavior policy of large populations. In International Conference on Learning Representations.
 Yang et al. [2018b] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang, J. (2018b). Mean field multiagent reinforcement learning. In International Conference on Machine Learning, pages 5567–5576.
 Yin et al. [2013] Yin, H., Mehta, P. G., Meyn, S. P., and Shanbhag, U. V. (2013). Learning in meanfield games. IEEE Transactions on Automatic Control, 59(3):629–644.
Appendix A Derivation of Bellman equation
Although the derivation of the Bellman equation (7) is rather standard, see e.g. [7], we include it for the sake of completeness.
To alleviate the presentation, we assume that the running reward $f$ is bounded by a constant ${C}_{f}$ and we focus on stationary controls. To avoid measurability issues, we assume that the set ${\mathcal{E}}^{0}$ is countable. More general settings could be considered at the expense of technicalities which are beyond the scope of the present work. We rewrite the value function as
$$V(\mu )=\underset{n\to +\mathrm{\infty}}{lim}\underset{a\in \mathbb{A}}{sup}\mathbb{E}\sum _{t=0}^{n}{\gamma}^{t}\stackrel{~}{f}({\mu}_{t},a({\mu}_{t})),$$ 
which leads us to introduce the following dynamic programming operators defined, for a bounded function $v$ and $a\in \mathbb{A}$, by
$({\mathcal{T}}_{a}v)(\mu )=\stackrel{~}{f}(\mu ,a(\mu ))+\gamma \mathbb{E}v({\mathrm{\Phi}}^{a(\mu )}(\mu )),$  
$(\mathcal{T}v)(\mu )=\underset{a\in \mathbb{A}}{sup}\left\{{\mathcal{T}}_{a}v\right\}.$ 
The above expression for $V$ then rewrites
$$V(\mu )=\underset{n\to +\mathrm{\infty}}{lim}({\mathcal{T}}^{n}0)(\mu ),$$ 
where $({\mathcal{T}}^{n}0)$ represents the result of $\mathcal{T}$ composed $n$ times and applied to the function which is identically $0$ on $\mathcal{P}(\mathcal{S})$. Thanks to the bound on $f$, one can check that for every $\mu \in \mathcal{P}(\mathcal{S})$ and every integer $n$,
$$V(\mu )\frac{{\gamma}^{n}}{1\gamma}{C}_{f}\le ({\mathcal{T}}^{n}0)(\mu )\le V(\mu )+\frac{{\gamma}^{n}}{1\gamma}{C}_{f},$$ 
Applying $\mathcal{T}$ to the above relation yields
$$(\mathcal{T}V)(\mu )\frac{{\gamma}^{n+1}}{1\gamma}{C}_{f}\le ({\mathcal{T}}^{n+1}0)(\mu )\le (\mathcal{T}V)(\mu )+\frac{{\gamma}^{n+1}}{1\gamma}{C}_{f}.$$ 
By taking $n\to +\mathrm{\infty}$, we deduce
$$(\mathcal{T}V)(\mu )=\underset{n\to +\mathrm{\infty}}{lim}({\mathcal{T}}^{n+1}0)(\mu )=V(\mu ),$$ 
which is the desired relation. One can also check that the solution to (7) is unique. Similarly, denoting $\stackrel{~}{J}(a)(\mu )=J(\mu ,a)$, we have for any $a\in \mathbb{A}$ and $\mu \in \mathcal{P}(\mathcal{S})$, $\stackrel{~}{J}(a)$ is the only fixed point of ${\mathcal{T}}_{a}$.
In addition, let us consider an stationary control ${a}^{*}$ which is optimal in the sense that, for any ${\mu}_{0}\in \mathcal{P}(\mathcal{S})$, it achieves the supremum over $a\in \mathbb{A}$ in (4), i.e., $\stackrel{~}{J}({a}^{*})=V$. Then, using the Bellman equations for $V$ and $\stackrel{~}{J}({a}^{*})$, we have
$(\mathcal{T}V)({\mu}_{0})$  $=V({\mu}_{0})$  
$=\stackrel{~}{J}({a}^{*})({\mu}_{0})$  
$=({\mathcal{T}}_{{a}^{*}}\stackrel{~}{J}({a}^{*}))({\mu}_{0})$  
$=({\mathcal{T}}_{{a}^{*}}V)({\mu}_{0}).$ 
Conversely, assume $a$ is such that for every ${\mu}_{0}$,
$(\mathcal{T}V)({\mu}_{0})=({\mathcal{T}}_{a}V)({\mu}_{0}).$ 
Then, since $(\mathcal{T}V)({\mu}_{0})=V({\mu}_{0}),$ we obtain that $V({\mu}_{0})=({\mathcal{T}}_{a}V)({\mu}_{0})$. By uniqueness of the fixed point of ${\mathcal{T}}_{a}$, we have that $V=\stackrel{~}{J}(a)$ and hence $a$ is optimal.
Appendix B Link with MDPs arising in Mean Field Games
Here, we clarify the difference between the Mean Field MDPs studied in the present work and the ones arising in the context of finite state Mean Field Games [38, 24]. Although both problems involve the term “mean field”, we argue that in a MFG, the MDP is a rather standard one.
In [38], the authors make the following point (at the beginning of their Section 4): given the evolution of the population (i.e., the meanfield term), an infinitesimal player solves a (standard) optimal control problem, to which corresponds a (standard) MDP. In other words, the population’s distribution appears as a given parameter in the MDP and not as the state over which the optimization is performed, as in our case. To emphasize, in our notation, the difference between their setting and ours, let us consider the evolution (6) for a finite state space $\mathcal{S}=\{1,2,\mathrm{\dots},\mathcal{S}\}$. As in [38], let us assume that the players control directly their transition probabilities. In other words, the action space is the set of probability distributions on $\mathcal{S}$, and all the players in a state $x\in \mathcal{S}$ control the probability with which they will go to each other state ${x}^{\prime}\in \mathcal{S}$. In this case, each element of $\mathcal{A}$ can be identified with a vector of length $\mathcal{S}$ representing a probability distribution on $\mathcal{S}$, and each element of $\stackrel{~}{a}\in \stackrel{~}{\mathbb{A}}$ can be identified with a matrix ${P}^{\stackrel{~}{a}}$ such that ${P}_{{x}^{\prime},x}^{\stackrel{~}{a}}=\stackrel{~}{a}(x)({x}^{\prime})$. In [38], the initial distribution ${\mu}_{0}$ is fixed (hence we omit to denote explicitly the dependence on ${\mu}_{0}$), and there is no common noise. In this case, one can look for feedback controls which are functions of time and $x$ only. Then the MFC problem (4) rewrites: Find $\stackrel{\mathrm{~}}{\mathbf{a}}\mathrm{=}{\mathrm{(}{\stackrel{\mathrm{~}}{a}}_{t}\mathrm{)}}_{t\mathrm{\ge}\mathrm{0}}\mathrm{,}{a}_{t}\mathrm{\in}\mathrm{A}$ maximizing
$$\begin{array}{cc}\hfill J(\stackrel{\mathbf{~}}{\bm{a}})& =\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}{\mathbb{E}}_{x\sim {\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}}}}\left[f(x,{\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}}},{\stackrel{~}{a}}_{t}(x))\right]\hfill \\ & =\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}\sum _{x\in \mathcal{S}}{\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}}}(x)f(x,{\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}}},{\stackrel{~}{a}}_{t}(x)),\hfill \end{array}$$ 
under the constraint: ${\mu}_{\mathrm{0}}^{\stackrel{\mathrm{~}}{\mathbf{a}}}\mathrm{=}{\mu}_{\mathrm{0}}$ and for $t\mathrm{\ge}\mathrm{0}$,
$${\mu}_{t+1}^{\stackrel{\mathbf{~}}{\bm{a}}}={P}^{{\stackrel{~}{a}}_{t}}{\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}}},$$  (12) 
where ${P}^{\mathrm{\cdot}}$ is as defined above. The corresponding MFG, analogous to the one studied in [38], can be formulated as follows: Find a sequence of distributions $\mathbf{m}\mathrm{=}{\mathrm{(}{m}_{t}\mathrm{)}}_{t\mathrm{\ge}\mathrm{0}}$ and a sequence of controls $\stackrel{\mathrm{~}}{\mathbf{a}}\mathrm{=}\mathrm{(}{\stackrel{\mathrm{~}}{a}}_{t}\mathrm{)}\mathrm{,}{\stackrel{\mathrm{~}}{a}}_{t}\mathrm{\in}\mathrm{A}$ such that: (1) $\stackrel{\mathrm{~}}{\mathbf{a}}$ maximizes
$$\begin{array}{cc}\hfill {J}^{MFG}(\stackrel{\mathbf{~}}{\bm{a}};\bm{m})& =\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}\sum _{x\in \mathcal{S}}{\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}},\bm{m}}(x)f(x,{m}_{t},{\stackrel{~}{a}}_{t}(x)),\hfill \end{array}$$ 
under the constraint: ${\mu}_{\mathrm{0}}^{\stackrel{\mathrm{~}}{\mathbf{a}}\mathrm{,}\mathbf{m}}\mathrm{=}{\mu}_{\mathrm{0}}$ and for $t\mathrm{\ge}\mathrm{0}$,
$${\mu}_{t+1}^{\stackrel{\mathbf{~}}{\bm{a}},\bm{m}}={P}^{{\stackrel{~}{a}}_{t}}{\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}},\bm{m}},$$  (13) 
and (2) for every $t\mathrm{\ge}\mathrm{0}$, ${m}_{t}\mathrm{=}{\mu}_{\mathrm{0}}^{\stackrel{\mathrm{~}}{\mathbf{a}}\mathrm{,}\mathbf{m}}$. Step (1) above corresponds to the problem faced by a typical agent, when the evolution of the population is given by $\bm{m}$. The dynamics (13) can be viewed as an MDP on the space of measures, but the evolution is purely linear, in the sense that at time $t$, the state of the MDP, namely ${\mu}_{t}^{\stackrel{\mathbf{~}}{\bm{a}},\bm{m}}$, is not involved in the transition matrix, namely ${P}^{{\stackrel{~}{a}}_{t}}$. In contrast, the dynamics (6) is in general nonlinear.
In [24], the authors build the MDP upon the same insight in a different way. To stress the difference with our notion of MDP, let us go back to the MFC problem (1) and consider again a finite state space $\mathcal{S}$ and no common noise. The corresponding MFG would be: Find a flow of distributions $\mathbf{m}\mathrm{=}{\mathrm{(}{m}_{t}\mathrm{)}}_{t\mathrm{\ge}\mathrm{0}}$ and a policy $\pi $ such that: (1) $\pi $ maximizes
$${J}^{MFG}(\pi ;\bm{m})=\mathbb{E}\left[\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}f({x}_{t}^{\pi ,\bm{m}},{m}_{t},{\pi}_{t})\right]$$ 
where the state process $x$ has initial distribution ${\mu}_{\mathrm{0}}$ and dynamics
$$\mathbb{P}({x}_{t+1}^{\pi ,\bm{m}}={x}^{\prime}{x}_{t}^{\pi ,\bm{m}},{m}_{t},{\pi}_{t})={p}_{{x}_{t}^{\pi ,\bm{m}},{m}_{t},{\pi}_{t}}({x}^{\prime}),$$ 
and (2) For every $t\mathrm{\ge}\mathrm{0}$, ${m}_{t}$ is the distribution of ${x}_{t}^{\pi \mathrm{,}\mathbf{m}}$. As mentioned above, in the optimization problem of a typical player, the flow of distributions $\bm{m}$ appears as a given parameter. The corresponding MDP is hence parameterized by this $\bm{m}$ but evolves in the finite state space $\mathcal{S}$. Furthermore, in [24], although they consider interactions through the control’s distributions (omitted here to alleviate the notations), the authors focus on a stationary Nash equilibrium. In other words, they look for a solution of the following type of problems: Find $m\mathrm{\in}\mathrm{P}\mathit{}\mathrm{(}\mathrm{S}\mathrm{)}$ and $\pi \mathrm{=}{\mathrm{(}{\pi}_{t}\mathrm{)}}_{t\mathrm{\ge}\mathrm{0}}$ such that, letting ${\mathbf{m}}^{\mathrm{\infty}}\mathrm{=}\mathrm{(}m\mathrm{,}m\mathrm{,}\mathrm{\dots}\mathrm{)}$, we have: (1) $\pi $ maximizes
$${J}^{MFG}(\pi ;{\bm{m}}^{\mathrm{\infty}})=\mathbb{E}\left[\sum _{t=0}^{+\mathrm{\infty}}{\gamma}^{t}f({x}_{t}^{\pi ,m},m,{\pi}_{t})\right]$$ 
where the state process has initial distribution $m$ and dynamics
$$\mathbb{P}({x}_{t+1}^{\pi ,{\bm{m}}^{\mathrm{\infty}}}={x}^{\prime}{x}_{t}^{\pi ,{\bm{m}}^{\mathrm{\infty}}},m,{\pi}_{t},{\u03f5}_{t}^{0})={p}_{{x}_{t}^{\pi ,{\bm{m}}^{\mathrm{\infty}}},m,{\pi}_{t},{\u03f5}_{t}^{0}}({x}^{\prime}),$$ 
and (2) For every $t\mathrm{\ge}\mathrm{0}$, $m$ is the distribution of ${x}_{t}^{\pi \mathrm{,}{\mathbf{m}}^{\mathrm{\infty}}}$. In this case, the rewards and dynamics of an infinitesimal player is parameterized by a single distribution $m$ and the same remark holds for the corresponding MDP.
Appendix C Proof of the convergence results
Proof of Theorem 2.
Let us denote by $\stackrel{\u02c7}{V}$ and $\stackrel{\u02c7}{Q}$ respectively the state value function and the stateaction value function of the projected MFC problem. We split the proof into several steps.
Step 1. If ${N}_{epi}$ is large enough, then for every $(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\in \stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}$,
$${\stackrel{\u02c7}{Q}}_{{N}_{epi}}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\approx \stackrel{\u02c7}{Q}(\stackrel{\u02c7}{\mu},\stackrel{~}{a}).$$ 
This comes from standard convergence results on Qlearning for finite stateaction spaces. More precisely, under Assumptions 1, 4 and 5, we can apply Theorem 4 and Corollary 34 in [19] for asynchronous Qlearning and polynomial learning rates, and we obtain that, with probability at least $1\delta $, ${\parallel {\stackrel{\u02c7}{Q}}_{{N}_{epi}}\stackrel{\u02c7}{Q}\parallel}_{\mathrm{\infty}}\le \u03f5$, given that ${N}_{epi}$ is of order (11).
Step 2. For every $(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\in \stackrel{\u02c7}{\mathrm{SS}}\times \stackrel{~}{\mathbb{A}}$,
$$\stackrel{\u02c7}{Q}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\approx Q(\stackrel{\u02c7}{\mu},\stackrel{~}{a}).$$ 
This amounts to say that the projection on $\stackrel{\u02c7}{\mathrm{SS}}$ realized at each step does not perturb too much the value function. Let us start by noting that, for every $\stackrel{\u02c7}{\mu}\in \stackrel{\u02c7}{\mathrm{SS}}$ and $\stackrel{~}{a}\in \stackrel{~}{\mathbb{A}}$,
$\left\stackrel{\u02c7}{Q}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})Q(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\right$  
$=\gamma \left\mathbb{E}\left[\underset{{\stackrel{~}{a}}^{\prime}}{\mathrm{max}}\stackrel{\u02c7}{Q}({\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}),{\stackrel{~}{a}}^{\prime})\underset{{\stackrel{~}{a}}^{\prime}}{\mathrm{max}}Q({\mathrm{\Phi}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}),{\stackrel{~}{a}}^{\prime})\right]\right$  
$\le \gamma \mathbb{E}\left\stackrel{\u02c7}{V}({\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))V({\mathrm{\Phi}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))\right$  
$\le \gamma \mathbb{E}\left\stackrel{\u02c7}{V}({\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))V({\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))\right$  
$\mathrm{\hspace{1em}\hspace{1em}}\mathit{\hspace{1em}\hspace{1em}}+\gamma \mathbb{E}\leftV({\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))V({\mathrm{\Phi}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))\right$  
$\le \gamma {\parallel \stackrel{\u02c7}{Q}Q\parallel}_{\mathrm{\infty}}+\gamma {L}_{V}\mathbb{E}\left[{d}_{\mathrm{SS}}({\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}),{\mathrm{\Phi}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))\right],$ 
where the last inequality holds by Lipschitz continuity of $V$, see Assumption 2, and because for every ${\stackrel{\u02c7}{\mu}}^{\prime}\in \stackrel{\u02c7}{\mathrm{SS}}$,
$\left\stackrel{\u02c7}{V}({\stackrel{\u02c7}{\mu}}^{\prime})V({\stackrel{\u02c7}{\mu}}^{\prime})\right$  
$=\left\underset{{\stackrel{~}{a}}^{\prime}}{sup}\stackrel{\u02c7}{Q}({\stackrel{\u02c7}{\mu}}^{\prime},{\stackrel{~}{a}}^{\prime})\underset{{\stackrel{~}{a}}^{\prime}}{sup}Q({\stackrel{\u02c7}{\mu}}^{\prime},{\stackrel{~}{a}}^{\prime})\right$  
$\le \underset{{\stackrel{\u02c7}{\mu}}^{\prime \prime},{\stackrel{~}{a}}^{\prime}}{sup}\left\stackrel{\u02c7}{Q}({\stackrel{\u02c7}{\mu}}^{\prime \prime},{\stackrel{~}{a}}^{\prime})Q({\stackrel{\u02c7}{\mu}}^{\prime \prime},{\stackrel{~}{a}}^{\prime})\right={\parallel \stackrel{\u02c7}{Q}Q\parallel}_{\mathrm{\infty}}.$ 
To conclude, we use Assumptions 3 and 1, and we obtain:
${\mathbb{E}}_{{e}^{0}}\left[{d}_{\mathrm{SS}}({\stackrel{\u02c7}{\mathrm{\Phi}}}^{\stackrel{~}{a},{e}^{0}}(\stackrel{\u02c7}{\mu}),{\mathrm{\Phi}}^{\stackrel{~}{a},{e}^{0}}(\mu ))\right]$  
$\le {\u03f5}_{S}+{\mathbb{E}}_{{e}^{0}}\left[{d}_{\mathrm{SS}}({\mathrm{\Phi}}^{\stackrel{~}{a},{e}^{0}}(\stackrel{\u02c7}{\mu}),{\mathrm{\Phi}}^{\stackrel{~}{a},{e}^{0}}(\mu ))\right]$  
$\le {\u03f5}_{S}+{L}_{\mathrm{\Phi}}{d}_{\mathrm{SS}}(\stackrel{\u02c7}{\mu},\mu )$  
$\le (1+{L}_{\mathrm{\Phi}}){\u03f5}_{S}.$ 
Combining the above bounds yields
$${\parallel \stackrel{\u02c7}{Q}Q\parallel}_{\mathrm{\infty}}\le \frac{\gamma}{1\gamma}{L}_{V}(1+{L}_{\mathrm{\Phi}}){\u03f5}_{S}.$$ 
Step 3. For every $(\mu ,\stackrel{~}{a})\in \mathrm{SS}\times \stackrel{~}{\mathbb{A}}$,
$$Q({\mathrm{Proj}}_{\stackrel{\u02c7}{\mathrm{SS}}}(\mu ),\stackrel{~}{a})\approx Q(\mu ,\stackrel{~}{a}).$$ 
Indeed, for every $\mu \in \stackrel{\u02c7}{\mathrm{SS}}$ and $\stackrel{~}{a}\in \stackrel{~}{\mathbb{A}}$, letting $\stackrel{\u02c7}{\mu}={\mathrm{Proj}}_{\stackrel{\u02c7}{\mathrm{SS}}}(\mu )$ to alleviate the notation, we have
$\leftQ(\stackrel{\u02c7}{\mu},\stackrel{~}{a})Q(\mu ,\stackrel{~}{a})\right$  
$\le \left\stackrel{~}{f}(\stackrel{\u02c7}{\mu},\stackrel{~}{a})\stackrel{~}{f}(\mu ,\stackrel{~}{a})\right+$  
$\mathrm{\hspace{1em}}+\gamma \mathbb{E}\left\underset{{\stackrel{~}{a}}^{\prime}}{\mathrm{max}}Q({\mathrm{\Phi}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}),{\stackrel{~}{a}}^{\prime})\underset{{\stackrel{~}{a}}^{\prime}}{\mathrm{max}}Q({\mathrm{\Phi}}^{\stackrel{~}{a}}(\mu ),{\stackrel{~}{a}}^{\prime})\right$  
$\le {L}_{\stackrel{~}{f}}{d}_{\mathrm{SS}}(\stackrel{\u02c7}{\mu},\mu )+\gamma \mathbb{E}\leftV({\mathrm{\Phi}}^{\stackrel{~}{a}}(\stackrel{\u02c7}{\mu}))V({\mathrm{\Phi}}^{\stackrel{~}{a}}(\mu ))\right$  
$\le \left({L}_{\stackrel{~}{f}}+\gamma {L}_{V}{L}_{\mathrm{\Phi}}\right){d}_{\mathrm{SS}}(\stackrel{\u02c7}{\mu},\mu )$  
$\le \left({L}_{\stackrel{~}{f}}+\gamma {L}_{V}{L}_{\mathrm{\Phi}}\right){\u03f5}_{S},$ 
where we used the Lipschitz continuity of $\stackrel{~}{f},V,\mathrm{\Phi}$ and the assumption on $\stackrel{\u02c7}{\mathrm{SS}}$, see Assumptions 1, 2 and 3. ∎
Proof of Corollary 3.
We use Proposition 4 of [22], which states that ${\mathrm{softmax}}_{\tau}$ is $\tau $Lipschitz and Lemma 7 [24], which states that for ${({x}_{i})}_{i=1,\mathrm{\dots},n}$,
$${\parallel {\mathrm{softmax}}_{\tau}(x)\mathrm{argmaxe}(x)\parallel}_{2}\le 2n{e}^{\tau \delta},$$ 
where $$, and $\delta =\mathrm{\infty}$ if all ${x}_{i}$ are equal. We can apply this latter result to $Q(\stackrel{\u02c7}{\mu},\cdot )$ thanks to our Assumption 1 on the action gap, with $n=\stackrel{~}{\mathbb{A}}$ and $\delta ={K}_{A}$. Combining this with the result of Theorem 2, we have, for every $\stackrel{\u02c7}{\mu}$,
${\parallel {\mathrm{softmax}}_{\tau}\stackrel{\u02c7}{Q}(\stackrel{\u02c7}{\mu},\cdot )\mathrm{argmaxe}Q(\stackrel{\u02c7}{\mu},\cdot )\parallel}_{2}$  
$\le {\parallel {\mathrm{softmax}}_{\tau}\stackrel{\u02c7}{Q}(\stackrel{\u02c7}{\mu},\cdot ){\mathrm{softmax}}_{\tau}Q(\stackrel{\u02c7}{\mu},\cdot )\parallel}_{2}$  
$\mathrm{\hspace{1em}\hspace{1em}}+{\parallel {\mathrm{softmax}}_{\tau}Q(\stackrel{\u02c7}{\mu},\cdot )\mathrm{argmaxe}Q(\stackrel{\u02c7}{\mu},\cdot )\parallel}_{2}$  
$\le \tau {\parallel \stackrel{\u02c7}{Q}(\stackrel{\u02c7}{\mu},\cdot )Q(\stackrel{\u02c7}{\mu},\cdot )\parallel}_{{\mathrm{\ell}}^{\mathrm{\infty}}(\stackrel{~}{\mathbb{A}})}+2\stackrel{~}{\mathbb{A}}{e}^{\tau {K}_{A}}$  
$\le \tau {\u03f5}^{\prime}+2\stackrel{~}{\mathbb{A}}{e}^{\tau {K}_{A}}.$ 
∎
Appendix D Deep Deterministic Policy Gradient Algorithm
In this section, we recall for the sake of completeness the Deep Deterministic Policy Gradient (DDPG) proposed in [33], adapted to our mean field control problem. See also [18] for how the same algorithm can be used for MFG. In the latter case, it is used to compute the (approximate) best response of a single infinitesimal player instead of the optimal control for the whole population as in MFC problems.
$$J({\theta}^{\pi})=\frac{1}{{N}_{batch}}\sum _{i}{\nabla}_{a}{Q}_{{\theta}^{Q}}({M}_{i},{\pi}_{{\theta}^{\pi}}({M}_{i}))$$ 