We develop a general reinforcement learning framework for mean field control(MFC) problems. Such problems arise for instance as the limit of collaborativemulti-agent control problems when the number of agents is very large. Theasymptotic problem can be phrased as the optimal control of a non-lineardynamics. This can also be viewed as a Markov decision process (MDP) but thekey difference with the usual RL setup is that the dynamics and the reward nowdepend on the state's probability distribution itself. Alternatively, it can berecast as a MDP on the Wasserstein space of measures. In this work, weintroduce generic model-free algorithms based on the state-action valuefunction at the mean field level and we prove convergence for a prototypicalQ-learning method. We then implement an actor-critic method and reportnumerical results on two archetypal problems: a finite space model motivated bya cyber security application and a continuous space model motivated by anapplication to swarm motion.
Quick Read (beta)
We develop a general reinforcement learning framework for mean field control (MFC) problems. Such problems arise for instance as the limit of collaborative multi-agent control problems when the number of agents is very large. The asymptotic problem can be phrased as the optimal control of a non-linear dynamics. This can also be viewed as a Markov decision process (MDP) but the key difference with the usual RL setup is that the dynamics and the reward now depend on the state’s probability distribution itself. Alternatively, it can be recast as a MDP on the Wasserstein space of measures. In this work, we introduce generic model-free algorithms based on the state-action value function at the mean field level and we prove convergence for a prototypical Q-learning method. We then implement an actor-critic method and report numerical results on two archetypal problems: a finite space model motivated by a cyber security application and a continuous space model motivated by an application to swarm motion.
Model-Free Mean-Field Reinforcement Learning:
Mean-Field MDP and Mean-Field Q-Learning
René Carmona Mathieu Laurière Zongjun Tan
Typical reinforcement learning (RL) applications involve the search for a procedure to learn by trial and error the optimal behavior so as to maximize a reward. While similar in spirit to optimal control applications, a key difference is that in the latter, the model is assumed to be known to the controller. This is in contrast with RL for which the environment has to be explored, and the reward cannot be predicted with certainty. Still, the RL paradigm has generated numerous theoretical developments and found plenty practical applications. As a matter of fact, bidirectional links with the optimal control literature have been unveiled as common tools lie at the heart of many studies. Mean field control (MFC), also called optimal control of McKean-Vlasov (MKV) dynamics, is an extension of stochastic control which has recently attracted a surge of interest (see e.g. [5, 11, 12]). From a theoretical standpoint, the main peculiarity of this type of problems is that the transition and reward functions not only involve the state and the action of the controller, but also the distribution of the state (and possibly of the control). Practically speaking, these problems appear as the asymptotic limits for the control of a large number of collaborative agents. They can also be introduced as single agent problems whose evolution and costs depend upon the distribution of her state (and potentially of her control). Such problems have found a wide range of applications in distributed robotics, energy, drone fleet management, risk management, finance, etc. Although they are bona fide control problems for which a dynamic programming principle can be formulated, they generally lead to Bellman equations on the infinite dimensional space of measures, which are extremely difficult to solve ([6, 32, 35]). Figure 1 contains a schematic diagram of the relationships between optimal control (i.e., planning with a model) and how the paradigms of MFC and RL are combined in mean-field reinforcement learning (MFRL).
Main contributions. Our first contribution is conceptual and consists in introducing a general framework of MFC problems in discrete time, with infinite horizon, discount and common noise. We argue that this setup, which has not been covered in the classical literature on MFC problems, is particularly relevant to develop a theory of reinforcement learning for mean field problems. We then rephrase the problem as a Markov decision process (MDP) on the space of measures. This point of view leads to the introduction of a state value function and a state-action value function as well as their associated Bellman equations on the Wasserstein space of measures. Our second contribution is to propose two model-free methods to learn an approximation of the state-action value function by trial and error. The first method relies on a discretization of the simplex and a tabular version of -learning, for which we prove a convergence result. The second method is based on an actor-critic method (Deep Deterministic Policy Gradient). Last, our third contribution is to implement the latter method and assess its convergence numerically. Numerical tests are conducted on two prototypical examples drawn from the mean-field literature: a finite state model motivated by a cyber security application and a continuous state and action model motivated by an application to swarm motion.
2 Mean Field Control
In this section, we keep the discussion at an informal level in order to encompass both finite and continuous state spaces. Specific methods and examples for each setting are presented in Sections 3 and 4.
Definition of the problem.
We denote by and respectively the state space and the action space. Typically, we have in mind the finite space case where both are finite sets, say and , or the continuous case, where and . A generic discrete time, infinite horizon, discounted mean field control (MFC) problem (or control of McKean-Vlasov dynamics) takes the following form: Maximize over the control process (or policy) the reward functional
where the state process has initial distribution and dynamics
Here encodes the transition. The mean-field nature of the model stems from the presence of , which is the law of conditioned on the realization of up to time , where is a stochastic process taking values in a set . For simplicity we assume that the are i.i.d. They play the role of a so-called common noise affecting the state transitions. Although the presence of this noise is not necessary for the model to be meaningful and we postpone the rigorous mathematical framework to future work, we believe that its presence is important for applications. Motivations and examples for this type of noise are provided in the sequel. Randomness in the rewards as well as interactions through the control’s distribution could also be handled, but for the sake of simplicity of the presentation we limit ourselves to a reward which is a deterministic function of and to the interaction through the conditional distribution of the state only.
Note that depends on . Although this dependence is usually omitted in the MFC literature, it is important to remember that the solution of the MFC problem changes if we let this initial distribution vary.
In the finite case, the above dynamics take the following form, where we identify with the -simplex denoted by : for every , corresponds to
Such an evolution can be interpreted in terms of a transition rate matrix, and the common noise can for instance affect the coefficients of this matrix. In the continuous case, (2) can come from a continuous time model, for example a stochastic differential equation of the McKean Vlasov type (MKV SDE) via an Euler scheme , in which case:
where the random variables are independent (e.g. with Gaussian distributions) and are interpreted as sources of noise. This type of setting has been studied in  with a linear dynamics and a quadratic cost.
When there is no ambiguity from the context, we will drop the superscripts and . Intuitively, (1) is the limiting problem, as grows to infinity, of the following control problem for agents: Maximize over the social average reward
where is the empirical distribution, are i.i.d. with distribution and the dynamics are Note that the same appears in the transitions of all . In other words, the system is affected by two sources of noise: one perturbs each state independently, and one affects all the agents. The first noise is idiosyncratic to each agent whereas the second one is common to the whole population. This latter type of noise is important in many applications, see e.g. [13, 2] for models of systemic risk or energy management in the context of mean field games. Since, in this -agent problem, the goal is to maximize , the problem can be interpreted e.g. as a large collaborative game (i.e. a Pareto optimum, rather than a Nash equilibrium as in mean field games), or as the problem of a central planner trying to find the best way to control a large group of robots.
Reformulation as an MDP on the Wasserstein space of measures.
We reformulate the MFC problem (1) as the optimal control of a Markov decision process in which the state space is the space of measures, in the spirit of e.g. . We restrict our attention to controls which are stationary feedback functions of , namely processes for which there exists a (deterministic) function such that for all
In this situation, a typical agent takes her decision at each time step based on only two pieces of information: her current state and the distribution of the population’s states. For such and we will write instead of . We denote by the set of such functions and it will sometimes be convenient to view them as functions from to the set . The initial MFC problem (1) can be recast as an optimal control problem on the distribution flow, namely,
under the constraint: , where solves (2). Here is defined by
and the evolution of the distribution is given by
where , formalizing the transition (2) in our new set of notations. Note that depends (possibly in a non-linear way) on the distribution at time , in accordance with the idea of MKV dynamics. If is constant with respect to the common noise, then the evolution of the distribution is deterministic and the expectation symbol in (4) is superfluous. To alleviate the presentation, we sometimes omit the dependence on (since it is now the only source of randomness) and see as a stochastic map from to .
We are facing an MDP over the space of probability measures on the underlying state . Note that is always continuous and infinite dimensional unless is finite. If is finite, the distribution can be viewed as a vector in whose dynamics can be written as
where is the transition matrix of the distribution, which can be interpreted in terms of the transition rate matrix of a typical player.
Let be the value function associated to the above problem (4), defined by: for ,
under the constraint: solves (5) with initial condition . One can check, at least formally, that a dynamic programming principle holds in the sense that solves the following Bellman equation:
where the expectation is over the randomness of , which stems from the common noise.
Moreover, under suitable conditions, we expect a verification theorem to hold, namely, any function satisfying the above Bellman equation (8) coincides with the value function of the MFC problem: for any initial distribution , and the optimal control is given by the maximizer in (8). The value function and its connection to the so-called Master equation and to the mean-field PDE system has been a very active research direction in recent years, see e.g. [5, 11, 12].
3 Mean-Field Q-Learning
We now turn our attention to the question of learning the solution of the MFC problem in a model-free setting, i.e., assuming the model is unknown but one has access to a simulator which can provide samples of trajectories and rewards.
State-action value function.
From now on, we see the distribution as the state in and the action taken by the central planner given this distribution as an element of . We then introduce the following state-action value function defined, for every by
where we recall that we dropped from the notation in the expectation and in . Under suitable conditions, and are well defined and . Note that finding one maximizer for each amounts to find a maximizer in for (7).
In the rest of this section, we propose two model-free algorithms relying on this state-action value function. We first focus on the case where and are finite and we explain later on how to adapt these techniques to the case of continuous spaces.
First method: Tabular MFQ-learning with projection.
Note that even when is finite, is an element of the -simplex , which is a continuous space, and hence a tabular version of Q-learning can not be applied directly. One possible workaround is to first replace by a finite subset and then project on at each time step , thus letting the mean-field term take only a finite number of values. We can then approximate the function by a function , which can be represented by a matrix (or a “table”) in (since is the set of functions from to ). We introduce the following “projected MFC problem”: Maximize over
under the constraint: where . This problem is still an optimal control problem of some sequence of distributions, except that at each time step the distribution is pushed forward by (instead of ) when using control . In this case, a straightforward adaptation of the tabular Q-learning algorithm leads to Algorithm 1. Note that, even in the absence of common noise, this algorithm is possibly stochastic since at each episode, the order in which the state-action pairs are picked is potentially random. In practice, the order could be fixed in advance or stem from a sampled trajectory.
For this elementary algorithm, as a proof of concept we provide a convergence result for the approximation of the Q-function of the original MFC problem by the table returned at the end of Algorithm 1. To this end, in order to keep the paper at a reasonable length, we will make the following simplifying assumptions.
We endow the simplex seen as a subset of with the Euclidean distance denoted by .
Regularity of the data: is bounded and Lipschitz continuous with respect to with constant and is Lipschitz continuous with respect to with constant , uniformly in in expectation over the randomness of the common noise, namely: for every ,
Regularity of the value function: is Lipschitz continuous wrt with constant .
Simplex discretization: There exists s.t.: s.t. .
Learning rates: There exists such that for every , for each , where is the number of times up to that the pair has been visited in Algorithm 1.
The regularity of in 2 can typically be ensured through suitable conditions on the data of the problem, as e.g. in [17, 9, 12], while to obtain 3, one can consider an -net as in . Assumption 4 holds for instance either by using exploring starts (if the learner can query an oracle which simulates transitions from any ), or by following a long enough sampled trajectory (provided some form of irreducibility or ergodicity of the dynamics, ensuring full exploration). Note that the boundedness of the running reward from Assumption 1 together with the fact that ensures the existence of a finite upper bound for the value function of the projected MFC problem. We denote by the horizon of the MDP corresponding to the projected MFC problem, and for , we let .
Note that can be chosen as small as desired provided is large enough. The second term in the error is proportional to , which is somehow unavoidable in general due to the projection on . However, this error vanishes as , i.e., as is better and better an approximation of .
The proof is deferred to Section C of the appendix. It relies on the following three steps: (1) Since is large enough, then on ; (2) on ; (3) For every , The first step relies on standard -learning convergence results , while the two other steps stem from the regularity assumptions and the approximation of by .
Let us now derive a consequence in terms of the control function. We will use the following additional assumption on the gap between the values of the best and second-best actions, which is rather standard in approximation algorithms based on tabular -functions [20, 4].
Action gap: There exists such that: For every and ,
For and , we use the notation
The proof is deferred to Section C of the appendix. The in the second term is here in case there are several optimal controls. The regularizes the best action predicted by the estimation of the function .
Second method: DDPG for MFC.
Although the elementary structure of the above method allows us to analyze it, one drawback is that it requires the computation of a projection on a discretization of the simplex at each time step, which can be quite cumbersome and computationally costly when the number of states in is large. For this reason, we propose a different RL method based on an approximation of the Q-function by neural networks which can deal with inputs and outputs in continuous spaces. Still in the case of finite and , since the state of distributions and the set of actions are finite dimensional, we can try to approximate the Q-function by a neural network taking inputs in and outputting a real number. For the learning procedure, we employ the Deep Deterministic Policy Gradient (DDPG) proposed in . It relies on two neural networks, one for the Q-function (the critic) and one for the policy (the actor). The heart of the algorithm consists in updating alternatively the critic by minimizing an empirical square error and the actor by making one step of gradient descent. To improve exploration, a Gaussian noise is added to the action prescribed by the actor, and for more stability, target networks are also added. The algorithm is summarized in the appendix (see Algorithm 2).
Adaptation to continuous spaces.
When the underlying state space is continuous, the distribution is an element of the infinite-dimensional space . In order to develop a reinforcement learning method, we thus need some kind of finite-dimensional approximation. Motivated by applications to physical models such as swarm of robots or drones in which the underlying space is in low dimension (typically with or ), we propose to represent by a histogram of its values at a finite number of points. In other words, we consider a finite number of points in and let be a vector which approximates . Similarly, an action can be approximated by its values on the points chosen in and can hence be represented by a vector in . The problem then reduces to the finite-state case discussed above.
4 Numerical Examples
We present numerical results obtained using the second method introduced above. We assumed that the central planner has access to an oracle which can simulate the evolution of as discussed at the end of the previous section. In the numerical implementation, we provided the learning algorithm with a black-box which computes transitions of . This oracle has been implemented using a finite-difference scheme for Kolmorov-Fokker-Planck equations, in line with recent research on numerical methods for mean field games . The actor and critic networks have been implemented using a feedforward fully connected architecture with hidden layers of width less than neurons. We used random initial states at each episode, and the noise used on the action is a Gaussian noise with mean and variance . We used Adam optimizer with initial learning rate and minibatches of size . For the sake of clarity and in order to benchmark the aforementioned method, we provide illustrations on examples without common noise but analogous results can also be obtained in the presence of a common noise.
Example 1: Cyber security model.
For a first testbed, we start with a finite state problem. We revisit the cyber security example introduced in , but here from the point of view of a central planner (such as a large company or a state) trying to protect its computers against the attacks of a hacker. The situation can hence be phrased as a mean field control problem.
In this model, the population consists of a large group of computers which can be either defended (D) or undefended (U) and either infected (I) or susceptible (S) of infection. The set has hence four elements corresponding to the four possible combinations: DI, DS, UI, US. The action set is , where is interpreted as the fact that the central planner is satisfied with the current level of protection (D or U) of the computer under consideration, whereas means that she wants to change this level of protection. In the latter case, the update occurs at a (fixed) rate . At each of the four states, all the computers are indistinguishable and hence the central planner only chooses one action per state and applies it to all the computers at that state. When infected, each computer may recover at rate or depending on whether it is defended or not. On the other hand, a computer may be infected either directly by a hacker, at rate (resp. ) if it is defended (resp. undefended), or by undefended infected computers, at rate (resp. ) if it is undefended (resp. defended), or by defended infected computers, at rate (resp. ) if it is undefended (resp. defended).
The transition matrix from (6) is given by
and all the instances of should be replaced by the negative of the sum of the entries of the row in which appears on the diagonal. At each time step, the central planner pays a protection cost for each defended computer, and a penalty for each infected computer. The total reward at time is hence .
Although the underlying state space is finite, instead of using MFQ (see Algorithm 1), we use the second method introduced above based on DDPG. In the present case, this approach has the advantage to avoid discretizing since we instead deal directly with the distribution as a vector in dimension . The DDPG method learns a control, that we then apply to the three reference cases studied in [11, Section 7.2.3] (for the sake of brevity, we do not reproduce here the values of the parameters). The resulting distribution’s evolution are shown in Figure 2. We can see that we recover the same evolution for the three initial distributions considered, namely and . In particular, at , we obtain in all three simulations the distribution , which is close to the values found in [11, Section 7.2] for the stationary distribution, namely .
Example 2: Swarm motion.
We then turn our attention to a model in continuous space. More precisely, let us consider a model of swarm motion with aversion to crowded regions introduced in  (in the context of mean field games). Since here we simply want to provide a proof of principle for our method, we take (as in the aforementioned work) the interval with periodic boundary condition (i.e. the unit torus) as the state space . The dynamics of a typical agent is driven by (3) with . In other words, the central planner chooses the velocity of each agent. The instantaneous reward (appearing in (1)) of a typical agent at location and using action while the population’s state is , is defined as Here, the first term penalizes a large velocity (it can be interpreted as a kind of cost proportional to the kinetic energy of the agent), encodes a preference for certain positions in space, and the last term models crowd aversion since it penalizes the fact of being at a location where the density of agents is high. By choosing
we obtain a model for which, when have Gaussian distribution and , the MFC admits an explicit ergodic solution that we can use as a benchmark. Indeed, in this case the optimal ergodic control is given by and the ergodic distribution of the corresponding MKV dynamics has density .
To implement the second RL method described above, we discretize with a mesh of points and use a finite difference scheme to simulate the evolution of the dynamics. The DDPG method uses this as a black-box and, for a given action , can only access the resulting new distribution and the associated reward. Figure 3 presents results obtained using this method after episodes. The system has been trained on initial distributions which are Gaussian with random mean and random variance. As illustrated in the figures, the system has learnt how to drive this type of initial distributions towards the analytical stationary distribution and then how to use an approximation of the stationary optimal control in order to keep the system in the stationary regime.
5 Conclusion and future research
In this work, we explored the central role played by MKV dynamics in multi-agent RL. We developed a framework and model-free methods to learn mean field optimal control. As a proof of principle, we established a rate of convergence for a -learning method and our numerical tests assess empirical convergence of an actor-critic method on examples from the literature. An important feature of our model is the presence of common noise, whose impact had to be controlled.
Our results can be extended in several directions. The analysis and the numerical implementations could be applied to other mean-field problems such as mean field games or mean field control problems with several populations, with important applications to multi-agent sweeping and tracking. The proof of convergence of the actor-critic method is also postponed for future work. Last, it would be interesting to investigate other types of simulators, such as Monte-Carlo simulators based on samples of a finite population.
Related work. Our work is at the intersection of RL and MFC. The latter has recently attracted a lot of attention, particularly since the introduction of mean field games (MFG) by Lasry and Lions [2006a, 2006b, 2007] and by Caines, Huang and Malhamé [2006, 2007]. MFGs correspond to the asymptotic limit of Nash equilibria for games with mean field interactions. They are defined through a fixed point procedure and hence, differ both conceptually and numerically, from MFC problems which correspond to social optima, see e.g.  for details. Most works on learning in the presence of mean field interactions have focused on MFGs, see e.g. [40, 10] for “learning” (or rather solving) MFGs based on the full knowledge of the model, and [27, 38, 39, 24, 34, 18, 37, 21] for RL based methods. In contrast, our work focuses on MFC problems. Along these lines, we have studied policy gradient methods for MFC in . However, this work was restricted to linear-quadratic models. While completing the present work, we became aware of the very recent work , which studies MFC with policy gradient methods too. However, their work is restricted to finite state and action spaces whereas we also consider continuous spaces. Furthermore, we provide a rate of convergence (see Theorem 2). We also stress that although some tools are common, our work differs significantly from [24, 18] because the latter works deal with a mean field game. Their learning procedure is embedded in a fixed point on the distribution and, for this reason, the -learning step is only needed to solve a classical control problem and not a mean field one. Here, the key novelty is that our learning methods are designed for MDPs on the space of measures. Last, we would also like to mention that the first two authors have recently proposed machine learning methods for solving mean field control problems and mean field games, see [14, 15]. These methods are based on the knowledge of the model, since one relies on it to compute gradients of the cost functional and implement a stochastic gradient descent. The present work can be viewed as an extension of these methods, where one tries to be free from the model and learn the solution by trial and error.
M.L. is grateful to Matthieu Geist and Julien Pérolat for helpful discussions on the DDPG algorithm.
- Achdou and Capuzzo-Dolcetta  Achdou, Y. and Capuzzo-Dolcetta, I. (2010). Mean field games: numerical methods. SIAM J. Numer. Anal., 48(3):1136–1162.
- Alasseur et al.  Alasseur, C., Ben Tahar, I., and Matoussi, A. (2017). An extended mean field game for storage in smart grids. arXiv preprint arXiv:1710.08991.
- Almulla et al.  Almulla, N., Ferreira, R., and Gomes, D. (2017). Two numerical approaches to stationary mean-field games. Dyn. Games Appl., 7(4):657–682.
- Bellemare et al.  Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., and Munos, R. (2016). Increasing the action gap: New operators for reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence.
- Bensoussan et al.  Bensoussan, A., Frehse, J., and Yam, S. C. P. (2013). Mean field games and mean field type control theory. Springer Briefs in Mathematics. Springer, New York.
- Bensoussan et al.  Bensoussan, A., Frehse, J., and Yam, S. C. P. (2015). The master equation in mean field theory. J. Math. Pures Appl. (9), 103(6):1441–1474.
- Bertsekas  Bertsekas, D. P. (2012). Dynamic programming and optimal control. Vol. II. Approximate dynamic programming. Athena Scientific, Belmont, MA, fourth edition.
- Bossy and Talay  Bossy, M. and Talay, D. (1997). A stochastic particle method for the McKean-Vlasov and the Burgers equation. Math. Comp., 66(217):157–192.
- Cardaliaguet et al.  Cardaliaguet, P., Delarue, F., Lasry, J.-M., and Lions, P.-L. (2019). The master equation and the convergence problem in mean field games, volume 201 of Annals of Mathematics Studies. Princeton University Press, Princeton, NJ.
- Cardaliaguet and Hadikhanloo  Cardaliaguet, P. and Hadikhanloo, S. (2017). Learning in mean field games: the fictitious play. ESAIM: Control, Optimisation and Calculus of Variations, 23(2):569–591.
- Carmona and Delarue [2018a] Carmona, R. and Delarue, F. (2018a). Probabilistic theory of mean field games with applications. I, volume 83 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field FBSDEs, control, and games.
- Carmona and Delarue [2018b] Carmona, R. and Delarue, F. (2018b). Probabilistic theory of mean field games with applications. II, volume 84 of Probability Theory and Stochastic Modelling. Springer, Cham. Mean field games with common noise and master equations.
- Carmona et al.  Carmona, R., Fouque, J.-P., and Sun, L.-H. (2015). Mean field games and systemic risk. Commun. Math. Sci., 13(4):911–933.
- Carmona and Laurière [2019a] Carmona, R. and Laurière, M. (2019a). Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: I - the ergodic case. arXiv preprint arXiv:1907.05980.
- Carmona and Laurière [2019b] Carmona, R. and Laurière, M. (2019b). Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: II - the finite horizon case. arXiv preprint arXiv:1908.01613.
- Carmona et al.  Carmona, R., Laurière, M., and Tan, Z. (2019). Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods. arXiv preprint arXiv:1910.04295.
- Chassagneux et al.  Chassagneux, J.-F., Crisan, D., and Delarue, F. (2014). A probabilistic approach to classical solutions of the master equation for large population equilibria. arXiv:1411.3009.
- Elie et al.  Elie, R., Pérolat, J., Laurière, M., Geist, M., and Pietquin, O. (2019). Approximate fictitious play for mean field games. arXiv preprint arXiv:1907.02633.
- Even-Dar and Mansour  Even-Dar, E. and Mansour, Y. (2003). Learning rates for Q-learning. J. Mach. Learn. Res., 5:1–25.
- Farahmand  Farahmand, A.-m. (2011). Action-gap phenomenon in reinforcement learning. In Advances in Neural Information Processing Systems, pages 172–180.
- Fu et al.  Fu, Z., Yang, Z., Chen, Y., and Wang, Z. (2019). Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498.
- Gao and Pavel  Gao, B. and Pavel, L. (2017). On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805.
- Gast et al.  Gast, N., Gaujal, B., and Le Boudec, J.-Y. (2012). Mean field for markov decision processes: from discrete to continuous optimization. IEEE Transactions on Automatic Control, 57(9):2266–2280.
- Guo et al.  Guo, X., Hu, A., Xu, R., and Zhang, J. (2019). Learning mean-field games. arXiv preprint arXiv:1901.09585.
- Huang et al.  Huang, M., Caines, P. E., and Malhamé, R. P. (2007). Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized -Nash equilibria. IEEE Trans. Automat. Control, 52(9):1560–1571.
- Huang et al.  Huang, M., Malhamé, R. P., and Caines, P. E. (2006). Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst., 6(3):221–251.
- Iyer et al.  Iyer, K., Johari, R., and Sundararajan, M. (2014). Mean field equilibria of dynamic auctions with learning. Management Science, 60(12):2949–2970.
- Kolokoltsov and Bensoussan  Kolokoltsov, V. N. and Bensoussan, A. (2016). Mean-field-game model for botnet defense in cyber-security. Appl. Math. Optim., 74(3):669–692.
- Lasry and Lions [2006a] Lasry, J.-M. and Lions, P.-L. (2006a). Jeux à champ moyen. I. Le cas stationnaire. C. R. Math. Acad. Sci. Paris, 343(9):619–625.
- Lasry and Lions [2006b] Lasry, J.-M. and Lions, P.-L. (2006b). Jeux à champ moyen. II. Horizon fini et contrôle optimal. C. R. Math. Acad. Sci. Paris, 343(10):679–684.
- Lasry and Lions  Lasry, J.-M. and Lions, P.-L. (2007). Mean field games. Jpn. J. Math., 2(1):229–260.
- Laurière and Pironneau  Laurière, M. and Pironneau, O. (2016). Dynamic programming for mean-field type control. J. Optim. Theory Appl., 169(3):902–924.
- Lillicrap et al.  Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR 2016).
- Mguni et al.  Mguni, D., Jennings, J., and de Cote, E. M. (2018). Decentralised learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Pham and Wei  Pham, H. and Wei, X. (2017). Dynamic programming for optimal control of stochastic McKean-Vlasov dynamics. SIAM J. Control Optim., 55(2):1069–1101.
- Subramanian and Mahajan  Subramanian, J. and Mahajan, A. (2019). Reinforcement learning in stationary mean-field games. In Proceedings. 18th International Conference on Autonomous Agents and Multiagent Systems.
- Tiwari et al.  Tiwari, N., Ghosh, A., and Aggarwal, V. (2019). Reinforcement learning for mean field game. arXiv preprint arXiv:1905.13357.
- Yang et al. [2018a] Yang, J., Ye, X., Trivedi, R., Xu, H., and Zha, H. (2018a). Deep mean field games for learning optimal behavior policy of large populations. In International Conference on Learning Representations.
- Yang et al. [2018b] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang, J. (2018b). Mean field multi-agent reinforcement learning. In International Conference on Machine Learning, pages 5567–5576.
- Yin et al.  Yin, H., Mehta, P. G., Meyn, S. P., and Shanbhag, U. V. (2013). Learning in mean-field games. IEEE Transactions on Automatic Control, 59(3):629–644.
Appendix A Derivation of Bellman equation
To alleviate the presentation, we assume that the running reward is bounded by a constant and we focus on stationary controls. To avoid measurability issues, we assume that the set is countable. More general settings could be considered at the expense of technicalities which are beyond the scope of the present work. We rewrite the value function as
which leads us to introduce the following dynamic programming operators defined, for a bounded function and , by
The above expression for then rewrites
where represents the result of composed times and applied to the function which is identically on . Thanks to the bound on , one can check that for every and every integer ,
Applying to the above relation yields
By taking , we deduce
which is the desired relation. One can also check that the solution to (7) is unique. Similarly, denoting , we have for any and , is the only fixed point of .
In addition, let us consider an stationary control which is optimal in the sense that, for any , it achieves the supremum over in (4), i.e., . Then, using the Bellman equations for and , we have
Conversely, assume is such that for every ,
Then, since we obtain that . By uniqueness of the fixed point of , we have that and hence is optimal.
Appendix B Link with MDPs arising in Mean Field Games
Here, we clarify the difference between the Mean Field MDPs studied in the present work and the ones arising in the context of finite state Mean Field Games [38, 24]. Although both problems involve the term “mean field”, we argue that in a MFG, the MDP is a rather standard one.
In , the authors make the following point (at the beginning of their Section 4): given the evolution of the population (i.e., the mean-field term), an infinitesimal player solves a (standard) optimal control problem, to which corresponds a (standard) MDP. In other words, the population’s distribution appears as a given parameter in the MDP and not as the state over which the optimization is performed, as in our case. To emphasize, in our notation, the difference between their setting and ours, let us consider the evolution (6) for a finite state space . As in , let us assume that the players control directly their transition probabilities. In other words, the action space is the set of probability distributions on , and all the players in a state control the probability with which they will go to each other state . In this case, each element of can be identified with a vector of length representing a probability distribution on , and each element of can be identified with a matrix such that . In , the initial distribution is fixed (hence we omit to denote explicitly the dependence on ), and there is no common noise. In this case, one can look for feedback controls which are functions of time and only. Then the MFC problem (4) rewrites: Find maximizing
under the constraint: and for ,
where is as defined above. The corresponding MFG, analogous to the one studied in , can be formulated as follows: Find a sequence of distributions and a sequence of controls such that: (1) maximizes
under the constraint: and for ,
and (2) for every , . Step (1) above corresponds to the problem faced by a typical agent, when the evolution of the population is given by . The dynamics (13) can be viewed as an MDP on the space of measures, but the evolution is purely linear, in the sense that at time , the state of the MDP, namely , is not involved in the transition matrix, namely . In contrast, the dynamics (6) is in general non-linear.
In , the authors build the MDP upon the same insight in a different way. To stress the difference with our notion of MDP, let us go back to the MFC problem (1) and consider again a finite state space and no common noise. The corresponding MFG would be: Find a flow of distributions and a policy such that: (1) maximizes
where the state process has initial distribution and dynamics
and (2) For every , is the distribution of . As mentioned above, in the optimization problem of a typical player, the flow of distributions appears as a given parameter. The corresponding MDP is hence parameterized by this but evolves in the finite state space . Furthermore, in , although they consider interactions through the control’s distributions (omitted here to alleviate the notations), the authors focus on a stationary Nash equilibrium. In other words, they look for a solution of the following type of problems: Find and such that, letting , we have: (1) maximizes
where the state process has initial distribution and dynamics
and (2) For every , is the distribution of . In this case, the rewards and dynamics of an infinitesimal player is parameterized by a single distribution and the same remark holds for the corresponding MDP.
Appendix C Proof of the convergence results
Proof of Theorem 2.
Let us denote by and respectively the state value function and the state-action value function of the projected MFC problem. We split the proof into several steps.
Step 1. If is large enough, then for every ,
This comes from standard convergence results on Q-learning for finite state-action spaces. More precisely, under Assumptions 1, 4 and 5, we can apply Theorem 4 and Corollary 34 in  for asynchronous Q-learning and polynomial learning rates, and we obtain that, with probability at least , , given that is of order (11).
Step 2. For every ,
This amounts to say that the projection on realized at each step does not perturb too much the value function. Let us start by noting that, for every and ,
where the last inequality holds by Lipschitz continuity of , see Assumption 2, and because for every ,
Combining the above bounds yields
Proof of Corollary 3.
Appendix D Deep Deterministic Policy Gradient Algorithm
In this section, we recall for the sake of completeness the Deep Deterministic Policy Gradient (DDPG) proposed in , adapted to our mean field control problem. See also  for how the same algorithm can be used for MFG. In the latter case, it is used to compute the (approximate) best response of a single infinitesimal player instead of the optimal control for the whole population as in MFC problems.