Learning Fairness in Multi-Agent Systems

  • 2019-10-31 13:59:37
  • Jiechuan Jiang, Zongqing Lu
  • 13

Abstract

Fairness is essential for human society, contributing to stability andproductivity. Similarly, fairness is also the key for many multi-agent systems.Taking fairness into multi-agent learning could help multi-agent systems becomeboth efficient and stable. However, learning efficiency and fairnesssimultaneously is a complex, multi-objective, joint-policy optimization. Totackle these difficulties, we propose FEN, a novel hierarchical reinforcementlearning model. We first decompose fairness for each agent and proposefair-efficient reward that each agent learns its own policy to optimize. Toavoid multi-objective conflict, we design a hierarchy consisting of acontroller and several sub-policies, where the controller maximizes thefair-efficient reward by switching among the sub-policies that provides diversebehaviors to interact with the environment. FEN can be trained in a fullydecentralized way, making it easy to be deployed in real-world applications.Empirically, we show that FEN easily learns both fairness and efficiency andsignificantly outperforms baselines in a variety of multi-agent scenarios.

 

Quick Read (beta)

Learning Fairness in Multi-Agent Systems

Jiechuan Jiang
Peking University
[email protected]
&Zongqing Lu
Peking University
[email protected]
Corresponding author
Abstract

Fairness is essential for human society, contributing to stability and productivity. Similarly, fairness is also the key for many multi-agent systems. Taking fairness into multi-agent learning could help multi-agent systems become both efficient and stable. However, learning efficiency and fairness simultaneously is a complex, multi-objective, joint-policy optimization. To tackle these difficulties, we propose FEN, a novel hierarchical reinforcement learning model. We first decompose fairness for each agent and propose fair-efficient reward that each agent learns its own policy to optimize. To avoid multi-objective conflict, we design a hierarchy consisting of a controller and several sub-policies, where the controller maximizes the fair-efficient reward by switching among the sub-policies that provides diverse behaviors to interact with the environment. FEN can be trained in a fully decentralized way, making it easy to be deployed in real-world applications. Empirically, we show that FEN easily learns both fairness and efficiency and significantly outperforms baselines in a variety of multi-agent scenarios.

 

Learning Fairness in Multi-Agent Systems


  Jiechuan Jiang Peking University [email protected] Zongqing Luthanks: Corresponding author Peking University [email protected]

\@float

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

1 Introduction

Fairness is essential for human society, contributing to stability and productivity. Similarly, fairness is also the key for many multi-agent systems, e.g., routing Jiang et al. (2018), traffic light control Bakker et al. (2010), and cloud computing Minarolli and Freisleben (2013). More specifically, in routing, link bandwidth needs to be fairly allocated to packets to achieve load balance; in traffic light control, resources provided by infrastructure needs to be fairly shared by vehicles; in cloud computing, resource allocation of virtual machines has to be fair to optimize profit.

Many game-theoretic methods de Jong et al. (2007); Procaccia (2009); Chen et al. (2013) have been proposed for fair division in multi-agent systems, which mainly focus on proportional fairness and envy-freeness. Most of them are in static settings, while some Kash et al. (2014); Zhang and Shah (2014); Beynier et al. (2018) consider the dynamic environment. Recently, multi-agent reinforcement learning (RL) has been successfully applied to multi-agent sequential decision-making, such as Lowe et al. (2017); Sukhbaatar et al. (2016); Yang et al. (2018); Jiang and Lu (2018); Sunehag et al. (2017); Rashid et al. (2018); Foerster et al. (2018). However, most of them try to maximize the reward of each individual agent or the shared reward among agents, without taking fairness into account. Only a few methods consider fairness, but are handcrafted for different applications Jain et al. (2017); Cui et al. (2018); Li et al. (2019), which all require domain-specific knowledge and cannot be generalized. Some methods Peysakhovich and Lerer (2018); Hughes et al. (2018); Wang et al. (2019) aim to encourage cooperation in social dilemmas but cannot guarantee fairness.

Taking fairness into multi-agent learning could help multi-agent systems become both efficient and stable. However, learning efficiency (system performance) and fairness simultaneously is a complex, multi-objective, joint-policy optimization. To tackle these difficulties, we propose a novel hierarchical RL model, FEN, to enable agents to easily and effectively learn both efficiency and fairness. First, we decompose fairness for each agent and propose fair-efficient reward that each agent learns its own policy to optimize it. We prove that agents achieve Pareto efficiency and fairness is guaranteed in infinite-horizon sequential decision-making if all agents maximize their own fair-efficient reward and learn the optimal policies. However, the conflicting nature between fairness and efficiency makes it hard for a single policy to learn effectively. To overcome this, we then design a hierarchy, which consists a controller and several sub-policies. The controller maximizes the fair-efficient reward by switching among the sub-policies which directly interact with the environment. One of the sub-policies is designated to maximize the environmental reward, and other sub-policies are guided by information-theoretic reward to explore diverse possible behaviors for fairness. Additionally, average consensus, which is included in the fair-efficient reward, coordinates the policies of agents in fully decentralized multi-agent learning. By saying fully decentralized, we emphasize that there is no centralized controller, agents exchange information locally, and they learn and execute based on only local information.

We evaluate FEN in three classic scenarios, i.e., job scheduling, the Mathew effect, and manufacturing plant. It is empirically demonstrated that FEN obtains both fairness and efficiency and significantly outperforms existing methods. By ablation studies, we confirm the hierarchy indeed helps agents to learn more easily. Benefited from distributed average consensus, FEN can learn and execute in a fully decentralized way, making it easy to be deployed in real-world applications.

2 Related Work

Fairness. There are many existing works on fair division in multi-agent systems. Most of them focus on static settings de Jong et al. (2007); Procaccia (2009); Chen et al. (2013), where the information of entire resources and agents are known and fixed, while some of them work on dynamic settings Kash et al. (2014); Zhang and Shah (2014); Beynier et al. (2018), where resource availability and agents are changing. For multi-agent sequential decision-making, a regularized maximin fairness policy is proposed Zhang and Shah (2014) to maximize the worst performance of agents while considering the overall performance, and the policy is computed by linear programming or game-theoretic approach. However, none of these works are learning approach. Some multi-agent RL methods Jain et al. (2017); Cui et al. (2018); Li et al. (2019) have been proposed and handcrafted for resource allocation in specific applications, such as resource allocation on multi-core systemsJain et al. (2017), sharing network resources among UAVsCui et al. (2018), and balancing various resources in complex logistics networksLi et al. (2019). However, all these methods require domain-specific knowledge and thus cannot be generalized. Some methods Peysakhovich and Lerer (2018); Hughes et al. (2018); Wang et al. (2019) are proposed to improve cooperation in social dilemmas. In Peysakhovich and Lerer (2018), the reward is shaped for two-player Stag Hunt games, which could be agents’ average reward in its multi-player version. In Peysakhovich and Lerer (2018), one agent’s reward is set to be the weighted average reward of two agents in two-player Stag Hunt games to induce prosociality. By extending the inequity aversion model, a shaped reward is designed in Hughes et al. (2018) to model agent’s envy and guilt. A reward network is proposed in Wang et al. (2019) to generate intrinsic reward for each agent, which is evolved based on the group’s collective reward. Although cooperation in social dilemmas helps improve agents’ sum reward, it does not necessarily mean the fairness is guaranteed.

The Matthew effect, summarized as the rich get richer and the poor get poorer, can be witnessed in many aspects of human society Bol et al. (2018), as well as in multi-agent systems, such as preferential attachment in networking Perc (2014); Jiang and Huang (2012) and mining process in blockchain systems Zeng and Zuo (2019). The Matthew effect causes inequality in society and also performance bottleneck in multi-agent systems. Learning fairness could avoid the Matthew effect and help systems become stable and efficient.

Multi-Agent RL. Many multi-agent RL models have been recently proposed, such as Lowe et al. (2017); Sukhbaatar et al. (2016); Yang et al. (2018); Jiang and Lu (2018); Sunehag et al. (2017); Rashid et al. (2018); Foerster et al. (2018), but all of them only consider efficiency. CommNet Sukhbaatar et al. (2016) and ATOC Jiang and Lu (2018) use continuous communication for multi-agent cooperation. Opponent modeling Hong et al. (2018); Rabinowitz et al. (2018) learns to reason about other agents’ behaviors or minds for better cooperation or competition. MADDPG Lowe et al. (2017) is designed for mixed cooperative-competitive environments. In these models, each agent only focuses on optimizing its own local reward. Thus, more capable agents will obtain more rewards and fairness is not considered. VDN Sunehag et al. (2017), QMIX Rashid et al. (2018), and COMA Foerster et al. (2018) are designed for the scenario where all agents jointly maximize a shared reward. The shared reward is not directly related to fairness. Even if the shared reward is defined as the sum of local rewards of all agents, we can easily see that higher reward sum does not mean fairer.

Hierarchical RL. To solve more complex tasks with sparse rewards or long time horizons and to speed up the learning process, hierarchical RL trains multiple levels of policies. The higher level policies give goals or options to the lower level policies and only the lowest level applies actions to the environment. So, the higher levels are able to plan over a longer time horizon or a more complex task. Learning a decomposition of complex tasks into sub-goals are considered in Dayan and Hinton (1993); Vezhnevets et al. (2017); Nachum et al. (2018), while learning options are considered in Sutton et al. (1999); Bacon et al. (2017); Frans et al. (2017). However, none of these hierarchical RL models can be directly applied to learning both fairness and efficiency in multi-agent systems.

3 Methods

We propose Fair-Efficient Network, FEN, to enable agents to learn both efficiency and fairness in multi-agent systems. Unlike existing work, we decompose fairness for each agent and propose fair-efficient reward, and each agent learns its own policy to optimize it. However, optimizing the two conflicting objectives is hard for a single learning policy. To this end, we propose a hierarchy specifically designed for easing this learning difficulty. The hierarchy consists a controller and several sub-policies, where the controller learns to select sub-policies and each sub-policy learns to interact with the environment in a different way. Average consensus, which is included in the fair-efficient reward, coordinates agents’ policies and enables agents to learn in a fully decentralized way.

3.1 Fair-Efficient Reward

In the multi-agent system we consider, there are n agents and limited resources in the environment. The resources are non-excludable and rivalrous (common resources), e.g., CPU, memory, and network bandwidth. At each timestep, the environmental reward r an agent obtains is only related to its occupied resources at that timestep. We define the utility of agent i at timestep t as uti=1tj=0trji, which is the average reward over elapsed timesteps. We use the coefficient of variation (CV) of agents’ utilities 1n-1i=1n(ui-u¯)2u¯2 to measure fairness Jain et al. (1984), where u¯ is average utility of all agents. A system is said to be fairer if and only if the CV is smaller.

In multi-agent sequential decision-making, it is difficult for an individual agent to optimize the CV since it is not just related to the agent’s own policy, but the joint policies of all agents. However, as the resources are limited, the upper bound of u¯ can be easily reached by self-interested agents. Thus, u¯ is hardly affected by an individual agent and the contribution of agent i to the variance could be approximated as (ui-u¯)2/u¯2. We decompose the fairness objective for each agent and propose the fair-efficient reward

r^ti=u¯t/cϵ+|uti/u¯t-1|,

where c is a constant that normalizes the numerator and is set to the maximum environmental reward the agent obtains at a timestep. In the fair-efficient reward, u¯t/c can be seen as the resource utilization of the system, encouraging the agent to improve efficiency; |uti/u¯t-1| measures the agent’s utility deviation from the average, and the agent will be punished no matter it is above or below the average, which leads to low variance; ϵ is a small positive number to avoid zero denominator. Each agent i learns its own policy to maximize the objective Fi=𝔼[t=0γtr^ti], where γ is the discount factor. The fair-efficient reward allows each agent to respond to the behaviors of other agents, which can be summarized by u¯. Therefore, u¯ can actually coordinate agents’ policies in decentralized multi-agent learning.

Proposition 1. The optimal fair-efficient policy set 𝛑* is Pareto efficient in infinite-horizon sequential decision-making.

Proof. We prove by contradiction. We first prove the resources must be fully occupied. Since the decision-making is infinite-horizon, the resources could be allocated in any proportion in the time domain. Assume the resources are not fully used, there must exist another 𝝅, under which each agent could occupy the remaining resources according to the ratio of ui/nu¯. Then, we have |ui/u¯-1|=|ui/u¯-1|, but u¯/c>u¯/c. Thus, for each agent i, Fi>Fi, which contradicts the pre-condition that 𝝅* is optimal. We then prove 𝝅* is Pareto efficient. Assume Pareto efficiency is not achieved, there must exist i,uiuii,ui>ui, so i=1nui>i=1nui, which contradicts the pre-condition that the resources are fully occupied.

Proposition 2. The optimal fair-efficient policy set 𝛑* achieves equal allocation when the resources are fully occupied.

Proof. We prove by contradiction. Assume the allocation is not equal when the resources are fully occupied, i,uiu¯. There must exist another 𝝅, under which agents that have ui>u¯ can give up resources to make ui=u¯. Then, for the remaining resources and other agents, this is an isomorphic subproblem. According to Proposition 1, the resources will be fully occupied by other agents. After that, we have Fi>Fi. This process can be repeated until i,Fi>Fi, which contradicts the pre-condition that 𝝅* is optimal.

3.2 Hierarchy

A learning policy could feel ambiguous while considering both fairness and efficiency since they might conflict in some states. For example, if different behaviors of other agents cause the change of u¯, an agent may need to perform different action at a same state to maximize its fair-efficient reward. However, this is hard for a single learned policy.

Figure 1: FEN architecture.

To overcome this difficulty, we design a hierarchy that consists of a controller and several sub-policies parameterized respectively by θ and ϕ , illustrated in Figure 1. The controller selects one of sub-policies by the sampled index ztπθ(|ot) based on the partial observation ot. The controller receives the fair-efficient reward r^ and acts at a lower temporal resolution than the sub-policies. Every T timesteps, the controller chooses a sub-policy and in the next T timesteps, the chosen sub-policy outputs actions to interact with the environment.

To obtain efficiency, we designate one of the sub-policies parameterized by ϕ1 to maximize the reward r given by the environment. For other sub-policies, we exploit an information-theoretic objective to guide the sub-policies to explore diverse possible behaviors for fairness.

From the perspective of the controller, these sub-policies should be able to be distinguished from each other and thus the controller can have more choices. Obviously, we can not quantify the difference of sub-policies directly. However, the experienced observations under a sub-policy could indirectly reflect the policy. The more differences between sub-policies, the less the uncertainty of z is, given observation o under the policy. That is to say, the mutual information I(Z;O) should be maximized by the sub-policy and we take it as one term of the objective. On the other hand, to explore diverse possible behaviors, the sub-policy should act as randomly as possible, so we also maximize the entropy between the action and observation H(A|O). In summary, the objective of the sub-policy is to maximize

J(ϕ)=I(Z;O)+H(A|O)=H(Z)-H(Z|O)+H(A|O)-H(Z|O)+H(A|O)=𝔼zπθ,oπϕ[logp(z|o)]+𝔼aπϕ[-p(a|o)logp(a|o)].

As H(Z) is only related to θ, H(Z) can be seen a constant and be neglected. The controller just outputs the probability pθ(z|o), and thus the first term of the objective can be interpreted as that each sub-policy tries to maximize the expected probability that it would be selected by the controller. To maximize it, we can give sub-policies a reward r~=logpθ(z|o) at each timestep and use RL to train the sub-policies. The second term can be treated as an entropy regularization, which is differentiable and could be optimized by backpropagation.

The hierarchy reduces the difficulty of learning both efficiency and fairness. The controller focuses on the fair-efficient reward and learns to decide when to optimize efficiency or fairness by selecting the sub-policy, without directly interacting with the environment. Sub-policy ϕ1 learns to optimize the environmental reward, i.e., efficiency. Other sub-policies learn diverse behaviors to meet the controller’s demand of fairness. The fair-efficient reward changes slowly since it is slightly affected by immediate environmental reward the sub-policy obtains. Thus, the controller can plan over a long time horizon to optimize both efficiency and fairness, while the sub-policies only optimize their own objectives within the given time interval T.

3.3 Decentralized Training

The centralized policy has an advantage in coordinating all agents’ behaviors. However, the centralized policy is hard to train, facing the curse of dimensionality as the number of agents increases. FEN is a decentralized policy in both training and execution. Although each agent only focuses on its own fair-efficient reward, they are coordinated by the average consensus on utility.

In the decentralized training, each agent need to perceive the average utility u¯ to know its current utility deviation from the average. When the number of agents is small, it is easy to collect the utility from each agent and compute the average. When the number of agents is large, it may be costly to collect the utility from each agent in real-world applications. To deal with this, we adopt a gossip algorithm for distributed average consensus Xiao et al. (2007). Each agent i maintains the average utility u¯i and iteratively updates it by

u¯i(t+1)=u¯i(t)+j𝒩iwij×(u¯j(t)-u¯i(t)),

where u¯i(0)=ui, 𝒩i is the set of neighboring agents in agent i’s observation, and the weight wij=1/(max{di,dj}+1), where the degree di=|𝒩i|. The gossip algorithm is distributed and requires only limited communication between neighbors to estimate the average.

{algorithm} [H] \[email protected]@algorithmic[1] \STATEInitialize ui, u¯i, the controller θ and sub-policies ϕ\FOR episode = 1,, \STATEThe controller chooses one sub-policy ϕz \FORt=1,,max-episode-length \STATEThe chosen sub-policy ϕz acts to the environment and gets the reward {rtif z=1,logpθ(z|ot)else \IF t%T=0 \STATEUpdate ϕz using PPO \STATEUpdate u¯i (with gossip algorithm) \STATECalculate r^i=u¯i/cϵ+|ui/u¯i-1| \STATEThe controller reselects one sub-policy \ENDIF\ENDFOR\STATEUpdate θ using PPO \ENDFOR
Figure 2: FEN training

The training of FEN is detailed in Algorithm 2. The controller and sub-policies are trained both using PPO Schulman et al. (2017). The controller selects one sub-policy every T to interact with the environment. The selected sub-policy is updated based on the trajectory during T. The controller is updated based on the trajectory of every sub-policy selection and its obtained fair-efficient reward during each episode.

Figure 3: Illustration of experimental scenarios: job scheduling (left), the Matthew effect (mid), manufacturing plant (right).

4 Experiments

For the experiments, we design three scenarios as abstractions of job scheduling, the Matthew effect, and manufacturing plant, which are illustrated in Figure 3. We demonstrate that by each agent decentralizedly optimizing the fair-efficient reward the multi-agent system could obtain a great balance between efficiency and fairness, and that the hierarchy indeed helps to learn both fairness and efficiency more easily. In the experiments, we compare FEN against several baselines which have different optimization objectives and are summarized as follows.

  • Independent agents are fully self-interested and each agent maximizes its expected sum of discounted environmental rewards ψi=𝔼[t=0γtrti].

  • Inequity Aversion agents receive a shaped reward ri-αN-1max(rj-ri,0)-βN-1max(ri-rj,0) to model the envy and guilt Hughes et al. (2018).

  • Avg agents take the average reward of agents as a shared reward and maximize 𝚊𝚟𝚐ψ=ψi/n Peysakhovich and Lerer (2018).

  • Min agents consider the worst performance of agents and maximize 𝚖𝚒𝚗ψ.

  • Min+αAvg agents consider both the worst performance and system performance and maximize the regularized maximin fairness Zhang and Shah (2014), 𝚖𝚒𝚗ψ+α𝚊𝚟𝚐ψ.

Note that in the last three baselines, all agents share the same optimization objective. To ensure the comparison is fair, the basic hyperparameters are all the same for FEN and the baselines, which are summarized in Appendix. The details about the experimental setting of each scenario are also available in Appendix. Moreover, the code of FEN is at https://github.com/PKU-AI-Edge/FEN.

4.1 Job Scheduling

In this scenario of job scheduling, we investigate whether agents can learn to fairly and efficiently share a resource. In a 5×5 grid world, there are 4 agents and 1 resource, illustrated in Figure 3 (left). The resource’s location is randomly initialized in different episodes, but fixed during an episode. Each agent has a local observation that contains a square view with 3×3 grids centered at the agent itself. At each timestep, each agent can move to one of four neighboring grids or stay at current grid. If the agent occupies the resource (move to or stay at the resource’s location), it receives a reward of 1, which could be seen as the job is scheduled, otherwise the reward is 0. Two agents cannot stay at a same grid, making sure the resource can only be occupied by one agent at a timestep. We trained all the methods for five runs with different random seeds. All experimental results are presented with standard deviation (also in other two scenarios). Moreover, as all agents are homogeneous in the task, we let agents share weights for all the methods.

Table 1: Job scheduling
resource utilization CV min utility max utility
Independent 96% ±11% 1.57 ±0.26 0 0.88 ±0.17
Inequity Aversion 72% ±9% 0.69 ±0.17 0.04 ±0.01 0.35 ±0.12
Min 47% ±8% 0.30 ±0.07 0.07 ±0.02 0.16 ±0.05
Avg 84% ±7% 0.75 ±0.13 0.05 ±0.03 0.46 ±0.17
Min+αAvg 63% ±5% 0.39 ±0.03 0.09 ±0.03 0.24 ±0.06
FEN 𝟗𝟎% ±5% 0.17 ±0.05 0.18±0.03 0.28 ±0.07
FEN w/o Hierarchy 57% ±13% 0.22 ±0.06 0.10 ±0.03 0.18 ±0.11
 centralized policy Min 12% ±4% 0.82 ±0.11 0 0.06 ±0.03
Avg 61% ±5% 1.46 ±0.14 0 0.53 ±0.06
Min+αAvg 19% ±5% 0.57 ±0.05 0.02 ±0.01 0.09 ±0.03
 w/ Hierarchy Min 62% ±9% 0.31 ±0.11 0.09 ±0.02 0.21 ±0.05
Avg 84% ±6% 0.61 ±0.14 0.08 ±0.03 0.41 ±0.07
Min+αAvg 71% ±8% 0.28 ±0.09 0.11 ±0.04 0.26 ±0.06

Table 1 shows the performance of FEN and the baselines in terms of resource utilization (sum of utility), coefficient of variation (CV) of utility (fairness), min utility, and max utility. Independent has the highest resource utilization, but also the worst CV. Self-interested Independent agents would not give up the resource for fairness, which is also witnessed by that min utility is 0 and max utility is 0.88, close to resource utilization. FEN has the lowest CV and the highest min utility. As only one agent can use the resource at a time, fairly sharing the resource among agents inevitably incurs the reduction of resource utilization. However, FEN can obtain much better fairness at a subtle cost of resource utilization, and its resource utilization is slightly less than Independent. Maximizing avgψ causes high CV since the average is not directly related to fairness. Its resource utilization is also lower, because avgψ is determined by all the agents, making it hard for individual agents to optimize by decentralized training. For the same reason, minψ is hard to be maximized by individual agents. The regularized maximin fairness reward minψ+αavgψ is designed to obtain a balance between fairness and resource utilization. However, due to the limitations of these two objective terms, Min+αAvg is much worse than FEN. The CV of Inequity Aversion is better than Independent but still worse than FEN, and the resource utilization is much lower, showing modeling envy and guilt is not effective in fairness problems. Moreover, the hyperparameters α and β might greatly affect the performance.

Since minψ, avgψ, and minψ+αavgψ do not depend on individual agents, but all agents, we adopt a centralized policy, which takes all observations and outputs actions for all agents, to optimize each objective. As shown in Table 1, the centralized policies for Min, Avg, and Min+αAvg are even worse than their decentralized versions. Although the centralized policy could coordinate agents’ behaviors, it is hard to train because of the curse of dimensionality. We also tried the centralized policy in the Matthew effect and manufacturing plant, but it did not work and thus is omitted.

Does the hierarchy indeed help the learning of FEN? To verify the effect of the hierarchy, we trained a single policy to maximize the fair-efficient reward directly without the hierarchy. Figure 5 illustrates the learning curves of FEN and FEN w/o Hierarchy, where we can see that FEN converges to a much higher mean fair-efficient reward than FEN w/o Hierarchy. As shown in Table 1, although FEN w/o Hierarchy is fairer than other baselines, the resource utilization is mediocre. This is because it is hard for a single policy to learn efficiency from the fair-efficient reward. However, in FEN, one of the sub-policies explicitly optimizes the environmental reward to improve the efficiency, other sub-policies learn diverse fairness behaviors, and the controller optimizes fair-efficient reward by long time horizon planing. The hierarchy successfully decomposes the complex objective and reduce the learning difficulty.

Figure 4: Learning curves of FEN and FEN w/o Hierarchy in job scheduling.
Figure 5: Probability of selecting different sub-policies in terms of (ui-u¯)/u¯.

To further verify the effectiveness of the hierarchy, we use the hierarchy with other baselines. The controller maximizes each own objective and the sub-policies are the same as FEN. Table 1 shows their performance has a certain degree of improvement, especially the resource utilizations of Min and Min+αAvg raise greatly and the CV of Min+αAvg reduces significantly. That demonstrates the hierarchy we proposed could reduce learning difficulty in many general cases with both global and local objectives. However, these baselines with the hierarchy are still worse than FEN in both resource utilization and CV, verifying the effectiveness of the fair-efficient reward.

In order to analyze the behavior of the controller, in Figure 5 we visualize the probability of selecting sub-policy ϕ1 and other sub-policies in terms of the utility deviation from average, (ui-u¯)/u¯. It shows when the agent’s utility is below average, the controller is more likely to select ϕ1 to occupy the resources, and when the agent’s utility is above average, the controller tends to select other sub-policies to improve fairness. The controller learns the sensible strategy based on the fair-efficient reward.

4.2 The Matthew Effect

In this scenario of the Matthew effect, we investigate whether agents can learn to mitigate/avoid the Matthew effect. In the scenario, there are 10 pac-men (agents) initialized with different positions, sizes, and speeds and also 3 stationary ghosts initialized at random locations, illustrated in Figure 3 (mid). Each pac-man can observe the nearest three other pac-men and the nearest one ghost. It could move to one of four directions or stay at current position. Once the distance between the pac-men and a ghost is less than the agent’s size, the ghost is consumed and the agent gets a reward 1. Then, a new ghost will be generated at a random location. When the agent gets a reward, its size and speed will increase correspondingly until the upper bounds are reached. In this scenario, the pac-man who consumes more ghosts becomes larger and faster, making consume ghosts easier. So, there exists inherent inequality in the setting. We trained all the models for five runs with different random seeds. As pac-men are homogeneous, we let pac-men share weights for all the methods.

Figure 6: The Matthew effect

Figure 6 shows the performance of FEN and the baselines in terms of social welfare (total ghosts consumed by all the pac-men), CV, and min and max income (consumed ghosts) among pac-men, episodes to converge. The detailed results are available in Appendix. Random pac-men take random actions and their CV shows the inherent unfairness of this scenario. Min pac-men cannot learn reasonable policies, because minψ is always closed to 0. Min+αAvg is only a little fairer than Avg since the effect of minψ is very weak. Independent causes the Matthew effect as indicated by the min (close to 0) and max (more than 200) income, where pac-man with initial larger size becomes larger and larger and ghosts are mostly consumed by these larger pac-men. Inequity Aversion is slightly fairer than Independent but lower social welfare.

Although Independent has the largest pac-man which consumes ghosts faster than others, this does not necessarily mean they together consume ghosts fast. FEN is not only fairer than the baselines but also has the highest social welfare, even higher than Independent. FEN pac-men have similar sizes and consume more ghosts than the baselines. This demonstrates FEN is capable of tackling the Matthew effect and helps social welfare increase. FEN w/o Hierarchy focuses more on fairness, neglecting the efficiency as in the scenario of job scheduling. Moreover, learning without hierarchy is much slower than FEN in this scenario, as illustrated in Figure 7. FEN w/o Hierarchy takes about 6000 episodes, while FEN takes only about 300 episodes, confirming that the hierarchy indeed speeds up the training.

Figure 7: Learning curves of FEN, FEN w/ Gossip, and FEN w/o Hierarchy in the Matthew effect.

Does distributed average consensus affect the performance of FEN? Instead of using the centrally computed average utility, we employ the gossip algorithm to estimate the average utility, where each agent only exchanges information with the agents in its observation. As shown in Figure 6, FEN w/ Gossip performs equivalently to FEN with only slight variation on each performance metric. The learning curve of FEN w/ Gossip is also similar to FEN, as illustrated in Figure 7. These confirm that FEN can be trained in a fully decentralized way.

Do sub-policies really learn something useful? To answer this question, after the training of FEN, we keep the learned weights θ and ϕ1 and replace other sub-policies with a random sub-policy. Once the controller chooses other sub-policies instead of ϕ1, the agent will perform random actions. In this FEN w/ random sub-policy, the min income become lower than FEN and CV becomes higher, because the random sub-policy cannot provide fairness behavior the controller requests. To investigate the difference of learned sub-policies and random sub-policy, we fix the three ghosts as a triangle at the center of the field and visualize the distribution of an agent’s positions under each sub-policy, as illustrated in Figure 8. It is clear that the learned sub-policies keep away from the three ghosts for fairness and their distributions are distinct, concentrated at different corners, verifying the effect of the information-theoretic reward.

4.3 Manufacturing Plant

In this scenario of manufacturing plant, we investigate whether agents with different needs can learn to share different types of resources and increase the production in a manufacturing plant. In a 8×8 grid world, there are 5 agents and 8 gems, as illustrated in Figure 3 (right). The gems belong to three types (y, g, b). Each agent has a local observation that contains a square view with 5×5 grids centered at the agent itself, and could move to one of four neighboring grids or stay. When an agent moves to the grid of one gem, the agent collects the gem and gets a reward of 0.01, and then a new gem with random type and random location is generated. The total number of gems is limited, and when all the gems are collected by the agents, the game ends. Each agent has a unique requirement of numbers for the three types of gems to manufacture a unique part of the product and receive a reward 1. Each product is assembled by the five unique parts manufactured by the five agents, respectively. So, the number of manufactured products is determined by the least part production among the agents. Due to the heterogeneity, we let each agent learn its own weights for FEN and the baselines.

Figure 8: Visualization of learned sub-policies (left three) and random sub-policy (right) in the Matthew effect.

Table 2 shows the performance of FEN and the baselines in terms of resource utilization (the ratio of the number of gems consumed to manufacture the products over the total number of gems), CV, number of products (minimum number of manufactured parts among agents), and max number of manufactured parts among agents. In this scenario, agents need to learn to collect the right gems and then to balance the parts manufactured by each agent (i.e., manufacturing similar large number of parts), because the unused collected gems and redundant parts will be wasted. FEN manufactures the most products, more than two times than the baselines. The more products are assembled, the higher the resource utilization is. Thus, FEN also has the highest resource utilization. Moreover, FEN is also the fairest one. Although FEN w/o Hierarchy agents are fairer than other baselines, they all manufacture less parts and hence eventually less products. Avg agents assemble the least products, though one agent manufactures the largest number of parts, resulting in serious waste.

Table 2: Manufacturing plant
resource utilization CV no. products max parts
Independent 28% ±5% 0.38 ±0.08 19 ±3 58 ±8
Inequity Aversion 27% ±6% 0.27 ±0.06 19 ±4 42 ±7
Min 29% ±6% 0.26 ±0.01 20 ±4 41 ±7
Avg 13% ±3% 0.63 ±0.07 9 ±2 71 ±9
Min+αAvg 34% ±6% 0.28 ±0.01 23 ±4 45 ±7
FEN 𝟖𝟐% ±5% 0.10 ±0.03 𝟒𝟖 ±3 63 ±3
FEN w/o Hierarchy 22% ±3% 0.18 ±0.07 15 ±1 24 ±4

5 Conclusion

We have proposed FEN, a novel hierarchical reinforcement learning model to learn both fairness and efficiency, driven by fair-efficient reward, in multi-agent systems. FEN consists of one controller and several sub-policies, where the controller learns to optimize the fair-efficient reward, one sub-policy learns to optimize the environmental reward, and other sub-policies learn to provide diverse fairness behaviors guided by the derived information-theoretic reward. FEN can learn and execute in a fully decentralized way, coordinated by average consensus. It is empirically demonstrated that FEN easily learns both fairness and efficiency and significantly outperforms baselines in a variety of multi-agent scenarios including job scheduling, the Matthew effect, and manufacturing plant.

Acknowledgments

This work was supported in part by Huawei Noah’s Ark Lab, Peng Cheng Lab, and NSF China under grant 61872009.

References

  • [1] P. Bacon, J. Harb, and D. Precup (2017) The option-critic architecture. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.
  • [2] B. Bakker, S. Whiteson, L. Kester, and F. C. Groen (2010) Traffic light control by multiagent reinforcement learning systems. In Interactive Collaborative Information Systems, pp. 475–510. Cited by: §1.
  • [3] A. Beynier, N. Maudet, and A. Damamme (2018) Fairness in multiagent resource allocation with dynamic and partial observations. In International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), Cited by: §1, §2.
  • [4] T. Bol, M. de Vaan, and A. van de Rijt (2018) The matthew effect in science funding. Proceedings of the National Academy of Sciences 115 (19), pp. 4887–4890. Cited by: §2.
  • [5] Y. Chen, J. K. Lai, D. C. Parkes, and A. D. Procaccia (2013) Truth, justice, and cake cutting. Games and Economic Behavior 77 (1), pp. 284–297. Cited by: §1, §2.
  • [6] J. Cui, Y. Liu, and A. Nallanathan (2018) Multi-agent reinforcement learning based resource allocation for uav networks. arXiv preprint arXiv:1810.10408. Cited by: §1, §2.
  • [7] P. Dayan and G. E. Hinton (1993) Feudal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [8] S. de Jong, K. Tuyls, K. Verbeeck, and N. Roos (2007) Considerations for fairness in multi-agent systems. In ALAMAS, pp. 104–110. Cited by: §1, §2.
  • [9] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §2.
  • [10] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman (2017) Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767. Cited by: §2.
  • [11] Z. Hong, S. Su, T. Shann, Y. Chang, and C. Lee (2018) A deep policy inference q-network for multi-agent systems. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Cited by: §2.
  • [12] E. Hughes, J. Z. Leibo, M. Phillips, K. Tuyls, E. Dueñez-Guzman, A. G. Castañeda, I. Dunning, T. Zhu, K. McKee, R. Koster, et al. (2018) Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, 2nd item, Hyperparameters.
  • [13] R. Jain, P. R. Panda, and S. Subramoney (2017) Cooperative multi-agent reinforcement learning-based co-optimization of cores, caches, and on-chip network. ACM Transactions on Architecture and Code Optimization 14 (4), pp. 32. Cited by: §1, §2.
  • [14] R. K. Jain, D. W. Chiu, and W. R. Hawe (1984) A quantitative measure of fairness and discrimination. Technical report Cited by: §3.1.
  • [15] J. Jiang, C. Dun, and Z. Lu (2018) Graph convolutional reinforcement learning for multi-agent cooperation. arXiv preprint arXiv:1810.09202. Cited by: §1.
  • [16] J. Jiang and Z. Lu (2018) Learning attentional communication for multi-agent cooperation. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §1, §2.
  • [17] Y. Jiang and Z. Huang (2012) The rich get richer: preferential attachment in the task allocation of cooperative networked multiagent systems with resource caching. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 42 (5), pp. 1040–1052. Cited by: §2.
  • [18] I. Kash, A. D. Procaccia, and N. Shah (2014) No agent left behind: dynamic fair division of multiple resources. Journal of Artificial Intelligence Research 51, pp. 579–603. Cited by: §1, §2.
  • [19] X. Li, J. Zhang, J. Bian, Y. Tong, and T. Liu (2019) A cooperative multi-agent reinforcement learning framework for resource balancing in complex logistics network. arXiv preprint arXiv:1903.00714. Cited by: §1, §2.
  • [20] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.
  • [21] D. Minarolli and B. Freisleben (2013) Virtual machine resource allocation in cloud computing via multi-agent fuzzy control. In International Conference on Cloud and Green Computing, Cited by: §1.
  • [22] O. Nachum, S. S. Gu, H. Lee, and S. Levine (2018) Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [23] M. Perc (2014) The matthew effect in empirical data. Journal of The Royal Society Interface 11 (98), pp. 20140378. Cited by: §2.
  • [24] A. Peysakhovich and A. Lerer (2018) Prosocial learning agents solve generalized stag hunts better than selfish ones. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Cited by: §1, §2, 3rd item.
  • [25] A. D. Procaccia (2009) Thou shalt covet thy neighbor’s cake. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §1, §2.
  • [26] N. C. Rabinowitz, F. Perbet, H. F. Song, C. Zhang, S. Eslami, and M. Botvinick (2018) Machine theory of mind. arXiv preprint arXiv:1802.07740. Cited by: §2.
  • [27] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485. Cited by: §1, §2.
  • [28] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.3.
  • [29] S. Sukhbaatar, R. Fergus, et al. (2016) Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.
  • [30] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296. Cited by: §1, §2.
  • [31] R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §2.
  • [32] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §2.
  • [33] J. X. Wang, E. Hughes, C. Fernando, W. M. Czarnecki, E. A. Duéñez-Guzmán, and J. Z. Leibo (2019) Evolving intrinsic motivations for altruistic behavior. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Cited by: §1, §2.
  • [34] L. Xiao, S. Boyd, and S. Kim (2007) Distributed average consensus with least-mean-square deviation. Journal of parallel and distributed computing 67 (1), pp. 33–46. Cited by: §3.3.
  • [35] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018) Mean field multi-agent reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §1, §2.
  • [36] Y. Zeng and S. Zuo (2019) The matthew effect in computation contests: high difficulty may lead to 51% dominance. arXiv preprint arXiv:1902.09089. Cited by: §2.
  • [37] C. Zhang and J. A. Shah (2014) Fairness in multi-agent sequential decision-making. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, 5th item.

Appendix

Hyperparameters

In the experiments, we use PPO for every RL agent. The PPO structure keeps same for the controller and sub-policies of FEN, and also for the baselines. The value network and policy network are MLPs with two 256-unit hidden layers and ReLU activation. The learning rates for the value network and policy network of PPO are 10-3 and 3×10-4, respectively. We trained all networks using Adam optimizer. The discounted factor is γ=0.98. For Min+αAvg agents, α=0.01. For Inequity Aversion agents, α=5 and β=0.05, the setting used in [12]. For FEN, the number of the sub-policies is 4, ϵ in the fair-efficient reward is set to 0.1. T is 25, 50, and 50 in job scheduling, the Matthew effect, and manufacturing plant, respectively.

Experimental Settings

In the scenario of job scheduling, we trained all the models for five runs with different random seeds, each episode contains 1000 timesteps.

In the scenario of the Matthew effect, the size and speed of pac-man are randomly initiated between (0.01,0.04) and (0.018,0.042), respectively. Each time after a pac-men consumes a ghost, its size will increase by 0.005 and the speed will increase by 0.004. The max size is 0.15 and the max speed is 0.13. We trained all the models for five runs with different random seeds, where each episode contains 1000 timesteps.

In the scenario of manufacturing plant, the total number of gems is 700. The unique requirements of numbers of three types of gems (Ly,Lg,Lb) are (2,1,0),(1,0,1),(0,1,1),(1,1,0),(0,1,2), respectively, for the five agents. We trained all the models for five runs with different random seeds, where each episode ends after all the gems are collected by agents.

All the experiments were conducted on Dell XPS desktops with i7-8700k 6-Core processors and 1080TI GPUs.

Experimental Results

The details of the experimental results in the Matthew effect are shown in Table 3.

Table 3: The Matthew effect
social welfare CV min income max income episodes
Random 84 ±30 0.93 ±0.25 1 22 ±10 -
Independent 791 ±62 0.86 ±0.11 1 202 ±33 100
Inequity Aversion 702 ±90 0.80 ±0.16 2 152 ±18 1000
Min 18 ±8 2.04 ±0.66 0 7±4 6000
Avg 527±113 0.86 ±0.21 2 126±18 1000
Min+αAvg 441±75 0.85 ±0.18 1±1 103±16 2000
FEN 𝟖𝟑𝟎±22 0.06 ±0.01 𝟕𝟗±2 94±3 300
FEN w/ Gossip 𝟖𝟒𝟏±55 0.07 ±0.01 𝟕𝟔±4 95±4 300
FEN w/o Hierarchy 251±12 0.06 ±0.04 23±2 26±1 6000
FEN w/ Random Sub-policy 834±47 0.08 ±0.02 66±7 99±6 -