Federated Deep Reinforcement Learning

  • 2020-02-09 12:30:36
  • Hankz Hankui Zhuo, Wenfeng Feng, Yufeng Lin, Qian Xu, Qiang Yang
  • 0


In deep reinforcement learning, building policies of high-quality ischallenging when the feature space of states is small and the training data islimited. Despite the success of previous transfer learning approaches in deepreinforcement learning, directly transferring data or models from an agent toanother agent is often not allowed due to the privacy of data and/or models inmany privacy-aware applications. In this paper, we propose a novel deepreinforcement learning framework to federatively build models of high-qualityfor agents with consideration of their privacies, namely Federated deepReinforcement Learning (FedRL). To protect the privacy of data and models, weexploit Gausian differentials on the information shared with each other whenupdating their local models. In the experiment, we evaluate our FedRL frameworkin two diverse domains, Grid-world and Text2Action domains, by comparing tovarious baselines.


Quick Read (beta)

Federated Deep Reinforcement Learning

Hankz Hankui Zhuo    Wenfeng Feng    Yufeng Lin    Qian Xu    Qiang Yang

In deep reinforcement learning, building policies of high-quality is challenging when the feature space of states is small and the training data is limited. Despite the success of previous transfer learning approaches in deep reinforcement learning, directly transferring data or models from an agent to another agent is often not allowed due to the privacy of data and/or models in many privacy-aware applications. In this paper, we propose a novel deep reinforcement learning framework to federatively build models of high-quality for agents with consideration of their privacies, namely Federated deep Reinforcement Learning (FedRL). To protect the privacy of data and models, we exploit Gausian differentials on the information shared with each other when updating their local models. In the experiment, we evaluate our FedRL framework in two diverse domains, Grid-world and Text2Action domains, by comparing to various baselines.

1 Introduction

In deep reinforcement learning, building policies of high-quality is challenging when the feature space of states is small and the training data is limited. In many real-world applications, however, datasets from clients are often privacy sensitive (Duchi et al., 2012) and it is often difficult for such a data center to guarantee building models of high-quality. To deal with the issue, Konecny et al. propose a new learning setting, namely federated learning, whose goal is to train a classification or clustering model with training data involving texts, images or videos distributed over a large number of clients (Konecný et al., 2016; McMahan et al., 2017). Different from previous federated learning setting (c.f. (Yang et al., 2019)), we propose a novel federated learning framework based on reinforcement learning (Sutton and Barto, 1998; Mnih et al., 2015; Co-Reyes et al., 2018), i.e., Federated deep Reinforcement Learning (FedRL), which aims to learn a private Q-network policy for each agent by sharing limited information (i.e., output of the Q-network) among agents. The information is “encoded” when it is sent to others and “decoded” when it is received by others. We assume that some agents have rewards corresponding to states and actions, while others have only observed states without rewards. Without rewards, those agents are unable to build decision policies on their own information. We claim that all agents benefit from joining the federation in building decision policies.

There are many applications regarding federated reinforcement learning. For example, in the manufacturing industry, producing products may involve various factories which produce different components of the products. Factories’ decision policies are private and will not be shared with each other. On the other hand, building individual decision policies of high-quality on their own is often difficult due to their limited businesses and lack of rewards (for some factories). It is thus helpful for them to learn decision polices federatively under the condition that private data is not given away. Another example is building medical treatment policies to patients for hospitals. Patients may be treated in some hospitals and never give feedbacks to the treatments, which indicates these hospitals are unable to collect rewards based on the treatments given to the patients and build treatment decision policies for patients. In addition, data records about patients are private and may not be shared among hospitals. It is thus necessitated to learn treatment policies for hospitals federatively.

Our FedRL framework is different from multi-agent reinforcement learning, which is concerned with a set of autonomous agents that observe global states (or partial states which are directly shared to make “global” states), select an individual action and receive a team reward (or each agent receives an individual reward but shares it with other agents) (Tampuu et al., 2015; Leibo et al., 2017; Foerster et al., 2016). FedRL assumes agents do not share their partial observations and some agents are unable to receive rewards. Our FedRL framework is also different from transfer learning in reinforcement learning, which aims to transfer experience gained in learning to perform one task to help improve learning performance in a related but different task or agent, assuming observations are shared with each other (Taylor and Stone, 2009; Tirinzoni et al., 2018), while FedRL assumes states cannot be shared among agents.

Our FedRL framework functions in three phases. Initially, each agent collects output values of Q-networks from other agents, which are “encrypted” with Gausian differentials. Furthermore, it builds a shared value network, e.g., MLP (multilayer perceptron), to compute a global Q-network output with its own Q-network output and the encrypted values as input. Finally, it updates both the shared value network and its own Q-network based on the global Q-network output. Note that MLP is shared among agents while agents’ own Q-networks are unknown to others and should not be inferred based on the encrypted Q-network output shared in the training process.

In the remainder of the paper, we first review previous work related to our FedRL framework, and then present the problem formulation of FedRL. After that, we introduce our FedRL framework in detail. Finally we evaluate our FedRL framework in the Grid-World domain with various sizes and the Text2Actions domain.

2 Related Work

The nascent field of federated learning considers training statistical models directly on devices (Konecný et al., 2015; McMahan et al., 2017). The aim in federated learning is to fit a model to data generated by distributed nodes. Each node collects data in a non-IID manner across the network, with data on each node being generated by a distinct distribution. There are typically a large number of nodes in the network, and communication is often a significant bottleneck. Different from previous work that train a single global model across the network (Konecný et al., 2015, 2016; McMahan et al., 2017), Smith et al. propose to learn separate models for each node which is naturally captured through a multi-task learning (MTL) framework, where the goal is to consider fitting separate but related models simultaneously (Smith et al., 2017). Different from those federated learning approaches, we consider federated settings in reinforcement learning.

Our work is also related to multi-agent reinforcement learning (MARL) which involves a set of agents in a shared environment. A straightforward way to MARL is to extend the single-agent RL approaches. Q-learning has been extended to cooperative multi-agent settings, namely Independent Q-learning (IQL), in which each agent observes the global state, selects an individual action and receives a team reward (Tampuu et al., 2015; Leibo et al., 2017). One challenging of MARL is that multi-agent domains are non-stationary from agent’s perspectives, due to other agents’ interactions in the shared environment. To address this issue, (Omidshafiei et al., 2017) propose to explore Concurrent Experience Replay Trajectories (CERTs) structures, which store different agents’ histories, and align them together based on the episode indices and time steps. Due to the action space growing exponentially with the number of agents, learning becomes very difficult due to partial observability of limited communication when the number of agents is large. (Lowe et al., 2017) thus propose to solve the MARL problem through a Centralized Critic and a Decentralized Actor, and (Rashid et al., 2018) propose to exploit a linear decomposition of the joint value function across agents. Different from MARL, our FedRL framework assumes agents do not share their partial observations and some agents are unable to receive rewards, instead of assuming observations are sharable and all agents are able to receive rewards.

3 Problem Definition

A Markov Decision Process (MDP) can be defined by S,A,T,r, where S is the state space, and A is the action space. T is the transition function: S×AS, i.e., T(s,a,s)=P(s|s,a), specifying the probability of next state sS given current state sS and aA that applies on s. r is the reward function: S, where is the space of real numbers. Given a policy π:SA, the value function Vπ(s) and the Q-function Qπ(s,a) at step t+1, can be updated by their t step:




for t{0,,K-1}. The solution to an MDP problem is the best policy π* such that Vπ*(s)=maxπVπ(s) or Qπ*(s,π*(s))=maxπQπ(s,π(s)). In DQN (Deep Q-Network), given transition function T is unknown, the Q-function is represented by a Q-network Q(s,a;θ) with θ as parameters of the network, and updated by


as done by (Mnih et al., 2015). To learn the parameters θ, one way is to store transitions s,a,s,r in replay memories Ω and exploit a mini-batch sampling to repeatedly update θ (Mnih et al., 2015). Once θ is learnt, the policy π* can be extracted from Q(s,a;θ),


We define our federated deep reinforcement learning problem by: given transitions Dα={sα,aα,sα,rα} collected by agent α, and pairs of states and actions Dβ={sβ,aβ} collected by agent β, we aim to federatively build policies πα* and πβ* for agents α and β, respectively. Note that in this paper we consider the federation with two members for simplicity. The setting can be extended to many agents by exploiting the same federated mechanism between each two agents. We denote states, actions, Q-functions, and policies with respect to agents α and β, by “sαSα, aαAα, Qα, πα*” and “sβSβ, aβAβ, Qβ, πβ*”, respectively.

In our federated deep reinforcement learning problem, we assume:

A1: The feature spaces of states sα and sβ are different between agents α and β. For example, a state sα denotes a patient’s cardiogram in hospital α, while another state sβ denotes the same patient’s electroencephalogram in hospital β, indicating the feature spaces of sα and sβ are different.

A2: Transitions 𝒟α and 𝒟β cannot be shared directly between α and β when they learning their own models. The correspondences between transitions from 𝒟α and 𝒟β are, however, known to each other. In other words, agent α can send the “ID” of a transition to agent β, and agent β can use that “ID” to find its corresponding transition in 𝒟β. For example, in hospital, “ID” can correspond to a specific patient.

A3: The output of functions Qα and Qβ can be shared with each other under the condition that they are protected by some privacy protection mechanism.

Based on A1-A3, we aim to learn policies πα* and πβ* of high-quality for agents α and β by preserving privacies of their own data and models.

4 Our FedRL Approach

In this section we present our FedRL framework in detail. An overview of FedRL is shown in Figure 1, where the left part is the model of agent α, and the right part is the model of agent β. Each model is composed of a local Q-network with parameters θα for agent α, θβ for agent β, and a global MLP module with parameters θg for both α and β.

Figure 1: The networks of agents α and β

Basic Q-networks

We build two Q-networks for agents α and β, denoted by Qα(sα,aα;θα) and Qβ(sβ,aβ;θβ), respectively, where θα and θg are parameters of the Q-networks. The outputs of these two basic Q-networks are not directly used to predict the actions, but taken as input of the MLP module.

Gaussian differential privacy

To avoid agents “inducing” models of each other according to repeatedly received outputs of other’s Q-network during training, we consider using differential privacy (Dwork and Roth, 2014) to protect the output of agent’s Q-network. There are various mechanisms of differential privacy such as the Gaussian mechanism (Abadi et al., 2016) and Binomial mechanism (Agarwal et al., 2018). In this paper, we exploit Gaussian mechanism since the output of the MLP with Gaussian input is Gaussian itself. In previous federated learning settings, the mechanism is applied to the gradients of agents (clients) before being sent to the server or other agents (clients). In our FedRL framework, we send the output of one agent’s Q-network to another, so the Gaussian noise is added to the output rather than the gradients. The mechanism is defined by

Q^α(sα,aα;θα)=Qα(sα,aα;θα)+N(0,σ2) (1)
Q^β(sβ,aβ;θβ)=Qβ(sβ,aβ;θβ)+N(0,σ2) (2)

where N(0,σ2) is the Gaussian distribution with mean 0 and standard deviation σ.

Federated Q-network

We build a new Q-network, namely federated Q-network denoted by Qf, to leverage the outputs of the two basic Q-networks with Gaussian noise based on MLP, which is defined by

MLP([Q^α(sα,aα;θα)|Q^β(sβ,aβ;θβ)];θg) (3)

where θg is the parameters of MLP and [|] indicates the concatenation operation. Note that parameters of an MLP can be shared between agents. Once MLP is updated, the updated parameters are shared with the other agent.

With respect to agents α and β, we define each agent’s federated Q-networks by viewing the other agent’s basic Q-network (with Gaussian noise) as a fixed constant when updating its own basic Q-network, as shown below,

Qfα(,Cβ;θα,θg)=MLP([Q^α(;θα)|Cβ];θg) (4)
Qfβ(,Cα;θβ,θg)=MLP([Cα|Q^β(;θβ)];θg) (5)

where Cα=Q^α(sα,aα;θα) and Cβ=Q^β(sβ,aβ;θβ) are fixed constants when updating agent β’s basic Q-network and agent α’s basic Q-network, respectively.

The Q-networks are trained by minimizing the square error loss Lαj(θα,θg) and Lβj(θβ,θg) of agents α and β,

Lαj(θα,θg)=𝔼[(Yj-Qfα(sαj,aαj,Cβ;θα,θg))2] (6)
Lβj(θβ,θg)=𝔼[(Yj-Qfβ(sβj,aβj,Cα;θβ,θg))2] (7)

where Yj=rj+γmaxaQfα(sαj,a,Cβ;θα,θg). Note that agent β is unable to compute Yj since it does not have reward r. Yj is computed by agent α and shared with agent β.

Figure 2: The training and testing procedures

Overview of training and testing

The training and testing procedures of agents α and β are shown in Figure 2. When training, agent α computes Yj and sends it to agent β. Agent β updates θβ and θg, computes Cβ, and sends θg and Cβ to agent α. After that, agent α updates θα and θg, computes Cα, and sends θg and Cα to agent β. When testing, both agents compute and send Cα and Cβ to each other for computing Qfα and Qfβ.

The detailed training procedure can be seen from Algorithms 1 and 2. In Steps 1 and 2 of Algorithm 1, we initialize the basic Q-network and replay memory of agent α and the MLP module. In Step 3 of Algorithm 1, we call the function from Algorithm 2 to initialize the Q-network and replay memory of agent β. In Step 6 of Algorithm 1, we obtain observation of agent α’s state sαt. In Step 7 of Algorithm 1, we call the function in Algorithm 2 to calculate the output of basic Q-network of agent β and obtain observation of agent β and select the corresponding action, as shown in Steps from 6 to 10 of Algorithm 2. In Steps from 8 and 11 of Algorithm 1, we perform the ϵ-greedy exploration and exploitation, obtain new observations and store the transitions to the replay memory. In Steps 12 and 13 of Algorithm 1, we sample a record j in the memory and call the function ComputeQBeta(j) of Algorithm 2 to calculate the output of basic Q-network of agent β based on the index j. In Steps 14 and 15 of Algorithm 1, we update parameters θα and θg. In Steps 16 and 17 of Algorithm 1, we compute the output of basic Q-network of agent α and pass it to agent β and call the function UpdateQ of agent β to update basic Q-network of agent β and MLP, as shown in Steps from 19 to 21 of Algorithm 2. Note that Algorithms 1 and 2 are executed by agents α and β separately.

Algorithm 1 FedRL-ALPHA

Input: state space Sα, action space Aα, rewards r
Output: θα,θg \[email protected]@algorithmic[1] \STATEInitialize Qα,Qf with random values for θα,θg \STATEInitialize replay memory Dα \STATECall FedRL-BETA.Init() \FORepisode = 1: M \REPEAT\STATEObserve sαt \STATECall Cβ=FedRL-BETA.ComputeQBeta() \STATESelect action at with probability ϵ \STATEOtherwise at=argmaxaQfα(sαt,a,Cβ;θα,θg) \STATEExecute action at, obtain reward rt and state st+1 \STATEObserve sαt+1, store (sαt,at,rt,sαt+1) in Dα \STATESample (sαj,aj,rj,sαj+1) from Dα


Call Cβ=FedRL-BETA.ComputeQBeta(j) \STATEYj=rj+γmaxaQfα(sαj,a,Cβ;θα,θg) \STATEUpdate θα,θg according to Eq. (4), (6) \STATECα=Q^α(sαj,a;θα) \STATECall θg=FedRL-BETA.UpdateQ(Yj,j,Cα,θg)


terminal t \ENDFOR

Algorithm 2 FedRL-BETA

Input: state space Sβ, action space Aβ
Output: θβ, θg \[email protected]@algorithmic[1] \FUNCTIONInit() \STATEInitialize Qβ with random values for θβ \STATEInitialize replay memory Dβ \ENDFUNCTION\FUNCTIONComputeQBeta() \STATEObserve sβ \STATESelect aβAβ with probability ϵ \STATEOtherwise aβ=argmaxaβQβ(sβ,aβ;θβ) \STATEStore (sβ,aβ) in Dβ \STATELet Cβ=Q^β(sβ,a;θβ) \STATEreturn Cβ \ENDFUNCTION\FUNCTIONComputeQBeta(j) \STATESelect (sβ,aβ) from Dβ based on index j \STATELet Cβ=Q^β(sβ,aβ;θβ) \STATEreturn Cβ \ENDFUNCTION\FUNCTIONUpdateQ(Yj,j,Cα,θg) \STATESelect (oβj,aβj) from Dβ based on index j \STATEUpdate θβ, θg based on Eq. (5), (7) \STATEreturn θg \ENDFUNCTION

5 Experiments

In the experiment, we evaluate FedRL11 1 The source code and datasets are available from https://github.com/FRL2019/FRL in the following two aspects. Firstly, we would like to see if agent α can learn better policies by joining the federation than without joining the federation (agent α can build policies by itself since it has complete transitions {s,a,s,r}, while agent β cannot build policies without joining the federation since it has only pairs of states and actions {s,a}). Secondly, we would like to see if our FedRL approach can learn policies of high-quality which are close to the ones learnt by directly combining data of both agent α and β, neglecting the data-privacy issue between α and β. To do this, we compared our FedRL approach to the following baselines:

  • DQN-alpha: which is a deep Q-network (Mnih et al., 2015) trained with agent α’s data only. It takes observations sα as input and outputs actions corresponding to sα.

  • DQN-full: which is a deep Q-network trained by directly putting data together from both agents α and β, i.e., neglecting data privacy between agents α and β.

  • FCN-alpha: which is a fully convolutional network (FCN) trained with agent α’s data only, similar to DQN-alpha. FCN-alpha aims to build policies via supervised learning (Furuta et al., 2019), i.e., viewing states as input and actions as labels.

  • FCN-full: which is a convolutional network trained with all data of agent α and β put together directly, similar to DQN-full.

In our experiment, we used the kernel of size 3×3 in CNN and two fully-connected layers with the size of 32×4 in MLP. We set the standard deviation in Gaussian differential privacy σ to be 1. We adopted the Adam optimizer with learning rate 0.001 and the ReLU activation for all models. We evaluated our FedRL with comparison to baselines in two domains, i.e., Grid-World and Text2Action.

5.1 Evaluation in Grid-World Domain

We first conducted experiments in Grid-World domain with randomly placed obstacles (Tamar et al., 2016). The two agents α and β were randomly put in the environment as well, as shown in Figure 3. The two agents aim to navigate through optimal pathes (i.e. the shortest pathes without hitting obstacles) to meet with each other. In this domain, we define states, actions and rewords as follows.

Figure 3: A 32×32 grid-world domain with agents α and β. The rectangles in blue are boundaries of observations.
  • States: The domain is represented by a Ng×Ng binary-valued matrix (0 for obstacle, 1 otherwise), where Ng is the size of the domain. We evaluated our approach with respect to different sizes, i.e., Ng=8,16,32. The observed state of agent α, denoted by sα, was set to be a 3×3 matrix with the current position of agent α in the center of the matrix. The observed state of agent β, denoted by sβ, was set to be a 5×5 matrix with the current position of agent β in the center of the matrix.

  • Actions: There are 4 actions for each agent, i.e. going towards 4 directions, denoted by {east,south,west,north}.

  • Rewards: The reward is composed of two parts, i.e., local reward rl and global reward rg, respectively. When an agent hits an obstacle, rl is set to be -10; when an agent meets the other agent, rl is set to be +50; otherwise, rl is set to be -1. Considering the goal of the task is to make the two agents eventually meet each other, we exploited an additional reward regarding the distance between two agents, namely global reward, i.e., rg=c/md(α,β), where md(α,β) is the Manhattan distance between the two agents, and c is a regularization factor which is set to be the dimension of the domain, i.e. c=Ng. The final reward r is calculated by r=rl+rg.

  • Dataset: We generated 8000 different maps (or matrices) for each size of 8×8,16×16 and 32×32. In each map, we randomly chose two positions for the two agents, and computed the optimal path (shortest path) which was compared to the paths predicted by our FedRL and baselines. We randomly split the 8000 maps into 6400 for training, 800 for validation, and 800 for testing.

  • Criteria: In each training episode, we first took the initial state and predicted an action with a model. We then got a new state and took it as the new input of the model at the second time step. We repeated the procedure until the two agents met each other, which is indicated as a successful episode, or it exceeded the maximum time step Tm, which is indicated as a failure episode. We set Tm to be twice the length of the longest optimal path, i.e. Tm=38,86,178 for each size of 8×8,16×16 and 32×32, respectively. We finally computed the measures of successful rate SuccRate, and average reward AvgRwd, i.e.,




    where #SuccessfulEpisodes indicates the number of successful episodes, #TotalEpisodes indicates the total number of episodes, TotalCumulativeReward indicates the total reward of all episodes, respectively.

Table 1: Comparison with baselines in Grid-World
Metric Method Domain w.r.t. various sizes
8×8 16×16 32×32
SuccRate FCN-alpha 69.73% 48.04% 41.73%
DQN-alpha 88.27% 76.20% 71.41%
FedRL-1 92.52% 79.83% 77.88%
FedRL-2 95.06% 84.31% 82.02%
FCN-full 72.16% 56.44% 50.15%
DQN-full 93.69% 83.40% 79.73%
AvgRwd DQN-alpha 13.781 -112.084 -285.946
FedRL-1 18.152 -94.193 -226.583
FedRL-2 19.101 -84.139 -189.756
DQN-full 31.286 -38.114 -52.72

Experimental Results w.r.t. Domain Sizes

We ran our FedRL and baselines five times and calculated an average of SuccRate (as well as AvgRwd). We calculated two results of our FedRL, denoted by FedRL-1 and FedRL-2, which correspond to “adding Gausian differential privacy on the local Q-network” and ”not adding Gausian differential privacy on the local Q-network” in the testing data, respectively. The results are shown in Table 1. From Table 1, we can see that SuccRate of both FedRL-1 and FedRL-2 is much better than DQN-alpha and CNN-alpha in all three different sizes of domains, which indicates agent α can indeed get help from agent β via learning federatively with agent β. Comparing to DQN-full, we can see that the SuccRate of both FedRL-1 and FedRL-2 is close to DQN-full, which indicates our federated learning framework can indeed take advantage of both training data from agents α and β, even though they are protected locally with Gausian differential privacy (the reason why FedRL-2 is slightly better than DQN-full is that the hierarchical structure of our federated learning framework with three components may be more suitable for this domain than a unique DQN framework). In addition, we can also see that the SuccRate generally decreases when the size of the domain increases, which is consistent with our intuition, since the larger the domain is, the more difficult the task is (which requires more training data to build models of high-quality).

From the metric of AvgRwd, we can see that both FedRL-1 and FedRL-2 are better than DQN-alpha, which indicates agent α can indeed get help from agent β in gaining rewards via federated learning with agent β. We can also find that FedRL-2 outperforms FedRL-1 in both AvgRwd and SuccRate. This is because our model is trained based on Gausian differential privacy, indicating SuccRate on testing data with Gausian differential privacy, as done by FedRL-2, should be better than on testing data without Gausian differential privacy, as done by FedRL-1.

Figure 4: Results about the impact of history length.

Experimental Results w.r.t. History Length

To study when FedRL works, we consider the amount of information that an agent uses. The observation input at each time is a sequence of observations, i.e. observation history. Intuitively, the longer the length of the observation history, the more information we have, and the more complicated the neural network is. We fixed the structures of all models and only changed the length of the history observations. We tested the length of history from 2 to 32 for all domains. The results are shown in Figure 4. In the first row of Figure 4, we can observe is that the success rates are improving with the increment of the history length. In 8×8 and 16×16 domains, the results converge when history length is longer than 16, while in 32×32 domain, they have not converged even at the history length of 32. The reason is that 32×32 domain is more complicated than the other two domains, so it needs more information (i.e. H>32) to learn a model of high-quality. We can also find that when the history length is short (i.e. H=2 for 16×16 and H4 for 32×32), DQN-alpha, DQN-full and FedRL-1 perform poorly. FedRL-1 and DQN-full do not show their advantages although they take as input both sα and sβ directly or indirectly. However, FedRL-2, which applies differential privacy to both training and testing samples, shows its great scalability even with limited amount of history. In small domains, such as 8×8, a few steps are enough to explore the whole environment. Therefore, the DQN-alpha which takes only single observation as input can performs as well as DQN-alpha and FedRL models.

5.2 Text2Actions Domain

In this experiment, we evaluated our FedRL in another domain, i.e., Text2Action (Feng et al., 2018), which aims to extract action sequences from texts. For example, consider a text “Cook the rice the day before, or use leftover rice in the refrigerator. The important thing to remember is not to heat up the rice, but keep it cold.”, which addresses the procedure of making egg fired rice. The task is to extract words composing an action sequence “cook(rice), keep(rice, cold)”, or “use(leftover rice), keep(rice, cold)”. We define states, actions and rewords as follows.

Table 2: Comparison with baselines in Text2Action
Metric Method Dataset
F1 FCN-alpha 86.19% 65.44% 55.18%
DQN-alpha 92.11% 74.64% 66.37%
FedRL-1 93.76% 85.05% 75.64%
FedRL-2 94.41% 84.92% 75.85%
FCN-full 91.42% 79.03% 68.93%
DQN-full 94.55% 83.39% 74.63%

DQN-alpha 54.762 47.375 46.510
FedRL-1 55.623 50.472 48.359
FedRL-2 55.894 50.452 48.373
DQN-full 56.192 50.307 48.154
Figure 5: Results about the impact of the number of convolutional kernels.
  • States: sαNw×K1 is real-valued matrix that describes the part-of-speech of words, and sβNw×K2 is real-valued matrix that describes the embedding of words. Nw is the number of words in the text, K1 is the dimension of a part-of-speech vector, and K2 is the dimension of word vectors. In our experiments, part-of-speech vectors are randomly initialized and trained together with the Q-network, while word vectors are generated from the pre-trained word embeddings and will not be changed during the training of the Q-network.

  • Actions: There are two actions for each agent, i.e., {select,neglect}, indicating selecting a word as an action name (or a parameter), or neglecting a word (which means the corresponding word is neither an action name nor a parameter).

  • Rewards: The instant reward includes a basic reward and an additional reward, where the basic reward indicates whether the agent selects a word correctly or not, and the additional reward encodes the priori knowledge of the domain, i.e., the proportion of words that are related to action sequences (Feng et al., 2018). We assume that only agent α knows the rewards, while agent β does not know them.

  • Dataset: We conducted experiments on three datasets, i.e., “Microsoft Windows Help and Support” (WHS) documents (Branavan et al., 2009), and two datasets collected from “WikiHow Home and Garden”22 2 https://www.wikihow.com/Category:Home-and-Garden (WHG) and “CookingTutorial”33 3 http://cookingtutorials.com/ (CT), respectively. In CT, there are 116 labeled texts and 134,000 words, with 10.37% being action names and 7.44% being action arguments. In WHG, there are 150 labeled texts and 34,000,000 words, with 7.61% being action names and 6.3% being action arguments. In WHS, there are 154 labeled texts and 1,500 words, with 19.47% being action names and 15.45% being action arguments.

  • Criteria: For evaluation, we first fed texts to each model to obtain selected words. We then compared the outputs to their corresponding ground truth and calculated #TotalTruth (total ground truth words), #TotalRight (total correctly selected words), and TotalSelected (total selected words). After that we computed metrics precision=#TotalRight#TotalSelected, recall=#TotalRight#TotalTruth, and


    We used F1-metric for all baselines. For reinforcement learning methods, we also computed the average cumulative rewards


    where TotalCumulativeReward indicates the total cumulative reward of all testing texts and #TotalTimeSteps indicates the total number of steps of all testing texts.

We adopted the same TextCNN structure as (Feng et al., 2018). Four convolutional layers corresponding to bigram, tri-gram, four-gram and five-gram with 32 kernels of size n×m, where n refers to the n-gram and m=50. Each convolutional layer is followed by a max-pooling layer with size of (Nw-n+1,1), where Nw-n+1 is the first dimension of the outputs of the n-gram convolutional layer. The max-pooling outputs are concatenated and fed to two fully connected layers with size 128×2, where 128 equals to 4×32, indicating 4 types of n-grams and 32 kernels for each n-grams, and 2 is the size of the action space. The MLP module of FedRL is composed of two fully-connected layers with size 4×2. The standard deviation of Gaussian differential privacy σ was set to be 1. For all models, we adopted the Adam optimizer with learning rate 0.001 and ReLU activation44 4 Detailed setting can be found from the source code: https://github.com/FRL2019/FRL.

Experimental Results

We ran our FedRL and baselines five times to calculate an average of F1 (as well as AvgRwd). The results are shown in Table 2. From Table 2 we can see that both FedRL-1 and FedRL-2 outperform both FCN-alpha and DQN-alpha in all three datasets under both F1 and AverRwd metrics, which indicates agent α can learn better policies via federated learning with agent β than learning by itself. Comparing our FedRL-1 and FedRL-2 with DQN-full in both F1 and AvgRwd, we can see that their performances are close to each other, suggesting our federated learning framework performs as good as the DQN model directly built from all training data from both agents α and β.

To see the impact of the model complexity, we varied the number of convolutional kernels from 2 to 32 with other parameters fixed. The results are shown in Figure 5. We can see that all models generally perform better when the number of convolutional kernels (filters) increases, and become stable after 8 kernels. We can also see that our FedRL models outperform DQN-alpha, FCN-alpha and FCN-full with respect to the number of kernels, which indicates agent α can indeed learns better policies via federated learning with agent β. Both FedRL-1 and FedRL-2 are close to DQN-full with respect to the number of filters, suggesting that our federated learning framework is effective even though the data of agents α and β are not shared with each other.

6 Conclusion

we propose a novel reinforcement learning approach to considering privacies and federatively building Q-network for each agent with the help of other agents, namely Federated deep Reinforcement Learning (FedRL). To protect the privacy of data and models, we exploit Gausian differentials on the information shared with each other when updating their local models. We demonstrate that the proposed FedRL approach is effective in building high-quality policies for agents under the condition that training data are not shared between agents. In the future, it would be interesting to study more members in the federation with background knowledge represented by (probably incomplete) action models (Zhuo et al., 2014; Zhuo and Yang, 2014; Zhuo and Kambhampati, 2017).


  • M. Abadi, A. Chu, I. J. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In ACM SIGSAC, pp. 308–318. Cited by: §4.
  • N. Agarwal, A. T. Suresh, F. X. Yu, S. Kumar, and B. McMahan (2018) CpSGD: communication-efficient and differentially-private distributed SGD. In NeurIPS, pp. 7575–7586. Cited by: §4.
  • S. R. K. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay (2009) Reinforcement learning for mapping instructions to actions. In ACL, Cited by: 4th item.
  • J. Co-Reyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine (2018) Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In ICML, pp. 1009–1018. Cited by: §1.
  • J. C. Duchi, M. I. Jordan, and M. J. Wainwright (2012) Privacy aware learning. In NIPS, pp. 1439–1447. Cited by: §1.
  • C. Dwork and A. Roth (2014) The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9 (3-4), pp. 211–407. Cited by: §4.
  • W. Feng, H. H. Zhuo, and S. Kambhampati (2018) Extracting action sequences from texts based on deep reinforcement learning. In IJCAI, pp. 4064–4070. Cited by: 3rd item, §5.2, §5.2.
  • J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson (2016) Learning to communicate with deep multi-agent reinforcement learning. In NIPS, pp. 2137–2145. Cited by: §1.
  • R. Furuta, N. Inoue, and T. Yamasaki (2019) Fully convolutional network with multi-step reinforcement learning for image processing. In AAAI, pp. 3598–3605. Cited by: 3rd item.
  • J. Konecný, B. McMahan, and D. Ramage (2015) Federated optimization: distributed optimization beyond the datacenter. CoRR abs/1511.03575. Cited by: §2.
  • J. Konecný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. CoRR abs/1610.05492. Cited by: §1, §2.
  • J. Z. Leibo, V. F. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel (2017) Multi-agent reinforcement learning in sequential social dilemmas. In AAMAS, pp. 464–473. Cited by: §1, §2.
  • R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, pp. 6382–6393. Cited by: §2.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In AISTATS, pp. 1273–1282. Cited by: §1, §2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski (2015) Human-level control through deep reinforcement learning.. Nature 518 (7540), pp. 529–33. Cited by: §1, §3, 1st item.
  • S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In ICML, pp. 2681–2690. Cited by: §2.
  • T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, pp. 4292–4301. Cited by: §2.
  • V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017) Federated multi-task learning. In NIPS, pp. 4427–4437. Cited by: §2.
  • R. S. Sutton and A. G. Barto (1998) Reinforcement learning - an introduction. Adaptive computation and machine learning, MIT Press. External Links: ISBN 0262193981 Cited by: §1.
  • A. Tamar, S. Levine, P. Abbeel, Y. Wu, and G. Thomas (2016) Value iteration networks. In NIPS, pp. 2146–2154. Cited by: §5.1.
  • A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente (2015) Multiagent cooperation and competition with deep reinforcement learning. CoRR abs/1511.08779. Cited by: §1, §2.
  • M. E. Taylor and P. Stone (2009) Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, pp. 1633–1685. External Links: ISSN 1532-4435 Cited by: §1.
  • A. Tirinzoni, R. Rodriguez Sanchez, and M. Restelli (2018) Transfer of value functions via variational methods. In NIPS, pp. 6182–6192. Cited by: §1.
  • Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM TIST 10 (2), pp. 12:1–12:19. Cited by: §1.
  • H. H. Zhuo and S. Kambhampati (2017) Model-lite planning: case-based vs. model-based approaches. Artif. Intell. 246, pp. 1–21. Cited by: §6.
  • H. H. Zhuo, H. Muñoz-Avila, and Q. Yang (2014) Learning hierarchical task network domains from partially observed plan traces. Artif. Intell. 212, pp. 134–157. Cited by: §6.
  • H. H. Zhuo and Q. Yang (2014) Action-model acquisition for planning via transfer learning. Artif. Intell. 212, pp. 80–103. Cited by: §6.