Abstract
In deep reinforcement learning, building policies of highquality ischallenging when the feature space of states is small and the training data islimited. Despite the success of previous transfer learning approaches in deepreinforcement learning, directly transferring data or models from an agent toanother agent is often not allowed due to the privacy of data and/or models inmany privacyaware applications. In this paper, we propose a novel deepreinforcement learning framework to federatively build models of highqualityfor agents with consideration of their privacies, namely Federated deepReinforcement Learning (FedRL). To protect the privacy of data and models, weexploit Gausian differentials on the information shared with each other whenupdating their local models. In the experiment, we evaluate our FedRL frameworkin two diverse domains, Gridworld and Text2Action domains, by comparing tovarious baselines.
Quick Read (beta)
Federated Deep Reinforcement Learning
Abstract
In deep reinforcement learning, building policies of highquality is challenging when the feature space of states is small and the training data is limited. Despite the success of previous transfer learning approaches in deep reinforcement learning, directly transferring data or models from an agent to another agent is often not allowed due to the privacy of data and/or models in many privacyaware applications. In this paper, we propose a novel deep reinforcement learning framework to federatively build models of highquality for agents with consideration of their privacies, namely Federated deep Reinforcement Learning (FedRL). To protect the privacy of data and models, we exploit Gausian differentials on the information shared with each other when updating their local models. In the experiment, we evaluate our FedRL framework in two diverse domains, Gridworld and Text2Action domains, by comparing to various baselines.
1 Introduction
In deep reinforcement learning, building policies of highquality is challenging when the feature space of states is small and the training data is limited. In many realworld applications, however, datasets from clients are often privacy sensitive (Duchi et al., 2012) and it is often difficult for such a data center to guarantee building models of highquality. To deal with the issue, Konecny et al. propose a new learning setting, namely federated learning, whose goal is to train a classification or clustering model with training data involving texts, images or videos distributed over a large number of clients (Konecný et al., 2016; McMahan et al., 2017). Different from previous federated learning setting (c.f. (Yang et al., 2019)), we propose a novel federated learning framework based on reinforcement learning (Sutton and Barto, 1998; Mnih et al., 2015; CoReyes et al., 2018), i.e., Federated deep Reinforcement Learning (FedRL), which aims to learn a private Qnetwork policy for each agent by sharing limited information (i.e., output of the Qnetwork) among agents. The information is “encoded” when it is sent to others and “decoded” when it is received by others. We assume that some agents have rewards corresponding to states and actions, while others have only observed states without rewards. Without rewards, those agents are unable to build decision policies on their own information. We claim that all agents benefit from joining the federation in building decision policies.
There are many applications regarding federated reinforcement learning. For example, in the manufacturing industry, producing products may involve various factories which produce different components of the products. Factories’ decision policies are private and will not be shared with each other. On the other hand, building individual decision policies of highquality on their own is often difficult due to their limited businesses and lack of rewards (for some factories). It is thus helpful for them to learn decision polices federatively under the condition that private data is not given away. Another example is building medical treatment policies to patients for hospitals. Patients may be treated in some hospitals and never give feedbacks to the treatments, which indicates these hospitals are unable to collect rewards based on the treatments given to the patients and build treatment decision policies for patients. In addition, data records about patients are private and may not be shared among hospitals. It is thus necessitated to learn treatment policies for hospitals federatively.
Our FedRL framework is different from multiagent reinforcement learning, which is concerned with a set of autonomous agents that observe global states (or partial states which are directly shared to make “global” states), select an individual action and receive a team reward (or each agent receives an individual reward but shares it with other agents) (Tampuu et al., 2015; Leibo et al., 2017; Foerster et al., 2016). FedRL assumes agents do not share their partial observations and some agents are unable to receive rewards. Our FedRL framework is also different from transfer learning in reinforcement learning, which aims to transfer experience gained in learning to perform one task to help improve learning performance in a related but different task or agent, assuming observations are shared with each other (Taylor and Stone, 2009; Tirinzoni et al., 2018), while FedRL assumes states cannot be shared among agents.
Our FedRL framework functions in three phases. Initially, each agent collects output values of Qnetworks from other agents, which are “encrypted” with Gausian differentials. Furthermore, it builds a shared value network, e.g., MLP (multilayer perceptron), to compute a global Qnetwork output with its own Qnetwork output and the encrypted values as input. Finally, it updates both the shared value network and its own Qnetwork based on the global Qnetwork output. Note that MLP is shared among agents while agents’ own Qnetworks are unknown to others and should not be inferred based on the encrypted Qnetwork output shared in the training process.
In the remainder of the paper, we first review previous work related to our FedRL framework, and then present the problem formulation of FedRL. After that, we introduce our FedRL framework in detail. Finally we evaluate our FedRL framework in the GridWorld domain with various sizes and the Text2Actions domain.
2 Related Work
The nascent field of federated learning considers training statistical models directly on devices (Konecný et al., 2015; McMahan et al., 2017). The aim in federated learning is to fit a model to data generated by distributed nodes. Each node collects data in a nonIID manner across the network, with data on each node being generated by a distinct distribution. There are typically a large number of nodes in the network, and communication is often a significant bottleneck. Different from previous work that train a single global model across the network (Konecný et al., 2015, 2016; McMahan et al., 2017), Smith et al. propose to learn separate models for each node which is naturally captured through a multitask learning (MTL) framework, where the goal is to consider fitting separate but related models simultaneously (Smith et al., 2017). Different from those federated learning approaches, we consider federated settings in reinforcement learning.
Our work is also related to multiagent reinforcement learning (MARL) which involves a set of agents in a shared environment. A straightforward way to MARL is to extend the singleagent RL approaches. Qlearning has been extended to cooperative multiagent settings, namely Independent Qlearning (IQL), in which each agent observes the global state, selects an individual action and receives a team reward (Tampuu et al., 2015; Leibo et al., 2017). One challenging of MARL is that multiagent domains are nonstationary from agent’s perspectives, due to other agents’ interactions in the shared environment. To address this issue, (Omidshafiei et al., 2017) propose to explore Concurrent Experience Replay Trajectories (CERTs) structures, which store different agents’ histories, and align them together based on the episode indices and time steps. Due to the action space growing exponentially with the number of agents, learning becomes very difficult due to partial observability of limited communication when the number of agents is large. (Lowe et al., 2017) thus propose to solve the MARL problem through a Centralized Critic and a Decentralized Actor, and (Rashid et al., 2018) propose to exploit a linear decomposition of the joint value function across agents. Different from MARL, our FedRL framework assumes agents do not share their partial observations and some agents are unable to receive rewards, instead of assuming observations are sharable and all agents are able to receive rewards.
3 Problem Definition
A Markov Decision Process (MDP) can be defined by $\u27e8S,A,T,r\u27e9$, where $S$ is the state space, and $A$ is the action space. $T$ is the transition function: $S\times A\to S$, i.e., $T(s,a,{s}^{\prime})=P({s}^{\prime}s,a)$, specifying the probability of next state ${s}^{\prime}\in S$ given current state $s\in S$ and $a\in A$ that applies on $s$. $r$ is the reward function: $S\to \mathcal{R}$, where $\mathcal{R}$ is the space of real numbers. Given a policy $\pi :S\to A$, the value function ${V}^{\pi}(s)$ and the Qfunction ${Q}^{\pi}(s,a)$ at step $t+1$, can be updated by their $t$ step:
$${V}_{t+1}^{\pi}(s)=r(s)+\sum _{{s}^{\prime}\in S}T(s,\pi (s),{s}^{\prime}){V}_{t}^{\pi}({s}^{\prime}),$$ 
and
$${Q}_{t+1}^{\pi}(s,a)=r(s)+\sum _{{s}^{\prime}\in S}T(s,a,{s}^{\prime}){V}_{t}^{\pi}({s}^{\prime}),$$ 
for $t\in \{0,\mathrm{\dots},K1\}$. The solution to an MDP problem is the best policy ${\pi}^{*}$ such that ${V}^{{\pi}^{*}}(s)={\mathrm{max}}_{\pi}{V}^{\pi}(s)$ or ${Q}^{{\pi}^{*}}(s,{\pi}^{*}(s))={\mathrm{max}}_{\pi}{Q}^{\pi}(s,\pi (s))$. In DQN (Deep QNetwork), given transition function $T$ is unknown, the Qfunction is represented by a Qnetwork $Q(s,a;\theta )$ with $\theta $ as parameters of the network, and updated by
$${Q}_{t+1}(s,a;\theta )={E}_{{s}^{\prime}}\{r(s)+\gamma \underset{{a}^{\prime}\in A}{\mathrm{max}}{Q}_{t}({s}^{\prime},{a}^{\prime};\theta )s,a\},$$ 
as done by (Mnih et al., 2015). To learn the parameters $\theta $, one way is to store transitions $\u27e8s,a,{s}^{\prime},r\u27e9$ in replay memories $\mathrm{\Omega}$ and exploit a minibatch sampling to repeatedly update $\theta $ (Mnih et al., 2015). Once $\theta $ is learnt, the policy ${\pi}^{*}$ can be extracted from $Q(s,a;\theta )$,
$${\pi}^{*}(s)=\mathrm{arg}\underset{a\in A}{\mathrm{max}}Q(s,a;\theta ).$$ 
We define our federated deep reinforcement learning problem by: given transitions ${\mathrm{D}}_{\alpha}\mathrm{=}\mathrm{\{}\mathrm{\u27e8}{s}_{\alpha}\mathrm{,}{a}_{\alpha}\mathrm{,}{s}_{\alpha}^{\mathrm{\prime}}\mathrm{,}{r}_{\alpha}\mathrm{\u27e9}\mathrm{\}}$ collected by agent $\alpha $, and pairs of states and actions ${\mathrm{D}}_{\beta}\mathrm{=}\mathrm{\{}\mathrm{\u27e8}{s}_{\beta}\mathrm{,}{a}_{\beta}\mathrm{\u27e9}\mathrm{\}}$ collected by agent $\beta $, we aim to federatively build policies ${\pi}_{\alpha}^{\mathrm{*}}$ and ${\pi}_{\beta}^{\mathrm{*}}$ for agents $\alpha $ and $\beta $, respectively. Note that in this paper we consider the federation with two members for simplicity. The setting can be extended to many agents by exploiting the same federated mechanism between each two agents. We denote states, actions, Qfunctions, and policies with respect to agents $\alpha $ and $\beta $, by “${s}_{\alpha}\in {S}_{\alpha}$, ${a}_{\alpha}\in {A}_{\alpha}$, ${Q}_{\alpha}$, ${\pi}_{\alpha}^{*}$” and “${s}_{\beta}\in {S}_{\beta}$, ${a}_{\beta}\in {A}_{\beta}$, ${Q}_{\beta}$, ${\pi}_{\beta}^{*}$”, respectively.
In our federated deep reinforcement learning problem, we assume:
A1: The feature spaces of states ${s}_{\alpha}$ and ${s}_{\beta}$ are different between agents $\alpha $ and $\beta $. For example, a state ${s}_{\alpha}$ denotes a patient’s cardiogram in hospital $\alpha $, while another state ${s}_{\beta}$ denotes the same patient’s electroencephalogram in hospital $\beta $, indicating the feature spaces of ${s}_{\alpha}$ and ${s}_{\beta}$ are different.
A2: Transitions ${\mathcal{D}}_{\alpha}$ and ${\mathcal{D}}_{\beta}$ cannot be shared directly between $\alpha $ and $\beta $ when they learning their own models. The correspondences between transitions from ${\mathcal{D}}_{\alpha}$ and ${\mathcal{D}}_{\beta}$ are, however, known to each other. In other words, agent $\alpha $ can send the “ID” of a transition to agent $\beta $, and agent $\beta $ can use that “ID” to find its corresponding transition in ${\mathcal{D}}_{\beta}$. For example, in hospital, “ID” can correspond to a specific patient.
A3: The output of functions ${Q}_{\alpha}$ and ${Q}_{\beta}$ can be shared with each other under the condition that they are protected by some privacy protection mechanism.
Based on A1A3, we aim to learn policies ${\pi}_{\alpha}^{*}$ and ${\pi}_{\beta}^{*}$ of highquality for agents $\alpha $ and $\beta $ by preserving privacies of their own data and models.
4 Our FedRL Approach
In this section we present our FedRL framework in detail. An overview of FedRL is shown in Figure 1, where the left part is the model of agent $\alpha $, and the right part is the model of agent $\beta $. Each model is composed of a local Qnetwork with parameters ${\theta}_{\alpha}$ for agent $\alpha $, ${\theta}_{\beta}$ for agent $\beta $, and a global MLP module with parameters ${\theta}_{g}$ for both $\alpha $ and $\beta $.
Basic Qnetworks
We build two Qnetworks for agents $\alpha $ and $\beta $, denoted by ${Q}_{\alpha}({s}_{\alpha},{a}_{\alpha};{\theta}_{\alpha})$ and ${Q}_{\beta}({s}_{\beta},{a}_{\beta};{\theta}_{\beta})$, respectively, where ${\theta}_{\alpha}$ and ${\theta}_{g}$ are parameters of the Qnetworks. The outputs of these two basic Qnetworks are not directly used to predict the actions, but taken as input of the MLP module.
Gaussian differential privacy
To avoid agents “inducing” models of each other according to repeatedly received outputs of other’s Qnetwork during training, we consider using differential privacy (Dwork and Roth, 2014) to protect the output of agent’s Qnetwork. There are various mechanisms of differential privacy such as the Gaussian mechanism (Abadi et al., 2016) and Binomial mechanism (Agarwal et al., 2018). In this paper, we exploit Gaussian mechanism since the output of the MLP with Gaussian input is Gaussian itself. In previous federated learning settings, the mechanism is applied to the gradients of agents (clients) before being sent to the server or other agents (clients). In our FedRL framework, we send the output of one agent’s Qnetwork to another, so the Gaussian noise is added to the output rather than the gradients. The mechanism is defined by
${\widehat{Q}}_{\alpha}({s}_{\alpha},{a}_{\alpha};{\theta}_{\alpha})={Q}_{\alpha}({s}_{\alpha},{a}_{\alpha};{\theta}_{\alpha})+N(0,{\sigma}^{2})$  (1)  
${\widehat{Q}}_{\beta}({s}_{\beta},{a}_{\beta};{\theta}_{\beta})={Q}_{\beta}({s}_{\beta},{a}_{\beta};{\theta}_{\beta})+N(0,{\sigma}^{2})$  (2) 
where $N(0,{\sigma}^{2})$ is the Gaussian distribution with mean 0 and standard deviation $\sigma $.
Federated Qnetwork
We build a new Qnetwork, namely federated Qnetwork denoted by ${Q}_{f}$, to leverage the outputs of the two basic Qnetworks with Gaussian noise based on MLP, which is defined by
${Q}_{f}(\cdot ;{\theta}_{\alpha},{\theta}_{\beta},{\theta}_{g})=$  
$MLP([{\widehat{Q}}_{\alpha}({s}_{\alpha},{a}_{\alpha};{\theta}_{\alpha}){\widehat{Q}}_{\beta}({s}_{\beta},{a}_{\beta};{\theta}_{\beta})];{\theta}_{g})$  (3) 
where ${\theta}_{g}$ is the parameters of MLP and $[\cdot \cdot ]$ indicates the concatenation operation. Note that parameters of an MLP can be shared between agents. Once MLP is updated, the updated parameters are shared with the other agent.
With respect to agents $\alpha $ and $\beta $, we define each agent’s federated Qnetworks by viewing the other agent’s basic Qnetwork (with Gaussian noise) as a fixed constant when updating its own basic Qnetwork, as shown below,
${Q}_{f}^{\alpha}(\cdot ,{C}_{\beta};{\theta}_{\alpha},{\theta}_{g})=MLP([{\widehat{Q}}_{\alpha}(\cdot ;{\theta}_{\alpha}){C}_{\beta}];{\theta}_{g})$  (4)  
${Q}_{f}^{\beta}(\cdot ,{C}_{\alpha};{\theta}_{\beta},{\theta}_{g})=MLP([{C}_{\alpha}{\widehat{Q}}_{\beta}(\cdot ;{\theta}_{\beta})];{\theta}_{g})$  (5) 
where ${C}_{\alpha}={\widehat{Q}}_{\alpha}({s}_{\alpha},{a}_{\alpha};{\theta}_{\alpha})$ and ${C}_{\beta}={\widehat{Q}}_{\beta}({s}_{\beta},{a}_{\beta};{\theta}_{\beta})$ are fixed constants when updating agent $\beta $’s basic Qnetwork and agent $\alpha $’s basic Qnetwork, respectively.
The Qnetworks are trained by minimizing the square error loss ${L}_{\alpha}^{j}({\theta}_{\alpha},{\theta}_{g})$ and ${L}_{\beta}^{j}({\theta}_{\beta},{\theta}_{g})$ of agents $\alpha $ and $\beta $,
${L}_{\alpha}^{j}({\theta}_{\alpha},{\theta}_{g})=\mathbb{E}\left[{({Y}^{j}{Q}_{f}^{\alpha}({s}_{\alpha}^{j},{a}_{\alpha}^{j},{C}_{\beta};{\theta}_{\alpha},{\theta}_{g}))}^{2}\right]$  (6)  
${L}_{\beta}^{j}({\theta}_{\beta},{\theta}_{g})=\mathbb{E}\left[{({Y}^{j}{Q}_{f}^{\beta}({s}_{\beta}^{j},{a}_{\beta}^{j},{C}_{\alpha};{\theta}_{\beta},{\theta}_{g}))}^{2}\right]$  (7) 
where ${Y}^{j}={r}^{j}+\gamma \underset{a}{\mathrm{max}}{Q}_{f}^{\alpha}({s}_{\alpha}^{j},a,{C}_{\beta};{\theta}_{\alpha},{\theta}_{g})$. Note that agent $\beta $ is unable to compute ${Y}^{j}$ since it does not have reward $r$. ${Y}^{j}$ is computed by agent $\alpha $ and shared with agent $\beta $.
Overview of training and testing
The training and testing procedures of agents $\alpha $ and $\beta $ are shown in Figure 2. When training, agent $\alpha $ computes ${Y}^{j}$ and sends it to agent $\beta $. Agent $\beta $ updates ${\theta}_{\beta}$ and ${\theta}_{g}$, computes ${C}_{\beta}$, and sends ${\theta}_{g}$ and ${C}_{\beta}$ to agent $\alpha $. After that, agent $\alpha $ updates ${\theta}_{\alpha}$ and ${\theta}_{g}$, computes ${C}_{\alpha}$, and sends ${\theta}_{g}$ and ${C}_{\alpha}$ to agent $\beta $. When testing, both agents compute and send ${C}_{\alpha}$ and ${C}_{\beta}$ to each other for computing ${Q}_{f}^{\alpha}$ and ${Q}_{f}^{\beta}$.
The detailed training procedure can be seen from Algorithms 1 and 2. In Steps 1 and 2 of Algorithm 1, we initialize the basic Qnetwork and replay memory of agent $\alpha $ and the MLP module. In Step 3 of Algorithm 1, we call the function from Algorithm 2 to initialize the Qnetwork and replay memory of agent $\beta $. In Step 6 of Algorithm 1, we obtain observation of agent $\alpha $’s state ${s}_{\alpha}^{t}$. In Step 7 of Algorithm 1, we call the function in Algorithm 2 to calculate the output of basic Qnetwork of agent $\beta $ and obtain observation of agent $\beta $ and select the corresponding action, as shown in Steps from 6 to 10 of Algorithm 2. In Steps from 8 and 11 of Algorithm 1, we perform the $\u03f5$greedy exploration and exploitation, obtain new observations and store the transitions to the replay memory. In Steps 12 and 13 of Algorithm 1, we sample a record $j$ in the memory and call the function $ComputeQBeta(j)$ of Algorithm 2 to calculate the output of basic Qnetwork of agent $\beta $ based on the index $j$. In Steps 14 and 15 of Algorithm 1, we update parameters ${\theta}_{\alpha}$ and ${\theta}_{g}$. In Steps 16 and 17 of Algorithm 1, we compute the output of basic Qnetwork of agent $\alpha $ and pass it to agent $\beta $ and call the function $UpdateQ$ of agent $\beta $ to update basic Qnetwork of agent $\beta $ and MLP, as shown in Steps from 19 to 21 of Algorithm 2. Note that Algorithms 1 and 2 are executed by agents $\alpha $ and $\beta $ separately.
5 Experiments
In the experiment, we evaluate FedRL^{1}^{1} 1 The source code and datasets are available from https://github.com/FRL2019/FRL in the following two aspects. Firstly, we would like to see if agent $\alpha $ can learn better policies by joining the federation than without joining the federation (agent $\alpha $ can build policies by itself since it has complete transitions $\{\u27e8s,a,{s}^{\prime},r\u27e9\}$, while agent $\beta $ cannot build policies without joining the federation since it has only pairs of states and actions $\{\u27e8s,a\u27e9\}$). Secondly, we would like to see if our FedRL approach can learn policies of highquality which are close to the ones learnt by directly combining data of both agent $\alpha $ and $\beta $, neglecting the dataprivacy issue between $\alpha $ and $\beta $. To do this, we compared our FedRL approach to the following baselines:

•
DQNalpha: which is a deep Qnetwork (Mnih et al., 2015) trained with agent $\alpha $’s data only. It takes observations ${s}_{\alpha}$ as input and outputs actions corresponding to ${s}_{\alpha}$.

•
DQNfull: which is a deep Qnetwork trained by directly putting data together from both agents $\alpha $ and $\beta $, i.e., neglecting data privacy between agents $\alpha $ and $\beta $.

•
FCNalpha: which is a fully convolutional network (FCN) trained with agent $\alpha $’s data only, similar to DQNalpha. FCNalpha aims to build policies via supervised learning (Furuta et al., 2019), i.e., viewing states as input and actions as labels.

•
FCNfull: which is a convolutional network trained with all data of agent $\alpha $ and $\beta $ put together directly, similar to DQNfull.
In our experiment, we used the kernel of size $3\times 3$ in CNN and two fullyconnected layers with the size of $32\times 4$ in MLP. We set the standard deviation in Gaussian differential privacy $\sigma $ to be 1. We adopted the Adam optimizer with learning rate 0.001 and the ReLU activation for all models. We evaluated our FedRL with comparison to baselines in two domains, i.e., GridWorld and Text2Action.
5.1 Evaluation in GridWorld Domain
We first conducted experiments in GridWorld domain with randomly placed obstacles (Tamar et al., 2016). The two agents $\alpha $ and $\beta $ were randomly put in the environment as well, as shown in Figure 3. The two agents aim to navigate through optimal pathes (i.e. the shortest pathes without hitting obstacles) to meet with each other. In this domain, we define states, actions and rewords as follows.

•
States: The domain is represented by a ${N}_{g}\times {N}_{g}$ binaryvalued matrix (0 for obstacle, 1 otherwise), where ${N}_{g}$ is the size of the domain. We evaluated our approach with respect to different sizes, i.e., ${N}_{g}=8,16,32$. The observed state of agent $\alpha $, denoted by ${s}_{\alpha}$, was set to be a $3\times 3$ matrix with the current position of agent $\alpha $ in the center of the matrix. The observed state of agent $\beta $, denoted by ${s}_{\beta}$, was set to be a $5\times 5$ matrix with the current position of agent $\beta $ in the center of the matrix.

•
Actions: There are 4 actions for each agent, i.e. going towards 4 directions, denoted by $\{east,south,west,north\}$.

•
Rewards: The reward is composed of two parts, i.e., local reward ${r}_{l}$ and global reward ${r}_{g}$, respectively. When an agent hits an obstacle, ${r}_{l}$ is set to be 10; when an agent meets the other agent, ${r}_{l}$ is set to be +50; otherwise, ${r}_{l}$ is set to be 1. Considering the goal of the task is to make the two agents eventually meet each other, we exploited an additional reward regarding the distance between two agents, namely global reward, i.e., ${r}_{g}=c/md(\alpha ,\beta )$, where $md(\alpha ,\beta )$ is the Manhattan distance between the two agents, and $c$ is a regularization factor which is set to be the dimension of the domain, i.e. $c={N}_{g}$. The final reward $r$ is calculated by $r={r}_{l}+{r}_{g}$.

•
Dataset: We generated 8000 different maps (or matrices) for each size of $8\times 8,16\times 16$ and $32\times 32$. In each map, we randomly chose two positions for the two agents, and computed the optimal path (shortest path) which was compared to the paths predicted by our FedRL and baselines. We randomly split the 8000 maps into 6400 for training, 800 for validation, and 800 for testing.

•
Criteria: In each training episode, we first took the initial state and predicted an action with a model. We then got a new state and took it as the new input of the model at the second time step. We repeated the procedure until the two agents met each other, which is indicated as a successful episode, or it exceeded the maximum time step ${T}_{m}$, which is indicated as a failure episode. We set ${T}_{m}$ to be twice the length of the longest optimal path, i.e. ${T}_{m}=38,86,178$ for each size of $8\times 8,16\times 16$ and $32\times 32$, respectively. We finally computed the measures of successful rate $SuccRate$, and average reward $AvgRwd$, i.e.,
$$SuccRate=\frac{\mathrm{\#}SuccessfulEpisodes}{\mathrm{\#}TotalEpisodes},$$ and
$$AvgRwd=\frac{TotalCumulativeReward}{\mathrm{\#}TotalEpisodes},$$ where $\mathrm{\#}SuccessfulEpisodes$ indicates the number of successful episodes, $\mathrm{\#}TotalEpisodes$ indicates the total number of episodes, $TotalCumulativeReward$ indicates the total reward of all episodes, respectively.
Metric  Method  Domain w.r.t. various sizes  

$8\times 8$  $16\times 16$  $32\times 32$  
$SuccRate$  FCNalpha  69.73%  48.04%  41.73% 
DQNalpha  88.27%  76.20%  71.41%  
FedRL1  92.52%  79.83%  77.88%  
FedRL2  95.06%  84.31%  82.02%  
FCNfull  72.16%  56.44%  50.15%  
DQNfull  93.69%  83.40%  79.73%  
$AvgRwd$  DQNalpha  13.781  112.084  285.946 
FedRL1  18.152  94.193  226.583  
FedRL2  19.101  84.139  189.756  
DQNfull  31.286  38.114  52.72 
Experimental Results w.r.t. Domain Sizes
We ran our FedRL and baselines five times and calculated an average of $SuccRate$ (as well as $AvgRwd$). We calculated two results of our FedRL, denoted by FedRL1 and FedRL2, which correspond to “adding Gausian differential privacy on the local Qnetwork” and ”not adding Gausian differential privacy on the local Qnetwork” in the testing data, respectively. The results are shown in Table 1. From Table 1, we can see that $SuccRate$ of both FedRL1 and FedRL2 is much better than DQNalpha and CNNalpha in all three different sizes of domains, which indicates agent $\alpha $ can indeed get help from agent $\beta $ via learning federatively with agent $\beta $. Comparing to DQNfull, we can see that the $SuccRate$ of both FedRL1 and FedRL2 is close to DQNfull, which indicates our federated learning framework can indeed take advantage of both training data from agents $\alpha $ and $\beta $, even though they are protected locally with Gausian differential privacy (the reason why FedRL2 is slightly better than DQNfull is that the hierarchical structure of our federated learning framework with three components may be more suitable for this domain than a unique DQN framework). In addition, we can also see that the $SuccRate$ generally decreases when the size of the domain increases, which is consistent with our intuition, since the larger the domain is, the more difficult the task is (which requires more training data to build models of highquality).
From the metric of $AvgRwd$, we can see that both FedRL1 and FedRL2 are better than DQNalpha, which indicates agent $\alpha $ can indeed get help from agent $\beta $ in gaining rewards via federated learning with agent $\beta $. We can also find that FedRL2 outperforms FedRL1 in both $AvgRwd$ and $SuccRate$. This is because our model is trained based on Gausian differential privacy, indicating $SuccRate$ on testing data with Gausian differential privacy, as done by FedRL2, should be better than on testing data without Gausian differential privacy, as done by FedRL1.
Experimental Results w.r.t. History Length
To study when FedRL works, we consider the amount of information that an agent uses. The observation input at each time is a sequence of observations, i.e. observation history. Intuitively, the longer the length of the observation history, the more information we have, and the more complicated the neural network is. We fixed the structures of all models and only changed the length of the history observations. We tested the length of history from 2 to 32 for all domains. The results are shown in Figure 4. In the first row of Figure 4, we can observe is that the success rates are improving with the increment of the history length. In $8\times 8$ and $16\times 16$ domains, the results converge when history length is longer than 16, while in $32\times 32$ domain, they have not converged even at the history length of 32. The reason is that $32\times 32$ domain is more complicated than the other two domains, so it needs more information (i.e. $H>32$) to learn a model of highquality. We can also find that when the history length is short (i.e. $H=2$ for $16\times 16$ and $H\le 4$ for $32\times 32$), DQNalpha, DQNfull and FedRL1 perform poorly. FedRL1 and DQNfull do not show their advantages although they take as input both ${s}_{\alpha}$ and ${s}_{\beta}$ directly or indirectly. However, FedRL2, which applies differential privacy to both training and testing samples, shows its great scalability even with limited amount of history. In small domains, such as $8\times 8$, a few steps are enough to explore the whole environment. Therefore, the DQNalpha which takes only single observation as input can performs as well as DQNalpha and FedRL models.
5.2 Text2Actions Domain
In this experiment, we evaluated our FedRL in another domain, i.e., Text2Action (Feng et al., 2018), which aims to extract action sequences from texts. For example, consider a text “Cook the rice the day before, or use leftover rice in the refrigerator. The important thing to remember is not to heat up the rice, but keep it cold.”, which addresses the procedure of making egg fired rice. The task is to extract words composing an action sequence “cook(rice), keep(rice, cold)”, or “use(leftover rice), keep(rice, cold)”. We define states, actions and rewords as follows.
Metric  Method  Dataset  
WHS  CT  WHG  
F1  FCNalpha  86.19%  65.44%  55.18% 
DQNalpha  92.11%  74.64%  66.37%  
FedRL1  93.76%  85.05%  75.64%  
FedRL2  94.41%  84.92%  75.85%  
FCNfull  91.42%  79.03%  68.93%  
DQNfull  94.55%  83.39%  74.63%  
$AvgRwd$ 
DQNalpha  54.762  47.375  46.510 
FedRL1  55.623  50.472  48.359  
FedRL2  55.894  50.452  48.373  
DQNfull  56.192  50.307  48.154 

•
States: ${s}_{\alpha}\in {\mathbb{R}}^{{N}_{w}\times {K}_{1}}$ is realvalued matrix that describes the partofspeech of words, and ${s}_{\beta}\in {\mathbb{R}}^{{N}_{w}\times {K}_{2}}$ is realvalued matrix that describes the embedding of words. ${N}_{w}$ is the number of words in the text, ${K}_{1}$ is the dimension of a partofspeech vector, and ${K}_{2}$ is the dimension of word vectors. In our experiments, partofspeech vectors are randomly initialized and trained together with the Qnetwork, while word vectors are generated from the pretrained word embeddings and will not be changed during the training of the Qnetwork.

•
Actions: There are two actions for each agent, i.e., $\{select,neglect\}$, indicating selecting a word as an action name (or a parameter), or neglecting a word (which means the corresponding word is neither an action name nor a parameter).

•
Rewards: The instant reward includes a basic reward and an additional reward, where the basic reward indicates whether the agent selects a word correctly or not, and the additional reward encodes the priori knowledge of the domain, i.e., the proportion of words that are related to action sequences (Feng et al., 2018). We assume that only agent $\alpha $ knows the rewards, while agent $\beta $ does not know them.

•
Dataset: We conducted experiments on three datasets, i.e., “Microsoft Windows Help and Support” (WHS) documents (Branavan et al., 2009), and two datasets collected from “WikiHow Home and Garden”^{2}^{2} 2 https://www.wikihow.com/Category:HomeandGarden (WHG) and “CookingTutorial”^{3}^{3} 3 http://cookingtutorials.com/ (CT), respectively. In CT, there are 116 labeled texts and 134,000 words, with 10.37% being action names and 7.44% being action arguments. In WHG, there are 150 labeled texts and 34,000,000 words, with 7.61% being action names and 6.3% being action arguments. In WHS, there are 154 labeled texts and 1,500 words, with 19.47% being action names and 15.45% being action arguments.

•
Criteria: For evaluation, we first fed texts to each model to obtain selected words. We then compared the outputs to their corresponding ground truth and calculated $\mathrm{\#}TotalTruth$ (total ground truth words), $\mathrm{\#}TotalRight$ (total correctly selected words), and $TotalSelected$ (total selected words). After that we computed metrics $precision=\frac{\mathrm{\#}TotalRight}{\mathrm{\#}TotalSelected}$, $recall=\frac{\mathrm{\#}TotalRight}{\mathrm{\#}TotalTruth}$, and
$$F1=\frac{2\times precision\times recall}{precision+recall}.$$ We used F1metric for all baselines. For reinforcement learning methods, we also computed the average cumulative rewards
$$AvgRwd=\frac{TotalCumulativeReward}{\mathrm{\#}TotalTimeSteps},$$ where $TotalCumulativeReward$ indicates the total cumulative reward of all testing texts and $\mathrm{\#}TotalTimeSteps$ indicates the total number of steps of all testing texts.
We adopted the same TextCNN structure as (Feng et al., 2018). Four convolutional layers corresponding to bigram, trigram, fourgram and fivegram with 32 kernels of size $n\times m$, where $n$ refers to the $n$gram and $m=50$. Each convolutional layer is followed by a maxpooling layer with size of $({N}_{w}n+1,1)$, where ${N}_{w}n+1$ is the first dimension of the outputs of the ngram convolutional layer. The maxpooling outputs are concatenated and fed to two fully connected layers with size $128\times 2$, where $128$ equals to $4\times 32$, indicating 4 types of $n$grams and 32 kernels for each $n$grams, and 2 is the size of the action space. The MLP module of FedRL is composed of two fullyconnected layers with size $4\times 2$. The standard deviation of Gaussian differential privacy $\sigma $ was set to be 1. For all models, we adopted the Adam optimizer with learning rate 0.001 and ReLU activation^{4}^{4} 4 Detailed setting can be found from the source code: https://github.com/FRL2019/FRL.
Experimental Results
We ran our FedRL and baselines five times to calculate an average of F1 (as well as $AvgRwd$). The results are shown in Table 2. From Table 2 we can see that both FedRL1 and FedRL2 outperform both FCNalpha and DQNalpha in all three datasets under both F1 and $AverRwd$ metrics, which indicates agent $\alpha $ can learn better policies via federated learning with agent $\beta $ than learning by itself. Comparing our FedRL1 and FedRL2 with DQNfull in both F1 and $AvgRwd$, we can see that their performances are close to each other, suggesting our federated learning framework performs as good as the DQN model directly built from all training data from both agents $\alpha $ and $\beta $.
To see the impact of the model complexity, we varied the number of convolutional kernels from 2 to 32 with other parameters fixed. The results are shown in Figure 5. We can see that all models generally perform better when the number of convolutional kernels (filters) increases, and become stable after 8 kernels. We can also see that our FedRL models outperform DQNalpha, FCNalpha and FCNfull with respect to the number of kernels, which indicates agent $\alpha $ can indeed learns better policies via federated learning with agent $\beta $. Both FedRL1 and FedRL2 are close to DQNfull with respect to the number of filters, suggesting that our federated learning framework is effective even though the data of agents $\alpha $ and $\beta $ are not shared with each other.
6 Conclusion
we propose a novel reinforcement learning approach to considering privacies and federatively building Qnetwork for each agent with the help of other agents, namely Federated deep Reinforcement Learning (FedRL). To protect the privacy of data and models, we exploit Gausian differentials on the information shared with each other when updating their local models. We demonstrate that the proposed FedRL approach is effective in building highquality policies for agents under the condition that training data are not shared between agents. In the future, it would be interesting to study more members in the federation with background knowledge represented by (probably incomplete) action models (Zhuo et al., 2014; Zhuo and Yang, 2014; Zhuo and Kambhampati, 2017).
References
 Deep learning with differential privacy. In ACM SIGSAC, pp. 308–318. Cited by: §4.
 CpSGD: communicationefficient and differentiallyprivate distributed SGD. In NeurIPS, pp. 7575–7586. Cited by: §4.
 Reinforcement learning for mapping instructions to actions. In ACL, Cited by: 4th item.
 Selfconsistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In ICML, pp. 1009–1018. Cited by: §1.
 Privacy aware learning. In NIPS, pp. 1439–1447. Cited by: §1.
 The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9 (34), pp. 211–407. Cited by: §4.
 Extracting action sequences from texts based on deep reinforcement learning. In IJCAI, pp. 4064–4070. Cited by: 3rd item, §5.2, §5.2.
 Learning to communicate with deep multiagent reinforcement learning. In NIPS, pp. 2137–2145. Cited by: §1.
 Fully convolutional network with multistep reinforcement learning for image processing. In AAAI, pp. 3598–3605. Cited by: 3rd item.
 Federated optimization: distributed optimization beyond the datacenter. CoRR abs/1511.03575. Cited by: §2.
 Federated learning: strategies for improving communication efficiency. CoRR abs/1610.05492. Cited by: §1, §2.
 Multiagent reinforcement learning in sequential social dilemmas. In AAMAS, pp. 464–473. Cited by: §1, §2.
 Multiagent actorcritic for mixed cooperativecompetitive environments. In NIPS, pp. 6382–6393. Cited by: §2.
 Communicationefficient learning of deep networks from decentralized data. In AISTATS, pp. 1273–1282. Cited by: §1, §2.
 Humanlevel control through deep reinforcement learning.. Nature 518 (7540), pp. 529–33. Cited by: §1, §3, 1st item.
 Deep decentralized multitask multiagent reinforcement learning under partial observability. In ICML, pp. 2681–2690. Cited by: §2.
 QMIX: monotonic value function factorisation for deep multiagent reinforcement learning. In ICML, pp. 4292–4301. Cited by: §2.
 Federated multitask learning. In NIPS, pp. 4427–4437. Cited by: §2.
 Reinforcement learning  an introduction. Adaptive computation and machine learning, MIT Press. External Links: ISBN 0262193981 Cited by: §1.
 Value iteration networks. In NIPS, pp. 2146–2154. Cited by: §5.1.
 Multiagent cooperation and competition with deep reinforcement learning. CoRR abs/1511.08779. Cited by: §1, §2.
 Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, pp. 1633–1685. External Links: ISSN 15324435 Cited by: §1.
 Transfer of value functions via variational methods. In NIPS, pp. 6182–6192. Cited by: §1.
 Federated machine learning: concept and applications. ACM TIST 10 (2), pp. 12:1–12:19. Cited by: §1.
 Modellite planning: casebased vs. modelbased approaches. Artif. Intell. 246, pp. 1–21. Cited by: §6.
 Learning hierarchical task network domains from partially observed plan traces. Artif. Intell. 212, pp. 134–157. Cited by: §6.
 Actionmodel acquisition for planning via transfer learning. Artif. Intell. 212, pp. 80–103. Cited by: §6.