In deep reinforcement learning, building policies of high-quality ischallenging when the feature space of states is small and the training data islimited. Despite the success of previous transfer learning approaches in deepreinforcement learning, directly transferring data or models from an agent toanother agent is often not allowed due to the privacy of data and/or models inmany privacy-aware applications. In this paper, we propose a novel deepreinforcement learning framework to federatively build models of high-qualityfor agents with consideration of their privacies, namely Federated deepReinforcement Learning (FedRL). To protect the privacy of data and models, weexploit Gausian differentials on the information shared with each other whenupdating their local models. In the experiment, we evaluate our FedRL frameworkin two diverse domains, Grid-world and Text2Action domains, by comparing tovarious baselines.
Quick Read (beta)
Federated Deep Reinforcement Learning
In deep reinforcement learning, building policies of high-quality is challenging when the feature space of states is small and the training data is limited. Despite the success of previous transfer learning approaches in deep reinforcement learning, directly transferring data or models from an agent to another agent is often not allowed due to the privacy of data and/or models in many privacy-aware applications. In this paper, we propose a novel deep reinforcement learning framework to federatively build models of high-quality for agents with consideration of their privacies, namely Federated deep Reinforcement Learning (FedRL). To protect the privacy of data and models, we exploit Gausian differentials on the information shared with each other when updating their local models. In the experiment, we evaluate our FedRL framework in two diverse domains, Grid-world and Text2Action domains, by comparing to various baselines.
In deep reinforcement learning, building policies of high-quality is challenging when the feature space of states is small and the training data is limited. In many real-world applications, however, datasets from clients are often privacy sensitive (Duchi et al., 2012) and it is often difficult for such a data center to guarantee building models of high-quality. To deal with the issue, Konecny et al. propose a new learning setting, namely federated learning, whose goal is to train a classification or clustering model with training data involving texts, images or videos distributed over a large number of clients (Konecný et al., 2016; McMahan et al., 2017). Different from previous federated learning setting (c.f. (Yang et al., 2019)), we propose a novel federated learning framework based on reinforcement learning (Sutton and Barto, 1998; Mnih et al., 2015; Co-Reyes et al., 2018), i.e., Federated deep Reinforcement Learning (FedRL), which aims to learn a private Q-network policy for each agent by sharing limited information (i.e., output of the Q-network) among agents. The information is “encoded” when it is sent to others and “decoded” when it is received by others. We assume that some agents have rewards corresponding to states and actions, while others have only observed states without rewards. Without rewards, those agents are unable to build decision policies on their own information. We claim that all agents benefit from joining the federation in building decision policies.
There are many applications regarding federated reinforcement learning. For example, in the manufacturing industry, producing products may involve various factories which produce different components of the products. Factories’ decision policies are private and will not be shared with each other. On the other hand, building individual decision policies of high-quality on their own is often difficult due to their limited businesses and lack of rewards (for some factories). It is thus helpful for them to learn decision polices federatively under the condition that private data is not given away. Another example is building medical treatment policies to patients for hospitals. Patients may be treated in some hospitals and never give feedbacks to the treatments, which indicates these hospitals are unable to collect rewards based on the treatments given to the patients and build treatment decision policies for patients. In addition, data records about patients are private and may not be shared among hospitals. It is thus necessitated to learn treatment policies for hospitals federatively.
Our FedRL framework is different from multi-agent reinforcement learning, which is concerned with a set of autonomous agents that observe global states (or partial states which are directly shared to make “global” states), select an individual action and receive a team reward (or each agent receives an individual reward but shares it with other agents) (Tampuu et al., 2015; Leibo et al., 2017; Foerster et al., 2016). FedRL assumes agents do not share their partial observations and some agents are unable to receive rewards. Our FedRL framework is also different from transfer learning in reinforcement learning, which aims to transfer experience gained in learning to perform one task to help improve learning performance in a related but different task or agent, assuming observations are shared with each other (Taylor and Stone, 2009; Tirinzoni et al., 2018), while FedRL assumes states cannot be shared among agents.
Our FedRL framework functions in three phases. Initially, each agent collects output values of Q-networks from other agents, which are “encrypted” with Gausian differentials. Furthermore, it builds a shared value network, e.g., MLP (multilayer perceptron), to compute a global Q-network output with its own Q-network output and the encrypted values as input. Finally, it updates both the shared value network and its own Q-network based on the global Q-network output. Note that MLP is shared among agents while agents’ own Q-networks are unknown to others and should not be inferred based on the encrypted Q-network output shared in the training process.
In the remainder of the paper, we first review previous work related to our FedRL framework, and then present the problem formulation of FedRL. After that, we introduce our FedRL framework in detail. Finally we evaluate our FedRL framework in the Grid-World domain with various sizes and the Text2Actions domain.
2 Related Work
The nascent field of federated learning considers training statistical models directly on devices (Konecný et al., 2015; McMahan et al., 2017). The aim in federated learning is to fit a model to data generated by distributed nodes. Each node collects data in a non-IID manner across the network, with data on each node being generated by a distinct distribution. There are typically a large number of nodes in the network, and communication is often a significant bottleneck. Different from previous work that train a single global model across the network (Konecný et al., 2015, 2016; McMahan et al., 2017), Smith et al. propose to learn separate models for each node which is naturally captured through a multi-task learning (MTL) framework, where the goal is to consider fitting separate but related models simultaneously (Smith et al., 2017). Different from those federated learning approaches, we consider federated settings in reinforcement learning.
Our work is also related to multi-agent reinforcement learning (MARL) which involves a set of agents in a shared environment. A straightforward way to MARL is to extend the single-agent RL approaches. Q-learning has been extended to cooperative multi-agent settings, namely Independent Q-learning (IQL), in which each agent observes the global state, selects an individual action and receives a team reward (Tampuu et al., 2015; Leibo et al., 2017). One challenging of MARL is that multi-agent domains are non-stationary from agent’s perspectives, due to other agents’ interactions in the shared environment. To address this issue, (Omidshafiei et al., 2017) propose to explore Concurrent Experience Replay Trajectories (CERTs) structures, which store different agents’ histories, and align them together based on the episode indices and time steps. Due to the action space growing exponentially with the number of agents, learning becomes very difficult due to partial observability of limited communication when the number of agents is large. (Lowe et al., 2017) thus propose to solve the MARL problem through a Centralized Critic and a Decentralized Actor, and (Rashid et al., 2018) propose to exploit a linear decomposition of the joint value function across agents. Different from MARL, our FedRL framework assumes agents do not share their partial observations and some agents are unable to receive rewards, instead of assuming observations are sharable and all agents are able to receive rewards.
3 Problem Definition
A Markov Decision Process (MDP) can be defined by , where is the state space, and is the action space. is the transition function: , i.e., , specifying the probability of next state given current state and that applies on . is the reward function: , where is the space of real numbers. Given a policy , the value function and the Q-function at step , can be updated by their step:
for . The solution to an MDP problem is the best policy such that or . In DQN (Deep Q-Network), given transition function is unknown, the Q-function is represented by a Q-network with as parameters of the network, and updated by
as done by (Mnih et al., 2015). To learn the parameters , one way is to store transitions in replay memories and exploit a mini-batch sampling to repeatedly update (Mnih et al., 2015). Once is learnt, the policy can be extracted from ,
We define our federated deep reinforcement learning problem by: given transitions collected by agent , and pairs of states and actions collected by agent , we aim to federatively build policies and for agents and , respectively. Note that in this paper we consider the federation with two members for simplicity. The setting can be extended to many agents by exploiting the same federated mechanism between each two agents. We denote states, actions, Q-functions, and policies with respect to agents and , by “, , , ” and “, , , ”, respectively.
In our federated deep reinforcement learning problem, we assume:
A1: The feature spaces of states and are different between agents and . For example, a state denotes a patient’s cardiogram in hospital , while another state denotes the same patient’s electroencephalogram in hospital , indicating the feature spaces of and are different.
A2: Transitions and cannot be shared directly between and when they learning their own models. The correspondences between transitions from and are, however, known to each other. In other words, agent can send the “ID” of a transition to agent , and agent can use that “ID” to find its corresponding transition in . For example, in hospital, “ID” can correspond to a specific patient.
A3: The output of functions and can be shared with each other under the condition that they are protected by some privacy protection mechanism.
Based on A1-A3, we aim to learn policies and of high-quality for agents and by preserving privacies of their own data and models.
4 Our FedRL Approach
In this section we present our FedRL framework in detail. An overview of FedRL is shown in Figure 1, where the left part is the model of agent , and the right part is the model of agent . Each model is composed of a local Q-network with parameters for agent , for agent , and a global MLP module with parameters for both and .
We build two Q-networks for agents and , denoted by and , respectively, where and are parameters of the Q-networks. The outputs of these two basic Q-networks are not directly used to predict the actions, but taken as input of the MLP module.
Gaussian differential privacy
To avoid agents “inducing” models of each other according to repeatedly received outputs of other’s Q-network during training, we consider using differential privacy (Dwork and Roth, 2014) to protect the output of agent’s Q-network. There are various mechanisms of differential privacy such as the Gaussian mechanism (Abadi et al., 2016) and Binomial mechanism (Agarwal et al., 2018). In this paper, we exploit Gaussian mechanism since the output of the MLP with Gaussian input is Gaussian itself. In previous federated learning settings, the mechanism is applied to the gradients of agents (clients) before being sent to the server or other agents (clients). In our FedRL framework, we send the output of one agent’s Q-network to another, so the Gaussian noise is added to the output rather than the gradients. The mechanism is defined by
where is the Gaussian distribution with mean 0 and standard deviation .
We build a new Q-network, namely federated Q-network denoted by , to leverage the outputs of the two basic Q-networks with Gaussian noise based on MLP, which is defined by
where is the parameters of MLP and indicates the concatenation operation. Note that parameters of an MLP can be shared between agents. Once MLP is updated, the updated parameters are shared with the other agent.
With respect to agents and , we define each agent’s federated Q-networks by viewing the other agent’s basic Q-network (with Gaussian noise) as a fixed constant when updating its own basic Q-network, as shown below,
where and are fixed constants when updating agent ’s basic Q-network and agent ’s basic Q-network, respectively.
The Q-networks are trained by minimizing the square error loss and of agents and ,
where . Note that agent is unable to compute since it does not have reward . is computed by agent and shared with agent .
Overview of training and testing
The training and testing procedures of agents and are shown in Figure 2. When training, agent computes and sends it to agent . Agent updates and , computes , and sends and to agent . After that, agent updates and , computes , and sends and to agent . When testing, both agents compute and send and to each other for computing and .
The detailed training procedure can be seen from Algorithms 1 and 2. In Steps 1 and 2 of Algorithm 1, we initialize the basic Q-network and replay memory of agent and the MLP module. In Step 3 of Algorithm 1, we call the function from Algorithm 2 to initialize the Q-network and replay memory of agent . In Step 6 of Algorithm 1, we obtain observation of agent ’s state . In Step 7 of Algorithm 1, we call the function in Algorithm 2 to calculate the output of basic Q-network of agent and obtain observation of agent and select the corresponding action, as shown in Steps from 6 to 10 of Algorithm 2. In Steps from 8 and 11 of Algorithm 1, we perform the -greedy exploration and exploitation, obtain new observations and store the transitions to the replay memory. In Steps 12 and 13 of Algorithm 1, we sample a record in the memory and call the function of Algorithm 2 to calculate the output of basic Q-network of agent based on the index . In Steps 14 and 15 of Algorithm 1, we update parameters and . In Steps 16 and 17 of Algorithm 1, we compute the output of basic Q-network of agent and pass it to agent and call the function of agent to update basic Q-network of agent and MLP, as shown in Steps from 19 to 21 of Algorithm 2. Note that Algorithms 1 and 2 are executed by agents and separately.
In the experiment, we evaluate FedRL11 1 The source code and datasets are available from https://github.com/FRL2019/FRL in the following two aspects. Firstly, we would like to see if agent can learn better policies by joining the federation than without joining the federation (agent can build policies by itself since it has complete transitions , while agent cannot build policies without joining the federation since it has only pairs of states and actions ). Secondly, we would like to see if our FedRL approach can learn policies of high-quality which are close to the ones learnt by directly combining data of both agent and , neglecting the data-privacy issue between and . To do this, we compared our FedRL approach to the following baselines:
DQN-alpha: which is a deep Q-network (Mnih et al., 2015) trained with agent ’s data only. It takes observations as input and outputs actions corresponding to .
DQN-full: which is a deep Q-network trained by directly putting data together from both agents and , i.e., neglecting data privacy between agents and .
FCN-alpha: which is a fully convolutional network (FCN) trained with agent ’s data only, similar to DQN-alpha. FCN-alpha aims to build policies via supervised learning (Furuta et al., 2019), i.e., viewing states as input and actions as labels.
FCN-full: which is a convolutional network trained with all data of agent and put together directly, similar to DQN-full.
In our experiment, we used the kernel of size in CNN and two fully-connected layers with the size of in MLP. We set the standard deviation in Gaussian differential privacy to be 1. We adopted the Adam optimizer with learning rate 0.001 and the ReLU activation for all models. We evaluated our FedRL with comparison to baselines in two domains, i.e., Grid-World and Text2Action.
5.1 Evaluation in Grid-World Domain
We first conducted experiments in Grid-World domain with randomly placed obstacles (Tamar et al., 2016). The two agents and were randomly put in the environment as well, as shown in Figure 3. The two agents aim to navigate through optimal pathes (i.e. the shortest pathes without hitting obstacles) to meet with each other. In this domain, we define states, actions and rewords as follows.
States: The domain is represented by a binary-valued matrix (0 for obstacle, 1 otherwise), where is the size of the domain. We evaluated our approach with respect to different sizes, i.e., . The observed state of agent , denoted by , was set to be a matrix with the current position of agent in the center of the matrix. The observed state of agent , denoted by , was set to be a matrix with the current position of agent in the center of the matrix.
Actions: There are 4 actions for each agent, i.e. going towards 4 directions, denoted by .
Rewards: The reward is composed of two parts, i.e., local reward and global reward , respectively. When an agent hits an obstacle, is set to be -10; when an agent meets the other agent, is set to be +50; otherwise, is set to be -1. Considering the goal of the task is to make the two agents eventually meet each other, we exploited an additional reward regarding the distance between two agents, namely global reward, i.e., , where is the Manhattan distance between the two agents, and is a regularization factor which is set to be the dimension of the domain, i.e. . The final reward is calculated by .
Dataset: We generated 8000 different maps (or matrices) for each size of and . In each map, we randomly chose two positions for the two agents, and computed the optimal path (shortest path) which was compared to the paths predicted by our FedRL and baselines. We randomly split the 8000 maps into 6400 for training, 800 for validation, and 800 for testing.
Criteria: In each training episode, we first took the initial state and predicted an action with a model. We then got a new state and took it as the new input of the model at the second time step. We repeated the procedure until the two agents met each other, which is indicated as a successful episode, or it exceeded the maximum time step , which is indicated as a failure episode. We set to be twice the length of the longest optimal path, i.e. for each size of and , respectively. We finally computed the measures of successful rate , and average reward , i.e.,
where indicates the number of successful episodes, indicates the total number of episodes, indicates the total reward of all episodes, respectively.
|Metric||Method||Domain w.r.t. various sizes|
Experimental Results w.r.t. Domain Sizes
We ran our FedRL and baselines five times and calculated an average of (as well as ). We calculated two results of our FedRL, denoted by FedRL-1 and FedRL-2, which correspond to “adding Gausian differential privacy on the local Q-network” and ”not adding Gausian differential privacy on the local Q-network” in the testing data, respectively. The results are shown in Table 1. From Table 1, we can see that of both FedRL-1 and FedRL-2 is much better than DQN-alpha and CNN-alpha in all three different sizes of domains, which indicates agent can indeed get help from agent via learning federatively with agent . Comparing to DQN-full, we can see that the of both FedRL-1 and FedRL-2 is close to DQN-full, which indicates our federated learning framework can indeed take advantage of both training data from agents and , even though they are protected locally with Gausian differential privacy (the reason why FedRL-2 is slightly better than DQN-full is that the hierarchical structure of our federated learning framework with three components may be more suitable for this domain than a unique DQN framework). In addition, we can also see that the generally decreases when the size of the domain increases, which is consistent with our intuition, since the larger the domain is, the more difficult the task is (which requires more training data to build models of high-quality).
From the metric of , we can see that both FedRL-1 and FedRL-2 are better than DQN-alpha, which indicates agent can indeed get help from agent in gaining rewards via federated learning with agent . We can also find that FedRL-2 outperforms FedRL-1 in both and . This is because our model is trained based on Gausian differential privacy, indicating on testing data with Gausian differential privacy, as done by FedRL-2, should be better than on testing data without Gausian differential privacy, as done by FedRL-1.
Experimental Results w.r.t. History Length
To study when FedRL works, we consider the amount of information that an agent uses. The observation input at each time is a sequence of observations, i.e. observation history. Intuitively, the longer the length of the observation history, the more information we have, and the more complicated the neural network is. We fixed the structures of all models and only changed the length of the history observations. We tested the length of history from 2 to 32 for all domains. The results are shown in Figure 4. In the first row of Figure 4, we can observe is that the success rates are improving with the increment of the history length. In and domains, the results converge when history length is longer than 16, while in domain, they have not converged even at the history length of 32. The reason is that domain is more complicated than the other two domains, so it needs more information (i.e. ) to learn a model of high-quality. We can also find that when the history length is short (i.e. for and for ), DQN-alpha, DQN-full and FedRL-1 perform poorly. FedRL-1 and DQN-full do not show their advantages although they take as input both and directly or indirectly. However, FedRL-2, which applies differential privacy to both training and testing samples, shows its great scalability even with limited amount of history. In small domains, such as , a few steps are enough to explore the whole environment. Therefore, the DQN-alpha which takes only single observation as input can performs as well as DQN-alpha and FedRL models.
5.2 Text2Actions Domain
In this experiment, we evaluated our FedRL in another domain, i.e., Text2Action (Feng et al., 2018), which aims to extract action sequences from texts. For example, consider a text “Cook the rice the day before, or use leftover rice in the refrigerator. The important thing to remember is not to heat up the rice, but keep it cold.”, which addresses the procedure of making egg fired rice. The task is to extract words composing an action sequence “cook(rice), keep(rice, cold)”, or “use(leftover rice), keep(rice, cold)”. We define states, actions and rewords as follows.
States: is real-valued matrix that describes the part-of-speech of words, and is real-valued matrix that describes the embedding of words. is the number of words in the text, is the dimension of a part-of-speech vector, and is the dimension of word vectors. In our experiments, part-of-speech vectors are randomly initialized and trained together with the Q-network, while word vectors are generated from the pre-trained word embeddings and will not be changed during the training of the Q-network.
Actions: There are two actions for each agent, i.e., , indicating selecting a word as an action name (or a parameter), or neglecting a word (which means the corresponding word is neither an action name nor a parameter).
Rewards: The instant reward includes a basic reward and an additional reward, where the basic reward indicates whether the agent selects a word correctly or not, and the additional reward encodes the priori knowledge of the domain, i.e., the proportion of words that are related to action sequences (Feng et al., 2018). We assume that only agent knows the rewards, while agent does not know them.
Dataset: We conducted experiments on three datasets, i.e., “Microsoft Windows Help and Support” (WHS) documents (Branavan et al., 2009), and two datasets collected from “WikiHow Home and Garden”22 2 https://www.wikihow.com/Category:Home-and-Garden (WHG) and “CookingTutorial”33 3 http://cookingtutorials.com/ (CT), respectively. In CT, there are 116 labeled texts and 134,000 words, with 10.37% being action names and 7.44% being action arguments. In WHG, there are 150 labeled texts and 34,000,000 words, with 7.61% being action names and 6.3% being action arguments. In WHS, there are 154 labeled texts and 1,500 words, with 19.47% being action names and 15.45% being action arguments.
Criteria: For evaluation, we first fed texts to each model to obtain selected words. We then compared the outputs to their corresponding ground truth and calculated (total ground truth words), (total correctly selected words), and (total selected words). After that we computed metrics , , and
We used F1-metric for all baselines. For reinforcement learning methods, we also computed the average cumulative rewards
where indicates the total cumulative reward of all testing texts and indicates the total number of steps of all testing texts.
We adopted the same TextCNN structure as (Feng et al., 2018). Four convolutional layers corresponding to bigram, tri-gram, four-gram and five-gram with 32 kernels of size , where refers to the -gram and . Each convolutional layer is followed by a max-pooling layer with size of , where is the first dimension of the outputs of the n-gram convolutional layer. The max-pooling outputs are concatenated and fed to two fully connected layers with size , where equals to , indicating 4 types of -grams and 32 kernels for each -grams, and 2 is the size of the action space. The MLP module of FedRL is composed of two fully-connected layers with size . The standard deviation of Gaussian differential privacy was set to be 1. For all models, we adopted the Adam optimizer with learning rate 0.001 and ReLU activation44 4 Detailed setting can be found from the source code: https://github.com/FRL2019/FRL.
We ran our FedRL and baselines five times to calculate an average of F1 (as well as ). The results are shown in Table 2. From Table 2 we can see that both FedRL-1 and FedRL-2 outperform both FCN-alpha and DQN-alpha in all three datasets under both F1 and metrics, which indicates agent can learn better policies via federated learning with agent than learning by itself. Comparing our FedRL-1 and FedRL-2 with DQN-full in both F1 and , we can see that their performances are close to each other, suggesting our federated learning framework performs as good as the DQN model directly built from all training data from both agents and .
To see the impact of the model complexity, we varied the number of convolutional kernels from 2 to 32 with other parameters fixed. The results are shown in Figure 5. We can see that all models generally perform better when the number of convolutional kernels (filters) increases, and become stable after 8 kernels. We can also see that our FedRL models outperform DQN-alpha, FCN-alpha and FCN-full with respect to the number of kernels, which indicates agent can indeed learns better policies via federated learning with agent . Both FedRL-1 and FedRL-2 are close to DQN-full with respect to the number of filters, suggesting that our federated learning framework is effective even though the data of agents and are not shared with each other.
we propose a novel reinforcement learning approach to considering privacies and federatively building Q-network for each agent with the help of other agents, namely Federated deep Reinforcement Learning (FedRL). To protect the privacy of data and models, we exploit Gausian differentials on the information shared with each other when updating their local models. We demonstrate that the proposed FedRL approach is effective in building high-quality policies for agents under the condition that training data are not shared between agents. In the future, it would be interesting to study more members in the federation with background knowledge represented by (probably incomplete) action models (Zhuo et al., 2014; Zhuo and Yang, 2014; Zhuo and Kambhampati, 2017).
- Deep learning with differential privacy. In ACM SIGSAC, pp. 308–318. Cited by: §4.
- CpSGD: communication-efficient and differentially-private distributed SGD. In NeurIPS, pp. 7575–7586. Cited by: §4.
- Reinforcement learning for mapping instructions to actions. In ACL, Cited by: 4th item.
- Self-consistent trajectory autoencoder: hierarchical reinforcement learning with trajectory embeddings. In ICML, pp. 1009–1018. Cited by: §1.
- Privacy aware learning. In NIPS, pp. 1439–1447. Cited by: §1.
- The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9 (3-4), pp. 211–407. Cited by: §4.
- Extracting action sequences from texts based on deep reinforcement learning. In IJCAI, pp. 4064–4070. Cited by: 3rd item, §5.2, §5.2.
- Learning to communicate with deep multi-agent reinforcement learning. In NIPS, pp. 2137–2145. Cited by: §1.
- Fully convolutional network with multi-step reinforcement learning for image processing. In AAAI, pp. 3598–3605. Cited by: 3rd item.
- Federated optimization: distributed optimization beyond the datacenter. CoRR abs/1511.03575. Cited by: §2.
- Federated learning: strategies for improving communication efficiency. CoRR abs/1610.05492. Cited by: §1, §2.
- Multi-agent reinforcement learning in sequential social dilemmas. In AAMAS, pp. 464–473. Cited by: §1, §2.
- Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, pp. 6382–6393. Cited by: §2.
- Communication-efficient learning of deep networks from decentralized data. In AISTATS, pp. 1273–1282. Cited by: §1, §2.
- Human-level control through deep reinforcement learning.. Nature 518 (7540), pp. 529–33. Cited by: §1, §3, 1st item.
- Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In ICML, pp. 2681–2690. Cited by: §2.
- QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, pp. 4292–4301. Cited by: §2.
- Federated multi-task learning. In NIPS, pp. 4427–4437. Cited by: §2.
- Reinforcement learning - an introduction. Adaptive computation and machine learning, MIT Press. External Links: Cited by: §1.
- Value iteration networks. In NIPS, pp. 2146–2154. Cited by: §5.1.
- Multiagent cooperation and competition with deep reinforcement learning. CoRR abs/1511.08779. Cited by: §1, §2.
- Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10, pp. 1633–1685. External Links: Cited by: §1.
- Transfer of value functions via variational methods. In NIPS, pp. 6182–6192. Cited by: §1.
- Federated machine learning: concept and applications. ACM TIST 10 (2), pp. 12:1–12:19. Cited by: §1.
- Model-lite planning: case-based vs. model-based approaches. Artif. Intell. 246, pp. 1–21. Cited by: §6.
- Learning hierarchical task network domains from partially observed plan traces. Artif. Intell. 212, pp. 134–157. Cited by: §6.
- Action-model acquisition for planning via transfer learning. Artif. Intell. 212, pp. 80–103. Cited by: §6.