Driving Reinforcement Learning with Models

  • 2019-11-11 17:14:56
  • Pietro Ferraro, Meghana Rathi, Giovanni Russo
  • 1


Over the years, Reinforcement Learning (RL) established itself as aconvenient paradigm to learn optimal policies from data. However, most RLalgorithms achieve optimal policies by exploring all the possible actions andthis, in real-world scenarios, is often infeasible or impractical due to e.g.safety constraints. Motivated by this, in this paper we propose to augment RLwith Model Predictive Control (MPC), a popular model-based control algorithmthat allows to optimally control a system while satisfying a set ofconstraints. The result is an algorithm, the MPC-augmented RL algorithm(MPCaRL) that makes use of MPC to both drive how RL explores the actions and tomodify the corresponding rewards. We demonstrate the effectiveness of theMPCaRL by letting it play against the Atari game Pong. The results obtainedhighlight the ability of the algorithm to learn general tasks with essentiallyno training.


Quick Read (beta)

Driving Reinforcement Learning with Models

Pietro Ferraro
Dyson School of Design Engineering
Imperial College London
South Kensington, London
[email protected] &Meghana Rathi
School of Electrical & Electronic Eng.
University College Dublin
Belfield, Dublin, 4
[email protected]
&Giovanni Russo
School of Electrical & Electronic Eng.
University College Dublin
Belfield, Dublin, 4
[email protected]

Over the years, Reinforcement Learning (RL) established itself as a convenient paradigm to learn optimal policies from data. However, most RL algorithms achieve optimal policies by exploring all the possible actions and this, in real-world scenarios, is often infeasible or impractical due to e.g. safety constraints. Motivated by this, in this paper we propose to augment RL with Model Predictive Control (MPC), a popular model-based control algorithm that allows to optimally control a system while satisfying a set of constraints. The result is an algorithm, the MPC-augmented RL algorithm (MPCaRL) that makes use of MPC to both drive how RL explores the actions and to modify the corresponding rewards. We demonstrate the effectiveness of the MPCaRL by letting it play against the Atari game Pong. The results obtained highlight the ability of the algorithm to learn general tasks with essentially no training.


Driving Reinforcement Learning with Models

  Pietro Ferraro Dyson School of Design Engineering Imperial College London South Kensington, London [email protected] Meghana Rathi School of Electrical & Electronic Eng. University College Dublin Belfield, Dublin, 4 [email protected] Giovanni Russo School of Electrical & Electronic Eng. University College Dublin Belfield, Dublin, 4 [email protected]


noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Driven by its recent impressive results, see e.g. [2],[3], reinforcement learning (RL) has become a popular paradigm to make agent achieve optimal policies [1]. On an intuitive level RL manages to find the optimal policy by exploring the state space and by learning which actions are best for each scenario, on the basis of function called reward function, which rewards or punishes the agent for its behaviour. Although, it can be proven that in the long term a RL algorithm manages to reach optimal control, the trade-off is that during the learning process the agent might end up in unsafe situations [4]; this, especially in applications where agents need to physically interact with their environment, might end up being infeasible due to safety requirements. Take autonomous driving, as an example: a key requirement would be that, while learning a policy for certain critical functionalities, safety needs to be guaranteed at all times (to e.g. avoid crashes). Furthermore, on this note we point out that it has been recently found that even using a dataset with 30 million driving examples is inadequate to train policies that are safe in any operational condition: no dataset exists that can exhaustively cover all situations that might arise when driving [5].

Motivated by this, we propose in this paper an algorithm that complements the capabilities of RL with those of a well known and established model-based control method such as the Model Predictive Control (MPC). In so doing, we present the MPC-augmented-RL (MPCaRL) that: (i) for certain critical functionalities, for which a dynamical model can be identified, it computes a control action optimizing a given cost function; (ii) for non-critical functionalities it uses RL to compute a policy from data; (iii) it makes use, when possible, of the MPC to both provide a guidance for the state-space exploration and to fine-tune the rewards obtained by the RL.

1.1 Related Work

This work is related to a number of threads of recent work in RL. We now briefly review some related works on physics simulation, model-based and safe RL.

Physics simulation. The idea of using models and simulation environments to develop intelligent reinforcement learning agents has recently been attracting a lot of attention from academia. For example, in [6] it is shown how a physical simulator can be embedded in a deep network, enabling agents to both learn parameters of the environment and improve the control performance of the agent. Essentially, this is done by simulating rigid body dynamics via a linear complementarity problem (LCP) technique [7][8]. LCP techniques are also used within other simulation environments such as MuJoCo [9], Bullet [10], and DART [11]. As remarked in [6], in these environments gradients are computed through finite differences and this, in turn, implies evaluating the results of the simulations multiple times. In order to improve the computational burden of this approach, LCP techniques have also been developed which involve analytic differentiation. See, for example [12] and, more recently, [13]. Furthermore, a complementary body of literature investigates the possibility of integrating into networks the mechanisms inspired by the intuitive human ability to understand physics. See e.g. [14][15][16], which leverage the ideas of[17][18][19].

Model-based RL. Although model-free methods, have achieved considerable successes in the recent years (see for example [2]), many works suggest that a model-based approach can potentially achieve better performances. Among those, the interested reader can refer to [4], [Atkeson97acomparison], [kurutach2018modelensemble]. Model-free approaches (see e.g. [2]), in fact, tend to provide optimal policies only on the long term and employing previous knowledge of domain, such as a mathematical model, can considerably decrease the learning time of a model-based RL algorithm (with respect to a model-free one).[4]
Based on these premises, the research in model-based RL is a very active area and it is focused on two main settings. The first one makes use of neural networks and a suitable loss function to simulate the dynamics of interest [20][21], whereas another approach makes use of more classical mathematical models and closely resembles classic system identification [23]. Recently, an orthogonal stream of literature (see e.g. [HERZALLAH2015199, doi:10.1002/acs.742, KARNY19961719]) has also emerged and focuses on designing of optimal decision makers based on a pure probabilistic knowledge (in the form of a probability density function) of the system of interest and dynamic programming [22].

Safe RL. The design of safe RL algorithms is a key research topic and different definitions of safety have been proposed in the literature( e.g. [24] [25] and references therein). As an example, in [DBLP:journals/corr/abs-1109-2147] the authors relate safety to a set of error states associated to dangerous/risky situations while in [26] risk adversion is specified in the reward. Other works, [27]-[30] define safety, by using accurate probabilistic model of the system or [31] where a notion of safety for closed-loop probabilistic systems is defined. A complementary approach is the one of model-based RL where safety is formalized via state space constraints. Examples of this approach include [2, 4], where Lyapunov functions are used to show forward invariance of the safety set defined by the constraints. Finally, we note that other approaches include [Garcia:2012:SES:2444851.2444864], where a priori knowledge of the system is used, in order to craft safe backup policies or [safe] in which authors consider uncertain systems and enforce probabilistic guarantees on their performance.

1.2 Contributions

Our key contributions can be summarized as follows. We introduce an algorithm, the MPCaRL that combines RL and MPC in a novel way so that they can augment each other’s strengths. The MPCaRL is inspired by the fact that, often, complex tasks can be fulfilled by combining together a set of functionalities and, in turn, each functionality can be either critical or non-critical. For example, in automated driving, critical functionalities are those directly associated to the prevention of crashes. At each iteration, MPCaRL identifies for critical functionalities, a mathematical model. Once the model is identified, then a control action is computed via MPC so as to optimize a given cost function. For all the other non-critical functionalities the algorithm makes use of a RL algorithm (in particular, we will use Q-Learning) to compute a policy from data. Moreover, MPC and RL interact within the MPCaRL and, in particular, MPC both drives the state-space exploration and tunes the RL rewards in order to speed the learning process. Finally, in order to illustrate the effectiveness of the MPCaRL, we show its performances at playing the Atari game Pong. In these contexts, the critical functionality is the defense (i.e., when the ball is bounced back to the player) which, if not performed correctly, leads to the loss of the round.

Interestingly the experiments highlight the ability of the algorithm to learn tasks with essentially no training.

2 Background

We start with outlining the two building blocks composing MPCaRL, i.e. the Model Predictive Control algorithm and the Q-Learning. We refer the reader to [GARCIA1989335] and [1] for a detailed description of MPC and Q-Learning.

Model Predictive Control. Model Predictive Control (MPC) is a model-based control technique. Essentially, at each time-step, the algorithm computes a control action by solving an optimization problem having as constraint the (discrete-time) dynamics of the system being controlled. In addition to the dynamics, other system requirements (e.g. safety or feasibility requirements) can also be formalized as constraints of the optimization problem [GARCIA1989335]. Let xkn be the state variable of the system at time k, ukm be its control input and ηk be some noise. In this paper we consider discrete-time dynamical systems of the form xk+1=Akxk+Dk+Ckuk+ηk with initial condition xinitial and where the time-varying matrices have appropriate dimensions. For this system, formally the MPC algorithm generates the control input uk by solving the problem

argminx0:T𝒳,u0:T𝒜t=0TJt(xt,ut)subject to   xt+1=Atxt+Dt+Ctut,x0=xinitial, (1)

with x0:T (u0:T) denoting the sequence {x0,,xT} (resp. {u0,,uT}) and where: (i) 𝒜 and 𝒳 are sets modelling the constraints for the valid control actions and states; (ii) t=0TJt(xt,ut) is the cost function being optimized. See [doi:10.1080/00207170600593901], [1624479] for more details on this subject.

Q-Learning and Markov Decision Processes. Q-Learning (Q-L) is a model free RL algorithm, whose aim is to find an optimal policy with respect to a finite Markov Decision Process (MDP). We adopt the standard formalism for MDPs. A MDP [1] is a discrete stochastic model defined by a tuple 𝒮,𝒜,P,γ,, where: (i) 𝒮 is the set of states s𝒮; (ii) 𝒜 is the set of actions a𝒜; (iii) P(s|s,a) is the probability of transitioning from state s to state s under action a ; (iv) γ[0,1) is the discount factor; (v) (s,a) is the reward of choosing the action a in the state s. Upon performing an action, the agent receives the reward (st,at). A policy, π, specifies (for each state) the action that the agent will take and the goal of the agent is that of finding the policy that maximizes the expected discounted total reward. The value Qπ(s,a), named Q-function, corresponding to the pair (s,a) represents the estimated expected future reward that can be obtained from (s,a) when using policy π. The objective of Q-learning is to estimate the Q-function for the optimal policy π*, Qπ*(s,a). Define the estimate as Q(s,a). The Q-learning algorithm works then as follows: after setting the initial values for the Q-function, at each time step, observe current state st and select action at, according to policy π(st). After receiving the reward R(st,at) update the corresponding value of the Q-function as follows:

Q(st,at)(1-α)Q(st,at)+α[(st,at)+γmaxaQ(st+1,a)], (2)

where α[0,1] is called the learning rate. Notice that the Q-learning algorithm does not specify which policy π() should be considered. In theory, to converge the agent should try every possible action for every possible state many times. For practical reasons, a popular choice for the Q-learning policy is the ϵ-greedy policy, which selects its highest valued (greedy) action, πϵ(st)=argmaxatQ(st,at), with probability 1-ϵ(k-1)/k and randomly selects among all other k actions with probability ϵ/k [wunder2010classes]. However, in this paper, due to the nature of the proposed algorithm and the way rewards are provided, we implemented a greedy policy, with ϵ=0. The interested reader can refer to [peng1994incremental] and [wunder2010classes] for further information on this topic.

3 The MPCaRL Algorithm

We are now ready to introduce the MPCaRL algorithm, the key steps of which are summarized as pseudo-code in Algorithm 3. The main intuition behind this algorithm is that, in many real world systems, tasks can be broken down into a set of critical and non-critical functionalities. Often, critical functionalities are associated to safety constraints and, for these functionalities, it often makes sense for the designer to build a mathematical model. Given this intuition, the MPCaRL aims at combining the strengths of MPC and Q-L. Indeed: (i) within MPCaRL, MPC can directly control the agent in performing critical functionalities and, at the same time, it drives the state exploration of Q-L. This can be used to e.g. prevent the agent to enter in states that might be unsafe; (ii) on the other hand, Q-L handles non-critical functionalities, for which e.g. no mathematical model is available and hence for which a classic control algorithm could not be used.

The MPCaRL takes as input the following design parameters: (i) the set of allowed actions, 𝒜; (ii) the time horizon and cost function used is (1); (iii) a recording window used to identify the dynamics in (1); (iv) a positive and a negative reward, r¯ and r¯ used by MPC to drive the exploration of Q-L; (v) an initial matrix Q(s,a). Given this set-up, following Algorithm 3 the following steps are performed:

Step 1:

at each time-step, MPCaRL determine whether the task functionality is a critical or a non-critical functionality. As we will see in Section 4, the critical functionality in the Pong game is associated to the agent’s defense strategy;

Step 2a:

if the functionality is critical, then the agent’s action is computed via MPC. Even if the action applied by the agent is given by MPC, the action that would have been obtained via Q-L is also computed. If the action from MPC and Q-L are the same, then the Q(s,a) matrix is updated by using the positive reward r¯. On the other hand, if the actions from MPC and Q-L differ from one another the Q(s,a) matrix is updated with a non-positive reward r¯;

Step 2b:

if, instead, the functionality is non-critical (or a model cannot be identified) then, the agent’s action is generated by Q-L;

Step 3:

all relevant quantities are saved within the main algorithm loop


MPCaRL Algorithm {algorithmic}[1] \StateInputs: \StateAllowed actions, 𝒜 \StateTime horizon, T, and cost function t=0TJt(xt,ut) \StateRecording window Tw \StateRewards r¯ and r¯ \StateInitial matrix Q(s,a) \StateMain loop: \Fork=0, \StateDetermine functionality (and whether model in (1) can be identified) \IfFunctionality critical \StateGet sk and xk \Ifk1 \Ifak-1=uk-1 \StateQ(sk-1,uk-1)(1-α)Q(sk-1,uk-1)+α[r¯+γmaxa{Q(sk,a)}] \Else\StateQ(sk-1,uk-1)(1-α)Q(sk-1,uk-1)+α[r¯+γmaxa{Q(sk,a)}] \EndIf\EndIf\Ifk-Tw>0 \StateGiven xk-Tw:k identify the model in (1) and compute uk with xinit=xk \Else\StateGiven x0:k identify the model in (1) and compute uk with xinit=xk \EndIf\StateCompute ak using Q-L (Section 2) \StateApply uk \StateSave xk, sk, ak, uk \Else\StateApply ak computed via Q-L \StateSave sk, ak \EndIf\Statekk+1 \EndFor

4 Simulations

We now illustrate the effectiveness of MPCaRL by making it play against the Atari game Pong. Pong is a 2 player game where each player moves a paddle in order to bounce a ball to the opponent. A player scores a point when the opponent fails to bounce the ball back and the game ends when a player scores 21 points. In what follows, we give a thorough description of how Algorithm 3 has been implemented in order to allow MPCaRL to play against Pong.

4.1 The environment and data gathering

The environment of the game was set-up using the OpenAI gym library in Python. In particular, we used the PongDeterministic-v4 (with 4 frame skips) configuration of the environment, which is the one used to assess Deep Q-Networks, see e.g. [EndtoEnd] and [mnih-atari-2013]. The configuration used has, as observation space, Box(210,160,3) 11 1 See http://gym.openai.com/docs/ for documentation on the environment observation space. Within our simulations, we first removed part of the images that display the points earned by the players and this yielded an observation space of Box(160,160,3) and then we down-sampled the resulting image to get a reduced observation space of Box(80,80,3) so that each frame consists of a matrix of 80×80 pixels. Within the image, a coordinate system is defined within the environment, with the origin of the x and y axes being in the bottom-right corner.

Given the above observation space, both the position of the ball and the vertical position of the paddle moved by MPCaRL were extracted from each frame (see also Figure 1, left panel). In particular:

  • At the beginning of the game, the centroid of the ball is found by iterating through the frame to find the location of all pixels with a value of 236 (this corresponds to the white color, i.e. the color of the ball in Pong). Then, once the ball is found the first time, the frame is only scanned in a window around the position of the ball previously found (namely, we used a window of 80×12 pixels, see Figure 1, right panel);

  • Similarly the paddle’s centroid position is found by scanning the frame for pixels having value 92 (this corresponds to the green color, i.e. the color of the MPCaRL paddle).

Figure 1: Left panel: a typical frame from the reduced observation space used in the experiments and consisting of 80×80 pixels. The paddle moved by MPCaRL is the green one. Right panel: a zoom illustrating the 80×12 pixels window used to extract the new position of the ball, given its previous position.

4.2 Definition of the task and its functionalities

The agent’s task is that of winning the game, which essentially consists of two phases: (i) defense phase, where the agent needs to move the paddle to bounce the ball in order to avoid that the opponent makes a point; (ii) an attack phase, where the agent needs instead to properly bounce the ball in order to make the point. When implementing MPCaRL to play Pong, we associated defense to critical functionalities, while the attack functionalities were defined as non-critical. Explicit conditions identifying critical and non-critical functionalities are given in the next sub-sections.

4.3 Implementing MPC

We describe the MPC implementation used within MPCaRL to play Pong by first introducing the mathematical model serving as the constraint in (1). This model describes both the dynamics of the ball and of the paddle moved by MPCaRL.

Ball dynamics. We denote by xt(b) and yt(b) the x and y coordinates of the centroid of the ball at time t. The mathematical model describing the dynamics of the ball is then:

xt+1(b)=xt(b)+vt,x,yt+1(b)=yt(b)+vt,y, (3)

where xt+1(b) and yt+1(b) are the predicted next coordinates at time t+1 and where the speeds at time t, i.e. vt,x and vt,y are computed from the positions extracted from the current and the two previous frames, i.e. xt-2(b),xt-1(b),xt(b) and yt-2(b),yt-1(b),yt(b) (that is, Tw=2 in Algorithm 3). In particular, this is done by first computing the quantities vx1,vx2 and vy1,vy2 as follows:

[vx1vy1]=[xt(b)yt(b)]-[xt-1(b)yt-1(b)],[vx2vy2]=[xt-1(b)yt-1(b)]-[xt-2(b)yt-2(b)]. (4)

Consider now the speed along the x axis. If there is no impact between the ball and the paddle, we set vt,x=0.5(vx1+vx2)+var([vx1,vx2])=v¯t,x+ηt,x (where var(a) denotes the variance of the generic vector a). Instead, along the y axis, we have vt,y=0.5(vy1+vy2)+var([vy1,vy2])=v¯t,x+ηt,x if there has been no impact and vt,x=vx1 if there has been an impact of the ball with one of the walls.

Paddle dynamics. In the gym environment used for the experiments, the only control action that can be applied by MPCaRL at time t, i.e. ut is that of moving its paddle. In particular, the agent can either move the paddle up (ut=1) or down (ut=-1) or simply not moving the paddle (ut=0). It follows that, given the vertical position of the centroid of the paddle at time t, say yt(p), its dynamics can be modeled by

yt+1(p)=yt(p)+ut. (5)

The MPC model and the cost function. Combining the models in (3) - (5) yields the dynamical system serving as constraint in (1). Note that the resulting model can be formally written as the system in (1) once the state xt is defined as xt=[xt(b),yt(b),yt(p)]T. Finally, in the implementation of the MPC algorithm, we used as cost function t=0TJt=yt+T(b)-yt+T(p)2. That is, with this choice of cost function the algorithm seeks to regulate the paddle’s position so that the distance between the position of the ball at time t and the position of the MPCaRL paddle at time t+T is minimised. Note that the time horizon, T, used in the above cost function is obtained by propagating the ball model (3) in order to estimate after how many iterates the ball will hit the border protected by the MPCaRL paddle.

4.4 Implementing Q-L

We implemented the Q-L algorithm outlined in Section 4. The set of actions available to the agent were at{-1,0,+1}, while the state at time st was defined as the 5-dimensional vector containing: (i) the coordinates of the position at time time t of the ball (in pixels); (ii) the velocity of the ball (rounded to the closest integer) across the x and y axes; (iii) the position of the paddle moved by MPCaRL. The reward was obtained from the game environment: our agent was given a reward of +1, each time the opponent missed to hit the ball, and -1 each time our agent missed to hit the ball. Finally, the values of the Q-table were initialized to 0. In the experiments, the state-action pair was updated whenever a point was scored and a greedy policy was used to select the action. Also, in the experiments we set both α and γ in (2) to 0.7. Moreover, following Algorithm 3, the Q-function was also updated when non-critical functionalities were performed by the agent. In particular, within the experiments we assigned: (i) a positive reward, r¯, whenever the action from Q-L and MPC were the same; (ii) a non-positive reward, r¯, whenever the actions were not the same. In this way, within MPCaRL, the Q-L component is driven to learn to take actions similar to MPC.

Finally, we now describe the conditions that we used in the experiments to allow MPCaRL to discriminate between critical and non-critical functionalities. Intuitively, critical functionalities were associated to the defense phase of the game. Therefore, all the functionalities performed when the ball was coming towards the MPCaRL paddle and, at the same time, the future vertical position of the ball (predicted via the model) was far from the actual position of the MPCaRL paddle were defined as critical. That is, the states that are defined as critical are the ones that satisfy the conditions vt,x(b)<0,yt+T(b)-yt(p)>Hy, (whereas all the others are considered non-critical). Note that: (i) vt(b) is estimated from the game frames as described above (in the environment negative velocities along the x axis mean that the ball is coming towards the green paddle); (ii) the computation of yt+T(b) relies on simulating the model describing the ball’s dynamics (3); (iii) Hy is a threshold and this is a design parameter".

4.5 Results

We are now ready to present the results obtained by letting MPCaRL play Pong (larger versions of the figures of this Section are provided in the Supplementary Materials). The results are quantified by plotting the game reward as a function of the number of episodes played by MPCaRL. An episode consists of as many rounds of pong it takes for one of the players to reach 21 points, while the game reward is defined to be the difference between the points scored by MPCaRL within the episode and the points scored by the opponent within the episode. Essentially, a negative game reward means that MPCaRL lost that episode; the lowest possible value that can be attained is -21 and this happens when MPCaRL is not able to score any point. Viceversa, a positive game reward means that the agent was able to beat the opponent; the maximum value that can be obtained is +21, when the opponent did not score any point.

As a first experiment, we implemented an agent that would only use MPC or the untrained Q-L algorithm (see e.g. the implementation in [andrej]). We let this agent play Pong for 50 episodes and, as the left panel in Figure 2 shows, as expected, the Q-L agent did not obtain good rewards in the first 50 episodes, consistently loosing games with a difference in the scores of about 20 points. Instead, when the agent used the MPC described in Section 4 better performance were obtained. These performance, however, were not comparable with those obtained via a trained Q-L agent and the reason for this is that, while MPC allows the agent to defend, it does not allow for the learning of an attack strategy to consistently obtain points. Using MPCaRL allowed to overcome the shortcomings of the MPC and Q-L agents. In particular, as shown in the left panel of Figure 2, when the MPCaRL agent played against Pong, it was able to consistently beat the game (note the agent never lost a game) while quickly learning an attack strategy to obtain high game rewards. Indeed, note how the agent is able to consistently obtain rewards of about 20 within 50 episodes.

Figure 2: Left panel: episode rewards as a function of the number of episodes played when either MPC or Q-L are used. Right panel: episode rewards as a function of the number of episodes played by MPCaRL. Parameters of MPCaRL were set as follows: r¯=0.1, r¯=0, Hy=5.

In order to further investigate the performance of MPCaRL, we also evaluate its performance sensitivity to the parameters r¯, r¯ and Hy. We first take Hy in consideration; Hy directly affects the definition of critical functionalities. As figure 3 (left panel) illustrates, MPCaRL is still able to consistently obtain high rewards when Hy is perturbed, i.e. Hy{4,5,6}. Notice that when Hy=4, while MPCaRL is still able to beat the game, it also experiences some drops in the rewards. This phenomenon, which will be further investigated in future studies, might be due to the fact that restricting the space of movements of the Q-L component of the algorithm also restricts its learning capabilities. Instead, when the manoeuvre space of the Q-L is bigger (Hy=6), the algorithm, due to the increased flexibility in the moves is able to learn faster. As a further experiment, we fixed Hy=5 and perturbed the algorithm parameter r¯ so that r¯{0.1,0.3,0.5,0.7,0.9}. The results of this experiment are shown in the middle panel of Figure 3. It can be noted that smaller values of r¯ (i.e. r¯=0.1 and r¯=0.3) lead to a more consistent performance as compared to the higher values of r¯ which lead to negative spikes in the performance. This behaviour might be due to the fact that higher values of r¯ essentially imply that MPCaRL trusts more the MPC actions than Q-L actions, hence penalizing the ability of MPCaRL to quickly learn the attack strategy. Intuitively, simulations show that, while the MPC component is important for enhancing the agent’s defense, too much influence of this component on the Q-L can reduce the attack performance of the agent (this, in turn, is essential in order to score higher points). Consistently, the same behaviour can be observed when r¯ is perturbed (Hy=5, r¯=0.1 and r¯{-0.1,-0.3,-0.5,-0.7,-0.9}), as shown in Figure 3 (right panel), where it can be observed that the most negative values r¯ lead to dips in performance.

Figure 3: Left panel: rewards obtained by the MPCaRL when Hy is perturbed. All the other parameters were kept unchanged (i.e. r¯=0.1, r¯=0). Middle panel: rewards obtained by MPCaRL when r¯ is perturbed. In this experiment, Hy=5 and r¯=0. Right panel: rewards obtained by MPCaRL when r¯ is perturbed. In the experiment, Hy=5 and r¯=0.1.

5 Conclusions and Future Work

We investigated the possibility of combining Q-L and MPC so that they can augment each other’s capabilities. In doing so, we introduced a novel algorithm, the MPCaRL that: (i) leverages MPC and mathematical modelling for certain, critical, functionalities; (ii) makes use of Q-L to both perform non-critical functionalities; (iii) uses MPC to drive the state-space exploration of Q-L. The effectiveness of our algorithm has been illustrated by letting it play against Pong. Its performance are analysed when the parameters are perturbed. Interestingly, the experiments highlight the ability of the algorithm to quickly learn general tasks with essentially no training. In future works, we will focus on implementing the MPCaRL in different settings (e.g., Hide and Seek with multiple agents might be a good candidate) and we will explore its convergence and stability properties.