Abstract
Over the years, Reinforcement Learning (RL) established itself as aconvenient paradigm to learn optimal policies from data. However, most RLalgorithms achieve optimal policies by exploring all the possible actions andthis, in realworld scenarios, is often infeasible or impractical due to e.g.safety constraints. Motivated by this, in this paper we propose to augment RLwith Model Predictive Control (MPC), a popular modelbased control algorithmthat allows to optimally control a system while satisfying a set ofconstraints. The result is an algorithm, the MPCaugmented RL algorithm(MPCaRL) that makes use of MPC to both drive how RL explores the actions and tomodify the corresponding rewards. We demonstrate the effectiveness of theMPCaRL by letting it play against the Atari game Pong. The results obtainedhighlight the ability of the algorithm to learn general tasks with essentiallyno training.
Quick Read (beta)
Driving Reinforcement Learning with Models
Abstract
Over the years, Reinforcement Learning (RL) established itself as a convenient paradigm to learn optimal policies from data. However, most RL algorithms achieve optimal policies by exploring all the possible actions and this, in realworld scenarios, is often infeasible or impractical due to e.g. safety constraints. Motivated by this, in this paper we propose to augment RL with Model Predictive Control (MPC), a popular modelbased control algorithm that allows to optimally control a system while satisfying a set of constraints. The result is an algorithm, the MPCaugmented RL algorithm (MPCaRL) that makes use of MPC to both drive how RL explores the actions and to modify the corresponding rewards. We demonstrate the effectiveness of the MPCaRL by letting it play against the Atari game Pong. The results obtained highlight the ability of the algorithm to learn general tasks with essentially no training.
Driving Reinforcement Learning with Models
Pietro Ferraro Dyson School of Design Engineering Imperial College London South Kensington, London [email protected] Meghana Rathi School of Electrical & Electronic Eng. University College Dublin Belfield, Dublin, 4 [email protected] Giovanni Russo School of Electrical & Electronic Eng. University College Dublin Belfield, Dublin, 4 [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Driven by its recent impressive results, see e.g. [2],[3], reinforcement learning (RL) has become a popular paradigm to make agent achieve optimal policies [1]. On an intuitive level RL manages to find the optimal policy by exploring the state space and by learning which actions are best for each scenario, on the basis of function called reward function, which rewards or punishes the agent for its behaviour. Although, it can be proven that in the long term a RL algorithm manages to reach optimal control, the tradeoff is that during the learning process the agent might end up in unsafe situations [4]; this, especially in applications where agents need to physically interact with their environment, might end up being infeasible due to safety requirements. Take autonomous driving, as an example: a key requirement would be that, while learning a policy for certain critical functionalities, safety needs to be guaranteed at all times (to e.g. avoid crashes). Furthermore, on this note we point out that it has been recently found that even using a dataset with 30 million driving examples is inadequate to train policies that are safe in any operational condition: no dataset exists that can exhaustively cover all situations that might arise when driving [5].
Motivated by this, we propose in this paper an algorithm that complements the capabilities of RL with those of a well known and established modelbased control method such as the Model Predictive Control (MPC). In so doing, we present the MPCaugmentedRL (MPCaRL) that: (i) for certain critical functionalities, for which a dynamical model can be identified, it computes a control action optimizing a given cost function; (ii) for noncritical functionalities it uses RL to compute a policy from data; (iii) it makes use, when possible, of the MPC to both provide a guidance for the statespace exploration and to finetune the rewards obtained by the RL.
1.1 Related Work
This work is related to a number of threads of recent work in RL. We now briefly review some related works on physics simulation, modelbased and safe RL.
Physics simulation. The idea of using models and simulation environments to develop intelligent reinforcement learning agents has recently been attracting a lot of attention from academia. For example, in [6] it is shown how a physical simulator can be embedded in a deep network, enabling agents to both learn parameters of the environment and improve the control performance of the agent. Essentially, this is done by simulating rigid body dynamics via a linear complementarity problem (LCP) technique [7][8]. LCP techniques are also used within other simulation environments such as MuJoCo [9], Bullet [10], and DART [11]. As remarked in [6], in these environments gradients are computed through finite differences and this, in turn, implies evaluating the results of the simulations multiple times. In order to improve the computational burden of this approach, LCP techniques have also been developed which involve analytic differentiation. See, for example [12] and, more recently, [13]. Furthermore, a complementary body of literature investigates the possibility of integrating into networks the mechanisms inspired by the intuitive human ability to understand physics. See e.g. [14][15][16], which leverage the ideas of[17][18][19].
Modelbased RL. Although modelfree methods, have achieved considerable successes in the recent years (see for example [2]), many works suggest that a modelbased approach can potentially achieve better performances. Among those, the interested reader can refer to [4], [Atkeson97acomparison], [kurutach2018modelensemble]. Modelfree approaches (see e.g. [2]), in fact, tend to provide optimal policies only on the long term and employing previous knowledge of domain, such as a mathematical model, can considerably decrease the learning time of a modelbased RL algorithm (with respect to a modelfree one).[4]
Based on these premises, the research in modelbased RL is a very active area and it is focused on two main settings. The first one makes use of neural networks and a suitable loss function to simulate the dynamics of interest [20][21], whereas another approach makes use of more classical mathematical models and closely resembles classic system identification [23]. Recently, an orthogonal stream of literature (see e.g. [HERZALLAH2015199, doi:10.1002/acs.742, KARNY19961719]) has also emerged and focuses on designing of optimal decision makers based on a pure probabilistic knowledge (in the form of a probability density function) of the system of interest and dynamic programming [22].
Safe RL. The design of safe RL algorithms is a key research topic and different definitions of safety have been proposed in the literature( e.g. [24] [25] and references therein). As an example, in [DBLP:journals/corr/abs11092147] the authors relate safety to a set of error states associated to dangerous/risky situations while in [26] risk adversion is specified in the reward. Other works, [27][30] define safety, by using accurate probabilistic model of the system or [31] where a notion of safety for closedloop probabilistic systems is defined. A complementary approach is the one of modelbased RL where safety is formalized via state space constraints. Examples of this approach include [2, 4], where Lyapunov functions are used to show forward invariance of the safety set defined by the constraints. Finally, we note that other approaches include [Garcia:2012:SES:2444851.2444864], where a priori knowledge of the system is used, in order to craft safe backup policies or [safe] in which authors consider uncertain systems and enforce probabilistic guarantees on their performance.
1.2 Contributions
Our key contributions can be summarized as follows. We introduce an algorithm, the MPCaRL that combines RL and MPC in a novel way so that they can augment each other’s strengths. The MPCaRL is inspired by the fact that, often, complex tasks can be fulfilled by combining together a set of functionalities and, in turn, each functionality can be either critical or noncritical. For example, in automated driving, critical functionalities are those directly associated to the prevention of crashes. At each iteration, MPCaRL identifies for critical functionalities, a mathematical model. Once the model is identified, then a control action is computed via MPC so as to optimize a given cost function. For all the other noncritical functionalities the algorithm makes use of a RL algorithm (in particular, we will use QLearning) to compute a policy from data. Moreover, MPC and RL interact within the MPCaRL and, in particular, MPC both drives the statespace exploration and tunes the RL rewards in order to speed the learning process. Finally, in order to illustrate the effectiveness of the MPCaRL, we show its performances at playing the Atari game Pong. In these contexts, the critical functionality is the defense (i.e., when the ball is bounced back to the player) which, if not performed correctly, leads to the loss of the round.
Interestingly the experiments highlight the ability of the algorithm to learn tasks with essentially no training.
2 Background
We start with outlining the two building blocks composing MPCaRL, i.e. the Model Predictive Control algorithm and the QLearning. We refer the reader to [GARCIA1989335] and [1] for a detailed description of MPC and QLearning.
Model Predictive Control. Model Predictive Control (MPC) is a modelbased control technique. Essentially, at each timestep, the algorithm computes a control action by solving an optimization problem having as constraint the (discretetime) dynamics of the system being controlled. In addition to the dynamics, other system requirements (e.g. safety or feasibility requirements) can also be formalized as constraints of the optimization problem [GARCIA1989335]. Let ${x}_{k}\in {\mathbb{R}}^{n}$ be the state variable of the system at time $k$, ${u}_{k}\in {\mathbb{R}}^{m}$ be its control input and ${\eta}_{k}$ be some noise. In this paper we consider discretetime dynamical systems of the form ${x}_{k+1}={A}_{k}{x}_{k}+{D}_{k}+{C}_{k}{u}_{k}+{\eta}_{k}$ with initial condition ${x}_{initial}$ and where the timevarying matrices have appropriate dimensions. For this system, formally the MPC algorithm generates the control input ${u}_{k}$ by solving the problem
$$\underset{{x}_{0:T}\in \mathcal{X},{u}_{0:T}\in \mathcal{A}}{\text{argmin}}\sum _{t=0}^{T}{J}_{t}({x}_{t},{u}_{t})\mathit{\hspace{1em}}\text{subject to}\mathit{\hspace{1em}\hspace{1em}\u2006}{x}_{t+1}={A}_{t}{x}_{t}+{D}_{t}+{C}_{t}{u}_{t},{x}_{0}={x}_{initial},$$  (1) 
with ${x}_{0:T}$ (${u}_{0:T}$) denoting the sequence $\{{x}_{0},\mathrm{\dots},{x}_{T}\}$ (resp. $\{{u}_{0},\mathrm{\dots},{u}_{T}\}$) and where: (i) $\mathcal{A}$ and $\mathcal{X}$ are sets modelling the constraints for the valid control actions and states; (ii) ${\sum}_{t=0}^{T}{J}_{t}({x}_{t},{u}_{t})$ is the cost function being optimized. See [doi:10.1080/00207170600593901], [1624479] for more details on this subject.
QLearning and Markov Decision Processes. QLearning (QL) is a model free RL algorithm, whose aim is to find an optimal policy with respect to a finite Markov Decision Process (MDP). We adopt the standard formalism for MDPs. A MDP [1] is a discrete stochastic model defined by a tuple $\u27e8\mathcal{S},\mathcal{A},P,\gamma ,\mathcal{R}\u27e9$, where: (i) $\mathcal{S}$ is the set of states $s\in \mathcal{S}$; (ii) $\mathcal{A}$ is the set of actions $a\in \mathcal{A}$; (iii) $\mathrm{P}({s}^{\prime}s,a)$ is the probability of transitioning from state $s$ to state ${s}^{\prime}$ under action $a$ ; (iv) $\gamma \in [0,1)$ is the discount factor; (v) $\mathcal{R}(s,a)$ is the reward of choosing the action $a$ in the state $s$. Upon performing an action, the agent receives the reward $\mathcal{R}({s}_{t},{a}_{t})$. A policy, $\pi $, specifies (for each state) the action that the agent will take and the goal of the agent is that of finding the policy that maximizes the expected discounted total reward. The value ${Q}^{\pi}(s,a)$, named Qfunction, corresponding to the pair $(s,a)$ represents the estimated expected future reward that can be obtained from $(s,a)$ when using policy $\pi $. The objective of Qlearning is to estimate the Qfunction for the optimal policy ${\pi}^{*}$, ${Q}^{{\pi}^{*}}(s,a)$. Define the estimate as $Q(s,a)$. The Qlearning algorithm works then as follows: after setting the initial values for the Qfunction, at each time step, observe current state ${s}_{t}$ and select action ${a}_{t}$, according to policy $\pi ({s}_{t})$. After receiving the reward $R({s}_{t},{a}_{t})$ update the corresponding value of the Qfunction as follows:
$$Q({s}_{t},{a}_{t})\leftarrow (1\alpha )Q({s}_{t},{a}_{t})+\alpha \left[\mathcal{R}({s}_{t},{a}_{t})+\gamma \underset{a}{\mathrm{max}}Q({s}_{t+1},a)\right],$$  (2) 
where $\alpha \in [0,1]$ is called the learning rate. Notice that the Qlearning algorithm does not specify which policy $\pi (\cdot )$ should be considered. In theory, to converge the agent should try every possible action for every possible state many times. For practical reasons, a popular choice for the Qlearning policy is the $\u03f5$greedy policy, which selects its highest valued (greedy) action, ${\pi}_{\u03f5}({s}_{t})={argmax}_{{a}_{t}}Q({s}_{t},{a}_{t})$, with probability $1\u03f5(k1)/k$ and randomly selects among all other $k$ actions with probability $\u03f5/k$ [wunder2010classes]. However, in this paper, due to the nature of the proposed algorithm and the way rewards are provided, we implemented a greedy policy, with $\u03f5=0$. The interested reader can refer to [peng1994incremental] and [wunder2010classes] for further information on this topic.
3 The MPCaRL Algorithm
We are now ready to introduce the MPCaRL algorithm, the key steps of which are summarized as pseudocode in Algorithm 3. The main intuition behind this algorithm is that, in many real world systems, tasks can be broken down into a set of critical and noncritical functionalities. Often, critical functionalities are associated to safety constraints and, for these functionalities, it often makes sense for the designer to build a mathematical model. Given this intuition, the MPCaRL aims at combining the strengths of MPC and QL. Indeed: (i) within MPCaRL, MPC can directly control the agent in performing critical functionalities and, at the same time, it drives the state exploration of QL. This can be used to e.g. prevent the agent to enter in states that might be unsafe; (ii) on the other hand, QL handles noncritical functionalities, for which e.g. no mathematical model is available and hence for which a classic control algorithm could not be used.
The MPCaRL takes as input the following design parameters: (i) the set of allowed actions, $\mathcal{A}$; (ii) the time horizon and cost function used is (1); (iii) a recording window used to identify the dynamics in (1); (iv) a positive and a negative reward, $\overline{r}$ and $\underset{\xaf}{r}$ used by MPC to drive the exploration of QL; (v) an initial matrix $Q(s,a)$. Given this setup, following Algorithm 3 the following steps are performed:
 Step 1:

at each timestep, MPCaRL determine whether the task functionality is a critical or a noncritical functionality. As we will see in Section 4, the critical functionality in the Pong game is associated to the agent’s defense strategy;
 Step 2a:

if the functionality is critical, then the agent’s action is computed via MPC. Even if the action applied by the agent is given by MPC, the action that would have been obtained via QL is also computed. If the action from MPC and QL are the same, then the $Q(s,a)$ matrix is updated by using the positive reward $\overline{r}$. On the other hand, if the actions from MPC and QL differ from one another the $Q(s,a)$ matrix is updated with a nonpositive reward $\underset{\xaf}{r}$;
 Step 2b:

if, instead, the functionality is noncritical (or a model cannot be identified) then, the agent’s action is generated by QL;
 Step 3:

all relevant quantities are saved within the main algorithm loop
{algorithmic}[1] \StateInputs: \StateAllowed actions, $\mathcal{A}$ \StateTime horizon, $T$, and cost function ${\sum}_{t=0}^{T}{J}_{t}({x}_{t},{u}_{t})$ \StateRecording window ${T}_{w}$ \StateRewards $\overline{r}$ and $\underset{\xaf}{r}$ \StateInitial matrix $Q(s,a)$ \StateMain loop: \For$k=0,\mathrm{\dots}$ \StateDetermine functionality (and whether model in (1) can be identified) \IfFunctionality critical \StateGet ${s}_{k}$ and ${x}_{k}$ \If$k\ge 1$ \If${a}_{k1}={u}_{k1}$ \State$Q({s}_{k1},{u}_{k1})\leftarrow (1\alpha )Q({s}_{k1},{u}_{k1})+\alpha \left[\overline{r}+\gamma {\mathrm{max}}_{a}\{Q({s}_{k},a)\}\right]$ \Else\State$Q({s}_{k1},{u}_{k1})\leftarrow (1\alpha )Q({s}_{k1},{u}_{k1})+\alpha \left[\underset{\xaf}{r}+\gamma {\mathrm{max}}_{a}\{Q({s}_{k},a)\}\right]$ \EndIf\EndIf\If$k{T}_{w}>0$ \StateGiven ${x}_{k{T}_{w}:k}$ identify the model in (1) and compute ${u}_{k}$ with ${x}_{init}={x}_{k}$ \Else\StateGiven ${x}_{0:k}$ identify the model in (1) and compute ${u}_{k}$ with ${x}_{init}={x}_{k}$ \EndIf\StateCompute ${a}_{k}$ using QL (Section 2) \StateApply ${u}_{k}$ \StateSave ${x}_{k}$, ${s}_{k}$, ${a}_{k}$, ${u}_{k}$ \Else\StateApply ${a}_{k}$ computed via QL \StateSave ${s}_{k}$, ${a}_{k}$ \EndIf\State$k\leftarrow k+1$ \EndFor
4 Simulations
We now illustrate the effectiveness of MPCaRL by making it play against the Atari game Pong. Pong is a 2 player game where each player moves a paddle in order to bounce a ball to the opponent. A player scores a point when the opponent fails to bounce the ball back and the game ends when a player scores $21$ points. In what follows, we give a thorough description of how Algorithm 3 has been implemented in order to allow MPCaRL to play against Pong.
4.1 The environment and data gathering
The environment of the game was setup using the OpenAI gym library in Python. In particular, we used the PongDeterministicv4 (with $4$ frame skips) configuration of the environment, which is the one used to assess Deep QNetworks, see e.g. [EndtoEnd] and [mnihatari2013]. The configuration used has, as observation space, $\text{Box}(210,160,3)$ ^{1}^{1} 1 See http://gym.openai.com/docs/ for documentation on the environment observation space. Within our simulations, we first removed part of the images that display the points earned by the players and this yielded an observation space of $\text{Box}(160,160,3)$ and then we downsampled the resulting image to get a reduced observation space of $\text{Box}(80,80,3)$ so that each frame consists of a matrix of $80\times 80$ pixels. Within the image, a coordinate system is defined within the environment, with the origin of the $x$ and $y$ axes being in the bottomright corner.
Given the above observation space, both the position of the ball and the vertical position of the paddle moved by MPCaRL were extracted from each frame (see also Figure 1, left panel). In particular:

•
At the beginning of the game, the centroid of the ball is found by iterating through the frame to find the location of all pixels with a value of $236$ (this corresponds to the white color, i.e. the color of the ball in Pong). Then, once the ball is found the first time, the frame is only scanned in a window around the position of the ball previously found (namely, we used a window of $80\times 12$ pixels, see Figure 1, right panel);

•
Similarly the paddle’s centroid position is found by scanning the frame for pixels having value $92$ (this corresponds to the green color, i.e. the color of the MPCaRL paddle).
4.2 Definition of the task and its functionalities
The agent’s task is that of winning the game, which essentially consists of two phases: (i) defense phase, where the agent needs to move the paddle to bounce the ball in order to avoid that the opponent makes a point; (ii) an attack phase, where the agent needs instead to properly bounce the ball in order to make the point. When implementing MPCaRL to play Pong, we associated defense to critical functionalities, while the attack functionalities were defined as noncritical. Explicit conditions identifying critical and noncritical functionalities are given in the next subsections.
4.3 Implementing MPC
We describe the MPC implementation used within MPCaRL to play Pong by first introducing the mathematical model serving as the constraint in (1). This model describes both the dynamics of the ball and of the paddle moved by MPCaRL.
Ball dynamics. We denote by ${x}_{t}^{(b)}$ and ${y}_{t}^{(b)}$ the $x$ and $y$ coordinates of the centroid of the ball at time $t$. The mathematical model describing the dynamics of the ball is then:
$$\begin{array}{c}\hfill {x}_{t+1}^{(b)}={x}_{t}^{(b)}+{v}_{t,x},{y}_{t+1}^{(b)}={y}_{t}^{(b)}+{v}_{t,y},\end{array}$$  (3) 
where ${x}_{t+1}^{(b)}$ and ${y}_{t+1}^{(b)}$ are the predicted next coordinates at time $t+1$ and where the speeds at time $t$, i.e. ${v}_{t,x}$ and ${v}_{t,y}$ are computed from the positions extracted from the current and the two previous frames, i.e. ${x}_{t2}^{(b)},{x}_{t1}^{(b)},{x}_{t}^{(b)}$ and ${y}_{t2}^{(b)},{y}_{t1}^{(b)},{y}_{t}^{(b)}$ (that is, ${T}_{w}=2$ in Algorithm 3). In particular, this is done by first computing the quantities ${v}_{x1},{v}_{x2}$ and ${v}_{y1},{v}_{y2}$ as follows:
$$\left[\begin{array}{c}\hfill {v}_{{x}_{1}}\hfill \\ \hfill {v}_{{y}_{1}}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {x}_{t}^{(b)}\hfill \\ \hfill {y}_{t}^{(b)}\hfill \end{array}\right]\left[\begin{array}{c}\hfill {x}_{t1}^{(b)}\hfill \\ \hfill {y}_{t1}^{(b)}\hfill \end{array}\right],\left[\begin{array}{c}\hfill {v}_{{x}_{2}}\hfill \\ \hfill {v}_{{y}_{2}}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill {x}_{t1}^{(b)}\hfill \\ \hfill {y}_{t1}^{(b)}\hfill \end{array}\right]\left[\begin{array}{c}\hfill {x}_{t2}^{(b)}\hfill \\ \hfill {y}_{t2}^{(b)}\hfill \end{array}\right].$$  (4) 
Consider now the speed along the $x$ axis. If there is no impact between the ball and the paddle, we set ${v}_{t,x}=0.5({v}_{x1}+{v}_{x2})+var\left([{v}_{x1},{v}_{x2}]\right)={\overline{v}}_{t,x}+{\eta}_{t,x}$ (where $var(a)$ denotes the variance of the generic vector $a$). Instead, along the $y$ axis, we have ${v}_{t,y}=0.5({v}_{y1}+{v}_{y2})+var\left([{v}_{y1},{v}_{y2}]\right)={\overline{v}}_{t,x}+{\eta}_{t,x}$ if there has been no impact and ${v}_{t,x}={v}_{x1}$ if there has been an impact of the ball with one of the walls.
Paddle dynamics. In the gym environment used for the experiments, the only control action that can be applied by MPCaRL at time $t$, i.e. ${u}_{t}$ is that of moving its paddle. In particular, the agent can either move the paddle up (${u}_{t}=1$) or down (${u}_{t}=1$) or simply not moving the paddle (${u}_{t}=0$). It follows that, given the vertical position of the centroid of the paddle at time $t$, say ${y}_{t}^{(p)}$, its dynamics can be modeled by
$$\begin{array}{c}\hfill {y}_{t+1}^{(p)}={y}_{t}^{(p)}+{u}_{t}.\end{array}$$  (5) 
The MPC model and the cost function. Combining the models in (3)  (5) yields the dynamical system serving as constraint in (1). Note that the resulting model can be formally written as the system in (1) once the state ${x}_{t}$ is defined as ${x}_{t}={[{x}_{t}^{(b)},{y}_{t}^{(b)},{y}_{t}^{(p)}]}^{T}$. Finally, in the implementation of the MPC algorithm, we used as cost function ${\sum}_{t=0}^{T}{J}_{t}={\parallel {y}_{t+T}^{(b)}{y}_{t+T}^{(p)}\parallel}^{2}$. That is, with this choice of cost function the algorithm seeks to regulate the paddle’s position so that the distance between the position of the ball at time $t$ and the position of the MPCaRL paddle at time $t+T$ is minimised. Note that the time horizon, $T$, used in the above cost function is obtained by propagating the ball model (3) in order to estimate after how many iterates the ball will hit the border protected by the MPCaRL paddle.
4.4 Implementing QL
We implemented the QL algorithm outlined in Section 4. The set of actions available to the agent were ${a}_{t}\in \{1,0,+1\}$, while the state at time ${s}_{t}$ was defined as the $5$dimensional vector containing: (i) the coordinates of the position at time time $t$ of the ball (in pixels); (ii) the velocity of the ball (rounded to the closest integer) across the $x$ and $y$ axes; (iii) the position of the paddle moved by MPCaRL. The reward was obtained from the game environment: our agent was given a reward of $+1$, each time the opponent missed to hit the ball, and $1$ each time our agent missed to hit the ball. Finally, the values of the Qtable were initialized to $0$. In the experiments, the stateaction pair was updated whenever a point was scored and a greedy policy was used to select the action. Also, in the experiments we set both $\alpha $ and $\gamma $ in (2) to $0.7$. Moreover, following Algorithm 3, the Qfunction was also updated when noncritical functionalities were performed by the agent. In particular, within the experiments we assigned: (i) a positive reward, $\overline{r}$, whenever the action from QL and MPC were the same; (ii) a nonpositive reward, $\underset{\xaf}{r}$, whenever the actions were not the same. In this way, within MPCaRL, the QL component is driven to learn to take actions similar to MPC.
Finally, we now describe the conditions that we used in the experiments to allow MPCaRL to discriminate between critical and noncritical functionalities. Intuitively, critical functionalities were associated to the defense phase of the game. Therefore, all the functionalities performed when the ball was coming towards the MPCaRL paddle and, at the same time, the future vertical position of the ball (predicted via the model) was far from the actual position of the MPCaRL paddle were defined as critical. That is, the states that are defined as critical are the ones that satisfy the conditions $$ (whereas all the others are considered noncritical). Note that: (i) ${v}_{t}^{(b)}$ is estimated from the game frames as described above (in the environment negative velocities along the $x$ axis mean that the ball is coming towards the green paddle); (ii) the computation of ${y}_{t+T}^{(b)}$ relies on simulating the model describing the ball’s dynamics (3); (iii) ${H}_{y}$ is a threshold and this is a design parameter".
4.5 Results
We are now ready to present the results obtained by letting MPCaRL play Pong (larger versions of the figures of this Section are provided in the Supplementary Materials). The results are quantified by plotting the game reward as a function of the number of episodes played by MPCaRL. An episode consists of as many rounds of pong it takes for one of the players to reach $21$ points, while the game reward is defined to be the difference between the points scored by MPCaRL within the episode and the points scored by the opponent within the episode. Essentially, a negative game reward means that MPCaRL lost that episode; the lowest possible value that can be attained is $21$ and this happens when MPCaRL is not able to score any point. Viceversa, a positive game reward means that the agent was able to beat the opponent; the maximum value that can be obtained is $+21$, when the opponent did not score any point.
As a first experiment, we implemented an agent that would only use MPC or the untrained QL algorithm (see e.g. the implementation in [andrej]). We let this agent play Pong for $50$ episodes and, as the left panel in Figure 2 shows, as expected, the QL agent did not obtain good rewards in the first $50$ episodes, consistently loosing games with a difference in the scores of about $20$ points. Instead, when the agent used the MPC described in Section 4 better performance were obtained. These performance, however, were not comparable with those obtained via a trained QL agent and the reason for this is that, while MPC allows the agent to defend, it does not allow for the learning of an attack strategy to consistently obtain points. Using MPCaRL allowed to overcome the shortcomings of the MPC and QL agents. In particular, as shown in the left panel of Figure 2, when the MPCaRL agent played against Pong, it was able to consistently beat the game (note the agent never lost a game) while quickly learning an attack strategy to obtain high game rewards. Indeed, note how the agent is able to consistently obtain rewards of about $20$ within $50$ episodes.
In order to further investigate the performance of MPCaRL, we also evaluate its performance sensitivity to the parameters $\overline{r}$, $\underset{\xaf}{r}$ and ${H}_{y}$. We first take ${H}_{y}$ in consideration; ${H}_{y}$ directly affects the definition of critical functionalities. As figure 3 (left panel) illustrates, MPCaRL is still able to consistently obtain high rewards when ${H}_{y}$ is perturbed, i.e. ${H}_{y}\in \{4,5,6\}$. Notice that when ${H}_{y}=4$, while MPCaRL is still able to beat the game, it also experiences some drops in the rewards. This phenomenon, which will be further investigated in future studies, might be due to the fact that restricting the space of movements of the QL component of the algorithm also restricts its learning capabilities. Instead, when the manoeuvre space of the QL is bigger (${H}_{y}=6$), the algorithm, due to the increased flexibility in the moves is able to learn faster. As a further experiment, we fixed ${H}_{y}=5$ and perturbed the algorithm parameter $\overline{r}$ so that $\overline{r}\in \{0.1,0.3,0.5,0.7,0.9\}$. The results of this experiment are shown in the middle panel of Figure 3. It can be noted that smaller values of $\overline{r}$ (i.e. $\overline{r}=0.1$ and $\overline{r}=0.3$) lead to a more consistent performance as compared to the higher values of $\overline{r}$ which lead to negative spikes in the performance. This behaviour might be due to the fact that higher values of $\overline{r}$ essentially imply that MPCaRL trusts more the MPC actions than QL actions, hence penalizing the ability of MPCaRL to quickly learn the attack strategy. Intuitively, simulations show that, while the MPC component is important for enhancing the agent’s defense, too much influence of this component on the QL can reduce the attack performance of the agent (this, in turn, is essential in order to score higher points). Consistently, the same behaviour can be observed when $\underset{\xaf}{r}$ is perturbed (${H}_{y}=5$, $\overline{r}=0.1$ and $\underset{\xaf}{r}\in \{0.1,0.3,0.5,0.7,0.9\}$), as shown in Figure 3 (right panel), where it can be observed that the most negative values $\underset{\xaf}{r}$ lead to dips in performance.
5 Conclusions and Future Work
We investigated the possibility of combining QL and MPC so that they can augment each other’s capabilities. In doing so, we introduced a novel algorithm, the MPCaRL that: (i) leverages MPC and mathematical modelling for certain, critical, functionalities; (ii) makes use of QL to both perform noncritical functionalities; (iii) uses MPC to drive the statespace exploration of QL. The effectiveness of our algorithm has been illustrated by letting it play against Pong. Its performance are analysed when the parameters are perturbed. Interestingly, the experiments highlight the ability of the algorithm to quickly learn general tasks with essentially no training. In future works, we will focus on implementing the MPCaRL in different settings (e.g., Hide and Seek with multiple agents might be a good candidate) and we will explore its convergence and stability properties.