Policy Poisoning in Batch Reinforcement Learning and Control

  • 2019-10-31 01:58:12
  • Yuzhe Ma, Xuezhou Zhang, Wen Sun, Xiaojin Zhu
  • 0

Abstract

We study a security threat to batch reinforcement learning and control wherethe attacker aims to poison the learned policy. The victim is a reinforcementlearner / controller which first estimates the dynamics and the rewards from abatch data set, and then solves for the optimal policy with respect to theestimates. The attacker can modify the data set slightly before learninghappens, and wants to force the learner into learning a target policy chosen bythe attacker. We present a unified framework for solving batch policy poisoningattacks, and instantiate the attack on two standard victims: tabular certaintyequivalence learner in reinforcement learning and linear quadratic regulator incontrol. We show that both instantiation result in a convex optimizationproblem on which global optimality is guaranteed, and provide analysis onattack feasibility and attack cost. Experiments show the effectiveness ofpolicy poisoning attacks.

 

Quick Read (beta)

Policy Poisoning
in Batch Reinforcement Learning and Control

Yuzhe Ma
University of Wisconsin–Madison
[email protected]
&Xuezhou Zhang
University of Wisconsin–Madison
[email protected]
\ANDWen Sun
Microsoft Research New York
[email protected]
&Xiaojin Zhu
University of Wisconsin–Madison
[email protected]
Abstract

We study a security threat to batch reinforcement learning and control where the attacker aims to poison the learned policy. The victim is a reinforcement learner / controller which first estimates the dynamics and the rewards from a batch data set, and then solves for the optimal policy with respect to the estimates. The attacker can modify the data set slightly before learning happens, and wants to force the learner into learning a target policy chosen by the attacker. We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. We show that both instantiation result in a convex optimization problem on which global optimality is guaranteed, and provide analysis on attack feasibility and attack cost. Experiments show the effectiveness of policy poisoning attacks.

\usetikzlibrary

arrows,calc,shapes,intersections \newfloatcommandcapbtabboxtable[][0.28]

 

Policy Poisoning
in Batch Reinforcement Learning and Control


  Yuzhe Ma University of Wisconsin–Madison [email protected] Xuezhou Zhang University of Wisconsin–Madison [email protected] Wen Sun Microsoft Research New York [email protected] Xiaojin Zhu University of Wisconsin–Madison [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

With the increasing adoption of machine learning, it is critical to study security threats to learning algorithms and design effective defense mechanisms against those threats. There has been significant work on adversarial attacks Biggio and Roli (2018); Huang et al. (2011). We focus on the subarea of data poisoning attacks where the adversary manipulates the training data so that the learner learns a wrong model. Prior work on data poisoning targeted victims in supervised learning Mei and Zhu (2015); Koh et al. (2018); Wang and Chaudhuri (2018); Zhang and Zhu (2019) and multi-armed bandits Jun et al. (2018); Ma et al. (2018); Liu and Shroff (2019). We take a step further and study data poisoning attacks on reinforcement learning (RL). Given RL’s prominent applications in robotics, games and so on, an intentionally and adversarially planted bad policy could be devastating.

While there has been some related work in test-time attack on RL, reward shaping, and teaching inverse reinforcement learning (IRL), little is understood on how to training-set poison a reinforcement learner. We take the first step and focus on batch reinforcement learner and controller as the victims. These victims learn their policy from a batch training set. We assume that the attacker can modify the rewards in the training set, which we show is sufficient for policy poisoning. The attacker’s goal is to force the victim to learn a particular target policy (hence the name policy poisoning), while minimizing the reward modifications. Our main contribution is to characterize batch policy poisoning with a unified optimization framework, and to study two instances against tabular certainty-equivalence (TCE) victim and linear quadratic regulator (LQR) victim, respectively.

2 Related Work

Of particular interest is the work on test-time attacks against RL Huang et al. (2017). Unlike policy poisoning, there the RL agent carries out an already-learned and fixed policy π to e.g. play the Pong Game. The attacker perturbs pixels in a game board image, which is part of the state s. This essentially changes the RL agent’s perceived state into some s. The RL agent then chooses the action a:=π(s) (e.g. move down) which may differ from a:=π(s) (e.g. move up). The attacker’s goal is to force some specific a on the RL agent. Note π itself stays the same through the attack. In contrast, ours is a data-poisoning attack which happens at training time and aims to change π.

Data-poisoning attacks were previously limited to supervised learning victims, either in batch mode Biggio et al. (2012); Xiao et al. (2015); Li et al. (2016); Mei and Zhu (2015) or online mode Wang and Chaudhuri (2018); Zhang and Zhu (2019). Recently data-poisoning attacks have been extended to multi-armed bandit victims Jun et al. (2018); Ma et al. (2018); Liu and Shroff (2019), but not yet to RL victims.

There are two related but distinct concepts in RL research. One concept is reward shaping Ng et al. (1999); Asmuth et al. (2008); Devlin and Kudenko (2012); Wiewiora (2003) which also modifies rewards to affect an RL agent. However, the goal of reward shaping is fundamentally different from ours. Reward shaping aims to speed up convergence to the same optimal policy as without shaping. Note the differences in both the target (same vs. different policies) and the optimality measure (speed to converge vs. magnitude of reward change).

The other concept is teaching IRL Cakmak and Lopes (2012); Brown and Niekum (2019); Kamalaruban et al. (2019). Teaching and attacking are mathematically equivalent. However, the main difference to our work is the victim. They require an IRL agent, which is a specialized algorithm that estimates a reward function from demonstrations of (state, action) trajectories alone (i.e. no reward given). In contrast, our attacks target more prevalent RL agents and are thus potentially more applicable. Due to the difference in the input to IRL vs. RL victims, our attack framework is completely different.

3 Preliminaries

A Markov Decision Process (MDP) is defined as a tuple (𝒮,𝒜,P,R,γ), where 𝒮 is the state space, 𝒜 is the action space, P:𝒮×𝒜Δ𝒮 is the transition kernel where Δ𝒮 denotes the space of probability distributions on 𝒮, R:𝒮×𝒜 is the reward function, and γ[0,1) is a discounting factor. We define a policy π:𝒮𝒜 as a function that maps a state to an action. We denote the Q function of a policy π as Qπ(s,a)=𝔼[τ=0γτR(sτ,aτ)s0=s,a0=a,π], where the expectation is over the randomness in both transitions and rewards. The Q function that corresponds to the optimal policy can be characterized by the following Bellman optimality equation:

Q*(s,a)=R(s,a)+γs𝒮P(s|s,a)maxa𝒜Q*(s,a), (1)

and the optimal policy is defined as π*(s)argmaxa𝒜Q*(s,a).

We focus on RL victims who perform batch reinforcement learning. A training item is a tuple (s,a,r,s)𝒮×𝒜××𝒮, where s is the current state, a is the action taken, r is the received reward, and s is the next state. A training set is a batch of T training items denoted by D=(st,at,rt,st)t=0:T-1. Given training set D, a model-based learner performs learning in two steps:

Step 1. The learner estimates an MDP M^=(𝒮,𝒜,P^,R^,γ) from D. In particular, we assume the learner uses maximum likelihood estimate for the transition kernel P^:𝒮×𝒜Δ𝒮

P^ argmaxPt=0T-1logP(st|st,at), (2)

and least-squares estimate for the reward function R^:𝒮×𝒜

R^ = argminRt=0T-1(rt-R(st,at))2. (3)

Note that we do not require (2) to have a unique maximizer P^. When multiple maximizers exist, we assume the learner arbitrarily picks one of them as the estimate. We assume the minimizer R^ is always unique. We will discuss the conditions to guarantee the uniqueness of R^ for two learners later.

Step 2. The learner finds the optimal policy π^ that maximizes the expected discounted cumulative reward on the estimated environment M^, i.e.,

π^argmaxπ:𝒮𝒜𝔼P^τ=0γτR^(sτ,π(sτ)), (4)

where s0 is a specified or random initial state. Note that there could be multiple optimal policies, thus we use in (4). Later we will specialize (4) to two specific victim learners: the tabular certainty equivalence learner (TCE) and the certainty-equivalent linear quadratic regulator (LQR).

4 Policy Poisoning

We study policy poisoning attacks on model-based batch RL learners. Our threat model is as follows:

Knowledge of the attacker. The attacker has access to the original training set D0=(st,at,rt0,st)t=0:T-1. The attacker knows the model-based RL learner’s algorithm. Importantly, the attacker knows how the learner estimates the environment, i.e., (2) and (3). In the case (2) has multiple maximizers, we assume the attacker knows exactly the P^ that the learner picks.

Available actions of the attacker. The attacker is allowed to arbitrarily modify the rewards 𝐫0=(r00,,rT-10) in D0 into 𝐫=(r0,,rT-1). As we show later, changing r’s but not s,a,s is sufficient for policy poisoning.

Attacker’s goals. The attacker has a pre-specified target policy π. The attack goals are to (1) force the learner to learn π, (2) minimize attack cost 𝐫-𝐫0α under an α-norm chosen by the attacker.

Given the threat model, we can formulate policy poisoning as a bi-level optimization problem11 1 As we will show, the constraint (7) could lead to an open feasible set (e.g., in (10)) for the attack optimization (5)-(7), on which the minimum of the objective function (5) may not be well-defined. In the case (7) induces an open set, we will consider instead a closed subset of it, and optimize over the subset. How to construct the closed subset will be made clear for concrete learners later.:

min𝐫,R^ 𝐫-𝐫0α (5)
s.t. R^=argminRt=0T-1(rt-R(st,at))2 (7)
{π}=argmaxπ:𝒮𝒜𝔼P^τ=0γτR^(sτ,π(sτ)).

The P^ in (7) does not involve 𝐫 and is precomputed from D0. The singleton set {π} on the LHS of (7) ensures that the target policy is learned uniquely, i.e., there are no other optimal policies tied with π. Next, we instantiate this attack formulation to two representative model-based RL victims.

4.1 Poisoning a Tabular Certainty Equivalence (TCE) Victim

In tabular certainty equivalence (TCE), the environment is a Markov Decision Process (MDP) with finite state and action space. Given original data D0=(st,at,rt0,st)0:T-1, let Ts,a={tst=s,at=a}, the time indexes of all training items for which action a is taken at state s. We assume Ts,a1, s,a, i.e., each state-action pair appears at least once in D0. This condition is needed to ensure that the learner’s estimate P^ and R^ exist. Remember that we require (3) to have a unique solution. For the TCE learner, R^ is unique as long as it exists. Therefore, Ts,a1, s,a is sufficient to guarantee a unique solution to (3). Let the poisoned data be D=(st,at,rt,st)0:T-1. Instantiating model estimation (2), (3) for TCE, we have

P^(ss,a)=1|Ts,a|tTs,a𝟙[st=s], (8)

where 𝟙[] is the indicator function, and

R^(s,a)=1|Ts,a|tTs,art. (9)

The TCE learner uses P^,R^ to form an estimated MDP M^, then solves for the optimal policy π^ with respect to M^ using the Bellman equation (1). The attack goal (7) can be naively characterized by

Q(s,π(s))>Q(s,a),s𝒮,aπ(s). (10)

However, due to the strict inequality, (10) induces an open set in the Q space, on which the minimum of (5) may not be well-defined. Instead, we require a stronger attack goal which leads to a closed subset in the Q space. This is defined as the following ε-robust target Q polytope.

Definition 1.

(ε-robust target Q polytope) The set of ε-robust Q functions induced by a target policy π is the polytope

𝒬ε(π)={Q:Q(s,π(s))Q(s,a)+ε,s𝒮,aπ(s)} (11)

for a fixed ε>0.

The margin parameter ε ensures that π is the unique optimal policy for any Q in the polytope. We now have a solvable attack problem, where the attacker wants to force the victim’s Q function into the ε-robust target Q polytope 𝒬ε(π):

min𝐫T,R^,Q|𝒮|×|𝒜| 𝐫-𝐫0α (12)
s.t. R^(s,a)=1|Ts,a|tTs,art (15)
Q(s,a)=R^(s,a)+γsP^(s|s,a)Q(s,π(s)),s,a
Q(s,π(s))Q(s,a)+ε,s𝒮,aπ(s).

The constraint (15) enforces Bellman optimality on the value function Q, in which maxa𝒜Q(s,a) is replaced by Q(s,π(s)), since the target policy is guaranteed to be optimal by (15). Note that problem (12)-(15) is a convex program with linear constraints given that α1, thus could be solved to global optimality. However, we point out that (12)-(15) is a more stringent formulation than (5)-(7) due to the additional margin parameter ε we introduced. The feasible set of (12)-(15) is a subset of (5)-(7). Therefore, the optimal solution to (12)-(15) could in general be a sub-optimal solution to (5)-(7) with potentially larger objective value. We now study a few theoretical properties of policy poisoning on TCE. All proofs are in the appendix. First of all, the attack is always feasible.

Proposition 1.

The attack problem (12)-(15) is always feasible for any target policy π.

Proposition 1 states that for any target policy π, there exists a perturbation on the rewards that teaches the learner that policy. Therefore, the attacker changing r’s but not s,a,s is already sufficient for policy poisoning.

We next bound the attack cost. Let the MDP estimated on the clean data be M^0=(𝒮,𝒜,P^,R^0,γ). Let Q0 be the Q function that satisfies the Bellman optimality equation on M^0. Define Δ(ε)=maxs𝒮[maxaπ(s)Q0(s,a)-Q0(s,π(s))+ε]+, where []+ takes the maximum over 0. Intuitively, Δ(ε) measures how suboptimal the target policy π is compared to the clean optimal policy π0 learned on M^0, up to a margin parameter ε.

Theorem 2.

Assume α1 in (12). Let r*, R^* and Q* be an optimal solution to (12)-(15), then

12(1-γ)Δ(ε)(mins,a|Ts,a|)1α𝐫*-𝐫0α12(1+γ)Δ(ε)T1α. (16)
Corollary 3.

If α=1, then the optimal attack cost is O(Δ(ε)T). If α=2, then the optimal attack cost is O(Δ(ε)T). If α=, then the optimal attack cost is O(Δ(ε)).

Note that both the upper and lower bounds on the attack cost are linear with respect to Δ(ε), which can be estimated directly from the clean training set D0. This allows the attacker to easily estimate its attack cost before actually solving the attack problem.

4.2 Poisoning a Linear Quadratic Regulator (LQR) Victim

As the second example, we study an LQR victim that performs system identification from a batch training set Dean et al. (2017). Let the linear dynamical system be

st+1=Ast+Bat+wt,t0, (17)

where An×n,Bn×m, stn is the state, atm is the control signal, and wt𝒩(𝟎,σ2I) is a Gaussian noise. When the agent takes action a at state s, it suffers a quadratic loss of the general form

L(s,a)=12sQs+qs+aRa+c (18)

for some Q0, R0, qn and c. Here we have redefined the symbols Q and R in order to conform with the notation convention in LQR: now we use Q for the quadratic loss matrix associated with state, not the action-value function; we use R for the quadratic loss matrix associated with action, not the reward function. The previous reward function R(s,a) in general MDP (section 3) is now equivalent to the negative loss -L(s,a). This form of loss captures various LQR control problems. Note that the above linear dynamical system can be viewed as an MDP with transition kernel P(ss,a)=𝒩(As+Ba,σ2I) and reward function -L(s,a). The environment is thus characterized by matrices A, B (for transition kernel) and Q, R, q, c (for reward function), which are all unknown to the learner.

We assume the clean training data D0=(st,at,rt0,st+1)0:T-1 was generated by running the linear system for multiple episodes following some random policy Dean et al. (2017). Let the poisoned data be D=(st,at,rt,st+1)0:T-1. Instantiating model estimation (2), (3), the learner performs system identification on the poisoned data:

(A^,B^)argmin(A,B)12t=0T-1Ast+Bat-st+122 (19)
(Q^,R^,q^,c^)=argmin(Q0,RεI,q,c)12t=0T-112stQst+qst+atRat+c+rt22. (20)

Note that in (20), the learner uses a stronger constraint RεI than the original constraint R0, which guarantees that the minimizer can be achieved. The conditions to further guarantee (20) having a unique solution depend on the property of certain matrices formed by the clean training set D0, which we defer to appendix D.

The learner then computes the optimal control policy with respect to A^, B^, Q^, R^, q^ and c^. We assume the learner solves a discounted version of LQR control

maxπ:𝒮𝒜 -𝔼[τ=0γτ(12sτQ^sτ+q^sτ+π(sτ)R^π(sτ)+c^)] (21)
s.t. sτ+1=A^sτ+B^π(sτ)+wτ,τ0. (22)

where the expectation is over wτ. It is known that the control problem has a closed-form solution a^τ=π^(sτ)=Ksτ+k, where

K=-γ(R^+γB^XB^)-1B^XA^,k=-γ(R^+γB^XB^)-1B^x. (23)

Here X0 is the unique solution of the Algebraic Riccati Equation,

X=γA^XA^-γ2A^XB^(R^+γB^XB^)-1B^XA^+Q^, (24)

and x is a vector that satisfies

x=q^+γ(A^+B^K)x. (25)

The attacker aims to force the victim into taking target action π(s),sn. Note that in LQR, the attacker cannot arbitrarily choose π, as the optimal control policy K and k enforce a linear structural constraint between π(s) and s. One can easily see that the target action must obey π(s)=Ks+k for some (K,k) in order to achieve successful attack. Therefore we must assume instead that the attacker has a target policy specified by a pair (K,k). However, an arbitrarily linear policy may still not be feasible. A target policy (K,k) is feasible if and only if it is produced by solving some Riccati equation, namely, it must lie in the following set:

{(K,k):Q0,RεI,qn,c,such that(23),(24), and(25) are satisfied}. (26)

Therefore, to guarantee feasibility, we assume the attacker always picks the target policy (K,k) by solving an LQR problem with some attacker-defined loss function. We can now pose the policy poisoning attack problem:

min𝐫,Q^,R^,q^,c^,X,x 𝐫-𝐫0α (27)
s.t. -γ(R^+γB^XB^)-1B^XA^=K (33)
-γ(R^+γB^XB^)-1B^x=k
X=γA^XA^-γ2A^XB^(R^+γB^XB^)-1B^XA^+Q^
x=q^+γ(A^+B^K)x
(Q^,R^,q^,c^)=argmin(Q0,RεI,q,c)t=0T-112stQst+qst+atRat+c+rt22
X0.

Note that the estimated transition matrices A^, B^ are not optimization variables because the attacker can only modify the rewards, which will not change the learner’s estimate on A^ and B^. The attack optimization (27)-(33) is hard to solve due to the constraint (33) itself being a semi-definite program (SDP). To overcome the difficulty, we pull all the positive semi-definite constraints out of the lower-level optimization. This leads to a more stringent surrogate attack optimization (see appendix C). Solving the surrogate attack problem, whose feasible region is a subset of the original problem, in general gives a suboptimal solution to (27)-(33). But it comes with one advantage: convexity.

5 Experiments

Throughout the experiments, we use CVXPY Diamond and Boyd (2016) to implement the optimization. All code can be found in https://github.com/myzwisc/PPRL_NeurIPS19.

5.1 Policy Poisoning Attack on TCE Victim

Experiment 1. We consider a simple MDP with two states A,B and two actions: stay in the same state or move to the other state, shown in figure 0(a). The discounting factor is γ=0.9. The MDP’s Q values are shown in table 0(b). Note that the optimal policy will always pick action stay.

{tikzpicture}\tikzstyle

n = [very thick,circle,inner sep=0mm,minimum width=6mm] \tikzstylea = [thick,>=latex,->] \node[n,C1,draw=C1] (2) at (-1.2,0) A; \node[n,C2,draw=C2] (1) at (1.2,0) B; a](2)edgeloop below] node +1(2) (1) edge [loop below] node +1(1) (2) edge [bend right=20] node[below] 0(1) (1) edge [bend right=20] node[above] 0(2);

(a) A toy MDP with two states.
stay move
A 10 9
B 10 9
(b) Original Q values.
stay move
A 9 10
B 9 10
(c) Poisoned Q values.
(d) Trajectory for the Q values of state A during value iteration.
Figure 1: Poisoning TCE in a two-state MDP.

The clean training data D0 reflects this underlying MDP, and consists of 4 tuples:

(A,𝑠𝑡𝑎𝑦,1,A)(A,𝑚𝑜𝑣𝑒,0,B)(B,𝑠𝑡𝑎𝑦,1,B)(B,𝑚𝑜𝑣𝑒,0,A)

Let the attacker’s target policy be π(s)=move, for any state s. The attacker sets ε=1 and uses α=2, i.e. 𝐫-𝐫02 as the attack cost. Solving the policy poisoning attack optimization problem (12)-(15) produces the poisoned data:

(A,𝑠𝑡𝑎𝑦,0,A)(A,𝑚𝑜𝑣𝑒,1,B)(B,𝑠𝑡𝑎𝑦,0,B)(B,𝑚𝑜𝑣𝑒,1,A)

with attack cost 𝐫-𝐫02=2. The resulting poisoned Q values are shown in table 0(c). To verify this attack, we run TCE learner on both clean data and poisoned data. Specifically, we estimate the transition kernel and the reward function as in (8) and (9) on each data set, and then run value iteration until the Q values converge. In Figure 0(d), we show the trajectory of Q values for state A, where the x and y axes denote Q(A,stay) and Q(A,move) respectively. All trajectories start at (0,0). The dots on the trajectory correspond to each step of value iteration, while the star denotes the converged Q values. The diagonal dashed line is the (zero margin) policy boundary, while the gray region is the ε-robust target Q polytope with an offset ε=1 to the policy boundary. The trajectory of clean data converges to a point below the policy boundary, where the action stay is optimal. With the poisoned data, the trajectory of Q values converge to a point exactly on the boundary of the ε-robust target Q polytope, where the action move becomes optimal. This validates our attack.

We also compare our attack with reward shaping Ng et al. (1999). We let the potential function ϕ(s) be the optimal value function V(s) for all s to shape the clean dataset. The dataset after shaping is

(A,𝑠𝑡𝑎𝑦,0,A)(A,𝑚𝑜𝑣𝑒,-1,B)(B,𝑠𝑡𝑎𝑦,0,B)(B,𝑚𝑜𝑣𝑒,-1,A)

In Figure 0(d), we show the trajectory of Q values after reward shaping. Note that same as on clean dataset, the trajectory after shaping converges to a point also below the policy boundary. This means reward shaping can not make the learner learn a different policy from the original optimal policy. Also note that after reward shaping, value iteration converges much faster (in only one iteration), which matches the benefits of reward shaping shown in Ng et al. (1999). More importantly, this illustrates the difference between our attack and reward shaping.

{tikzpicture}\draw

[step=2cm,black,thin,opacity=0.5] (0,0) grid (14,12); [black] (0,2) rectangle (2,12); [black] (4,0) rectangle (14,2); [black] (4,4) rectangle (10,8); [black] (4,10) rectangle (8,12); [black] (12,0) rectangle (14,12); [black] (10,10) rectangle (12,12); [gray,opacity=0.8] (2,4) rectangle (4,8); [blue,opacity=0.5] (2,10) rectangle (4,12); \nodeat (1,1) S; \nodeat (3,11) G;

\draw

(1,-0.5) – (1,-2.5) – (3,-2.5) – (3,-0.5) – (1,-0.5); [blue,opacity=0.5] (1,-2.5) rectangle (3,-0.5); \nodeat (2,-1.5) G; \node[text width=1cm] at (3.6,-1.5) :2;

\draw

(6,-0.5) – (6,-2.5) – (8,-2.5) – (8,-0.5) – (8,-0.5); [gray,opacity=0.8] (6,-2.5) rectangle (8,-0.5); \node[text width=1cm] at (8.8,-1.5) :-10;

\draw

(10.6,-0.5) – (10.6,-2.5) – (12.6,-2.5) – (12.6,-0.5) – (10.6,-0.5); [white] (10.6,-2.5) rectangle (12.6,-0.5); \node[text width=1cm] at (13.2,-1.5) :-1;

\draw

[->=stealth,line width=0.5mm, opacity=0.5, blue] (1.25 , 0.75) –(3, 0.75) (3,0.75) – (3,3) (3,3) – (11,3) (11,3) – (11,9) (11,9) – (3,9) (3,9) – (3,10); \draw[->=stealth,line width=0.5mm,opacity=0.5, red] (1.25 , 1.25) –(2.75, 1.25) (2.75,1.25) – (2.75,10);

\draw

[orange,->,>=stealth] (0.5,1) – (-0.5,1); \draw[orange,->,>=stealth] (1,0.5) – (1,-0.5); \draw[orange,->,>=stealth] (1.5,1.2) – (2.5,1.2); \draw[orange,->,>=stealth] (2.5,0.8) – (1.5,0.8); \draw[orange,->,>=stealth] (1,1.5) – (1,2.5);

\draw

[orange,->,>=stealth] (3,0.5) – (3,-0.5); \draw[orange,->,>=stealth] (3.2,2.5) – (3.2,1.5); \draw[orange,->,>=stealth] (3.5,1) – (4.5,1); \draw[orange,->,>=stealth] (2.8,1.5) – (2.8,2.5);

\draw

[orange,->,>=stealth] (3.2,4.5) – (3.2,3.5); \draw[orange,->,>=stealth] (3.5,3.2) – (4.5,3.2); \draw[orange,->,>=stealth] (4.5,2.8) – (3.5,2.8); \draw[orange,->,>=stealth] (2.8,3.5) – (2.8,4.5); \draw[orange,->,>=stealth] (2.5,3) – (1.5,3);

\draw

[orange,->,>=stealth] (5.5,3.2) – (6.5,3.2); \draw[orange,->,>=stealth] (6.5,2.8) – (5.5,2.8); \draw[orange,->,>=stealth] (5,3.5) – (5,4.5); \draw[orange,->,>=stealth] (5,2.5) – (5,1.5);

\draw

[orange,->,>=stealth] (7.5,3.2) – (8.5,3.2); \draw[orange,->,>=stealth] (8.5,2.8) – (7.5,2.8); \draw[orange,->,>=stealth] (7,3.5) – (7,4.5); \draw[orange,->,>=stealth] (7,2.5) – (7,1.5);

\draw

[orange,->,>=stealth] (9.5,3.2) – (10.5,3.2); \draw[orange,->,>=stealth] (10.5,2.8) – (9.5,2.8); \draw[orange,->,>=stealth] (9,3.5) – (9,4.5); \draw[orange,->,>=stealth] (9,2.5) – (9,1.5);

\draw

[orange,->,>=stealth] (11.2,4.5) – (11.2,3.5); \draw[orange,->,>=stealth] (11.5,3) – (12.5,3); \draw[orange,->,>=stealth] (10.8,3.5) – (10.8,4.5); \draw[orange,->,>=stealth] (11,2.5) – (11,1.5);

\draw

[orange,->,>=stealth] (11.2,6.5) – (11.2,5.5); \draw[orange,->,>=stealth] (11.5,5) – (12.5,5); \draw[orange,->,>=stealth] (10.8,5.5) – (10.8,6.5); \draw[orange,->,>=stealth] (10.5,5) – (9.5,5);

\draw

[orange,->,>=stealth] (11.2,8.5) – (11.2,7.5); \draw[orange,->,>=stealth] (11.5,7) – (12.5,7); \draw[orange,->,>=stealth] (10.8,7.5) – (10.8,8.5); \draw[orange,->,>=stealth] (10.5,7) – (9.5,7);

\draw

[orange,->,>=stealth] (11.5,9) – (12.5,9); \draw[orange,->,>=stealth] (11,9.5) – (11,10.5); \draw[orange,->,>=stealth] (9.5,9.2) – (10.5,9.2); \draw[orange,->,>=stealth] (10.5,8.8) – (9.5,8.8);

\draw

[orange,->,>=stealth] (8.8,9.5) – (8.8,10.5); \draw[orange,->,>=stealth] (9.2,10.5) – (9.2,9.5); \draw[orange,->,>=stealth] (7.5,9.2) – (8.5,9.2); \draw[orange,->,>=stealth] (8.5,8.8) – (7.5,8.8); \draw[orange,->,>=stealth] (9,8.5) – (9,7.5);

\draw

[orange,->,>=stealth] (7,9.5) – (7,10.5); \draw[orange,->,>=stealth] (5.5,9.2) – (6.5,9.2); \draw[orange,->,>=stealth] (6.5,8.8) – (5.5,8.8); \draw[orange,->,>=stealth] (7,8.5) – (7,7.5);

\draw

[orange,->,>=stealth] (5,9.5) – (5,10.5); \draw[orange,->,>=stealth] (3.5,9.2) – (4.5,9.2); \draw[orange,->,>=stealth] (4.5,8.8) – (3.5,8.8); \draw[orange,->,>=stealth] (5,8.5) – (5,7.5);

\draw

[orange,->,>=stealth] (2.8,9.5) – (2.8,10.5); \draw[orange,->,>=stealth] (3.2,10.5) – (3.2,9.5); \draw[orange,->,>=stealth] (2.5,9) – (1.5,9); \draw[orange,->,>=stealth] (2.8,7.5) – (2.8,8.5); \draw[orange,->,>=stealth] (3.2,8.5) – (3.2,7.5);

\draw

[orange,->,>=stealth] (2.8,5.5) – (2.8,6.5); \draw[orange,->,>=stealth] (3.2,6.5) – (3.2,5.5); \draw[orange,->,>=stealth] (2.5,7) – (1.5,7); \draw[orange,->,>=stealth] (3.5,7) – (4.5,7);

\draw

[orange,->,>=stealth] (2.5,5) – (1.5,5); \draw[orange,->,>=stealth] (3.5,5) – (4.5,5);

\draw

[orange,->,>=stealth] (2.5,11) – (1.5,11); \draw[orange,->,>=stealth] (3.5,11) – (4.5,11); \draw[orange,->,>=stealth] (3,11.5) – (3,12.5);

\draw

[orange,->,>=stealth] (8.5,11) – (7.5,11); \draw[orange,->,>=stealth] (9.5,11) – (10.5,11); \draw[orange,->,>=stealth] (9,11.5) – (9,12.5); \nodeat (4,3.2) -0.572; \nodeat (2.5,4) +0.572; \nodeat (6,3.2) -0.515; \nodeat (8,3.2) -0.464; \nodeat (10,3.2) -0.417; \nodeat (10.6,4) -0.376; \nodeat (10.6,6) -0.338; \nodeat (10.6,8) -0.304; \nodeat (10,8.8) -0.274; \nodeat (8,8.8) -0.246; \nodeat (6,8.8) -0.221; \nodeat (4,8.8) -0.200; \nodeat (2.5,10) +0.238; \nodeat (3,12) +2.139; \nodeat (2.5,8) +0.464; \nodeat (2.5,6) +0.515;

(a) Grid world with a single terminal state G.
{tikzpicture}\draw

(1,-0.5) – (1,-2) – (2.5,-2) – (2.5,-0.5) – (1,-0.5); [blue,opacity=0.5] (1,-2) rectangle (2.5,-0.5); \nodeat (1.75,-1.25) G1; \node[text width=0.5cm] at (2.9,-1.25) :1;

\draw

(5.2,-0.5) – (5.2,-2) – (6.7,-2) – (6.7,-0.5) – (5.2,-0.5); [blue,opacity=0.5] (5.2,-2) rectangle (6.7,-0.5); \nodeat (5.95,-1.25) G2; \node[text width=0.7cm] at (7.15,-1.25) :2;

\draw

(9,-0.5) – (9,-2) – (10.5,-2) – (10.5,-0.5) – (9,-0.5); [white] (9,-2) rectangle (10.5,-0.5); \node[text width=0.7cm] at (11,-1.25) :-1;

\draw

[step=2cm,black,thin,opacity=0.5] (0,0) grid (12,12); [black] (2,8) rectangle (4,12); [black] (8,8) rectangle (10,12); \nodeat (1,11) G1; \nodeat (11,11) G2; [blue,opacity=0.5] (0,10) rectangle (2,12); [blue,opacity=0.5] (10,10) rectangle (12,12); \nodeat (3,7) S;

\tikzset

>=latex \draw[->=stealth,line width=0.5mm, opacity=0.5,blue] (2.75 , 7) –(1, 7) (1 , 7) –(1, 10); \draw[->=stealth,line width=0.5mm, opacity=0.5,red] (3.25, 7) –(11,7) (11, 7) – (11, 10);

\draw

[orange,->,>=stealth] (0.5,1) – (-0.5,1); \draw[orange,->,>=stealth] (1,0.5) – (1,-0.5); \draw[orange,->,>=stealth] (1.5,1.2) – (2.5,1.2); \draw[orange,->,>=stealth] (2.5,0.8) – (1.5,0.8); \draw[orange,->,>=stealth] (0.8,1.5) – (0.8,2.5); \draw[orange,->,>=stealth] (1.2,2.5) – (1.2,1.5);

\draw

[orange,->,>=stealth] (0.5,3) – (-0.5,3); \draw[orange,->,>=stealth] (1.5,3.2) – (2.5,3.2); \draw[orange,->,>=stealth] (2.5,2.8) – (1.5,2.8); \draw[orange,->,>=stealth] (0.8,3.5) – (0.8,4.5); \draw[orange,->,>=stealth] (1.2,4.5) – (1.2,3.5);

\draw

[orange,->,>=stealth] (0.5,5) – (-0.5,5); \draw[orange,->,>=stealth] (1.5,5.2) – (2.5,5.2); \draw[orange,->,>=stealth] (2.5,4.8) – (1.5,4.8); \draw[orange,->,>=stealth] (0.8,5.5) – (0.8,6.5); \draw[orange,->,>=stealth] (1.2,6.5) – (1.2,5.5);

\draw

[orange,->,>=stealth] (0.5,7) – (-0.5,7); \draw[orange,->,>=stealth] (1.5,7.2) – (2.5,7.2); \draw[orange,->,>=stealth] (2.5,6.8) – (1.5,6.8); \draw[orange,->,>=stealth] (0.8,7.5) – (0.8,8.5); \draw[orange,->,>=stealth] (1.2,8.5) – (1.2,7.5);

\draw

[orange,->,>=stealth] (0.5,9) – (-0.5,9); \draw[orange,->,>=stealth] (1.5,9) – (2.5,9); \draw[orange,->,>=stealth] (0.8,9.5) – (0.8,10.5); \draw[orange,->,>=stealth] (1.2,10.5) – (1.2,9.5);

\draw

[orange,->,>=stealth] (0.5,11) – (-0.5,11); \draw[orange,->,>=stealth] (1.5,11) – (2.5,11); \draw[orange,->,>=stealth] (1,11.5) – (1,12.5);

\draw

[orange,->,>=stealth] (3,0.5) – (3,-0.5); \draw[orange,->,>=stealth] (3.5,1.2) – (4.5,1.2); \draw[orange,->,>=stealth] (4.5,0.8) – (3.5,0.8); \draw[orange,->,>=stealth] (2.8,1.5) – (2.8,2.5); \draw[orange,->,>=stealth] (3.2,2.5) – (3.2,1.5);

\draw

[orange,->,>=stealth] (5,0.5) – (5,-0.5); \draw[orange,->,>=stealth] (5.5,1.2) – (6.5,1.2); \draw[orange,->,>=stealth] (6.5,0.8) – (5.5,0.8); \draw[orange,->,>=stealth] (4.8,1.5) – (4.8,2.5); \draw[orange,->,>=stealth] (5.2,2.5) – (5.2,1.5);

\draw

[orange,->,>=stealth] (7,0.5) – (7,-0.5); \draw[orange,->,>=stealth] (7.5,1.2) – (8.5,1.2); \draw[orange,->,>=stealth] (8.5,0.8) – (7.5,0.8); \draw[orange,->,>=stealth] (6.8,1.5) – (6.8,2.5); \draw[orange,->,>=stealth] (7.2,2.5) – (7.2,1.5);

\draw

[orange,->,>=stealth] (9,0.5) – (9,-0.5); \draw[orange,->,>=stealth] (9.5,1.2) – (10.5,1.2); \draw[orange,->,>=stealth] (10.5,0.8) – (9.5,0.8); \draw[orange,->,>=stealth] (8.8,1.5) – (8.8,2.5); \draw[orange,->,>=stealth] (9.2,2.5) – (9.2,1.5);

\draw

[orange,->,>=stealth] (11,0.5) – (11,-0.5); \draw[orange,->,>=stealth] (11.5,1) – (12.5,1); \draw[orange,->,>=stealth] (10.8,1.5) – (10.8,2.5); \draw[orange,->,>=stealth] (11.2,2.5) – (11.2,1.5);

\draw

[orange,->,>=stealth] (3.5,3.2) – (4.5,3.2); \draw[orange,->,>=stealth] (4.5,2.8) – (3.5,2.8); \draw[orange,->,>=stealth] (2.8,3.5) – (2.8,4.5); \draw[orange,->,>=stealth] (3.2,4.5) – (3.2,3.5);

\draw

[orange,->,>=stealth] (3.5,5.2) – (4.5,5.2); \draw[orange,->,>=stealth] (4.5,4.8) – (3.5,4.8); \draw[orange,->,>=stealth] (2.8,5.5) – (2.8,6.5); \draw[orange,->,>=stealth] (3.2,6.5) – (3.2,5.5);

\draw

[orange,->,>=stealth] (3.5,7.2) – (4.5,7.2); \draw[orange,->,>=stealth] (4.5,6.8) – (3.5,6.8); \draw[orange,->,>=stealth] (3,7.5) – (3,8.5);

\draw

[orange,->,>=stealth] (5.5,3.2) – (6.5,3.2); \draw[orange,->,>=stealth] (6.5,2.8) – (5.5,2.8); \draw[orange,->,>=stealth] (4.8,3.5) – (4.8,4.5); \draw[orange,->,>=stealth] (5.2,4.5) – (5.2,3.5);

\draw

[orange,->,>=stealth] (5.5,5.2) – (6.5,5.2); \draw[orange,->,>=stealth] (6.5,4.8) – (5.5,4.8); \draw[orange,->,>=stealth] (4.8,5.5) – (4.8,6.5); \draw[orange,->,>=stealth] (5.2,6.5) – (5.2,5.5);

\draw

[orange,->,>=stealth] (5.5,7.2) – (6.5,7.2); \draw[orange,->,>=stealth] (6.5,6.8) – (5.5,6.8); \draw[orange,->,>=stealth] (4.8,7.5) – (4.8,8.5); \draw[orange,->,>=stealth] (5.2,8.5) – (5.2,7.5);

\draw

[orange,->,>=stealth] (5.5,9.2) – (6.5,9.2); \draw[orange,->,>=stealth] (6.5,8.8) – (5.5,8.8); \draw[orange,->,>=stealth] (4.8,9.5) – (4.8,10.5); \draw[orange,->,>=stealth] (5.2,10.5) – (5.2,9.5); \draw[orange,->,>=stealth] (4.5,9) – (3.5,9);

\draw

[orange,->,>=stealth] (5.5,11.2) – (6.5,11.2); \draw[orange,->,>=stealth] (6.5,10.8) – (5.5,10.8); \draw[orange,->,>=stealth] (5,11.5) – (5,12.5); \draw[orange,->,>=stealth] (4.5,11) – (3.5,11);

\draw

[orange,->,>=stealth] (7.5,3.2) – (8.5,3.2); \draw[orange,->,>=stealth] (8.5,2.8) – (7.5,2.8); \draw[orange,->,>=stealth] (6.8,3.5) – (6.8,4.5); \draw[orange,->,>=stealth] (7.2,4.5) – (7.2,3.5);

\draw

[orange,->,>=stealth] (7.5,5.2) – (8.5,5.2); \draw[orange,->,>=stealth] (8.5,4.8) – (7.5,4.8); \draw[orange,->,>=stealth] (6.8,5.5) – (6.8,6.5); \draw[orange,->,>=stealth] (7.2,6.5) – (7.2,5.5);

\draw

[orange,->,>=stealth] (7.5,7.2) – (8.5,7.2); \draw[orange,->,>=stealth] (8.5,6.8) – (7.5,6.8); \draw[orange,->,>=stealth] (6.8,7.5) – (6.8,8.5); \draw[orange,->,>=stealth] (7.2,8.5) – (7.2,7.5);

\draw

[orange,->,>=stealth] (7.5,9) – (8.5,9); \draw[orange,->,>=stealth] (6.8,9.5) – (6.8,10.5); \draw[orange,->,>=stealth] (7.2,10.5) – (7.2,9.5);

\draw

[orange,->,>=stealth] (7.5,11) – (8.5,11); \draw[orange,->,>=stealth] (7,11.5) – (7,12.5);

\draw

[orange,->,>=stealth] (9.5,3.2) – (10.5,3.2); \draw[orange,->,>=stealth] (10.5,2.8) – (9.5,2.8); \draw[orange,->,>=stealth] (8.8,3.5) – (8.8,4.5); \draw[orange,->,>=stealth] (9.2,4.5) – (9.2,3.5);

\draw

[orange,->,>=stealth] (9.5,5.2) – (10.5,5.2); \draw[orange,->,>=stealth] (10.5,4.8) – (9.5,4.8); \draw[orange,->,>=stealth] (8.8,5.5) – (8.8,6.5); \draw[orange,->,>=stealth] (9.2,6.5) – (9.2,5.5);

\draw

[orange,->,>=stealth] (9.5,7.2) – (10.5,7.2); \draw[orange,->,>=stealth] (10.5,6.8) – (9.5,6.8); \draw[orange,->,>=stealth] (9,7.5) – (9,8.5);

\draw

[orange,->,>=stealth] (11.5,3) – (12.5,3); \draw[orange,->,>=stealth] (10.8,3.5) – (10.8,4.5); \draw[orange,->,>=stealth] (11.2,4.5) – (11.2,3.5);

\draw

[orange,->,>=stealth] (11.5,5) – (12.5,5); \draw[orange,->,>=stealth] (10.8,5.5) – (10.8,6.5); \draw[orange,->,>=stealth] (11.2,6.5) – (11.2,5.5);

\draw

[orange,->,>=stealth] (11.5,7) – (12.5,7); \draw[orange,->,>=stealth] (10.8,7.5) – (10.8,8.5); \draw[orange,->,>=stealth] (11.2,8.5) – (11.2,7.5);

\draw

[orange,->,>=stealth] (11.5,9) – (12.5,9); \draw[orange,->,>=stealth] (10.5,9) – (9.5,9); \draw[orange,->,>=stealth] (10.8,9.5) – (10.8,10.5); \draw[orange,->,>=stealth] (11.2,10.5) – (11.2,9.5);

\draw

[orange,->,>=stealth] (11.5,11) – (12.5,11); \draw[orange,->,>=stealth] (10.5,11) – (9.5,11); \draw[orange,->,>=stealth] (11,11.5) – (11,12.5);

\node

at (2.6,2) +0.012; \nodeat (4,1.2) -0.012;

\node

at (4.6,2) +0.007; \nodeat (6,1.2) -0.018;

\node

at (2.6,4) +0.020; \nodeat (4,3.2) -0.009;

\node

at (0.6,6) -0.006;

\node

at (6.6,2) +0.002; \nodeat (8,1.2) -0.018;

\node

at (4.6,4) +0.012; \nodeat (6,3.2) -0.014;

\node

at (2.6,6) +0.036; \nodeat (4,5.2) -0.010; \nodeat (2,4.8) -0.007;

\node

at (0.6,8) -0.044;

\node

at (8.6,2) -0.003; \nodeat (10,1.2) -0.013;

\node

at (6.6,4) +0.004; \nodeat (8,3.2) -0.015;

\node

at (4.6,6) +0.015; \nodeat (6,5.2) -0.013;

\node

at (2,6.8) -0.043; \nodeat (4,7.2) +0.075;

\node

at (0.5,10) -0.040;

\node

at (10.6,2) -0.012;

\node

at (8.6,4) -0.006; \nodeat (10,3.2) -0.011;

\node

at (6.6,6) +0.008; \nodeat (8,5.2) -0.016;

\node

at (6,7.2) +0.088;

\node

at (0,11) -0.115; \nodeat (2,11) -0.115; \nodeat (1,12) -0.015; \nodeat (1.7,10) -0.115;

\node

at (10.6,4) -0.020;

\node

at (8.6,6) -0.005; \nodeat (10,5.2) -0.015;

\node

at (8,7.2) +0.080;

\node

at (5.4,8) +0.009; \nodeat (6,9.2) -0.004;

\node

at (10.6,6) -0.032;

\node

at (10,7.2) +0.068;

\node

at (7.4,8) -0.008;

\node

at (5.4,10) +0.005; \nodeat (6,11.2) -0.005;

\node

at (10.6,8) +0.032;

\node

at (7.4,10) -0.004;

\node

at (10.6,10) +0.029;

\node

at (11,12) +0.262;

(b) Grid world with two terminal states G1 and G2.
Figure 2: Poisoning TCE in grid-world tasks.

Experiment 2. As another example, we consider the grid world tasks in Cakmak and Lopes (2012). In particular, we focus on two tasks shown in figure 1(a) and 1(b). In figure 1(a), the agent starts from S and aims to arrive at the terminal cell G. The black regions are walls, thus the agent can only choose to go through the white or gray regions. The agent can take four actions in every state: go left, right, up or down, and stays if the action takes it into the wall. Reaching a gray, white, or the terminal state results in rewards -10, -1, 2, respectively. After the agent arrives at the terminal state G, it will stay there forever and always receive reward 0 regardless of the following actions. The original optimal policy is to follow the blue trajectory. The attacker’s goal is to force the agent to follow the red trajectory. Correspondingly, we set the target actions for those states on the red trajectory as along the trajectory. We set the target actions for the remaining states to be the same as the original optimal policy learned on clean data.

The clean training data contains a single item for every state-action pair. We run the attack with ε=0.1 and α=2. Our attack is successful: with the poisoned data, TCE generates a policy that produces the red trajectory in Figure 1(a), which is the desired behavior. The attack cost is 𝐫-𝐫022.64, which is small compared to 𝐫02=21.61. In Figure 1(a), we show the poisoning on rewards. Each state-action pair is denoted by an orange arrow. The value tagged to each arrow is the modification to that reward, where red value means the reward is increased and blue means decreased. An arrow without any tagged value means the corresponding reward is not changed by attack. Note that rewards along the red trajectory are increased, while those along the blue trajectory are reduced, resulting in the red trajectory being preferred by the agent. Furthermore, rewards closer to the starting state S suffer larger poisoning since they contribute more to the Q values. For the large attack +2.139 happening at terminal state, we provide an explanation in appendix E.

Experiment 3. In Figure 1(b) there are two terminal states G1 and G2 with reward 1 and 2, respectively. The agent starts from S. Although G2 is more profitable, the path is longer and each step has a -1 reward. Therefore, the original optimal policy is the blue trajectory to G1. The attacker’s target policy is to force the agent along the red trajectory to G2. We set the target actions for states as in experiment 2. The clean training data contains a single item for every state-action pair. We run our attack with ε=0.1 and α=2. Again, after the attack, TCE on the poisoned dataset produces the red trajectory in figure 1(b), with attack cost 𝐫-𝐫020.38, compared to 𝐫02=11.09. The reward poisoning follows a similar pattern to experiment 2.

5.2 Policy Poisoning Attack on LQR Victim

(a) Clean and poisoned vehicle trajectory.
(b) Clean and poisoned rewards.
Figure 3: Poisoning a vehicle running LQR in 4D state space.

Experiment 4. We now demonstrate our attack on LQR. We consider a linear dynamical system that approximately models a vehicle. The state of the vehicle consists of its 2D position and 2D velocity: st=(xt,yt,vtx,vty)4. The control signal at time t is the force at2 which will be applied on the vehicle for h seconds. We assume there is a friction parameter η such that the friction force is -ηvt. Let m be the mass of the vehicle. Given small enough h, the transition matrices can be approximated by (17) where

A=[10h0010h001-hη/m00001-hη/m],B=[0000h/m00h/m]. (34)

In this example, we let h=0.1, m=1, η=0.5, and wt𝒩(0,σ2I) with σ=0.01. The vehicle starts from initial position (1,1) with velocity (1,-0.5), i.e., s0=(1,1,1,-0.5). The true loss function is L(s,a)=12sQs+aRa with Q=I and R=0.1I (i.e., Q=I,R=0.1I,q=0,c=0 in (18)). Throughout the experiment, we let γ=0.9 for solving the optimal control policy in (21). With the true dynamics and loss function, the computed optimal control policy is

K*=[-1.320-2.3900-1.320-2.39],k*=[00], (35)

which will drive the vehicle to the origin.

The batch LQR learner estimates the dynamics and the loss function from a batch training data. To produce the training data, we let the vehicle start from state s0 and simulate its trajectory with a random control policy. Specifically, in each time step, we uniformly sample a control signal at in a unit sphere. The vehicle then takes action at to transit from current state st to the next state st+1, and receives a reward rt=-L(st,at). This gives us one training item (st,at,rt,st+1). We simulate a total of 400 time steps to obtain a batch data that contains 400 items, on which the learner estimates the dynamics and the loss function. With the learner’s estimate, the computed clean optimal policy is

K^0=[-1.311.00e-2-2.412.03e-3-1.97e-2-1.35-1.14e-2-2.42],k^0=[-4.88e-54.95e-6]. (36)

The clean optimal policy differs slightly from the true optimal policy due to the inaccuracy of the learner’s estimate. The attacker has a target policy (K,k) that can drive the vehicle close to its target destination (x,y)=(0,1) with terminal velocity (0,0), which can be represented as a target state s=(0,1,0,0). To ensure feasibility, we assume that the attacker starts with the loss function 12(s-s)Q(s-s)+aRa where Q=I,R=0.1I. Due to the offset this corresponds to setting Q=I,R=0.1I,q=-s,c=12sQs=0.5 in (18). The attacker then solves the Riccati equation with its own loss function and the learner’s estimates A^ and B^ to arrive at the target policy

K=[-1.319.99e-3-2.412.02e-3-1.97e-2-1.35-1.14e-2-2.42],k=[-0.011.35]. (37)

We run our attack (27)-(33) with α=2 and ε=0.01 in (33). Figure 3 shows the result of our attack. In Figure 2(a), we plot the trajectory of the vehicle with policy learned on clean data and poisoned data respectively. Our attack successfully forces LQR into a policy that drives the vehicle close to the target destination. The wiggle on the trajectory is due to the noise wt of the dynamical system. On the poisoned data, the LQR victim learns the policy

K^=[-1.319.99e-3-2.412.02e-3-1.97e-2-1.35-1.14e-2-2.42],k^=[-0.011.35], (38)

which matches exactly the target policy K, k. In Figure 2(b), we show the poisoning on rewards. Our attack leads to very small modification on each reward, thus the attack is efficient. The total attack cost over all 400 items is only 𝐫-𝐫02=0.73, which is tiny small compared to 𝐫02=112.94. The results here demonstrate that our attack can dramatically change the behavior of LQR by only slightly modifying the rewards in the dataset.

Finally, for both attacks on TCE and LQR, we note that by setting the attack cost norm α=1 in (5), the attacker is able to obtain a sparse attack, meaning that only a small fraction of the batch data needs to be poisoned. Such sparse attacks have profound implications in adversarial machine learning as they can be easier to carry out and harder to detect. We show detailed results in appendix E.

6 Conclusion

We presented a policy poisoning framework against batch reinforcement learning and control. We showed the attack problem can be formulated as convex optimization. We provided theoretical analysis on attack feasibility and cost. Experiments show the attack can force the learner into an attacker-chosen target policy while incurring only a small attack cost.

Acknowledgments. This work is supported in part by NSF 1545481, 1561512, 1623605, 1704117, 1836978 and the MADLab AF Center of Excellence FA9550-18-1-0166.

References

  • Asmuth et al. (2008) John Asmuth, Michael L Littman, and Robert Zinkov. Potential-based shaping in model-based reinforcement learning. In Proceedings of the 23rd national conference on Artificial intelligence-Volume 2, pages 604–609. AAAI Press, 2008.
  • Biggio and Roli (2018) Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317–331, 2018.
  • Biggio et al. (2012) Battista Biggio, B Nelson, and P Laskov. Poisoning attacks against support vector machines. In 29th International Conference on Machine Learning, pages 1807–1814. ArXiv e-prints, 2012.
  • Brown and Niekum (2019) Daniel S Brown and Scott Niekum. Machine teaching for inverse reinforcement learning: Algorithms and applications. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7749–7758, 2019.
  • Cakmak and Lopes (2012) Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision tasks. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
  • Dean et al. (2017) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
  • Devlin and Kudenko (2012) Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pages 433–440. IFAAMAS, 2012.
  • Diamond and Boyd (2016) Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  • Huang et al. (2011) Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 43–58. ACM, 2011.
  • Huang et al. (2017) Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
  • Jun et al. (2018) Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3640–3649, 2018.
  • Kamalaruban et al. (2019) P Kamalaruban, R Devidze, V Cevher, and A Singla. Interactive teaching algorithms for inverse reinforcement learning. In 28th International Joint Conference on Artificial Intelligence, pages 604–609, 2019.
  • Koh et al. (2018) Pang Wei Koh, Jacob Steinhardt, and Percy Liang. Stronger data poisoning attacks break data sanitization defenses. arXiv preprint arXiv:1811.00741, 2018.
  • Li et al. (2016) Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on factorization-based collaborative filtering. In Advances in neural information processing systems, pages 1885–1893, 2016.
  • Liu and Shroff (2019) Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pages 4042–4050, 2019.
  • Ma et al. (2018) Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pages 186–204. Springer, 2018.
  • Mei and Zhu (2015) Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks on machine learners. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
  • Wang and Chaudhuri (2018) Yizhen Wang and Kamalika Chaudhuri. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
  • Wiewiora (2003) Eric Wiewiora. Potential-based shaping and q-value initialization are equivalent. Journal of Artificial Intelligence Research, 19:205–208, 2003.
  • Xiao et al. (2015) Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. Is feature selection secure against training data poisoning? In International Conference on Machine Learning, pages 1689–1698, 2015.
  • Zhang and Zhu (2019) Xuezhou Zhang and Xiaojin Zhu. Online data poisoning attack. arXiv preprint arXiv:1903.01666, 2019.

Supplementary Material

Appendix A Proof of Proposition 1

The proof of feasibility relies on the following result, which states that there is a bijection mapping between reward space and value function space.

Proposition 4.

Given an MDP with transition probability function P and discounting factor γ[0,1), let R={R:S×AR} denote the set of all possible reward functions, and let Q={Q:S×AR} denote the set of all possible Q tables. Then, there exists a bijection mapping between R and Q, induced by Bellman optimality equation.

Proof.

Given any reward function R(s,a), define the Bellman operator as

HR(Q)(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a). (39)

Since γ<1, HR(Q) is a contraction mapping, i.e., HR(Q1)-HR(Q2)γQ1-Q2, Q1,Q2𝒬. Then by Banach Fixed Point Theorem, there is a unique Q𝒬 that satisfies Q=HR(Q), which is the Q that R maps to.

Given any Q𝒬, one can define the corresponding R by

R(s,a)=Q(s,a)-γsP(ss,a)maxaQ(s,a). (40)

Thus the mapping is one-to-one.  

See 1

Proof.

For any target policy π:𝒮𝒜, we construct the following Q:

Q(s,a)={εs𝒮,a=π(s),0, otherwise. (41)

The Q values in (41) satisfy the constraint (15). Note that we construct the Q values so that for all s𝒮, maxaQ(s,a)=ε. By proposition 4, the corresponding reward function induced by Bellman optimality equation is

R^(s,a)={(1-γ)εs𝒮,a=π(s),-γε, otherwise. (42)

Then one can let rt=R^(st,at) so that 𝐫=(r0,,rT-1), R^ in (42), together with Q in (41) is a feasible solution to (12)-(15).  

Appendix B Proof of Theorem 2

The proof of Theorem 2 relies on a few lemmas. We first prove the following result, which shows that given two vectors that have equal element summation, the vector whose elements are smoother will have smaller α norm for any α1. This result is used later to prove Lemma 6.

Lemma 5.

Let x,yRT be two vectors. Let I{0,1,,T-1} be a subset of indexes such that

i).xi=1||jyj,i,ii).xi=yi,i. (43)

Then for any α1, we have xαyα.

Proof.

Note that the conditions i) and ii) suggest the summation of elements in x and y are equal, and only elements in differ for the two vectors. However, the elements in of x are smoother than that of y, thus x has smaller norm. To prove the result, we consider three cases separately.

Case 1: α=1. Then we have

xα-yα=i|xi|-j|yj|=i|xi|-j|yj|=|jyj|-j|yj|0. (44)

Case 2: 1<α<. We show xααyαα. Note that

xαα-yαα =i|xi|α-j|yj|α=i|xi|α-j|yj|α (45)
=1||α-1|jyj|α-j|yj|α1||α-1(j|yj|)α-j|yj|α.

Let β=αα-1. By Holder’s inequality, we have

j|yj|(j|yj|α)1α(j1β)1β=(j|yj|α)1α||1-1α. (46)

Plugging (46) into (45), we have

xαα-yαα1||α-1(j|yj|α)||α-1-j|yj|α=0. (47)

Case 3: α=. We have

xα =maxi|xi|=max{1|||jyj|,maxi|xi|}max{1||j|yj|,maxi|xi|} (48)
max{maxj|yj|,maxi|xi|}=max{maxj|yj|,maxj|yj|}=maxj|yj|=yα.

Therefore α1, we have xαyα.  

Next we prove Lemma 6, which shows that one possible optimal attack solution to (12)-(15) takes the following form: shift all the clean rewards in Ts,a by the same amount ψ(s,a). Here ψ(s,a) is a function of state s and action a. That means, rewards belonging to different Ts,a might be shifted a different amount, but those corresponding to the same (s,a) pair will be identically shifted.

Lemma 6.

There exists a function ψ(s,a) such that rt=rt0+ψ(st,at), together with some R^ and Q, is an optimal solution to our attack problem (12)-(15).

We point out that although there exists an optimal attack taking the above form, it is not necessarily the only optimal solution. However, all those optimal solutions must have exactly the same objective value (attack cost), thus it suffices to consider the solution in Lemma 6.

Proof.

Let 𝐫*=(r0*,,rT-1*), R^* and Q* be any optimal solution to (12)-(15). Fix a particular state-action pair (s,a), we have

R^*(s,a)=1|Ts,a|tTs,art*. (49)

Let R^0(s,a)=1|Ts,a|tTs,art0 be the reward function for the (s,a) pair estimated from clean data 𝐫0. We then define a different poisoned reward vector 𝐫=(r0,,rT-1), where

rt={rt0+R^*(s,a)-R^0(s,a),tTs,a,rt*,tTs,a. (50)

Now we show 𝐫, R^* and Q* is another optimal solution to (12)-(15). We first verify that 𝐫, R^*, and Q* satisfy constraints (15)-(15). To verify (15), we only need to check R^*(s,a)=1|Ts,a|tTs,art, since 𝐫 and 𝐫* only differ on those rewards in Ts,a. We have

1|Ts,a|tTs,art =1|Ts,a|tTs,a(rt0+R^*(s,a)-R^0(s,a)) (51)
=R^0(s,a)+R^*(s,a)-R^0(s,a)=R^*(s,a),

Thus 𝐫 and R^* satisfy constraint (15). R^* and Q* obviously satisfy constraints (15) and (15) because 𝐫*, R^* and Q* is an optimal solution.

Let δ=𝐫-𝐫0 and δ*=𝐫*-𝐫0, then one can easily show that δ and δ* satisfy the conditions in Lemma 5 with =Ts,a. Therefore by Lemma 5, we have

𝐫-𝐫0α=δαδ*α=𝐫*-𝐫0α. (52)

But note that by our assumption, 𝐫* is an optimal solution, thus 𝐫*-𝐫0α𝐫-𝐫0α, which gives 𝐫-𝐫0α=𝐫*-𝐫0α. This suggests 𝐫, R^*, and Q* is another optimal solution. Compared to 𝐫*, 𝐫 differs in that rt-rt0 now becomes identical for all tTs,a for a particular (s,a) pair. Reusing the above argument iteratively, one can make rt-rt0 identical for all tTs,a for all (s,a) pairs, while guaranteeing the solution is still optimal. Therefore, we have

rt=rt0+R^*(s,a)-R^0(s,a),tTs,a,s,a, (53)

together with R^* and Q* is an optimal solution to (12)-(15). Let ψ(s,a)=R^*(s,a)-R^0(s,a) conclude the proof.  

Finally, Lemma 7 provides a sensitive analysis on the value function Q as the reward function changes.

Lemma 7.

Let M^=(S,A,P^,R^,γ) and M^0=(S,A,P^,R^0,γ) be two MDPs, where only the reward function differs. Let Q and Q0 be action values satisfying the Bellman optimality equation on M^ and M^0 respectively, then

(1-γ)Q-Q0R^-R^0(1+γ)Q-Q0. (54)
Proof.

Define the Bellman operator as

HR^(Q)(s,a)=R^(s,a)+γsP^(ss,a)maxaQ(s,a). (55)

From now on we suppress variables s and a for convenience. Note that due to the Bellman optimality, we have HR^0(Q0)=Q0 and HR^(Q)=Q, thus

Q-Q0 =HR^(Q)-HR^0(Q0) (56)
=HR^(Q)-HR^(Q0)+HR^(Q0)-HR^0(Q0)
HR^(Q)-HR^(Q0)+HR^(Q0)-HR^0(Q0)
γQ-Q0+HR^(Q0)-HR^0(Q0) (by contraction of HR^())
=γQ-Q0+R^-R^0 (by HR^(Q0)-HR^0(Q0)=R^-R^0)

Rearranging we have (1-γ)Q-Q0R^-R^0. Similarly we have

Q-Q0 =HR^(Q)-HR^0(Q0) (57)
=HR^(Q0)-HR^0(Q0)+HR^(Q)-HR^(Q0)
HR^(Q0)-HR^0(Q0)-HR^(Q)-HR^(Q0)
HR^(Q0)-HR^0(Q0)-γQ-Q0
=R^-R^0-γQ-Q0

Rearranging we have R^-R^0(1+γ)Q-Q0, concluding the proof.  

Now we are ready to prove our main result. See 2

Proof.

We construct the following value function Q.

Q(s,a)={Q0(s,a)+Δ(ε)2,s𝒮,a=π(s),Q0(s,a)-Δ(ε)2,s𝒮,aπ(s). (58)

Note that s𝒮 and aπ(s), we have

Δ(ε) =maxs𝒮[maxaπ(s)Q0(s,a)-Q0(s,π(s))+ε]+ (59)
maxaπ(s)Q0(s,a)-Q0(s,π(s))+εQ0(s,a)-Q0(s,π(s))+ε,

which leads to

Q0(s,a)-Q0(s,π(s))-Δ(ε)-ε, (60)

thus we have s𝒮 and aπ(s),

Q(s,π(s)) =Q0(s,π(s))+Δ(ε)2 (61)
=Q0(s,a)-[Q0(s,a)-Q0(s,π(s))-Δ(ε)]-Δ(ε)2
Q0(s,a)+ε-Δ(ε)2=Q(s,a)+ε.

Therefore Q satisfies the constraint (15). By proposition 4, there exists a unique function R such that Q satisfies the Bellman optimality equation of MDP M^=(𝒮,𝒜,P^,R,γ). We then construct the following reward vector 𝐫=(r0,,rT-1) such that (s,a) and tTs,a, rt=rt0+R(s,a)-R^0(s,a), where R^0(s,a) is the reward function estimated from 𝐫0. The reward function estimated on 𝐫 is then

R^(s,a) =1|Ts,a|tTs,art=1|Ts,a|tTs,a(rt0+R(s,a)-R^0(s,a)) (62)
=R^0(s,a)+R(s,a)-R^0(s,a)=R(s,a).

Thus 𝐫, R^ and Q is a feasible solution to (12)-(15). Now we analyze the attack cost for 𝐫, which gives us a natural upper bound on the attack cost of the optimal solution 𝐫*. Note that Q and Q0 satisfy the Bellman optimality equation for reward function R^ and R^0 respectively, and

Q-Q0=Δ(ε)2, (63)

thus by Lemma 7, we have t,

|rt-rt0| =|R^(st,at)-R^0(st,at)|maxs,a|R^(s,a)-R^0(s,a)|=R^-R^0 (64)
(1+γ)Q-Q0=12(1+γ)Δ(ε).

Therefore, we have

𝐫*-𝐫0α𝐫-𝐫0α=(t=0T-1|rt-rt0|α)1α12(1+γ)Δ(ε)T1α. (65)

Now we prove the lower bound. We consider two cases separately.

Case 1: Δ(ε)=0. We must have Q0(s,π(s))Q0(s,a)+ε, s𝒮,aπ(s). In this case no attack is needed and therefore the optimal solution is 𝐫*=𝐫0. The lower bound holds trivially.

Case 2: Δ(ε)>0. Let s and a (aπ(s)) be a state-action pair such that

Δ(ε)=Q0(s,a)-Q0(s,π(s))+ε. (66)

Let 𝐫*, R^* and Q* be an optimal solution to (12)-(15) that takes the form in Lemma 6, i.e.,

rt*=rt0+R^*(s,a)-R^0(s,a),tTs,a,s,a. (67)

Constraint (15) ensures that Q*(s,π(s))Q*(s,a)+ε, in which case either one of the following two conditions must hold:

i).Q*(s,π(s))-Q0(s,π(s))Δ(ε)2,ii).Q0(s,a)-Q*(s,a)Δ(ε)2, (68)

since otherwise we have

Q*(s,π(s))<Q0(s,π(s))+Δ(ε)2=Q0(s,π(s))+12[Q0(s,a)-Q0(s,π(s))+ε] (69)
=12Q0(s,a)+12Q0(s,π(s))+ε2=Q0(s,a)-12[Q0(s,a)-Q0(s,π(s))+ε]+ε
=Q0(s,a)-Δ(ε)2+ε<Q*(s,a)+ε.

Next note that if either i) or ii) holds, we have Q*-Q0Δ(ε)2. By Lemma 7, we have

maxs,a|R^*(s,a)-R^0(s,a)|=R^*-R^0(1-γ)Q*-Q012(1-γ)Δ(ε). (70)

Let s*,a*argmaxs,a|R^*(s,a)-R^0(s,a)|, then we have

|R^*(s*,a*)-R^0(s*,a*)|12(1-γ)Δ(ε). (71)

Therefore, we have

𝐫*-𝐫0αα =t=0T-1|rt*-rt0|α=s,atTs,a|rt*-rt0|αtTs*,a*|rt*-rt0|α (72)
=tTs*,a*|R^*(s*,a*)-R^0(s*,a*)|α(12(1-γ)Δ(ε))α|Ts*,a*|
(12(1-γ)Δ(ε))αmins,a|Ts,a|.

Therefore 𝐫*-𝐫0α12(1-γ)Δ(ε)(mins,a|Ts,a|)1α.

We finally point out that while an optimal solution 𝐫* may not necessarily take the form in Lemma 6, it suffices to bound the cost of an optimal attack which indeed takes this form (as we did in the proof) since all optimal attacks have exactly the same objective value.  

Appendix C Convex Surrogate for LQR Attack Optimization

By pulling the positive semi-definite constraints on Q and R out of the lower level optimization (33), one can turn the original attack optimization (27)-(33) into the following surrogate optimization:

min𝐫,Q^,R^,q^,c^,X,x 𝐫-𝐫0α (73)
s.t. -γ(R^+γB^XB^)-1B^XA^=K, (79)
-γ(R^+γB^XB^)-1B^x=k,
X=γA^XA^-γ2A^XB^(R^+γB^XB^)-1B^XA^+Q^
x=q^+γ(A^+B^K)x
(Q^,R^,q^,c^)=argmint=0T-112stQst+qst+atRat+c+rt22
Q^0,R^εI,X0.

The feasible set of (73)-(79) is a subset of the original problem, thus the surrogate attack optimization is a more stringent formulation than the original attack optimization, that is, successfully solving the surrogate optimization gives us a (potentially) sub-optimal solution to the original problem. To see why the surrogate optimization is more stringent, we illustrate with a much simpler example as below. A formal proof is straight forward, thus we omit it here. The original problem is (80)-(81). The feasible set for a^ is a singleton set {0}, and the optimal objective value is 0.

mina^ 0 (80)
s.t. a^=argmina0(a+3)2, (81)

Once we pull the constraint out of the lower-level optimization (81), we end up with a surrogate optimization (82)-(84). Note that (84) requires a^=-3, which does not satisfy (84). Therefore the feasible set of the surrogate optimization is , meaning it is more stringent than (80)-(81).

mina^ 0 (82)
s.t. a^=argmin(a+3)2, (84)
a^0

Back to our attack optimization (73)-(79), this surrogate attack optimization comes with the advantage of being convex, thus can be solved to global optimality.

Proposition 8.

The surrogate attack optimization (73)-(79) is convex.

Proof.

First note that the sub-level optimization (79) is itself a convex problem, thus is equivalent to the corresponding KKT condition. We write out the KKT condition of (79) to derive an explicit form of our attack formulation as below:

min𝐫,Q^,R^,q^,c^,X,x 𝐫-𝐫0α (85)
s.t. -γ(R^+γB^XB^)-1B^XA^=K, (94)
-γ(R^+γB^XB^)-1B^x=k,
X=γA^XA^-γ2A^XB^(R^+γB^XB^)-1B^XA^+Q^
x=q^+γ(A^+B^K)x
t=0T-1(12stQ^st+q^st+atR^at+c^+rt)stst=0,
t=0T-1(12stQ^st+q^st+atR^at+c^+rt)atat=0,
t=0T-1(12stQ^st+q^st+atR^at+c^+rt)st=0,
t=0T-1(12stQ^st+q^st+atR^at+c^+rt)=0,
Q^0,R^εI,X0.

The objective is obviously convex. (94)-(94) are equivalent to

-γB^XA^=(R^+γB^XB^)K. (95)
-γB^x=(R^+γB^XB^)k. (96)
X=γA^X(A^+B^K)+Q^, (97)

Note that these three equality constraints are all linear in X, R^, x, and Q^. (94) is linear in x and q^. (94)-(94) are also linear in Q^, R^, q^, c^ and 𝐫. Finally, (94) contains convex constraints on Q^, R^, and X. Given all above, the attack problem is convex.  

Next we analyze the feasibility of the surrogate attack optimization.

Proposition 9.

Let A^, B^ be the learner’s estimated transition kernel. Let

L(s,a)=12sQs+(q)s+aRa+c (98)

be the attacker-defined loss function. Assume RεI. If the target policy K, k is the optimal control policy induced by the LQR with transition kernel A^, B^, and loss function L(s,a), then the surrogate attack optimization (73)-(79) is feasible. Furthermore, the optimal solution can be achieved.

Proof.

To prove feasibility, it suffices to construct a feasible solution to optimization (73)-(79). Let

rt=12stQst+qst+atRat+c (99)

and 𝐫 be the vector whose t-th element is rt. We next show that 𝐫, Q, R, q, c, together with some X and x is a feasible solution. Note that since K, k is induced by the LQR with transition kernel A^, B^ and cost function L(s,a), constraints (79)-(79) must be satisfied with some X and x. The poisoned reward vector 𝐫 obviously satisfies (79) since it is constructed exactly as the minimizer. By our assumption, RεI, thus (79) is satisfied. Therefore, 𝐫, Q, R, q, c, together with the corresponding X, x is a feasible solution, and the optimization (73)-(79) is feasible. Furthermore, since the feasible set is closed, the optimal solution can be achieved.  

Appendix D Conditions for The LQR Learner to Have Unique Estimate

The LQR learner estimates the cost function by

(Q^,R^,q^,c^)=argmin(Q0,RεI,q,c)12t=0T-112stQst+qst+atRat+c+rt22. (100)

We want to find a condition that guarantees the uniqueness of the solution.

Let ψT be a vector, whose t-th element is

ψt=12stQst+qst+atRat+c,0tT-1. (101)

Note that we can view ψ as a function of D, Q, R, q, and c, thus we can also denote ψ(D,Q,R,q,c). Define Ψ(D)={ψ(D,Q,R,q,c)Q0,RεI,q,c}, i.e., all possible vectors that are achievable with form (101) if we vary Q, R, q and c subject to positive semi-definite constraints on Q and R. We can prove that Ψ is a closed convex set.

Proposition 10.

D, Ψ(D)={ψ(D,Q,R,q,c)Q0,RεI,q,c} is a closed convex set.

Proof.

Let ψ1,ψ2Ψ(D). We use ψi,t to denote the t-th element of vector ψi. Then we have

ψ1,t=12stQ1st+q1st+atR1at+c1 (102)

for some Q10, R1εI, q1 and c1, and

ψ2,t=12stQ2st+q2st+atR2at+c2 (103)

for some Q20, R2εI, q2 and c2. k[0,1], let ψ3=kψ1+(1-k)ψ2. Then the t-th element of ψ3 is

ψ3,t= 12st[kQ1+(1-k)Q2]st+[kq1+(1-k)q2]st (104)
+at[kR1+(1-k)R2]at+kc1+(1-k)c2

Since kQ1+(1-k)Q20 and kR1+(1-k)R2εI, ψ3Ψ(D), concluding the proof.  

The optimization (100) is intrinsically a least-squares problem with positive semi-definite constraints on Q and R, and is equivalent to solving the following linear equation:

12stQ^st+q^st+atR^at+c^=ψt*,t, (105)

where ψ*=argminψΨ(D)ψ+𝐫22 is the projection of the negative reward vector -𝐫 onto the set Ψ(D). The solution to (105) is unique if and only if the following two conditions both hold

  • i).

    The projection ψ* is unique.

  • ii).

    (105) has a unique solution for ψ*.

Condition i) is satisfied because Ψ(D) is convex, and any projection (in 2 norm) onto a convex set exists and is always unique (see Hilbert Projection Theorem). We next analyze when condition ii) holds. (105) is a linear function in Q^, R^, q^, and c^, thus one can vectorize Q^ and R^ to obtain a problem in the form of linear regression. Then the uniqueness is guaranteed if and only if the design matrix has full column rank. Specifically, let Q^n×n, R^m×m, and q^n. Let st,i and at,i denote the i-th element of st and at respectively. Define

𝐀=[s0,122s0,is0,j2s0,n22a0,12a0,ia0,ja0,m2s01s1,122s1,is1,j2s1,n22a1,12a1,ia2,ja1,m2s11st,122st,ist,j2st,n22at,12at,iat,jat,m2st1sT-1,122sT-1,isT-1,j2sT-1,n22aT-1,12aT-1,iaT-1,jaT-1,m2sT-11],
𝐱=[Q^11Q^ijQ^nnR^11R^ijR^mmq^1q^iq^nc^],

then (105) is equivalent to 𝐀𝐱=ψ*, where 𝐱 contains the vectorized variables Q^, R^, q^ and c^. 𝐀𝐱=ψ* has a unique solution if and only if 𝐀 has full column rank.

Appendix E Sparse Attacks on TCE and LQR

In this section, we present experimental details for both TCE and LQR victims when the attacker uses 1 norm to measure the attack cost, i.e. α=1. The other experimental parameters are set exactly the same as in the main text.

We first show the result for MDP experiment 2 with α=1, see Figure 4. The attack cost is 𝐫-𝐫01=3.27, which is small compared to 𝐫01=105. We note that the reward poisoning is extremely sparse: only the reward corresponding to action “go up” at the terminal state G is increased by 3.27, and all other rewards remain unchanged. To explain this attack, first note that we set the target action for the terminal state to “go up”, thus the corresponding reward must be increased. Next note that after the attack, the terminal state becomes a sweet spot, where the agent can keep taking action “go up” to gain large amount of discounted future reward. However, such future reward is discounted more if the agent reaches the terminal state via a longer path. Therefore, the agent will choose to go along the red trajectory to get into the terminal state earlier, though at a price of two discounted -10 rewards.

{tikzpicture}\draw

[step=2cm,black,thin,opacity=0.5] (0,0) grid (14,12); [black] (0,2) rectangle (2,12); [black] (4,0) rectangle (14,2); [black] (4,4) rectangle (10,8); [black] (4,10) rectangle (8,12); [black] (12,0) rectangle (14,12); [black] (10,10) rectangle (12,12); [gray,opacity=0.8] (2,4) rectangle (4,8); [blue,opacity=0.5] (2,10) rectangle (4,12); \nodeat (1,1) S; \nodeat (3,11) G;

\draw

[orange,->,>=stealth] (0.5,1) – (-0.5,1); \draw[orange,->,>=stealth] (1,0.5) – (1,-0.5); \draw[orange,->,>=stealth] (1.5,1.2) – (2.5,1.2); \draw[orange,->,>=stealth] (2.5,0.8) – (1.5,0.8); \draw[orange,->,>=stealth] (1,1.5) – (1,2.5);

\draw

[orange,->,>=stealth] (3,0.5) – (3,-0.5); \draw[orange,->,>=stealth] (3.2,2.5) – (3.2,1.5); \draw[orange,->,>=stealth] (3.5,1) – (4.5,1); \draw[orange,->,>=stealth] (2.8,1.5) – (2.8,2.5);

\draw

[orange,->,>=stealth] (3.2,4.5) – (3.2,3.5); \draw[orange,->,>=stealth] (3.5,3.2) – (4.5,3.2); \draw[orange,->,>=stealth] (4.5,2.8) – (3.5,2.8); \draw[orange,->,>=stealth] (2.8,3.5) – (2.8,4.5); \draw[orange,->,>=stealth] (2.5,3) – (1.5,3);

\draw

[orange,->,>=stealth] (5.5,3.2) – (6.5,3.2); \draw[orange,->,>=stealth] (6.5,2.8) – (5.5,2.8); \draw[orange,->,>=stealth] (5,3.5) – (5,4.5); \draw[orange,->,>=stealth] (5,2.5) – (5,1.5);

\draw

[orange,->,>=stealth] (7.5,3.2) – (8.5,3.2); \draw[orange,->,>=stealth] (8.5,2.8) – (7.5,2.8); \draw[orange,->,>=stealth] (7,3.5) – (7,4.5); \draw[orange,->,>=stealth] (7,2.5) – (7,1.5);

\draw

[orange,->,>=stealth] (9.5,3.2) – (10.5,3.2); \draw[orange,->,>=stealth] (10.5,2.8) – (9.5,2.8); \draw[orange,->,>=stealth] (9,3.5) – (9,4.5); \draw[orange,->,>=stealth] (9,2.5) – (9,1.5);

\draw

[orange,->,>=stealth] (11.2,4.5) – (11.2,3.5); \draw[orange,->,>=stealth] (11.5,3) – (12.5,3); \draw[orange,->,>=stealth] (10.8,3.5) – (10.8,4.5); \draw[orange,->,>=stealth] (11,2.5) – (11,1.5);

\draw

[orange,->,>=stealth] (11.2,6.5) – (11.2,5.5); \draw[orange,->,>=stealth] (11.5,5) – (12.5,5); \draw[orange,->,>=stealth] (10.8,5.5) – (10.8,6.5); \draw[orange,->,>=stealth] (10.5,5) – (9.5,5);

\draw

[orange,->,>=stealth] (11.2,8.5) – (11.2,7.5); \draw[orange,->,>=stealth] (11.5,7) – (12.5,7); \draw[orange,->,>=stealth] (10.8,7.5) – (10.8,8.5); \draw[orange,->,>=stealth] (10.5,7) – (9.5,7);

\draw

[orange,->,>=stealth] (11.5,9) – (12.5,9); \draw[orange,->,>=stealth] (11,9.5) – (11,10.5); \draw[orange,->,>=stealth] (9.5,9.2) – (10.5,9.2); \draw[orange,->,>=stealth] (10.5,8.8) – (9.5,8.8);

\draw

[orange,->,>=stealth] (8.8,9.5) – (8.8,10.5); \draw[orange,->,>=stealth] (9.2,10.5) – (9.2,9.5); \draw[orange,->,>=stealth] (7.5,9.2) – (8.5,9.2); \draw[orange,->,>=stealth] (8.5,8.8) – (7.5,8.8); \draw[orange,->,>=stealth] (9,8.5) – (9,7.5);

\draw

[orange,->,>=stealth] (7,9.5) – (7,10.5); \draw[orange,->,>=stealth] (5.5,9.2) – (6.5,9.2); \draw[orange,->,>=stealth] (6.5,8.8) – (5.5,8.8); \draw[orange,->,>=stealth] (7,8.5) – (7,7.5);

\draw

[orange,->,>=stealth] (5,9.5) – (5,10.5); \draw[orange,->,>=stealth] (3.5,9.2) – (4.5,9.2); \draw[orange,->,>=stealth] (4.5,8.8) – (3.5,8.8); \draw[orange,->,>=stealth] (5,8.5) – (5,7.5);

\draw

[orange,->,>=stealth] (2.8,9.5) – (2.8,10.5); \draw[orange,->,>=stealth] (3.2,10.5) – (3.2,9.5); \draw[orange,->,>=stealth] (2.5,9) – (1.5,9); \draw[orange,->,>=stealth] (2.8,7.5) – (2.8,8.5); \draw[orange,->,>=stealth] (3.2,8.5) – (3.2,7.5);

\draw

[orange,->,>=stealth] (2.8,5.5) – (2.8,6.5); \draw[orange,->,>=stealth] (3.2,6.5) – (3.2,5.5); \draw[orange,->,>=stealth] (2.5,7) – (1.5,7); \draw[orange,->,>=stealth] (3.5,7) – (4.5,7);

\draw

[orange,->,>=stealth] (2.5,5) – (1.5,5); \draw[orange,->,>=stealth] (3.5,5) – (4.5,5);

\draw

[orange,->,>=stealth] (2.5,11) – (1.5,11); \draw[orange,->,>=stealth] (3.5,11) – (4.5,11); \draw[orange,->,>=stealth] (3,11.5) – (3,12.5);

\draw

[orange,->,>=stealth] (8.5,11) – (7.5,11); \draw[orange,->,>=stealth] (9.5,11) – (10.5,11); \draw[orange,->,>=stealth] (9,11.5) – (9,12.5); \nodeat (3,12) +3.270;

Figure 4: Sparse reward modification for MDP experiment 2.

The result is similar for MDP experiment 3. The attack cost is 𝐫-𝐫01=1.05, compared to 𝐫01=121. In Figure 5, we show the reward modification for each state action pair. Again, the attack is very sparse: only rewards of 12 state-action pairs are modified out of a total of 124.

{tikzpicture}\draw

[step=2cm,black,thin,opacity=0.5] (0,0) grid (12,12); [black] (2,8) rectangle (4,12); [black] (8,8) rectangle (10,12); \nodeat (1,11) G1; \nodeat (11,11) G2; [blue,opacity=0.5] (0,10) rectangle (2,12); [blue,opacity=0.5] (10,10) rectangle (12,12); \nodeat (3,7) S;

\draw

[orange,->,>=stealth] (0.5,1) – (-0.5,1); \draw[orange,->,>=stealth] (1,0.5) – (1,-0.5); \draw[orange,->,>=stealth] (1.5,1.2) – (2.5,1.2); \draw[orange,->,>=stealth] (2.5,0.8) – (1.5,0.8); \draw[orange,->,>=stealth] (0.8,1.5) – (0.8,2.5); \draw[orange,->,>=stealth] (1.2,2.5) – (1.2,1.5);

\draw

[orange,->,>=stealth] (0.5,3) – (-0.5,3); \draw[orange,->,>=stealth] (1.5,3.2) – (2.5,3.2); \draw[orange,->,>=stealth] (2.5,2.8) – (1.5,2.8); \draw[orange,->,>=stealth] (0.8,3.5) – (0.8,4.5); \draw[orange,->,>=stealth] (1.2,4.5) – (1.2,3.5);

\draw

[orange,->,>=stealth] (0.5,5) – (-0.5,5); \draw[orange,->,>=stealth] (1.5,5.2) – (2.5,5.2); \draw[orange,->,>=stealth] (2.5,4.8) – (1.5,4.8); \draw[orange,->,>=stealth] (0.8,5.5) – (0.8,6.5); \draw[orange,->,>=stealth] (1.2,6.5) – (1.2,5.5);

\draw

[orange,->,>=stealth] (0.5,7) – (-0.5,7); \draw[orange,->,>=stealth] (1.5,7.2) – (2.5,7.2); \draw[orange,->,>=stealth] (2.5,6.8) – (1.5,6.8); \draw[orange,->,>=stealth] (0.8,7.5) – (0.8,8.5); \draw[orange,->,>=stealth] (1.2,8.5) – (1.2,7.5);

\draw

[orange,->,>=stealth] (0.5,9) – (-0.5,9); \draw[orange,->,>=stealth] (1.5,9) – (2.5,9); \draw[orange,->,>=stealth] (0.8,9.5) – (0.8,10.5); \draw[orange,->,>=stealth] (1.2,10.5) – (1.2,9.5);

\draw

[orange,->,>=stealth] (0.5,11) – (-0.5,11); \draw[orange,->,>=stealth] (1.5,11) – (2.5,11); \draw[orange,->,>=stealth] (1,11.5) – (1,12.5);

\draw

[orange,->,>=stealth] (3,0.5) – (3,-0.5); \draw[orange,->,>=stealth] (3.5,1.2) – (4.5,1.2); \draw[orange,->,>=stealth] (4.5,0.8) – (3.5,0.8); \draw[orange,->,>=stealth] (2.8,1.5) – (2.8,2.5); \draw[orange,->,>=stealth] (3.2,2.5) – (3.2,1.5);

\draw

[orange,->,>=stealth] (5,0.5) – (5,-0.5); \draw[orange,->,>=stealth] (5.5,1.2) – (6.5,1.2); \draw[orange,->,>=stealth] (6.5,0.8) – (5.5,0.8); \draw[orange,->,>=stealth] (4.8,1.5) – (4.8,2.5); \draw[orange,->,>=stealth] (5.2,2.5) – (5.2,1.5);

\draw

[orange,->,>=stealth] (7,0.5) – (7,-0.5); \draw[orange,->,>=stealth] (7.5,1.2) – (8.5,1.2); \draw[orange,->,>=stealth] (8.5,0.8) – (7.5,0.8); \draw[orange,->,>=stealth] (6.8,1.5) – (6.8,2.5); \draw[orange,->,>=stealth] (7.2,2.5) – (7.2,1.5);

\draw

[orange,->,>=stealth] (9,0.5) – (9,-0.5); \draw[orange,->,>=stealth] (9.5,1.2) – (10.5,1.2); \draw[orange,->,>=stealth] (10.5,0.8) – (9.5,0.8); \draw[orange,->,>=stealth] (8.8,1.5) – (8.8,2.5); \draw[orange,->,>=stealth] (9.2,2.5) – (9.2,1.5);

\draw

[orange,->,>=stealth] (11,0.5) – (11,-0.5); \draw[orange,->,>=stealth] (11.5,1) – (12.5,1); \draw[orange,->,>=stealth] (10.8,1.5) – (10.8,2.5); \draw[orange,->,>=stealth] (11.2,2.5) – (11.2,1.5);

\draw

[orange,->,>=stealth] (3.5,3.2) – (4.5,3.2); \draw[orange,->,>=stealth] (4.5,2.8) – (3.5,2.8); \draw[orange,->,>=stealth] (2.8,3.5) – (2.8,4.5); \draw[orange,->,>=stealth] (3.2,4.5) – (3.2,3.5);

\draw

[orange,->,>=stealth] (3.5,5.2) – (4.5,5.2); \draw[orange,->,>=stealth] (4.5,4.8) – (3.5,4.8); \draw[orange,->,>=stealth] (2.8,5.5) – (2.8,6.5); \draw[orange,->,>=stealth] (3.2,6.5) – (3.2,5.5);

\draw

[orange,->,>=stealth] (3.5,7.2) – (4.5,7.2); \draw[orange,->,>=stealth] (4.5,6.8) – (3.5,6.8); \draw[orange,->,>=stealth] (3,7.5) – (3,8.5);

\draw

[orange,->,>=stealth] (5.5,3.2) – (6.5,3.2); \draw[orange,->,>=stealth] (6.5,2.8) – (5.5,2.8); \draw[orange,->,>=stealth] (4.8,3.5) – (4.8,4.5); \draw[orange,->,>=stealth] (5.2,4.5) – (5.2,3.5);

\draw

[orange,->,>=stealth] (5.5,5.2) – (6.5,5.2); \draw[orange,->,>=stealth] (6.5,4.8) – (5.5,4.8); \draw[orange,->,>=stealth] (4.8,5.5) – (4.8,6.5); \draw[orange,->,>=stealth] (5.2,6.5) – (5.2,5.5);

\draw

[orange,->,>=stealth] (5.5,7.2) – (6.5,7.2); \draw[orange,->,>=stealth] (6.5,6.8) – (5.5,6.8); \draw[orange,->,>=stealth] (4.8,7.5) – (4.8,8.5); \draw[orange,->,>=stealth] (5.2,8.5) – (5.2,7.5);

\draw

[orange,->,>=stealth] (5.5,9.2) – (6.5,9.2); \draw[orange,->,>=stealth] (6.5,8.8) – (5.5,8.8); \draw[orange,->,>=stealth] (4.8,9.5) – (4.8,10.5); \draw[orange,->,>=stealth] (5.2,10.5) – (5.2,9.5); \draw[orange,->,>=stealth] (4.5,9) – (3.5,9);

\draw

[orange,->,>=stealth] (5.5,11.2) – (6.5,11.2); \draw[orange,->,>=stealth] (6.5,10.8) – (5.5,10.8); \draw[orange,->,>=stealth] (5,11.5) – (5,12.5); \draw[orange,->,>=stealth] (4.5,11) – (3.5,11);

\draw

[orange,->,>=stealth] (7.5,3.2) – (8.5,3.2); \draw[orange,->,>=stealth] (8.5,2.8) – (7.5,2.8); \draw[orange,->,>=stealth] (6.8,3.5) – (6.8,4.5); \draw[orange,->,>=stealth] (7.2,4.5) – (7.2,3.5);

\draw

[orange,->,>=stealth] (7.5,5.2) – (8.5,5.2); \draw[orange,->,>=stealth] (8.5,4.8) – (7.5,4.8); \draw[orange,->,>=stealth] (6.8,5.5) – (6.8,6.5); \draw[orange,->,>=stealth] (7.2,6.5) – (7.2,5.5);

\draw

[orange,->,>=stealth] (7.5,7.2) – (8.5,7.2); \draw[orange,->,>=stealth] (8.5,6.8) – (7.5,6.8); \draw[orange,->,>=stealth] (6.8,7.5) – (6.8,8.5); \draw[orange,->,>=stealth] (7.2,8.5) – (7.2,7.5);

\draw

[orange,->,>=stealth] (7.5,9) – (8.5,9); \draw[orange,->,>=stealth] (6.8,9.5) – (6.8,10.5); \draw[orange,->,>=stealth] (7.2,10.5) – (7.2,9.5);

\draw

[orange,->,>=stealth] (7.5,11) – (8.5,11); \draw[orange,->,>=stealth] (7,11.5) – (7,12.5);

\draw

[orange,->,>=stealth] (9.5,3.2) – (10.5,3.2); \draw[orange,->,>=stealth] (10.5,2.8) – (9.5,2.8); \draw[orange,->,>=stealth] (8.8,3.5) – (8.8,4.5); \draw[orange,->,>=stealth] (9.2,4.5) – (9.2,3.5);

\draw

[orange,->,>=stealth] (9.5,5.2) – (10.5,5.2); \draw[orange,->,>=stealth] (10.5,4.8) – (9.5,4.8); \draw[orange,->,>=stealth] (8.8,5.5) – (8.8,6.5); \draw[orange,->,>=stealth] (9.2,6.5) – (9.2,5.5);

\draw

[orange,->,>=stealth] (9.5,7.2) – (10.5,7.2); \draw[orange,->,>=stealth] (10.5,6.8) – (9.5,6.8); \draw[orange,->,>=stealth] (9,7.5) – (9,8.5);

\draw

[orange,->,>=stealth] (11.5,3) – (12.5,3); \draw[orange,->,>=stealth] (10.8,3.5) – (10.8,4.5); \draw[orange,->,>=stealth] (11.2,4.5) – (11.2,3.5);

\draw

[orange,->,>=stealth] (11.5,5) – (12.5,5); \draw[orange,->,>=stealth] (10.8,5.5) – (10.8,6.5); \draw[orange,->,>=stealth] (11.2,6.5) – (11.2,5.5);

\draw

[orange,->,>=stealth] (11.5,7) – (12.5,7); \draw[orange,->,>=stealth] (10.8,7.5) – (10.8,8.5); \draw[orange,->,>=stealth] (11.2,8.5) – (11.2,7.5);

\draw

[orange,->,>=stealth] (11.5,9) – (12.5,9); \draw[orange,->,>=stealth] (10.5,9) – (9.5,9); \draw[orange,->,>=stealth] (10.8,9.5) – (10.8,10.5); \draw[orange,->,>=stealth] (11.2,10.5) – (11.2,9.5);

\draw

[orange,->,>=stealth] (11.5,11) – (12.5,11); \draw[orange,->,>=stealth] (10.5,11) – (9.5,11); \draw[orange,->,>=stealth] (11,11.5) – (11,12.5);

\node

at (2.6,2) +0.010;

\node

at (6,1.2) -0.010;

\node

at (2.6,4) +0.010;

\node

at (8,1.2) -0.010;

\node

at (2.6,6) +0.010;

\node

at (10,1.2) -0.010;

\node

at (4,7.2) +0.100;

\node

at (6,7.2) +0.123;

\node

at (1,12) +0.100;

\node

at (8,7.2) +0.123;

\node

at (10,7.2) +0.123;

\node

at (11,12) +0.424;

Figure 5: Sparse reward modification for MDP experiment 3.

Finally, we show the result on attacking LQR with α=1. The attack cost is 𝐫-𝐫01=5.44, compared to 𝐫01=2088.57. In Figure 6, we plot the clean and poisoned trajectory of the vehicle, together with the reward modification in each time step. The attack is as effective as with a dense 2-norm attack in Figure 3. However, the poisoning is highly sparse: only 10 out of 400 rewards are changed.

(a) Clean and poisoned vehicle trajectory.
(b) Clean and poisoned rewards.
Figure 6: Sparse-poisoning a vehicle running LQR in 4D state space.

Appendix F Derivation of Discounted Discrete-time Algebraic Riccati Equation

We provide a derivation for the discounted Discrete-time Algebraic Riccati Equation. For simplicity, we consider the noiseless case, but the derivation easily generalizes to noisy case. We consider the loss function is a general quadratic function w.r.t. s as follows:

L(s,a)=12sQs+qs+c+aRa. (106)

When q=0,c=0, we recover the classic LQR setting. Assume the general value function takes the form V(s)=12sXs+sx+v. Let Q(s,a) (note that this is different notation from the Q matrix in L(s,a)) be the corresponding action value function. We perform dynamics programming as follows:

Q(s,a) =12sQs+qs+c+aRa+γV(As+Ba) (107)
=12sQs+qs+c+aRa+γ(12(As+Ba)X(As+Ba)+(As+Ba)x+v)
=12s(Q+γAXA)s+12a(R+γBXB)a+s(γAXB)a
+s(q+γAx)+a(γBx)+(c+γv).

We minimize a above:

(R+γBXB)a+γBXAs+γBx=0 (108)
a=-γ(R+γBXB)-1BXAs-γ(R+γBXB)-1BxKs+k.

Now we substitute it back to Q(s,a) and regroup terms, we get:

V(s)= 12s(Q+γAXA+K(R+γBXB)K+2γAXBK)s (109)
+s(K(R+γBXB)k+γAXBk+q+γAx+γKBx)+C

for some constant C, which gives us the following recursion:

X=γAXA-γ2AXB(R+γBXB)-1BXA+Q, (110)
x=q+γ(A+BK)x.