We study a security threat to batch reinforcement learning and control wherethe attacker aims to poison the learned policy. The victim is a reinforcementlearner / controller which first estimates the dynamics and the rewards from abatch data set, and then solves for the optimal policy with respect to theestimates. The attacker can modify the data set slightly before learninghappens, and wants to force the learner into learning a target policy chosen bythe attacker. We present a unified framework for solving batch policy poisoningattacks, and instantiate the attack on two standard victims: tabular certaintyequivalence learner in reinforcement learning and linear quadratic regulator incontrol. We show that both instantiation result in a convex optimizationproblem on which global optimality is guaranteed, and provide analysis onattack feasibility and attack cost. Experiments show the effectiveness ofpolicy poisoning attacks.
Quick Read (beta)
in Batch Reinforcement Learning and Control
We study a security threat to batch reinforcement learning and control where the attacker aims to poison the learned policy. The victim is a reinforcement learner / controller which first estimates the dynamics and the rewards from a batch data set, and then solves for the optimal policy with respect to the estimates. The attacker can modify the data set slightly before learning happens, and wants to force the learner into learning a target policy chosen by the attacker. We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. We show that both instantiation result in a convex optimization problem on which global optimality is guaranteed, and provide analysis on attack feasibility and attack cost. Experiments show the effectiveness of policy poisoning attacks.
in Batch Reinforcement Learning and Control
Yuzhe Ma University of Wisconsin–Madison [email protected] Xuezhou Zhang University of Wisconsin–Madison [email protected] Wen Sun Microsoft Research New York [email protected] Xiaojin Zhu University of Wisconsin–Madison [email protected]
noticebox[b]Preprint. Under review.\[email protected]
With the increasing adoption of machine learning, it is critical to study security threats to learning algorithms and design effective defense mechanisms against those threats. There has been significant work on adversarial attacks Biggio and Roli (2018); Huang et al. (2011). We focus on the subarea of data poisoning attacks where the adversary manipulates the training data so that the learner learns a wrong model. Prior work on data poisoning targeted victims in supervised learning Mei and Zhu (2015); Koh et al. (2018); Wang and Chaudhuri (2018); Zhang and Zhu (2019) and multi-armed bandits Jun et al. (2018); Ma et al. (2018); Liu and Shroff (2019). We take a step further and study data poisoning attacks on reinforcement learning (RL). Given RL’s prominent applications in robotics, games and so on, an intentionally and adversarially planted bad policy could be devastating.
While there has been some related work in test-time attack on RL, reward shaping, and teaching inverse reinforcement learning (IRL), little is understood on how to training-set poison a reinforcement learner. We take the first step and focus on batch reinforcement learner and controller as the victims. These victims learn their policy from a batch training set. We assume that the attacker can modify the rewards in the training set, which we show is sufficient for policy poisoning. The attacker’s goal is to force the victim to learn a particular target policy (hence the name policy poisoning), while minimizing the reward modifications. Our main contribution is to characterize batch policy poisoning with a unified optimization framework, and to study two instances against tabular certainty-equivalence (TCE) victim and linear quadratic regulator (LQR) victim, respectively.
2 Related Work
Of particular interest is the work on test-time attacks against RL Huang et al. (2017). Unlike policy poisoning, there the RL agent carries out an already-learned and fixed policy to e.g. play the Pong Game. The attacker perturbs pixels in a game board image, which is part of the state . This essentially changes the RL agent’s perceived state into some . The RL agent then chooses the action (e.g. move down) which may differ from (e.g. move up). The attacker’s goal is to force some specific on the RL agent. Note itself stays the same through the attack. In contrast, ours is a data-poisoning attack which happens at training time and aims to change .
Data-poisoning attacks were previously limited to supervised learning victims, either in batch mode Biggio et al. (2012); Xiao et al. (2015); Li et al. (2016); Mei and Zhu (2015) or online mode Wang and Chaudhuri (2018); Zhang and Zhu (2019). Recently data-poisoning attacks have been extended to multi-armed bandit victims Jun et al. (2018); Ma et al. (2018); Liu and Shroff (2019), but not yet to RL victims.
There are two related but distinct concepts in RL research. One concept is reward shaping Ng et al. (1999); Asmuth et al. (2008); Devlin and Kudenko (2012); Wiewiora (2003) which also modifies rewards to affect an RL agent. However, the goal of reward shaping is fundamentally different from ours. Reward shaping aims to speed up convergence to the same optimal policy as without shaping. Note the differences in both the target (same vs. different policies) and the optimality measure (speed to converge vs. magnitude of reward change).
The other concept is teaching IRL Cakmak and Lopes (2012); Brown and Niekum (2019); Kamalaruban et al. (2019). Teaching and attacking are mathematically equivalent. However, the main difference to our work is the victim. They require an IRL agent, which is a specialized algorithm that estimates a reward function from demonstrations of (state, action) trajectories alone (i.e. no reward given). In contrast, our attacks target more prevalent RL agents and are thus potentially more applicable. Due to the difference in the input to IRL vs. RL victims, our attack framework is completely different.
A Markov Decision Process (MDP) is defined as a tuple , where is the state space, is the action space, is the transition kernel where denotes the space of probability distributions on , is the reward function, and is a discounting factor. We define a policy as a function that maps a state to an action. We denote the function of a policy as , where the expectation is over the randomness in both transitions and rewards. The function that corresponds to the optimal policy can be characterized by the following Bellman optimality equation:
and the optimal policy is defined as .
We focus on RL victims who perform batch reinforcement learning. A training item is a tuple , where is the current state, is the action taken, is the received reward, and is the next state. A training set is a batch of training items denoted by . Given training set , a model-based learner performs learning in two steps:
Step 1. The learner estimates an MDP from . In particular, we assume the learner uses maximum likelihood estimate for the transition kernel
and least-squares estimate for the reward function
Note that we do not require (2) to have a unique maximizer . When multiple maximizers exist, we assume the learner arbitrarily picks one of them as the estimate. We assume the minimizer is always unique. We will discuss the conditions to guarantee the uniqueness of for two learners later.
Step 2. The learner finds the optimal policy that maximizes the expected discounted cumulative reward on the estimated environment , i.e.,
where is a specified or random initial state. Note that there could be multiple optimal policies, thus we use in (4). Later we will specialize (4) to two specific victim learners: the tabular certainty equivalence learner (TCE) and the certainty-equivalent linear quadratic regulator (LQR).
4 Policy Poisoning
We study policy poisoning attacks on model-based batch RL learners. Our threat model is as follows:
Knowledge of the attacker. The attacker has access to the original training set . The attacker knows the model-based RL learner’s algorithm. Importantly, the attacker knows how the learner estimates the environment, i.e., (2) and (3). In the case (2) has multiple maximizers, we assume the attacker knows exactly the that the learner picks.
Available actions of the attacker. The attacker is allowed to arbitrarily modify the rewards in into . As we show later, changing ’s but not is sufficient for policy poisoning.
Attacker’s goals. The attacker has a pre-specified target policy . The attack goals are to (1) force the learner to learn , (2) minimize attack cost under an -norm chosen by the attacker.
Given the threat model, we can formulate policy poisoning as a bi-level optimization problem11 1 As we will show, the constraint (7) could lead to an open feasible set (e.g., in (10)) for the attack optimization (5)-(7), on which the minimum of the objective function (5) may not be well-defined. In the case (7) induces an open set, we will consider instead a closed subset of it, and optimize over the subset. How to construct the closed subset will be made clear for concrete learners later.:
The in (7) does not involve and is precomputed from . The singleton set on the LHS of (7) ensures that the target policy is learned uniquely, i.e., there are no other optimal policies tied with . Next, we instantiate this attack formulation to two representative model-based RL victims.
4.1 Poisoning a Tabular Certainty Equivalence (TCE) Victim
In tabular certainty equivalence (TCE), the environment is a Markov Decision Process (MDP) with finite state and action space. Given original data , let , the time indexes of all training items for which action is taken at state . We assume , , i.e., each state-action pair appears at least once in . This condition is needed to ensure that the learner’s estimate and exist. Remember that we require (3) to have a unique solution. For the TCE learner, is unique as long as it exists. Therefore, , is sufficient to guarantee a unique solution to (3). Let the poisoned data be . Instantiating model estimation (2), (3) for TCE, we have
where is the indicator function, and
However, due to the strict inequality, (10) induces an open set in the space, on which the minimum of (5) may not be well-defined. Instead, we require a stronger attack goal which leads to a closed subset in the space. This is defined as the following -robust target polytope.
(-robust target polytope) The set of -robust functions induced by a target policy is the polytope
for a fixed .
The margin parameter ensures that is the unique optimal policy for any in the polytope. We now have a solvable attack problem, where the attacker wants to force the victim’s function into the -robust target polytope :
The constraint (15) enforces Bellman optimality on the value function , in which is replaced by , since the target policy is guaranteed to be optimal by (15). Note that problem (12)-(15) is a convex program with linear constraints given that , thus could be solved to global optimality. However, we point out that (12)-(15) is a more stringent formulation than (5)-(7) due to the additional margin parameter we introduced. The feasible set of (12)-(15) is a subset of (5)-(7). Therefore, the optimal solution to (12)-(15) could in general be a sub-optimal solution to (5)-(7) with potentially larger objective value. We now study a few theoretical properties of policy poisoning on TCE. All proofs are in the appendix. First of all, the attack is always feasible.
Proposition 1 states that for any target policy , there exists a perturbation on the rewards that teaches the learner that policy. Therefore, the attacker changing ’s but not is already sufficient for policy poisoning.
We next bound the attack cost. Let the MDP estimated on the clean data be . Let be the function that satisfies the Bellman optimality equation on . Define , where takes the maximum over 0. Intuitively, measures how suboptimal the target policy is compared to the clean optimal policy learned on , up to a margin parameter .
If , then the optimal attack cost is . If , then the optimal attack cost is . If , then the optimal attack cost is .
Note that both the upper and lower bounds on the attack cost are linear with respect to , which can be estimated directly from the clean training set . This allows the attacker to easily estimate its attack cost before actually solving the attack problem.
4.2 Poisoning a Linear Quadratic Regulator (LQR) Victim
As the second example, we study an LQR victim that performs system identification from a batch training set Dean et al. (2017). Let the linear dynamical system be
where , is the state, is the control signal, and is a Gaussian noise. When the agent takes action at state , it suffers a quadratic loss of the general form
for some , , and . Here we have redefined the symbols and in order to conform with the notation convention in LQR: now we use for the quadratic loss matrix associated with state, not the action-value function; we use for the quadratic loss matrix associated with action, not the reward function. The previous reward function in general MDP (section 3) is now equivalent to the negative loss . This form of loss captures various LQR control problems. Note that the above linear dynamical system can be viewed as an MDP with transition kernel and reward function . The environment is thus characterized by matrices , (for transition kernel) and , , , (for reward function), which are all unknown to the learner.
We assume the clean training data was generated by running the linear system for multiple episodes following some random policy Dean et al. (2017). Let the poisoned data be . Instantiating model estimation (2), (3), the learner performs system identification on the poisoned data:
Note that in (20), the learner uses a stronger constraint than the original constraint , which guarantees that the minimizer can be achieved. The conditions to further guarantee (20) having a unique solution depend on the property of certain matrices formed by the clean training set , which we defer to appendix D.
The learner then computes the optimal control policy with respect to , , , , and . We assume the learner solves a discounted version of LQR control
where the expectation is over . It is known that the control problem has a closed-form solution , where
Here is the unique solution of the Algebraic Riccati Equation,
and is a vector that satisfies
The attacker aims to force the victim into taking target action . Note that in LQR, the attacker cannot arbitrarily choose , as the optimal control policy and enforce a linear structural constraint between and . One can easily see that the target action must obey for some in order to achieve successful attack. Therefore we must assume instead that the attacker has a target policy specified by a pair . However, an arbitrarily linear policy may still not be feasible. A target policy is feasible if and only if it is produced by solving some Riccati equation, namely, it must lie in the following set:
Therefore, to guarantee feasibility, we assume the attacker always picks the target policy by solving an LQR problem with some attacker-defined loss function. We can now pose the policy poisoning attack problem:
Note that the estimated transition matrices , are not optimization variables because the attacker can only modify the rewards, which will not change the learner’s estimate on and . The attack optimization (27)-(33) is hard to solve due to the constraint (33) itself being a semi-definite program (SDP). To overcome the difficulty, we pull all the positive semi-definite constraints out of the lower-level optimization. This leads to a more stringent surrogate attack optimization (see appendix C). Solving the surrogate attack problem, whose feasible region is a subset of the original problem, in general gives a suboptimal solution to (27)-(33). But it comes with one advantage: convexity.
5.1 Policy Poisoning Attack on TCE Victim
Experiment 1. We consider a simple MDP with two states and two actions: stay in the same state or move to the other state, shown in figure 0(a). The discounting factor is . The MDP’s values are shown in table 0(b). Note that the optimal policy will always pick action stay.
The clean training data reflects this underlying MDP, and consists of 4 tuples:
Let the attacker’s target policy be move, for any state . The attacker sets and uses , i.e. as the attack cost. Solving the policy poisoning attack optimization problem (12)-(15) produces the poisoned data:
with attack cost . The resulting poisoned values are shown in table 0(c). To verify this attack, we run TCE learner on both clean data and poisoned data. Specifically, we estimate the transition kernel and the reward function as in (8) and (9) on each data set, and then run value iteration until the values converge. In Figure 0(d), we show the trajectory of values for state , where the and axes denote and respectively. All trajectories start at . The dots on the trajectory correspond to each step of value iteration, while the star denotes the converged values. The diagonal dashed line is the (zero margin) policy boundary, while the gray region is the -robust target polytope with an offset to the policy boundary. The trajectory of clean data converges to a point below the policy boundary, where the action is optimal. With the poisoned data, the trajectory of values converge to a point exactly on the boundary of the -robust target polytope, where the action becomes optimal. This validates our attack.
We also compare our attack with reward shaping Ng et al. (1999). We let the potential function be the optimal value function for all to shape the clean dataset. The dataset after shaping is
In Figure 0(d), we show the trajectory of values after reward shaping. Note that same as on clean dataset, the trajectory after shaping converges to a point also below the policy boundary. This means reward shaping can not make the learner learn a different policy from the original optimal policy. Also note that after reward shaping, value iteration converges much faster (in only one iteration), which matches the benefits of reward shaping shown in Ng et al. (1999). More importantly, this illustrates the difference between our attack and reward shaping.
Experiment 2. As another example, we consider the grid world tasks in Cakmak and Lopes (2012). In particular, we focus on two tasks shown in figure 1(a) and 1(b). In figure 1(a), the agent starts from S and aims to arrive at the terminal cell G. The black regions are walls, thus the agent can only choose to go through the white or gray regions. The agent can take four actions in every state: go left, right, up or down, and stays if the action takes it into the wall. Reaching a gray, white, or the terminal state results in rewards , , 2, respectively. After the agent arrives at the terminal state G, it will stay there forever and always receive reward 0 regardless of the following actions. The original optimal policy is to follow the blue trajectory. The attacker’s goal is to force the agent to follow the red trajectory. Correspondingly, we set the target actions for those states on the red trajectory as along the trajectory. We set the target actions for the remaining states to be the same as the original optimal policy learned on clean data.
The clean training data contains a single item for every state-action pair. We run the attack with and . Our attack is successful: with the poisoned data, TCE generates a policy that produces the red trajectory in Figure 1(a), which is the desired behavior. The attack cost is , which is small compared to . In Figure 1(a), we show the poisoning on rewards. Each state-action pair is denoted by an orange arrow. The value tagged to each arrow is the modification to that reward, where red value means the reward is increased and blue means decreased. An arrow without any tagged value means the corresponding reward is not changed by attack. Note that rewards along the red trajectory are increased, while those along the blue trajectory are reduced, resulting in the red trajectory being preferred by the agent. Furthermore, rewards closer to the starting state S suffer larger poisoning since they contribute more to the values. For the large attack +2.139 happening at terminal state, we provide an explanation in appendix E.
Experiment 3. In Figure 1(b) there are two terminal states G1 and G2 with reward 1 and 2, respectively. The agent starts from S. Although G2 is more profitable, the path is longer and each step has a reward. Therefore, the original optimal policy is the blue trajectory to G1. The attacker’s target policy is to force the agent along the red trajectory to G2. We set the target actions for states as in experiment 2. The clean training data contains a single item for every state-action pair. We run our attack with and . Again, after the attack, TCE on the poisoned dataset produces the red trajectory in figure 1(b), with attack cost , compared to . The reward poisoning follows a similar pattern to experiment 2.
5.2 Policy Poisoning Attack on LQR Victim
Experiment 4. We now demonstrate our attack on LQR. We consider a linear dynamical system that approximately models a vehicle. The state of the vehicle consists of its 2D position and 2D velocity: . The control signal at time is the force which will be applied on the vehicle for seconds. We assume there is a friction parameter such that the friction force is . Let be the mass of the vehicle. Given small enough , the transition matrices can be approximated by (17) where
In this example, we let , , , and with . The vehicle starts from initial position with velocity , i.e., . The true loss function is with and (i.e., in (18)). Throughout the experiment, we let for solving the optimal control policy in (21). With the true dynamics and loss function, the computed optimal control policy is
which will drive the vehicle to the origin.
The batch LQR learner estimates the dynamics and the loss function from a batch training data. To produce the training data, we let the vehicle start from state and simulate its trajectory with a random control policy. Specifically, in each time step, we uniformly sample a control signal in a unit sphere. The vehicle then takes action to transit from current state to the next state , and receives a reward . This gives us one training item . We simulate a total of 400 time steps to obtain a batch data that contains 400 items, on which the learner estimates the dynamics and the loss function. With the learner’s estimate, the computed clean optimal policy is
The clean optimal policy differs slightly from the true optimal policy due to the inaccuracy of the learner’s estimate. The attacker has a target policy that can drive the vehicle close to its target destination with terminal velocity , which can be represented as a target state . To ensure feasibility, we assume that the attacker starts with the loss function where . Due to the offset this corresponds to setting in (18). The attacker then solves the Riccati equation with its own loss function and the learner’s estimates and to arrive at the target policy
We run our attack (27)-(33) with and in (33). Figure 3 shows the result of our attack. In Figure 2(a), we plot the trajectory of the vehicle with policy learned on clean data and poisoned data respectively. Our attack successfully forces LQR into a policy that drives the vehicle close to the target destination. The wiggle on the trajectory is due to the noise of the dynamical system. On the poisoned data, the LQR victim learns the policy
which matches exactly the target policy , . In Figure 2(b), we show the poisoning on rewards. Our attack leads to very small modification on each reward, thus the attack is efficient. The total attack cost over all 400 items is only , which is tiny small compared to . The results here demonstrate that our attack can dramatically change the behavior of LQR by only slightly modifying the rewards in the dataset.
Finally, for both attacks on TCE and LQR, we note that by setting the attack cost norm in (5), the attacker is able to obtain a sparse attack, meaning that only a small fraction of the batch data needs to be poisoned. Such sparse attacks have profound implications in adversarial machine learning as they can be easier to carry out and harder to detect. We show detailed results in appendix E.
We presented a policy poisoning framework against batch reinforcement learning and control. We showed the attack problem can be formulated as convex optimization. We provided theoretical analysis on attack feasibility and cost. Experiments show the attack can force the learner into an attacker-chosen target policy while incurring only a small attack cost.
Acknowledgments. This work is supported in part by NSF 1545481, 1561512, 1623605, 1704117, 1836978 and the MADLab AF Center of Excellence FA9550-18-1-0166.
- Asmuth et al. (2008) John Asmuth, Michael L Littman, and Robert Zinkov. Potential-based shaping in model-based reinforcement learning. In Proceedings of the 23rd national conference on Artificial intelligence-Volume 2, pages 604–609. AAAI Press, 2008.
- Biggio and Roli (2018) Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317–331, 2018.
- Biggio et al. (2012) Battista Biggio, B Nelson, and P Laskov. Poisoning attacks against support vector machines. In 29th International Conference on Machine Learning, pages 1807–1814. ArXiv e-prints, 2012.
- Brown and Niekum (2019) Daniel S Brown and Scott Niekum. Machine teaching for inverse reinforcement learning: Algorithms and applications. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7749–7758, 2019.
- Cakmak and Lopes (2012) Maya Cakmak and Manuel Lopes. Algorithmic and human teaching of sequential decision tasks. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
- Dean et al. (2017) Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. arXiv preprint arXiv:1710.01688, 2017.
- Devlin and Kudenko (2012) Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pages 433–440. IFAAMAS, 2012.
- Diamond and Boyd (2016) Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
- Huang et al. (2011) Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and JD Tygar. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 43–58. ACM, 2011.
- Huang et al. (2017) Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
- Jun et al. (2018) Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3640–3649, 2018.
- Kamalaruban et al. (2019) P Kamalaruban, R Devidze, V Cevher, and A Singla. Interactive teaching algorithms for inverse reinforcement learning. In 28th International Joint Conference on Artificial Intelligence, pages 604–609, 2019.
- Koh et al. (2018) Pang Wei Koh, Jacob Steinhardt, and Percy Liang. Stronger data poisoning attacks break data sanitization defenses. arXiv preprint arXiv:1811.00741, 2018.
- Li et al. (2016) Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on factorization-based collaborative filtering. In Advances in neural information processing systems, pages 1885–1893, 2016.
- Liu and Shroff (2019) Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pages 4042–4050, 2019.
- Ma et al. (2018) Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pages 186–204. Springer, 2018.
- Mei and Zhu (2015) Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks on machine learners. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
- Wang and Chaudhuri (2018) Yizhen Wang and Kamalika Chaudhuri. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
- Wiewiora (2003) Eric Wiewiora. Potential-based shaping and q-value initialization are equivalent. Journal of Artificial Intelligence Research, 19:205–208, 2003.
- Xiao et al. (2015) Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. Is feature selection secure against training data poisoning? In International Conference on Machine Learning, pages 1689–1698, 2015.
- Zhang and Zhu (2019) Xuezhou Zhang and Xiaojin Zhu. Online data poisoning attack. arXiv preprint arXiv:1903.01666, 2019.
Appendix A Proof of Proposition 1
The proof of feasibility relies on the following result, which states that there is a bijection mapping between reward space and value function space.
Given an MDP with transition probability function and discounting factor , let denote the set of all possible reward functions, and let denote the set of all possible Q tables. Then, there exists a bijection mapping between and , induced by Bellman optimality equation.
Given any reward function , define the Bellman operator as
Since , is a contraction mapping, i.e., , . Then by Banach Fixed Point Theorem, there is a unique that satisfies , which is the that maps to.
Given any , one can define the corresponding by
Thus the mapping is one-to-one.
For any target policy , we construct the following :
Appendix B Proof of Theorem 2
The proof of Theorem 2 relies on a few lemmas. We first prove the following result, which shows that given two vectors that have equal element summation, the vector whose elements are smoother will have smaller norm for any . This result is used later to prove Lemma 6.
Let be two vectors. Let be a subset of indexes such that
Then for any , we have .
Note that the conditions and suggest the summation of elements in and are equal, and only elements in differ for the two vectors. However, the elements in of are smoother than that of , thus has smaller norm. To prove the result, we consider three cases separately.
Case 1: . Then we have
Case 2: . We show . Note that
Let . By Holder’s inequality, we have
Case 3: . We have
Therefore , we have .
Next we prove Lemma 6, which shows that one possible optimal attack solution to (12)-(15) takes the following form: shift all the clean rewards in by the same amount . Here is a function of state and action . That means, rewards belonging to different might be shifted a different amount, but those corresponding to the same pair will be identically shifted.
We point out that although there exists an optimal attack taking the above form, it is not necessarily the only optimal solution. However, all those optimal solutions must have exactly the same objective value (attack cost), thus it suffices to consider the solution in Lemma 6.
Let be the reward function for the pair estimated from clean data . We then define a different poisoned reward vector , where
Now we show , and is another optimal solution to (12)-(15). We first verify that , , and satisfy constraints (15)-(15). To verify (15), we only need to check , since and only differ on those rewards in . We have
But note that by our assumption, is an optimal solution, thus , which gives . This suggests , , and is another optimal solution. Compared to , differs in that now becomes identical for all for a particular pair. Reusing the above argument iteratively, one can make identical for all for all pairs, while guaranteeing the solution is still optimal. Therefore, we have
Finally, Lemma 7 provides a sensitive analysis on the value function as the reward function changes.
Let and be two MDPs, where only the reward function differs. Let and be action values satisfying the Bellman optimality equation on and respectively, then
Define the Bellman operator as
From now on we suppress variables and for convenience. Note that due to the Bellman optimality, we have and , thus
Rearranging we have . Similarly we have
Rearranging we have , concluding the proof.
Now we are ready to prove our main result. See 2
We construct the following value function .
Note that and , we have
which leads to
thus we have and ,
Therefore satisfies the constraint (15). By proposition 4, there exists a unique function such that satisfies the Bellman optimality equation of MDP . We then construct the following reward vector such that and , , where is the reward function estimated from . The reward function estimated on is then
Thus , and is a feasible solution to (12)-(15). Now we analyze the attack cost for , which gives us a natural upper bound on the attack cost of the optimal solution . Note that and satisfy the Bellman optimality equation for reward function and respectively, and
thus by Lemma 7, we have ,
Therefore, we have
Now we prove the lower bound. We consider two cases separately.
Case 1: . We must have , . In this case no attack is needed and therefore the optimal solution is . The lower bound holds trivially.
Case 2: . Let and () be a state-action pair such that
Constraint (15) ensures that , in which case either one of the following two conditions must hold:
since otherwise we have
Next note that if either or holds, we have . By Lemma 7, we have
Let , then we have
Therefore, we have
We finally point out that while an optimal solution may not necessarily take the form in Lemma 6, it suffices to bound the cost of an optimal attack which indeed takes this form (as we did in the proof) since all optimal attacks have exactly the same objective value.
Appendix C Convex Surrogate for LQR Attack Optimization
The feasible set of (73)-(79) is a subset of the original problem, thus the surrogate attack optimization is a more stringent formulation than the original attack optimization, that is, successfully solving the surrogate optimization gives us a (potentially) sub-optimal solution to the original problem. To see why the surrogate optimization is more stringent, we illustrate with a much simpler example as below. A formal proof is straight forward, thus we omit it here. The original problem is (80)-(81). The feasible set for is a singleton set , and the optimal objective value is 0.
Once we pull the constraint out of the lower-level optimization (81), we end up with a surrogate optimization (82)-(84). Note that (84) requires , which does not satisfy (84). Therefore the feasible set of the surrogate optimization is , meaning it is more stringent than (80)-(81).
First note that the sub-level optimization (79) is itself a convex problem, thus is equivalent to the corresponding KKT condition. We write out the KKT condition of (79) to derive an explicit form of our attack formulation as below:
Note that these three equality constraints are all linear in , , , and . (94) is linear in and . (94)-(94) are also linear in , , , and . Finally, (94) contains convex constraints on , , and . Given all above, the attack problem is convex.
Next we analyze the feasibility of the surrogate attack optimization.
Let , be the learner’s estimated transition kernel. Let
be the attacker-defined loss function. Assume . If the target policy , is the optimal control policy induced by the LQR with transition kernel , , and loss function , then the surrogate attack optimization (73)-(79) is feasible. Furthermore, the optimal solution can be achieved.
and be the vector whose -th element is . We next show that , , , , , together with some and is a feasible solution. Note that since , is induced by the LQR with transition kernel , and cost function , constraints (79)-(79) must be satisfied with some and . The poisoned reward vector obviously satisfies (79) since it is constructed exactly as the minimizer. By our assumption, , thus (79) is satisfied. Therefore, , , , , , together with the corresponding , is a feasible solution, and the optimization (73)-(79) is feasible. Furthermore, since the feasible set is closed, the optimal solution can be achieved.
Appendix D Conditions for The LQR Learner to Have Unique Estimate
The LQR learner estimates the cost function by
We want to find a condition that guarantees the uniqueness of the solution.
Let be a vector, whose -th element is
Note that we can view as a function of , , , , and , thus we can also denote . Define , i.e., all possible vectors that are achievable with form (101) if we vary , , and subject to positive semi-definite constraints on and . We can prove that is a closed convex set.
, is a closed convex set.
Let . We use to denote the -th element of vector . Then we have
for some , , and , and
for some , , and . , let . Then the -th element of is
Since and , , concluding the proof.
The optimization (100) is intrinsically a least-squares problem with positive semi-definite constraints on and , and is equivalent to solving the following linear equation:
where is the projection of the negative reward vector onto the set . The solution to (105) is unique if and only if the following two conditions both hold
The projection is unique.
(105) has a unique solution for .
Condition is satisfied because is convex, and any projection (in norm) onto a convex set exists and is always unique (see Hilbert Projection Theorem). We next analyze when condition holds. (105) is a linear function in , , , and , thus one can vectorize and to obtain a problem in the form of linear regression. Then the uniqueness is guaranteed if and only if the design matrix has full column rank. Specifically, let , , and . Let and denote the -th element of and respectively. Define
then (105) is equivalent to , where contains the vectorized variables , , and . has a unique solution if and only if has full column rank.
Appendix E Sparse Attacks on TCE and LQR
In this section, we present experimental details for both TCE and LQR victims when the attacker uses norm to measure the attack cost, i.e. . The other experimental parameters are set exactly the same as in the main text.
We first show the result for MDP experiment 2 with , see Figure 4. The attack cost is , which is small compared to . We note that the reward poisoning is extremely sparse: only the reward corresponding to action “go up” at the terminal state is increased by 3.27, and all other rewards remain unchanged. To explain this attack, first note that we set the target action for the terminal state to “go up”, thus the corresponding reward must be increased. Next note that after the attack, the terminal state becomes a sweet spot, where the agent can keep taking action “go up” to gain large amount of discounted future reward. However, such future reward is discounted more if the agent reaches the terminal state via a longer path. Therefore, the agent will choose to go along the red trajectory to get into the terminal state earlier, though at a price of two discounted rewards.
The result is similar for MDP experiment 3. The attack cost is , compared to . In Figure 5, we show the reward modification for each state action pair. Again, the attack is very sparse: only rewards of 12 state-action pairs are modified out of a total of 124.
Finally, we show the result on attacking LQR with . The attack cost is , compared to . In Figure 6, we plot the clean and poisoned trajectory of the vehicle, together with the reward modification in each time step. The attack is as effective as with a dense 2-norm attack in Figure 3. However, the poisoning is highly sparse: only 10 out of 400 rewards are changed.
Appendix F Derivation of Discounted Discrete-time Algebraic Riccati Equation
We provide a derivation for the discounted Discrete-time Algebraic Riccati Equation. For simplicity, we consider the noiseless case, but the derivation easily generalizes to noisy case. We consider the loss function is a general quadratic function w.r.t. as follows:
When , we recover the classic LQR setting. Assume the general value function takes the form . Let (note that this is different notation from the matrix in ) be the corresponding action value function. We perform dynamics programming as follows:
We minimize above:
Now we substitute it back to and regroup terms, we get:
for some constant , which gives us the following recursion: