Abstract
Reinforcement Learning (RL) of contactrich manipulation tasks has yieldedimpressive results in recent years. While many studies in RL focus on varyingthe observation space or reward model, few efforts focused on the choice ofaction space (e.g. joint or endeffector space, position, velocity, etc.).However, studies in robot motion control indicate that choosing an action spacethat conforms to the characteristics of the task can simplify exploration andimprove robustness to disturbances. This paper studies the effect of differentaction spaces in deep RL and advocates for Variable Impedance Control inEndeffector Space (VICES) as an advantageous action space for constrained andcontactrich tasks. We evaluate multiple action spaces on three prototypicalmanipulation tasks: Path Following (task with no contact), Door Opening (taskwith kinematic constraints), and Surface Wiping (task with continuous contact).We show that VICES improves sample efficiency, maintains low energyconsumption, and ensures safety across all three experimental setups. Further,RL policies learned with VICES can transfer across different robot models insimulation, and from simulation to real for the same robot. Further informationis available at https://stanfordvl.github.io/vices.
Quick Read (beta)
Variable Impedance Control in EndEffector Space:
An Action Space for Reinforcement Learning in ContactRich Tasks
Abstract
Reinforcement Learning (RL) of contactrich manipulation tasks has yielded impressive results in recent years. While many studies in RL focus on varying the observation space or reward model, few efforts focused on the choice of action space (e.g. joint or endeffector space, position, velocity, etc.). However, studies in robot motion control indicate that choosing an action space that conforms to the characteristics of the task can simplify exploration and improve robustness to disturbances. This paper studies the effect of different action spaces in deep RL and advocates for variable impedance control in endeffector space (VICES) as an advantageous action space for constrained and contactrich tasks. We evaluate multiple action spaces on three prototypical manipulation tasks: Path Following (task with no contact), Door Opening (task with kinematic constraints), and Surface Wiping (task with continuous contact). We show that VICES improves sample efficiency, maintains low energy consumption, and ensures safety across all three experimental setups. Further, RL policies learned with VICES can transfer across different robot models in simulation, and from simulation to real for the same robot. Further information is available at https://stanfordvl.github.io/vices.
variableimpedance.bib \usetikzlibrarybackgrounds \pgfplotssetcompat=newest \xpatchcmd\@todo
I Introduction
Diverse control tasks in robot manipulation are naturally addressed in different action spaces – for a specific task, one action space might simplify learning and control more than another. For a walking robot, it is important to directly control contact interactions to avoid slippage [Ludo_Contact_2013]. In contrast, for a tennis swing, it is important to track and control position, velocity, and at times the acceleration of the endeffector [DMP_Initial_2002]. For a surfacetosurface alignment task, minimizing the moment around a contact is important for robustness [khansari2016adaptive]. Tackling all these tasks requires solving two subproblems: (1) the generation of reference signals (desired contacts, trajectories, moments, etc.), and (2) the tracking of these signals. Control systems for robots can be structured with two feedback control loops that address the aforementioned subproblems: an outer loop controller that generates a timevarying reference trajectory, and an inner loop controller that tracks this trajectory. Let us refer to the outer loop as a functional map from observations to reference signals $g(o):\mathrm{O}\to \mathrm{A}$, and the inner loop as a map from reference signals to actuation commands $f(a):\mathrm{A}\to \mathrm{U}$. The combined control law becomes: $u=f\circ g(o)$, where $o\in \mathrm{O}$ is some observation, $a\in \mathrm{A}$ is an abstract action providing a reference signal in some space, and $u\in \mathrm{U}$ is the control command sent to the robot’s actuators to track this reference. While control theory provides a vast repertoire of strategies to map from reference signals to actuation commands ($f(\cdot )$ implementations), a key problem in robotics is to generate a viable reference in a suitable space given raw sensory observations, i.e. modelling the function $g(\cdot )$ and deciding on the interface to $f(\cdot )$. Once the interface has been defined (e.g. forces, positions, contact points, etc.), the generation of reference signals by $g(\cdot )$ can be addressed in multiple ways, ranging from handtuned state machines [sen2016automating, khansari2016adaptive], trajectory optimization [18toussaintRSS, 2017_rss_system], imitation learning [schaal2003computational, Ijspeert:2013:DMP, Calinon08IROS, kroemer2015towards], or reinforcement learning (RL) [lee2019making, harrison2017adapt]. Recent research in RL has focused on “observationstotorques” [levine2016end], which is akin to merging $f(\cdot )$ and $g(\cdot )$ into a single learned model. Other methods use higherlevel action spaces, such as joint space commands (e.g. position or velocity) [gu2017deep, haarnoja2018soft, vevcerik2017leveraging, zhu2018reinforcement] or task space commands (e.g. endeffector poses, force or fixed impedance) [lee2019making, Mrinal:2011]. These works typically focus on the effects of choosing an observation space $\mathrm{O}$ on the learning process and rarely justify the choice of action spaces, $\mathrm{A}$. However, the choice of action space defines the quantity around which the inner control loop is closed, and by extension the space wherein tracking error is minimized. This critically impacts robustness and task performance as well as learning efficiency and exploration.
Moreover, previously proposed action spaces do not necessarily create suitable references for some contactrich tasks. Consider the task of wiping a board, wherein a robot must control forces in some directions (to keep pressing against the board) and motion in others. The physical constraints of the task dictate in which axes the robot should be stiff and in which it should be compliant. This can often be timevarying. Manual specification of the task constraints is not a scalable solution for the variety of contactrich tasks the robot may need to perform, and for some tasks, manual specification may be nontrivial. This paper studies how the selection of an interface between $g(\cdot )$ and $f(\cdot )$, (the space $\mathrm{A}$) affects RL as a method to learn the mapping from observations to reference signals, $g(\cdot )$, and presents the first empirical study comparing the most common choices for the $\mathrm{A}$ in contactrich manipulation. We argue that an action space that captures motion and impedance in endeffector space can enable efficient learning of such tasks. We evaluate joint position, joint velocity, joint torque, joint variable impedance, as well as fixed and variable impedance in endeffector space. The choice of action space should be guided not only by the robot model but also by prior knowledge of the task. Hence, we compare action spaces across tasks with varying degrees of taskspace constraints, i.e., Path Following with no contact (path following), manipulation of constrained mechanisms (door opening), and continuous unconstrained contact (surface wiping). Moreover, we introduce variable impedance control in endeffector space (VICES) and advocate this action space for Deep RL algorithms applied to contactrich manipulation. We show that policies defined in VICES improve sample efficiency for exploration in RL, energy efficiency, and reduce applied forces. Thanks to the classical dynamically consistent operational space formalism [khatib1987unified], we observe that policies learned in endeffector space are also more robust to transfer across robots with significant differences in dynamics whether in simulation or the real world.
II Related Work
Robot Motion Control: Compliant control of a robot manipulator enables adaption to uncertainty in the environment (e.g. exact shape of a surface, or kinematic constraints of mechanisms) during contactrich tasks. However, certain tasks require direct control of the contact interactions (e.g. don’t apply too much force when wiping a window). Previous abstractions have divided dimensions of the task into those that are controlled kinematically (through position and velocity) and dynamically (through force and torques) [4308708, khatib1987unified, kroger2004compliant]. However in practice, the hard decoupling requires a level of knowledge about the task that is not always available [508440]. Impedance control [part1985impedance] allows safe robot contact manipulation with an unknown environment by explicitly controlling the amount of force the robot exerts when it deviates from a given kinematic goal. Therefore, it alleviates the need for perfect knowledge or hard separation between the dynamic and kinematic task dimensions. However, different phases of a manipulation task may require a dynamic balancing between kinematic and dynamic control. Existing methods address it by scheduling variable impedance gains to maintain stability or safety for a given kinematic trajectory [Mitrovic2011, Li2018, ruckert2013learned]. However, these methods assume that a reference trajectory is given.Instead, we propose to directly predict both endeffector displacement (reference) and variable impedance gains based on observations. Action Spaces in Learning from Demonstrations: LfD derives a task policy based on demonstrations provided by some other agent(s) [Argall:2009:SRL]. If the demonstrations do not perfectly overlap, a possible approach is to derive a policy that imitates the mean motion of the demonstration set, and varies the stiffness according to the coherence of the trials [6636303, 5648931] or according to the force sensed during kinesthetic replay [AbuDakka2018]. Similar use of variable impedance as an action representation for LfD has been demonstrated to be successful for adaptive grasping [6907861], manipulation of deformable objects [lee2015learning], and coadaptation to human workers [rozo2016learning]. However, the specification of impedance in the demonstration only reflects variability in demonstrations trajectories but not the underlying task constraints imposed by the environment nor the force profiles required for the task. This approach is also restricted to tasks where expert demonstrations are feasible, and hence is limited in application to kinematic tasks with phases that require different level of precision. Reinforcement Learning: In the field of modelbased reinforcement learning, Kim et al. [kim2010impedance] proposed a method to learn the parameters of a variable impedance position controller in endeffector space based on the equilibrium point formalism. They demonstrated the convergence, robustness and energy efficiency of their method on simulated manipulation task with a two DoF planar arm. However, their method requires an initial trajectory which is not always available. Similar to our method, Buchli et al. [buchli2011learning] apply policy improvements with path integrals (PI${}^{2}$) [theodorou2010generalized] to refine initial trajectories and learn variable scheduling for the joint impedance parameters. They demonstrate that energy consumption can be optimized while achieving a task using variable impedance. However, they use joint space as their action space, which limits the transferability of the learned behaviors to different robots and the optimality of the trajectories in the space of the task. Also, their method requires an initial estimate of the solution to start the iteration. Rey et al. [Rey2018] propose an approach to simultaneously learn kinematic trajectories from demonstrations and variable impedance in task space from exploration. They use Gaussian Mixture Regression as representation for the policy and demonstrate their method in simulation and in one realworld planar task with one single stiffness parameter. [Mrinal:2011] proposed a method to refine given trajectories with additional force profiles using PI${}^{2}$[theodorou2010generalized] and readings from a forcetorque sensor. We aim to achieve dynamic behavior without direct force loop control by using impedance to learn both trajectories and variable stiffness profiles. It is worth noting that all previous approaches have boostrapped learning with initial demonstrations while we explore learning from scratch to better understand how fast policies converge. Viereck et al. [Viereck2018] studied how to incorporate control structure to learn hopping policies for onelegged robot with RL. They use an optimal controller for fixed task conditions and learn to imitate its policy with neural networks to generalize to new task conditions. Interestingly, they compare two network architectures outputting signals in different action spaces: directly desired torques or full feedback parameters and desired configuration, which is transformed into torques with an analytic function. Their experiments show that this second action space is best suited for the hopping task with intermittent contact and adds interpretability to the network output. We study a set of analytic functions (controllers) that map policy actions to low level robot commands for robot manipulation in three tasks with different contact properties. With related motivation to ours, Peng et al. [Peng:2017:LLS:3099564.3099567] studied the importance of different action representations in RL for the task of locomotion. Similar to us, they aimed to shed light on the best action space to be used, but in their case they focused on imitation learning in bipedal motion of simulation agents. We would like to provide similar insights in the more complex contactrich robot manipulation domain and include preliminary studies of transfer to real world.
III Reinforcement Learning
The goal in reinforcement learning is to find a policy $\pi $, that selects actions based on current observations so as to maximize the expected reward obtained from interactions with the environment [sutton2018reinforcement]. We assume that the underlying problem can be modelled as a discretetime continuous Markov decision process ($\mathrm{S}$, $\mathrm{A}$, $\mathrm{P}$, $\mathrm{R}$, $\gamma $, $\rho $), where $\mathrm{S}$ is a continuous state space, $\mathrm{A}$ is a continuous action space, $\mathrm{P}$ is a Markovian transition model defining the probability of transitioning between states for a given action $P({s}^{\prime}s,a)$, $\mathrm{R}$ is a reward function $\mathrm{R}(\mathrm{s},\mathrm{a})=\mathrm{r}\in \mathbb{R}$, $\gamma \in [0,1)$ is a discount factor (for infinite horizon problems) and $\rho $ is the initial state distribution. When $\pi $ is probabilistic it represents the probability of the action, $a$, given the state, $s$: $\pi (as)=p(as)$, and $\pi (\cdot s)$ is the density distribution over $\mathrm{A}$ in state $s$. Alternatively, we can assume the state is not directly observable and learn a policy $\pi (ao)$ conditioned on observations $o\in \mathrm{O}$ instead of latent states $s\in \mathrm{S}$. Herein, the agent following the policy $\pi $ obtains an observation ${o}_{t}$ at time $t$ and performs an action ${a}_{t}$, receiving from the environment an immediate reward ${r}_{t}$ and a new observation ${o}_{t+1}$. Assuming the policy is parameterized by $\theta $, a policy gradient algorithm optimizes $\theta $ to maximize that the expected future return:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{max}}J(\theta ,\rho )=\underset{\theta}{\mathrm{arg}\mathrm{max}}E\left[\sum _{t}{\gamma}^{t}{r}_{t}\right]$$  (1) 
These algorithms are based on the policy gradient theorem that states: $\frac{\delta J}{\delta \theta}={\sum}_{s}{\mu}_{{\pi}_{\theta}}(s,\rho ){\sum}_{a}\frac{\delta {\pi}_{\theta}(as)}{\delta \theta}{Q}_{{\pi}_{\theta}}(s,a)$, where ${\mu}_{{\pi}_{\theta}}(s,\rho )={\sum}_{t}{\gamma}^{t}P({s}_{t}=s\rho )$ and ${Q}_{{\pi}_{\theta}}$ is the actionvalue function associated to the current policy ${\pi}_{\theta}$. There are multiple algorithmic solutions based on the policy gradient theorem that allow us to represent the policy with a deep neural network, e.g. Trust Region Policy Optimization (TRPO) [schulman2015trust], Deep Deterministic Policy Gradients (DDPG) [lillicrap2015continuous], or Advantage ActorCritic (A2C) [mnih2016asynchronous]. In our evaluation of different action spaces for policies, we will use Proximal Policy Optimization (PPO) [schulman2017proximal]. The evaluation of the sensitivity of different algorithms to the action space is deferred to future work.
IV Action Spaces in RL for Robot Manipulation
Relating the formalisms we introduced in Sec. I and III, $\pi $ corresponds to $g(\cdot )$, the function that maps observations to reference actions in some space $\mathrm{A}$, assuming that another function $f(\cdot )$ will map these actions to low level control commands, $u\in \mathrm{U}$. We note that the RL algorithms in Sec. III are agnostic to choice of action space $\mathrm{A}$. In practice, the most common $\mathrm{A}$ in RL for robot manipulation are a) joint torques [levine2016end], b) joint velocities [gu2017deep, vevcerik2017leveraging, zhu2018reinforcement], c) joint positions [haarnoja2018soft], and d) endeffector position [lee2019making, thananjeyan2017multilateral] possibly with orientation. The most common lowest level control commands (and the one we assume for our underlying physical agent) are joint torques, $u=\tau \in \mathrm{T}$. Joint torques are safer than positions and velocities for contactrich tasks in unstructured environments, because the forces the robot will apply on the environment are limited by the specified desired torques.
Manipulation tasks can seldom be solved solely by only controlling motion since there are tasks that contain contact and force constraints (e.g. the adaptation required to manipulate an articulated object or the minimum force to press while we wipe a surface) [4308708, khatib1987unified, 508440, kroger2004compliant]. To succeed in these tasks, the robot needs to dynamically modulate the exerted force on the environment through the torques on each joint. To map between action space $\mathrm{A}$ and actuation space $\mathrm{U}$, we can define analytic parameterized functions (i.e. controllers), ${f}_{\kappa}(a,{s}_{\text{\mathit{r}\mathit{o}\mathit{b}\mathit{o}\mathit{t}}}):\mathrm{A}\times {\mathrm{S}}_{\text{\mathit{r}\mathit{o}\mathit{b}\mathit{o}\mathit{t}}}\to \mathrm{U}$, that transform the output of the policy from the action space to the space of control commands depending on the current state of the robot, ${s}_{\text{\mathit{r}\mathit{o}\mathit{b}\mathit{o}\mathit{t}}}\in {\mathrm{S}}_{\text{\mathit{r}\mathit{o}\mathit{b}\mathit{o}\mathit{t}}}$. The parameters of these functions, $\kappa $, can be made part of the policy action space so that the agent has full controllability on the manipulation behavior [buchli2011learning]. In the following, we will introduce the different choices of analytic controllers ${f}_{\kappa}$, we use to map policy actions from commonly used policy action spaces into joint torques. Joint Torques: When the policy directly outputs desired joint torques, i.e. $a={\tau}_{\text{des}}$, the function that transforms to robot commands is simply ($\text{jt}=\mathit{\text{joint torques}}$):
$$u={f}_{\kappa}^{\text{jt}}(a={\tau}_{\text{des}})={\tau}_{\text{des}}$$  (1) 
Joint Velocities: When the policy outputs reference joint velocities $a={\dot{q}}_{\text{des}}$, the function to map to joint torques is ($\text{jv}=\mathit{\text{joint velocities}}$):
$$u={f}_{\kappa}^{\text{jv}}(a={\dot{q}}_{\text{des}},{s}_{t}=\dot{q})={k}_{\text{v}}({\dot{q}}_{\text{des}}\dot{q})$$  (2) 
where we close the loop around $\dot{q}$, the current joint velocity (state), and ${k}_{\text{v}}$ is a vector of proportional gain (parameter $\kappa $). Joint Positions: For policies that output reference joint positions, $a={q}_{\text{des}}$, it is most straightforward to use a proportionalderivative (PD) controller that generates torques that increase with the joint position error and decrease with the current joint velocity. We also remove the dynamic effects of the mechanism by scaling the torques with the inertia matrix, $M$ [khatib1987unified]. The function to transform reference joint positions to joint torques is thus ($\text{jp}=\mathit{\text{joint positions}}$):
$$u={f}_{\kappa}^{\text{jp}}({q}_{\text{des}},q,\dot{q})=M\left[{k}_{\text{p}}\mathrm{\Delta}q{k}_{\text{v}}\dot{q}\right]$$  (3) 
where $\mathrm{\Delta}q={q}_{\text{des}}q$ is the difference between current and desired joint configurations, which can be used as an alternative policy action space. ${k}_{\text{p}}$ and ${k}_{\text{v}}$ are vectors of proportional and derivative gains (parameters $\kappa $). EndEffector Pose: In the cases where the policy outputs the desired 6D pose $\in \mathbb{S}\mathbb{E}(3)$ of the robot in endeffector space, ${\mathbf{x}}_{\text{des}}$, we can use an impedancebased PD controller to first derive an endeffector space acceleration to move towards the goal. To do that, ${\mathbf{x}}_{\text{des}}$ can be decomposed into desired position, ${p}_{\text{des}}\in {\mathbb{R}}^{3}$, and desired orientation, ${R}_{\text{des}}\in \mathbb{S}\mathbb{O}(3)$. In the impedancebased PD controller, the endeffector acceleration increases with the difference between desired endeffector pose and current pose, $p$ and $R$, and decreases with the current endeffector velocity, $v$ and $\omega $. We then compute the robot actuations (joint torques) to achieve the desired endeffector space accelerations leveraging the kinematic and dynamic models of the robot with the dynamicallyconsistent operational space formulation [Khatib1995a]. First, we compute the wrenches at the endeffector that correspond to the desired accelerations, $f\in {\mathbb{R}}^{6}$. Then, we map the wrenches in endeffector space $f$ to joint torque commands with the endeffector Jacobian at the current joint configuration $J=J(q)$: $u={J}^{T}f$. Thus, the function that maps endeffector space position and orientation to low level robot commands is ($\text{ee}=\mathit{\text{endeffector space}}$):
$u$  $={f}_{\kappa}^{\text{ee}}({p}_{\text{des}},{R}_{\text{des}},p,R,v,\omega )$  (4)  
$={J}_{\text{pos}}^{T}[{\mathrm{\Lambda}}^{\text{pos}}[{k}_{\text{p}}^{\text{pos}}({p}_{\text{des}}p){k}_{\text{v}}^{\text{pos}}v]]+$  
${J}_{\text{ori}}^{T}[{\mathrm{\Lambda}}^{\text{ori}}\left[{k}_{\text{p}}^{\text{ori}}({R}_{\text{des}}\ominus R){k}_{\text{v}}^{\text{ori}}\omega \right]]$ 
where ${\mathrm{\Lambda}}^{\text{pos}}$ and ${\mathrm{\Lambda}}^{\text{ori}}$ are the parts corresponding to position and orientation in $\mathrm{\Lambda}\in {\mathbb{R}}^{6\times 6}$, the inertial matrix in the endeffector frame that decouples the endeffector motions, ${J}_{\text{pos}}$ and ${J}_{\text{ori}}$ are the position and orientation parts of the endeffector Jacobian, and $\ominus $ corresponds to the subtraction in $\mathbb{S}\mathbb{O}(3)$. The difference between current and desired position (${\mathrm{\Delta}}^{\text{pos}}={p}_{\text{des}}p$) and between current and desired orientation (${\mathrm{\Delta}}^{\text{ori}}={R}_{\text{des}}\ominus R$) can be used as alternative policy action space, $\mathrm{A}$. ${k}_{\text{p}}^{\text{pos}}$, ${k}_{\text{v}}^{\text{pos}}$, ${k}_{\text{p}}^{\text{ori}}$, and ${k}_{\text{v}}^{\text{ori}}$ are vectors of proportional and derivative gains for position and orientation (parameters $\kappa $), respectively. Variable Impedance EndEffector Space (VICES): Thus far, we have defined transformations between policy actions $a$ and robot commands $u$ are parameterized with $\kappa $. In these cases, parameters $\kappa $ are manually specified. We observe that it is beneficial to augment the action space with these parameters to give the agent full control of the behavior. As discussed in Sec. II, this idea has been previously explored in joint space for ${k}_{\text{p}}$ and ${k}_{\text{v}}$ [buchli2011learning]. In this paper, we propose to also turn the parameters of the endeffector space function (${k}_{\text{p}}^{\text{pos}}$, ${k}_{\text{v}}^{\text{pos}}$, ${k}_{\text{p}}^{\text{ori}}$, and ${k}_{\text{v}}^{\text{ori}}$) into policy outputs. We term this action space as Variable Impedance EndEffector Space (VICES). It enables the policy to learn both to predict the endeffector pose as a trajectory reference as well as to dynamically adapt the impedance gains along each of the six axes (rotation and translation) according to the phase of the task.
V Experiments
We conduct experiments in three application domains: a) free space Path Following [buchli2011learning], b) manipulation of articulated mechanisms [kim2010impedance] and c) surface wiping [ott2008cartesian, Leidner2016RoboticAR]. These tasks are not only relevant applications in robotics, but also span different levels of task constraints from free motion to highly constrained contactrich manipulation, which allows us to evaluate and compare the characteristics of the different action spaces for policy learning. For these three tasks and for each of the evaluated action spaces we aim to answer the following questions: is the action space suitable for modelfree RL? Is the learned policy physically efficient? Does the policy learned with a simulated robot transfer to a different simulated robot? Does a policy learned in simulation transfer to a real robot? To answer these questions we will use the following metrics and tests:

1.
Sample efficiency and task completion: samples required for the policy to succeed in the task and/or converge

2.
Physical efficiency: energy consumed by the robot when using the trained policy. We assume a proportional relationship between joint torques and electric power

3.
Physical effort: wrenches applied to the environment by the trained policy during contactrich manipulation tasks

4.
Transferability between robots: does a robot achieve the task using a policy trained on a different robot?

5.
Simtoreal transfer of contactrich policies: does a real robot achieve the task using a policy trained in simulation?
Our control framework is outlined in Fig 2. In all experiments our policies output actions ($a\in \mathrm{A}$) at $20\mathrm{Hz}$, while we send joint torque commands ($u\in \mathrm{U}=\mathrm{T}$) to the robot at $500\mathrm{Hz}$. To generate torque commands at a higher frequency, the controllers use the constant desired goal from the policy while updating the current state of the robot, ${s}_{\text{\mathit{r}\mathit{o}\mathit{b}\mathit{o}\mathit{t}}}$. In order to ensure smooth robot commands and generated motions, in all of our controllers we interpolate linearly between policy commands at consecutive time steps.
VA FreeSpace Motion  Path Following
Setup. In this experiment we aim to measure the properties of different action spaces for tasks that do not involve any contact with the environment. Agent’s goal is to follow a trajectory in freespace passing through four viapoints. The viapoints are placed on a virtual plane in front of the agent at a constant distance along the x axis. The order and location of the viapoints are fixed. We measure success as the fraction of the four viapoints that are passed through. This setup is a more complex version of the one viapoint trajectory of Buchli et al. [buchli2011learning]. This task can be solved kinematically without impedance control. However, we found that controlling the compliance of the robot could still offer benefits in this setup. Reward Model. This task is trained in two phases : a first phase of task completion and a second phase of energy optimization. In the first phase, the agent is rewarded only to complete the task: to pass through the four viapoints. In the second phase, the trained models from previous phase are further trained with the additional objective of optimizing their motion to reduce energy consumption. In the first phase of the experiment, the agent is rewarded when it hits a viapoint (it gets closer than ${d}_{\text{\mathit{t}\u210e}}=5\mathrm{cm}$). To help guide exploration, we also provide a small dense reward inversely proportional to the distance to the next viapoint in the trajectory. Since the episodes continue after the task is completed, a taskcompletion bonus proportional to the remaining time steps was introduced to discourage the robot from unnecessarily extending the duration of the task. We train policies with this reward using the different action spaces to evaluate if they can learn to follow the freespace trajectory. In the second phase of the experiment , we explore if the action spaces can optimize for the additional objective of minimizing energy consumption without decreasing the quality of the first objective (passing through the viapoints). We include an energy consumption penalty to the previously defined reward function. To evaluate energy consumption we assume that the torques from the motors are proportional to electric current and the voltage is constant, and thus the amount of electric power scales proportional to the torque and the energy is its time integral. Observations. We use as observations the pose and velocity of the endeffector in the robot reference frame, as well as the location of the viapoints (and whether each one has been checked). Evaluation. We first evaluate each of the different action spaces in simulation, using a simulated Panda robot agent with five different random seeds. In the first phase of the experiment, we measure sample efficiency (reward as function of the iterations) and level of completion of the task (number of viapoints crossed). In the second phase of the experiment, we also measure the total energy consumption and task success. In both phases we evaluate how the original trained policies transfer between robots.
Sample Efficiency and Task Completion. Fig. 3 (a) shows the training curves for policies in each of the action spaces. All policies except the ones that output reference joint torques and endeffector poses resolved with a fixed low impedance controller were able to achieve the goal of the task: checking all 4 viapoints (see Fig. 4(a)). For the policies that achieve the task, the differences in reward value after convergence is simply a consequence of the termination bonus: some action spaces (e.g. desired endeffector poses resolved with high fixed impedance) allow for faster motion and thus faster completion of the task We gain insights on how an RL policy exploits VICES for this task by observing the stiffness and damping over the course of an episode. Fig. 4 depicts the commands (the desired stiffness and the product of desired stiffness and delta position) from the policy trained with VICES for one episode after the first stage of training (before applying the energy penalty). The policy exploits the impedance (stiffness) to reach each viapoint in the different portions of the trajectory. As it checks each viapoint (indicated by the vertical bars in the figure), the impedance changes in the appropriate dimension to move quickly to the next viapoint with enough stiffness to avoid overshooting.
Physical Efficiency. We evaluate the physical efficiency of policies in different action spaces by comparing the total energy consumption of the agents at the end of the first phase and of the second phase of our experiment, where we add the energy penalty. We found that the policies using variable impedance in endeffector space as action space were the only endeffector space policies that consistently improved energy efficiency while maintaining task performance. Both the medium and high fixed impedance models became unstable, since the action space does not have sufficient degrees of freedom to optimize the motion to reduce energy consumption while still achieving the trajectory task. Note that since the low impedance model never achieved the task, it was not evaluated with energy penalties. In joint space, the policies outputting actions in variable impedance space were also able to reduce the energy consumption significantly more than the controllers with fixed impedance, as expected [buchli2011learning]. They also reduced more energy than policies outputting variable impedance in endeffector space. This reflects that the policies outputting joint space commands resulting from the first phase of our experiment solved the task much faster (with higher energy consumption) than their endeffector counterparts and therefore had much more room for improvement when optimizing for energy efficiency. Therefore, the difference in absolute energy optimization between policies in joint and in endeffector space is an artifact of the difference in magnitude between endeffector space delta position limits and joint space delta angle limits (i.e., the joint space agents were originally allowed to move more at each time step). Transferability. We also evaluate how policies using different action spaces transfer in simulation from one robot to another through zeroshot transfer from the Panda robot to the Sawyer robot. The results are depicted in Fig. 4(a). As expected, we observed that after convergence only the policies using fixed and variable impedance in endeffector space could transfer directly between robots. The jointspace policies were not able to transfer due to the very different kinematics and dynamics of the two robot platforms. By using endeffector space control, we factor out the effects of the embodiment from the policy learning problem.
VB Manipulation of Constrained Mechanisms  Door Opening
Setup. In this task, the robot has to learn how to manipulate a one DoF constrained mechanism, a door, to a specific configuration. The agent is equipped with a twofinger gripper it can use to hold the door handle. The door handle is a bar attached vertically on the door leaf. The gripper is closed, leaving a space between the fingers to cage the door handle while still allowing for rotation between handle and gripper. We ensure that the agent learns to interact in a controlled and safe manner. Hence instead of maximally opening the door, we set the goal to manipulate the door into a desired joint configuration (${\theta}_{goal}$ = $60\mathrm{\xb0}$). We measure success as the fraction of the total desired joint state achieved by the robot: $1\frac{{\theta}_{door}{\theta}_{goal}}{{\theta}_{goal}}$. Reward Model. We reward the agent when the door joint gets closer to the desired configuration. We provide additional constant reward if the configuration of the door is very close to the desired value (less than $5\mathrm{\xb0}$). We penalize forces and torques exerted on the environment that go beyond the physical payload of the robot ($40\mathrm{N}$). We also penalize the agent for colliding with the environment with links other than the gripper and for going beyond its joint limits. For safety, the episode terminates when joint limits are violated. Observations. We use as observation the pose and velocity of the robot’s endeffector in the robot reference frame, as well as the door’s angle and angular velocity. Evaluation: We evaluate the different action spaces in simulation. We train an agent with a Panda robot embodiment for each action space with five different random seeds. Sample Efficiency and Task Completion. We first evaluate the different action spaces on their sample efficiency of learning the doormanipulation. The training results are depicted in Fig. 3, middle. The task success results for the door task can be found in Fig. 4(b). We observe that policies that output endeffector space actions (with medium, variable, and high impedance) outperform policies in all other action spaces, in terms of achieving close to 100% task success rate and higher rewards. In endeffector space, the policy resolved with an impedance controller with fixed medium stiffness and damping is able to learn the task at a faster rate than the variable impedance controller, as it is initialized with a suitable impedance to operate the door with the defined friction. However, policies outputting actions in the both aforementioned spaces reach similar rewards and task success rates at the end of training, as the policies that can vary impedance end up learning a suitable impedance for the task. While the policies resolved with an impedance controller with fixed high impedance parameters also achieve on average 100% task success rate, their rewards are lower because they exert higher forces in the environment that is penalized. The policies resolved with an impedance controller with fixed low impedance parameters are not able to learn the door opening task because they cannot exert high enough forces to overcome the friction of the door and move it. The policies outputting joint velocity actions can reach up to 75% task success, but the rewards are much lower than policies in VICES, as they often reach joint limits while opening the door. Policies outputting other jointspace actions (torques, positions) are unable to learn to exert enough force to open the door without reaching joint limits. Transferability. We also evaluate the ability of policies in different action spaces to transfer from the Panda robot to the Sawyer robot. The results are shown in Fig. 4(b), in lighter colors. Transferring policies for the door opening task is more complex than for the freespace Path Following task because the different robots’ kinematics lead to very different taskspace limitations, as well as very different joint limit constraints. Similar to results in the other tasks, policies trained in joint space are unable to transfer, since the kinematics of the robots differ substantially. The endeffector space policies are able to transfer much more successfully, as the endeffector space policies are able to abstract away the dynamics and kinematics of each specific robot model. There is still a performance drop, as the policies in endeffector space do not learn to account for the robots’ different kinematic constraints (i.e. joint limits).
VC ContactRich Manipulation  Surface Wiping
Setup. In this experiment the goal is to wipe a table whose surface location is unknown. The agents are equipped with a wiping tool, resembling a scrubber or a whiteboard eraser (see Fig. 1). In the simulator, the tool is modeled as a soft material that creates contact forces that increase proportionally to the penetration into the tool’s surface. The material to wipe is modeled as a set of small elements of a color different from the table. The elements are placed randomly on the table surface to form a continuous “stain” and are marked initially as unwiped. They become wiped if the wiping tool passes through them, which also causes them to disappear visually. Note that since the elements are modeled as very thin ($1\mathrm{mm}$ height) cylinders resting on the table’s surface, the agent needs to press the tool against the surface so as to be able to wipe elements. Success rate is measured as the fraction of the elements wiped. The sliding coefficient of friction of the table acting along both axes of the tangent plane is sampled uniformly between $1.0$ and $0.1$. The initial location of the agent above the table is randomized. Reward Model. The main reward comes from wiping off elements. We also provide additional reward for wiping off all the elements. Additionally, to help during the initial phases of exploration, we give the agent a small reward for maintaining contact with the table. Finally, since we aim to generate safe solutions that can directly be tested on the real robot, we slightly penalize the agent for applying forces over the payload of the real robot ($40\mathrm{N}$), and harshly penalize the agent for reaching joint limits or colliding with the table with parts other than the wiping tool. If such collisions occur, the episode ends and the tasks restarts. Observations. There is no straightforward way to represent the state of a wiping task. Instead, we directly use visual observations: $48\times 48\times 3$ RGB images of the wiping scene generated in our simulator, and obtained from a camera on our real robot platform for the simulationtoreal transfer experiments. As in previous experiments, we also provide the pose and velocity of the endeffector. Evaluation: We first evaluate the different action spaces on simulation. We train an agent with a Panda robot embodiment for each action space with five different random seeds. Sample Efficiency and Task Completion. Fig. 3, right, shows the convergence of the agents with different action spaces. We observe that the agents with variable impedance in endeffector space converge faster and achieve higher reward. The higher reward is obtained thanks to a lower penalty for applying excessive force on the table since the agents can learn to appropriately adapt their stiffness. The mean force applied by the policies with variable impedance in endeffector space is $28\mathrm{N}$, less than the robot’s payload. Agents using other action spaces apply higher mean force or not enough to wipe the table. In terms of task completion, the results are depicted in Fig. 4(c), dark colors. Policies outputting actions in VICES achieve the highest ratio of wiped units.
Transferability. We also evaluate if the policies learned with the Panda robot embodiment transfer directly to the Sawyer robot in simulation. Fig. 4(c), depict the results of the policy transfer between robots. Policies trained in variable impedance in endeffector space transfer better than policies in any other space since the policy is independent of the robot embodiment. However, there is a significant drop in performance due to the different forces generated by the different embodiments. Simulationtoreal transfer. In a final experiment we evaluate if the policies trained in simulation can be used on the real robot without any retraining. We use in our experiment the best performing simulation policy. The goal in the real world is to wipe a whiteboard painted with a marker. Since our focus is on the evaluation of the action space and not on learning a representation of the image, we convert the real images into fake simulated images by superimposing the results of a color segmentation for the colored parts of the table on an image from the simulator where the robot configuration is set to track the real robot. As a safety precaution, we stop the robot if the payload is exceeded. We note that the robot does not use any direct force sensing during the experiments. We initialize the robot to the same location and run ten trials each with a different part of the whiteboard painted. One example of the run can be seen in Fig. 6 and more runs in the video attachment. We assume a successful trial when the robot wipes more than 3/4 of the painted line. The robot wipes successfully the board in 8 of the 10 trials. In one of the failed trials the robot moved abruptly and triggered the safety mechanism. In another trial the robot did not wipe the mark entirely. These results indicate that the policies trained with VICES can transfer seamlessly to real world by exploiting the knowledge of the dynamics model of the robot.
VI Conclusion
Reinforcement Learning (RL) as a family of algorithms has ushered in impressive results in generalization, yet principled evaluation on how to choose action spaces to learn control policies is missing. We presented a thorough evaluation of the effect of the choice of action space on learning policies in RL for tasks without contact, with kinematic constraints and contactrich manipulation tasks. We also presented variable impedance in endeffector space (VICES) as an efficient choice of action space for RL and showed empirically that, even when contact conditions are dynamically variable during the task, this model outperforms other action space choices on sample efficiency, energy consumption, and safety. We also showed that, thanks to the subtraction of the dynamic effects of the embodiment, using variable impedance in endeffector space we can transfer policies learned in simulation to other simulated robots and to a real robot without fine tuning.
\printbibliography