Abstract
We present a new approach for transfer of dynamic robot control policies suchas biped locomotion from simulation to real hardware. Key to our approach is toperform system identification of the model parameters {\mu} of the hardware(e.g. friction, center-of-mass) in two distinct stages, before policy learning(pre-sysID) and after policy learning (post-sysID). Pre-sysID begins bycollecting trajectories from the physical hardware based on a set of genericmotion sequences. Because the trajectories may not be related to the task ofinterest, presysID does not attempt to accurately identify the true value of{\mu}, but only to approximate the range of {\mu} to guide the policy learning.Next, a Projected Universal Policy (PUP) is created by simultaneously traininga network that projects {\mu} to a low-dimensional latent variable {\eta} and afamily of policies that are conditioned on {\eta}. The second round of systemidentification (post-sysID) is then carried out by deploying the PUP on therobot hardware using task-relevant trajectories. We use Bayesian Optimizationto determine the values for {\eta} that optimizes the performance of PUP on thereal hardware. We have used this approach to create three successful bipedlocomotion controllers (walk forward, walk backwards, walk sideways) on theDarwin OP2 robot.