Hybrid Zero Dynamics Inspired Feedback Control Policy Design for 3D Bipedal Locomotion using Reinforcement Learning

  • 2019-10-03 22:20:04
  • Guillermo A. Castillo, Bowen Weng, Wei Zhang, Ayonga Hereid
  • 1


This paper presents a novel model-free reinforcement learning (RL) frameworkto design feedback control policies for 3D bipedal walking. Existing RLalgorithms are often trained in an end-to-end manner or rely on prior knowledgeof some reference joint trajectories. Different from these studies, we proposea novel policy structure that appropriately incorporates physical insightsgained from the hybrid nature of the walking dynamics and the well-establishedhybrid zero dynamics approach for 3D bipedal walking. As a result, the overallRL framework has several key advantages, including lightweight networkstructure, short training time, and less dependence on prior knowledge. Wedemonstrate the effectiveness of the proposed method on Cassie, a challenging3D bipedal robot. The proposed solution produces stable limit walking cyclesthat can track various walking speed in different directions. Surprisingly,without specifically trained with disturbances to achieve robustness, it alsoperforms robustly against various adversarial forces applied to the torsotowards both the forward and the backward directions.


Quick Read (beta)

Hybrid Zero Dynamics Inspired Feedback Control Policy Design for 3D Bipedal Locomotion using Reinforcement Learning

Guillermo A. Castillo1, Bowen Weng1, Wei Zhang2, and Ayonga Hereid3 *This work was supported in part by the National Science Foundation under grant CNS-1552838 and the OSU M&MS Discovery Theme Initiative.1Electrical and Computer Engineering, Ohio State University, Columbus, OH, USA; {castillomartinez.2, weng.172}@osu.edu.2SUSTech Institute of Robotics, Southern University of Science and Technology, China; [email protected]3Mechanical and Aerospace Engineering, Ohio State University, Columbus, OH, USA. [email protected]

This paper presents a novel model-free reinforcement learning (RL) framework to design feedback control policies for 3D bipedal walking. Existing RL algorithms are often trained in an end-to-end manner or rely on prior knowledge of some reference joint trajectories. Different from these studies, we propose a novel policy structure that appropriately incorporates physical insights gained from the hybrid nature of the walking dynamics and the well-established hybrid zero dynamics approach for 3D bipedal walking. As a result, the overall RL framework has several key advantages, including lightweight network structure, short training time, and less dependence on prior knowledge. We demonstrate the effectiveness of the proposed method on Cassie, a challenging 3D bipedal robot. The proposed solution produces stable limit walking cycles that can track various walking speed in different directions. Surprisingly, without specifically trained with disturbances to achieve robustness, it also performs robustly against various adversarial forces applied to the torso towards both the forward and the backward directions.

I Introduction

3D bipedal walking is a challenging problem due to the multi-phase and hybrid nature of legged locomotion. Properties like underactuation, unilateral ground contacts, nonlinear dynamics, and high degrees of freedom significantly increase the model complexity. Existing approaches on bipedal walking can be roughly grouped into two categories: model-based and model-free methods. In [Grizzle2014Models], the authors provide a comprehensive review of model-based methods, feedback control, and open problems of 3D bipedal walking. One of the main challenges for model-based methods is the limitation of mathematical models that capture the complex dynamics of a 3D robot in the real world. This results in non-robust controllers that require additional heuristic compensations and tuning processes, which can be time-consuming and requires experiences.

Reduced order models, such as Linear Inverted Pendulum and its variants [Kajita2001invertedpendulum], have been studied extensively in the literature. For these simple models, stable walking conditions can be stated in terms of the ZMP (zero moment point) [Vukobratovic2004ZMP, Yoshida2008, Stephens2010] or CP (capture point) [Pratt2006, Pratt2012], which can significantly simplify the control design. However, these approaches rely on some strong assumptions that often lead to quasi-static and unrealistic walking behaviors. Optimization-based methods such as Linear Quadratic Regulator (LQR) [Posa2016Optimization], Model Predictive Control (MPC) [Erez2013Integrated, Koenemann2015Whole], and Hybrid Zero Dynamics (HZD) [Westervelt2007Feedback, Ames2014Human] use the full order model of the robot to capture the underlying dynamics more accurately, which yields more natural dynamic walking behaviors. In particular, HZD is a formal framework for the control of bipedal robots with or without underactuation through the design of nonlinear feedback controllers and a set of virtual constraints. It has been successfully implemented in several physical robots, including many underactuated robots [Chevallereau2003RABBIT, Sreenath2011compliant, Zhao2017Multi, Hereid2018Dynamic]. Nonetheless, these methods are computationally expensive and sensitive to model parameters and environmental changes. Particularly for 3D walking, additional feedback regulation controllers are required to stabilize the system [Rezazadeh2015Spring, Reher2016Algorithmic, Gong2019Feedback]. Notably, recent work has successfully realized robust 3D bipedal locomotion by combining Supervised Learning with HZD [Da2019Combining].

Fig. 1: Cassie in simulation: gait recovering from the backward adversarial force with our proposed method.

With recent progress on deep learning, Reinforcement Learning (RL) has become a popular tool in solving challenging control problems in robotics. Existing RL methods often rely on end-to-end training without considering the underlying physics of the particular robot. A NN function is trained with policy gradient methods that directly maps the state space to a set of continuous actions [Lillicrap2015Continuous, schulman2017proximal]. Despite the empirical success, such methods are often sampling inefficient (millions of data samples) and are usually over-parameterized (thousands of tunable parameters). They may also lead to non-smooth control signals and unnatural motions that are not applicable to real robots. Through incorporating HZD with RL training, [Castillo2019Reinforcement] generated feasible trajectories that are tracked by PD controllers to produce sustainable walking gaits at different speeds. However, this method only works for a simple 2D robot model. For the more complex 3D bipedal walking, [xie2018feedback] adopts RL methods as part of the feedback control. The method relies on prior knowledge of a good joint reference trajectory and only learns small compensations added to the known reference trajectory, which does not provide the overall control solution. An imitation learning inspired method is proposed for a 3D robot [Xie2019Iterative], which also requires a known walking policy and gradually improves the policy through learning.

In this paper, we propose a novel hybrid control structure for robust and stable 3D bipedal locomotion. It harnesses the advantages of parameterized policies obtained through RL while exploiting the structure of the intuitive yet powerful additional regulations commonly used in 3D bipedal walking. The regulation terms embedded in the design of policy structure and reward functions differentiate our proposed method from the previous work of [Castillo2019Reinforcement] that is also inspired by HZD. We evaluate the performance of our method on Cassie, which is also a much more challenging robot than the 2D Rabbit [Castillo2019Reinforcement]. We further summarize the primary contributions of the present paper as follows:

  • Model-free: The trained policy naturally learns a feasible walking gait from scratch without the need of a given reference trajectory or the model dynamics. To the best of our knowledge, this is the first time that a variable speed controller for a 3D robot is learned without using previously known reference trajectories or training separate policies for different speeds.

  • Efficiency: By incorporating the physical insight of bipedal walking, such as its hybrid nature, symmetric motion, and heuristic compensation, into the control structure and learning process, we significantly simplify the design to a shallow neural network (NN) with only 5069 trainable parameters. To the best of our knowledge, this is the smallest NN ever reported for Cassie. As a result, the NN policy is easy to train. It is also fast enough for real-time control with 1 kHz frequency on single process CPU (the low-level PD controller runs at 2 kHz).

  • Robustness: The learned controller is capable of stably tracking a wide range of walking speeds in both longitudinal and lateral directions with just one trained policy. The robustness of the controller is also evaluated by several disturbance rejection tests in simulation.


In this section, we will review the classic HZD based feedback controllers for dynamic 3D walking robots. Inspired by the HZD framework, we then propose to study a model-free control problem using RL techniques.

II-A Existing Challenges of the HZD Framework

One of the main challenges of 3D bipedal walking is to find feasible trajectories that render stable and robust limit walking cycles while keeping certain desired behaviors, such as walking speeds of the system. In the HZD framework, to obtain such trajectories, an offline optimization problem is solved using the full-order model, and virtual constraints are introduced as a means to synthesize feedback controllers that realize stable and dynamic locomotion. By designing virtual constraints that are invariant through impact, an invariant sub-manifold is created—termed the hybrid zero dynamics surface—wherein the evolution of the system is dictated by the reduced-dimensional dynamics of the under-actuated degrees of freedom of the system [Westervelt2007Feedback, Ames2014Human].

However, the mathematical models used in this optimization cannot completely capture the complex dynamics of a 3D bipedal robot. Consequently, additional heuristic compensation controllers or regulators are often required on top of the PD tracking controllers to stabilize the robot [Gong2019Feedback, Reher2016Algorithmic]. An example of such a control structure with feedback regulations for a 3D walking robot is shown in Fig. 2. The compensations δq will either modify the original reference trajectories 𝐪d, or exert extra feed-forward torques based on additional feedback information, such as the lateral hip velocity or torso orientation, to improve the stability and robustness of the walking gaits in experiments. The new regulated reference trajectory 𝐪reg is then tracked by PD controller through the control action 𝐮.

Fig. 2: An example of 3D bipedal walking controller with heuristic feedback regulations.

II-B Structure of HZD-Based Feedback Controller

From a high-level abstraction, the problem of 3D bipedal walking can be divided in two stages: trajectory planning and feedback control.

Virtual Constraints Let 𝐪 be the vector of joint coordinates of a general 3D bipedal robot, and τ(t)[0,1] be a time-based phase variable (see (3) for explicit definition), then the virtual constraints are defined as the difference between the actual and desired outputs of the robot [Ames2013Human]:

𝐲2 :=𝐲2a(𝐪)-𝐲2d(τ(t),α), (1)

where 𝐲2d is a vector of desired outputs defined in terms of 5th order Bézier polynomials parameterized by the coefficients α, given as:

𝐲2d(τ(t),α):=k=05α[k]M!k!(M-k)!τ(t)k(1-τ(t))M-k. (2)

In this paper, we choose τ(t) to be the scaled relative time with respect to step time interval, i.e.,

τ(t)=t-t-tstep, (3)

where tstep is the duration of one walking step, and t- is the time at the beggining of the step. It is important to denote that by properly choosing the coefficients of these Bézier polynomials, one can achieve different walking motions.

In the HZD framework, the Bézier coefficients are obtained from the solution of an optimization problem whose cost function and constraints are determined by the desired behavior of the robot. Then, the Bézier Polynomials define the desired trajectories to be tracked in order to drive the virtual constraints to zero. However, for 3D bipedal walking robots, simply tracking desired trajectories is not enough to achieve a stable walking motion. Therefore, the following regulations are added to the controller: foot placement, torso regulation, and ankle regulation.

Foot placement controller has been widely used in 3D bipedal walking robots with the objective of improving the speed tracking and the stability and robustness of the walking gait [Da2016, Rezazadeh2015Spring, Gong2019Feedback]. Longitudinal speed regulation, defined by (4), sets a target offset in the swing hip pitch joint, whereas lateral speed regulation (5) do the same for the swing hip roll angle. In these equations, vx[k] and vy[k] are the average longitudinal and lateral speeds of the robot at the middle of step k, vxd, vxd are the reference speeds, and Kpx,Kdx,Kpy,Kdy are manually tuned gains.

δhpitchsw[k] =Kpx(vx[k]-vxd)+Kdx(vx[k]-vx[k-1]), (4)
δhrollsw[k] =Kpy(vy[k]-vyd)+Kdy(vy[k]-vy[k-1]). (5)

Torso regulation is applied to keep the torso in an upright position, which is desired for a stable walking gait. Assuming that the robot has a rigid body torso, simple PD controllers defined by (6) and (7) can be applied respectively to the hip roll and hip pitch angle of the stance leg:

uhrollst =Kptroll(ϕ-ϕd)+Kdtroll(ϕ˙-ϕ˙d), (6)
uhpitchst =Kptpitch(θ-θd)+Kdtpitch(θ˙-θ˙d), (7)

where ϕ and θ are the torso roll and pitch angles, and Kptroll,Kdtroll,Kptpitch,Kdtpitch are manually tuned gains.

Ankle regulation is applied to keep the swing foot flat during the whole swinging phase, including the landing moment. For Cassie, this can be done by using forward kinematics for the reference trajectory of the swing ankle joint, given as

γsw=θ-13deg-50deg, (8)

where γsw is the ankle joint corresponding to the pitch angle of the swing foot. In addition, to stabilize the walking gait, especially when walking on soft surfaces [Gong2019Feedback], the stance foot pitch angle of the robot will be set to be passive.

It is important to denote that the speed and torso regulations presented above are fixed, intuitive, and applicable to any general 3D bipedal walking robot. However, given the decoupled structure of the controller used for the different regulations, there are several gains that need to be manually tuned in order to achieve improved stability, which is time-consuming and requires experience. However, this process can be easily automated within an RL framework.

II-C Tackling the Problem with Reinforcement Learning

Inspired by the nice properties of HZD (low-dimensional space, accounting of the hybrid nature of walking, virtual constraints), we propose an RL framework that incorporates those properties in the learning process to solve the complex problem of 3D walking. The main objective is to create a unified policy that can handle both problems presented in Fig. 2: trajectory planning and feedback control.

By this, we aim to address two specific challenges generally present in the current methods for 3D bipedal walking (i) eliminating the hideous task of manually tuning the gains of the feedback regulations by including them into the learning process, and (ii) improving the data efficiency of the RL method by reducing significantly its number of parameters.

To validate the proposed approach, we use Cassie-series bipedal robot, designed by Agility Robotics, as our test-bed in this paper. This underactuated biped has 20 degrees of freedom (DOF) in total. Each leg has seven joints, in which five of them are directly actuated by electrical motors and the other two joints are connected via specially designed leaf-spring four-bar linkages for additional compliance. When supported on one foot during walking, the robot is underactuated due to its narrow feet. Agility Robotics has released dynamic simulation models of the robot in MuJuCo [AgilitySimsJune2018], which will be used later in this paper.


In this section, we propose a non-conventional RL framework that combines a learning structure inspired by the HZD with the foot placement, torso and ankle regulation introduced in section II. We incorporate useful insights from traditional control framework into the learning process of the control policy. By HZD-inspired, we refer to the fact that the learning structure uses a low dimensional representation of the state and a time-based phase variable to command the behavior of the whole system while enforcing invariance of the virtual constraints through impact and symmetry conditions of the walking gait.

III-A Overall Framework

We formally present a non-conventional RL framework that uses a low dimensional state of the robot to learn a robust control policy able to track different walking speeds while maintaining the stability of the walking limit cycle. The proposed framework first establishes a NN function that maps a reduced order of the robot’s state to (i) a set of coefficients of the Bézier polynomials that define the trajectory of the actuated joints, and (ii) a set of gains corresponding to the derivative gain of the joints PD controller, as well as the gains for the foot placement and torso regulations described in section II-B. Independent low level PD controllers are then used to track the desired output for each joint, which enforces the compliance of the HZD virtual constraints.

A diagram of the overall RL framework is presented in Fig. 3. At each time step, the trained policy maps the inputs of the desired walking velocity, the average actual velocity, the average velocity tracking error, and the torso orientation and angular velocity to the set of coefficients α, the derivative gains of the joint PD controller, and the compensation gains for the foot placement and torso regulation. A detailed explanation of the NN structure will be given in Section III-B. Then, α is used jointly with the phase variable τ(t) to compute each joint’s desired position and velocity. Compensation gains are used in the foot placement regulation and torso regulation to compute the trajectory compensation for hip roll and pitch angle of the stance and swing leg of the robot. In addition, the ankle regulation computes the trajectory for the swing and stance leg ankle joints. The PD controller uses the tracking error between the desired and actual value of the output to compute the torque of each actuated joint, which is the input of the dynamic system that represents the walking motion of the robot. To close the control loop, the measurement of the robot’s states are used as feedback for the inner and outer control loops.

Fig. 3: Overall structure of the RL framework.

It is worth mentioning that the reference trajectories are learned from scratch and naturally obtained by the proposed RL framework. This is in contrast to some existing studies of RL [Peng2017DeepLoco], [xie2018feedback][Xie2019Iterative], which rely on some given working policy providing the joints reference trajectories.

III-B Neural Network Structure

Fig. 4 shows the structure of the NN implemented for the learning process. Cassie’s dynamics model contains 40 states, the robot’s pelvis position, velocity, orientation, and angular velocity, plus the angle and angular velocity of all the active and passive joints of the robot. However, the proposed NN only contains 12 dimensional reduced-order state: desired longitudinal and lateral velocity (vxd, vyd), average longitudinal and lateral velocity (vx¯ vy¯), average longitudinal and lateral velocity error (evx¯, evy¯), roll, pitch and yaw angles (ϕ,θ,ψ), and roll, pitch and yaw angular velocities (ϕ˙,θ˙,ψ˙). Here, we consider the average speed as the speed during one walking step of the robot, which takes about 350 ms. All the inputs are normalized in the interval [-0.5,0.5]. The value of the desired velocity is uniformly sampled from a continuous space interval from -0.5 to 1.0 m/s.

Fig. 4: Structure of the neural network.

The output of the NN corresponds to the coefficients of the Bézier polynomials, denoted by α, and the set of gains of the PD controller, foot placement and torso compensations, denoted by 𝐊d, 𝐊fp and 𝐊t, respectively. Initially, since the robot has ten actuated joints and each Bézier polynomial is of degree 5, the total size of the set of parameters α should be 60 for the right stance and 60 for the left stance. In addition, the same derivative gains 𝐊d are used for the corresponding joints of the left and right legs, which is possible because of the symmetric nature of the walking gait. Finally, by equations (4)-(6), 𝐊fp and 𝐊t are of the form

𝐊fp =[Kpx,Kdx,Kpy,Kdy], (9)
𝐊t =[Kptroll,Kdtroll,Kptpitch,Kdtpitch].

Then 𝐊d is of dimension 5, and both, 𝐊fp and 𝐊t, are of dimension 4. This results in a total of 133 outputs. Nonetheless, by considering the physical insight of the dynamic walking, we can significantly reduce the number of outputs of the NN from 128 to 45. This will be explained in detail in Section III-C. The number of hidden layers of the NN is 4, each with 32 neurons. ReLU activation functions are used between hidden layers, whereas the final layer employs a sigmoid function to limit the range of the outputs. As compared with other methods in the literature [Lillicrap2015Continuous, xie2018feedback], the proposed NN is much smaller in size, making the overall RL method sample efficient and easy to implement.

Finally, due to the properties of the family of polynomials used to parameterize the joints trajectories, the set of Bézier coefficients accurately define the upper and lower bounds of the desired output trajectories. That is, for each set of Bézier coefficients αi and desired trajectory qid associated with the ith joint, we have

qimin<αimin<qid<αimax<qimax. (10)

Therefore, the output range of the set of parameters can be limited by the physical constraint of each actuated joints, or even more, by the expected behavior of the robot (qimin,qimax). This critical feature significantly reduces the continuous interval of the output, which decreases the complexity of the RL problem and improves the efficiency of the learning process.

III-C Reduction of Output Dimension

For a general walking pattern, there exists a symmetry between the right and left stance. Therefore, given the set of coefficients for the right stance αR6x10, where each column represents the Bézier coefficients for a desired joint trajectory, we can easily obtain the set of coefficients for the left stance αL6x10 by

αL=αR𝐓 (11)

where 𝐓10x10 is a very sparse transformation matrix that represents the symmetry between the joints of the right and left legs of the robot.

To encourage the smoothness of the control actions after the ground impact, we enforce that at the beginning of every step the initial point of the Bézier polynomial coincides with the current position of the robot’s joints. That is, for each joint i with Bézier coefficients αR6x10, we have

αi[0]=qi(τ(0)). (12)

In addition, to encourage the invariance of the virtual constraints through impact, we enforce the position of the hip joints and knee joints to be equal at the beginning and end of the step (τ(t)=0 and τ(t)=1 respectively).

Finally, the ankle regulation enforces the stance ankle to be passive and the trajectory of the non-stance ankle to be defined by forward kinematics accordingly to (8). Therefore, we do not need to find the Bézier coefficients for the stance and swing ankle joints.

III-D Learning Procedure

Provided the NN policy structure and the reduced desired output of actions, the network is then trained with the evolution strategies (ES) [salimans2017evolution]. Note that our proposed method is not limited to a particular training method. It can be trained using any RL algorithms that can handle continuous action space, including evolution strategies (ES), proximal policy optimization (PPO) [schulman2017proximal], and deterministic policy gradient methods [silver2014deterministic].

In this paper, we adopt the following reward function in training for Cassie:

r=𝐰T𝐫, (13)

with a vector of 8 customized rewards 𝐫 and the weights 𝐰. Specifically,

𝐫=[rvx,rvy,rh,ru,rCOM,rang,rangvel,rfd]T. (14)

This encourages better velocity tracking (through rvx,rvy), height maintenance (rh), energy efficiency (ru) and natural walking gaits (rCOM,rang,rangvel,rfd). Starting from a random initial state that is close to a ”stand-up” position with zero velocity and a uniformly sampled desired velocity, we collect a trajectory of states, actions and rewards, referred as an episode. The episode length is 10000 simulation steps and it has an early termination if any of the following conditions is violated:

ψ|<0.5,|θ|<0.5,|ϕ|<0.5, (15)

where pz is the height of the robot’s pelvis and Δf is the distance between the feet.


To validate the proposed method, a customized environment for Cassie was built in MuJoCo physics engine [Todorov2012MuJoCo]. We used the model information of Cassie robot provided by Agility Robotics and the Oregon State University, which is publicly available [AgilitySimsJune2018]. The number of trainable parameters for the NN is 5069, and the training time is about 10 hours using a single 12-core CPU machine. Visualized results of the learning process and evaluation of the policy in simulation can be seen in the accompanying video submission [video_link]. This section presents the performance of the control policy obtained from the training in terms of (i) speed tracking, (ii) disturbance rejection, and (iii) the convergence of stable periodic limit cycles.

IV-A Speed Tracking

Due to the decoupled structure, the learned controller can effectively track a wide range of desired walking speeds in both longitudinal and lateral directions. The performance of tracking a fixed desired speed of 0.5 m/s in the forward direction is shown in Fig. 5. From Fig. 5(c), one can see the controller is capable of keeping the upright position of the torso while walking. This particular behavior is encouraged by the reward function during the training process, and it also contributes to the stability of the walking gait.

Fig. 5: Performance of the learned policy while tracking a fixed desired longitudinal walking speed.
Fig. 6: Performance of the learned policy while tracking varying desired longitudinal and lateral walking speeds.

The performance of continuously tracking various desired speeds is shown in Fig. 6, in an interval from -0.5 to 1.0 m/s longitudinally (vx), and an interval from -0.3 to 0.3 m/s in the lateral direction (vy). Note that walking at negative vx means the robot is walking backward, whereas moving at positive vy implies the robot is taking side steps to the left. In both cases, the controller is able to handle any speed change without falling or losing track of the reference, even after steep changes in the desired velocity.

IV-B Disturbance Rejection

To evaluate the robustness of our controller, we applied an adversarial force directly at the robot’s pelvis in both the forward and the backward directions. It is worth emphasizing that we do not inject any torso disturbance throughout the training process. The robustness of the policy is achieved naturally through the constant updates of the Bézier coefficients, the derivative gains of the adaptive PD controller, and the gains of the additional foot placement, torso and ankle regulators. In the results shown in Fig. 7 and Fig. 8, we adopt the adversarial force with the same magnitude of 25 N in both directions. It is applied 2 seconds after starting the test and lasts for 0.1 seconds. Throughout our adversarial tests, the robot can handle up to 40 N in the forward direction and 45 N in the backward direction without falling, but the speed tracking may take a long time to recover with an external force of high magnitude.

Fig. 7: Robustness of the controller when an adversarial force is applied in the forward direction.
Fig. 8: Robustness of the controller when an adversarial force is applied in the backward direction.

Fig. 7 illustrates the response of the controller for a forward adversarial force when the robot is walking in place and walking forward at 0.5 m/s. Fig. 8 illustrates the response of the controller when the same force of 25 N is applied in the backward direction while the robot is walking forward at 0 and 0.8 m/s. Throughout the four tests, the robot never falls and always closely recovers to the desired velocity.

IV-C Periodic Stability of the Walking Gaits

Periodic stability is one of the most important metrics for assessing the stability of walking gaits. In this paper, we only empirically evaluated the stability by observing the joint limit cycles of a periodic walking gait. Fig. 9 shows that the convergence of several representative robot actuated joints to periodic limit cycles during a fixed speed walking. Moreover, the orbit described by the left and right joints demonstrates the symmetry of walking gaits. This is due to the specific feature we encouraged in the design of the control policy.

Fig. 9: Walking limit cycle of the learned policy with the desired longitudinal velocity of 0.5 m/s.

V Conclusion

This paper presents a novel model-free RL approach for the design of feedback controllers for 3D bipedal robots. The unique decoupled structure of the learned control policy incorporates the physical insights of the dynamic walking and heuristic compensations from classic 3D walking controllers. The result is a data-efficient RL method with a reduced number of parameters in the NN that can learn stable and robust dynamic walking gaits from scratch, without any reference motion or expert guidance. The learned policy demonstrates good velocity tracking and disturbance rejection performances on a 3D bipedal robot.