### Abstract

This paper addresses two challenges facing sampling-based kinodynamic motionplanning: a way to identify good candidate states for local transitions and thesubsequent computationally intractable steering between these candidate states.Through the combination of sampling-based planning, a Rapidly ExploringRandomized Tree (RRT) and an efficient kinodynamic motion planner throughmachine learning, we propose an efficient solution to long-range planning forkinodynamic motion planning. First, we use deep reinforcement learning to learnan obstacle-avoiding policy that maps a robot's sensor observations to actions,which is used as a local planner during planning and as a controller duringexecution. Second, we train a reachability estimator in a supervised manner,which predicts the RL policy's time to reach a state in the presence ofobstacles. Lastly, we introduce RL-RRT that uses the RL policy as a localplanner, and the reachability estimator as the distance function to biastree-growth towards promising regions. We evaluate our method on threekinodynamic systems, including physical robot experiments. Results across allthree robots tested indicate that RL-RRT outperforms state of the artkinodynamic planners in efficiency, and also provides a shorter path finishtime than a steering function free method. The learned local planner policy andaccompanying reachability estimator demonstrate transferability to thepreviously unseen experimental environments, making RL-RRT fast because theexpensive computations are replaced with simple neural network inference.

### Quick Read (beta)

# RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies

###### Abstract

This paper addresses two challenges facing sampling-based kinodynamic motion planning: a way to identify good candidate states for local transitions and the subsequent computationally intractable steering between these candidate states. Through the combination of sampling-based planning, a Rapidly Exploring Randomized Tree (RRT) and an efficient kinodynamic motion planner through machine learning, we propose an efficient solution to long-range planning for kinodynamic motion planning. First, we use deep reinforcement learning to learn an obstacle-avoiding policy that maps a robot’s sensor observations to actions, which is used as a local planner during planning and as a controller during execution. Second, we train a reachability estimator in a supervised manner, which predicts the RL policy’s time to reach a state in the presence of obstacles. Lastly, we introduce RL-RRT that uses the RL policy as a local planner, and the reachability estimator as the distance function to bias tree-growth towards promising regions. We evaluate our method on three kinodynamic systems, including physical robot experiments. Results across all three robots tested indicate that RL-RRT outperforms state of the art kinodynamic planners in efficiency, and also provides a shorter path finish time than a steering function free method. The learned local planner policy and accompanying reachability estimator demonstrate transferability to the previously unseen experimental environments, making RL-RRT fast because the expensive computations are replaced with simple neural network inference. The following is a video demonstrating the tree growth and deployment on a physical Fetch robot: https://youtu.be/dDMVMTOI8KY

## I INTRODUCTION

Consider motion planning for robots such as UAVs [17], autonomous ships [3], and spacecrafts [23]. The planning solution needs to satisfy two criteria. First, the solution path must be feasible, meaning that the path must be collision-free and satisfy kinodynamic constraints, e.g. velocity and acceleration bounds even in the presence of sensor noise. Second, the path needs to be efficient, i.e. near optimal with respect to objectives such as time to reach the goal. For example, a motion plan for a car-like robot should avoid obstacles, reach the goal promptly, not make impossibly sharp turns, and maintain enough clearance to compensate for sensor noise.

Current state of the art kinodynamic motion planners search the robot’s feasible state space by building a tree data structure of possible robot motions rooted at the robot’s current state. The methods iteratively use a local planner to attempt to grow the tree until the goal is reached. While some tree-based methods grow the tree by randomly propagating actions, others guide the tree growth to focus state space expansion thus requiring the local planner to be a steering function, a control policy that guides a robot to a specific goal in obstacle-free space, while satisfying the kinodynamic constraints. For example, consider a car-like robot needing to translate a small distance to the left, a motion resembling parallel parking. This motion plan is difficult, even in the absence of obstacles, and requires a steering function to steer the car to the goal. Computing the steering function requires solving an optimal control problem, and is generally NP-Hard [28]. To date, only very limited robot dynamics such as linear [27] and differential drive [20] systems have optimal steering functions.

Besides the existence of steering functions, there are two additional difficulties facing efficient tree-based kinodynamic motion planning. First, tree-based methods that use steering functions require identifying the state in the tree from which to grow. This requires a function that compare the distance between states and return those that are expected to be easily solved by the steering function. An effective distance function for kinodynamic planning is the Time To Reach (TTR) between states using an optimal steering function [20]. TTR, however, is often expensive to compute as it involves numerically integrating the steering function [20]. Second, neither the steering functions nor the related TTR are informed by sensors, and, as a result, do not account for potential obstacles. For example, if a goal is occluded by a wall, the steering function is not able to see the wall due to the lack of sensory input, and TTR would return a value as if an agent could go through the wall.

Recently, deep Reinforcement Learning (RL) emerged as a promising near optimal steering function for kinodynamic systems [13]. In addition, deep RL algorithms can learn policies that map noisy lidar or camera observations directly to robot actions, thus enabling obstacle avoidance while navigating between states for differential drive robots [4, 5]. With the recent development of AutoRL [4], which uses evolutionary algorithms to largely eliminate the need to hand-tune hyper-parameters, network structure and reward functions. This combination offers the promise of deep RL being employed for local planning, i.e., providing both steering function and obstacle avoidance. However, RL policies often lack long-term planning capabilities [18] and get trapped in environments with complex obstacles [6].

To address the lack of available steering functions, good distance functions for aiding tree growth, and obstacle-awareness facing kinodynamic motion planning, we propose RL-RRT, which combines RL and sampling-based planning. It works in three steps. First, we learn an obstacle-avoiding point-to-point (P2P) policy with AutoRL. This is a mapless, goal-conditioned policy that maps sensor readings to control. These P2P policies generalize to new environments without re-training [4]. Second, we train a supervised obstacle-aware reachability estimator that predicts the time it takes the P2P policy to guide the robot from a start to goal state in presence of obstacles, using local observations such as lidar. The key insight is that the AutoRL policy and the estimator implicitly learn the topology of the obstacles. This allows reasonably accurate estimates of time to reach in new environments. Lastly, presented with a motion planning problem in a new envrionment, in a RRT setting, we use the RL policy as a local planner and the reachability estimator as the distance function. The combination of these two learning solutions offers two primary advantages. First, by using RL policies as an obstacle avoiding local planner, RL-RRT can be applied to a variety of kinodynamic systems without optimal steering functions. Second, by using the obstacle-aware reachability estimator, RL-RRT can prune out randomly sampled states that are un-reachable from the tree, e.g., the policy is expected to be unsuccessful, and identify nodes with small TTR to the sampled state. In the example of a car in front of a wall, the RL policy will go around the wall, and the estimator will predict that the time to reach will be longer because of the wall.

We evaluate RL-RRT in two environments with three kinodynamic robots. Results indicate that AutoRL policies are effective obstacle-avoiding local planners. The obstacle-aware reachability estimators, one for each robot, are 74-80% accurate in identifying if a goal state is reachable. Compared to a state of the art steering function free method, SST [15], RL-RRT is up to 2.3 times more likely to identify a path within a fixed time budget and the identified path is up to 4.5 times shorter. RL-RRT typically identifies dynamically-feasible paths in very few iterations – 11 in this case – thanks to intelligent node selection and the obstacle-avoiding local planner (Figure (a)a). The enclosed video demonstrates RL-RRT tree construction and trajectory execution on a physical differential drive robot.

## II RELATED WORK

Steering function-based kinodynamic planners, such as kinodynamic RRT* [27] and D-FMT [24] rely on a steering function to “pull” the tree to achieve rapid exploration [22] and a proper distance function [27, 20, 28]. RL-RRT uses AutoRL [4] to learn steering functions, thus bypassing the challenging two-point boundary value problem.

Steering function free-based approaches, such as EST [22] and SST [15], propagate random actions from a selected node. These methods can be applied to a variety of robot dynamics, although they tend to “wander” [1], thus they can take a long time to identify a solution.

Recent research has offered several solutions for P2P obstacle-avoidance policies on a differential drive robot from raw sensory input, including learning from demonstration [21], curriculum learning [29], and reinforcement learning [26, 4]. Other research offers hierarchical solutions to navigation, where the RL agent executes a path identified by another planner, e.g., from a grid [5], PRMs [6, 8], or manually selected waypoints [12]. However, none of those methods are designed for kinodynamic robots, leading to failures at milestones due to dynamic constraints [8].

Designing an effective distance function for sampling-based kinodynamnic motion planning is challenging [20]. The commonly used Euclidean and weighted Euclidean distance for configuration space planning is inefficient as kinodynamic robot states have limited reachability [14]. The minimum TTR between states is a highly effective distance function [20, 28] but is often too computationally-expensive to be used as a distance function [20]. While learned TTR of a near-optimal differential drive steering function [20] can overcome the computational complexity, this approach still requires a near-optimal steering function. Indirect optimal control has also been used to generate training samples composed of minimum TTR and optimal control actions along trajectories [28]. However, this approach currently only works for low dimensional systems such as inverted pendulum and does not handle limited action bounds. Our approach addresses these challenges by bypassing the necessity of a near-optimal steering function via RL. Unlike previous methods, we also take into account obstacle avoidance, which can significantly change the minimum TTR.

## III METHODS

RL-RRT is a kinodynamic motion planner that learns local planner and distance function w.r.t the individual robot dynamics. It has three main steps. First, we learn an obstacle-avoiding point to point policy with AutoRL [4]. Next, since the RL policy avoids obstacles, we can use the policy to generate obstacle-aware reachability training samples by repeatedly rolling out the learned policy. An obstacle-aware reachability estimator is trained to predict the time to reach between two robot states in the presence of obstacles. Policy and estimator training is done once per robot in training environments. Third, during planning, RL-RRT creates dynamically-feasible motion plans using the RL policy as the local planner and the reachablity estimator as the distance function. Note, that the training and planning simulators require simulated depth measurements (e.g. lidar) around the robot, which can be thought of as analogous to motion planning with information about clearance.

### III-A AutoRL Local Planner

We train a RL agent to perform a P2P task that avoids obstacles. The output of the training is a policy that is used as a local planner, an execution policy, and a data generation source for the obstacle-aware reachability estimator. Using one RL policy for both local planning and path execution is inspired by [9]. This allows the planner to account for potential noise during path execution. To train a policy robust against noise, we model the RL policy is a solution for a continuous state, continuous action, partially observable Markov decision process (POMDP) given as a tuple ${(}{\mathrm{\Omega}}{,}{S}{,}{A}{,}{D}{,}{R}{,}{\gamma}{,}{O}{)}$ of observations, state, actions, dynamics, reward, scalar discount, ${\gamma}{\in}{(}{0}{,}{1}{)}$, and observation probability. The observations are the last three measurements of the noisy robot lidar and potentially noisy relative goal position and robot velocity. We define states as the true robot configuration and its derivative. A black-box robot dynamics simulator, which maps states-action pairs to states, is an input to the RL training environment. Another black-box simulator maps the robot state to noisy lidar observations w.r.t. obstacles. The goal is to train the agent to reach a goal state, $G$, within radius, ${d}_{G}.$ Note that AutoRL identifies a policy that maps noisy sensor and state observations to action. We explore simulated lidar measurement noise in this work and left state estimation and process noise to future work. AutoRL training is required only once for a given robot.

AutoRL [4] over DDPG [16], used for learning the RL agent policy, takes as input: observations, actions, dynamics, goal definition, $(G,r)$, and a parametrized reward, $R:O\times {\theta}_{r}\to \mathbb{R},$. The agent is trained to maximize the probability of reaching the goal without collision. This is achieved by using evolutionary algorithms over populations of agents to find a dense reward that maximizes successful goal reaching. Each generation of agents is trained with a new reward, selected based on the previous experience. At the end, the fittest agent that performs P2P tasks best, is selected as the P2P policy. In this work, all three agents use the same observations, goal definitions, and neural network architectures, but differ in the robot dynamics and reward features used.

As an example, we explain the training of the Asteroid robot here (details of the robot are in the Appendix). Details for the Differential Drive and Car robot can be found in [4] and [8]. The observation is a vector of ${3}{}{{N}}_{{\text{beams}}}$ noisy lidar returns concatenated with the relative planar position of the goal, the robot velocity and orientation (${3}{}{{N}}_{{\text{beams}}}{+}{5}$ dimensional vector). The state is the planar position, velocity and orientation of the robot. The action is the amount of forward thrust and turn rate. The parameterized reward includes

$${R}_{{\bm{\theta}}_{{\bm{r}}_{\text{DD}}}}={\bm{\theta}}^{T}[{r}_{\text{goal}}{r}_{\text{goalDist}}{r}_{\text{collision}}{r}_{\text{clearance}}{r}_{\text{speed}}{r}_{\text{step}}{r}_{\text{disp}}],$$ |

where ${r}_{\text{goal}}$ is 1 when the agent reaches the goal and 0 otherwise, ${r}_{\text{goalDist}}$ is the negative Euclidean distance to the goal, ${r}_{\text{collision}}$ is -1 when the agent collides with obstacles and 0 otherwise, ${r}_{\text{clearance}}$ is the distance to the closest obstacle, ${r}_{\text{speed}}$ is the agent speed when the clearane is below 0.25m, ${r}_{\text{step}}$ is a constant penalty step with value 1, and ${r}_{\text{disp}}$ is sum of displacement between the current and positions 3, 6 and 9 steps before. ${\bm{\theta}}$ is the weight vector tuned by AutoRL.

### III-B Obstacle-Aware Reachablity Estimator

We further improve upon work in [20] by learning the TTR of an obstacle-avoiding P2P RL policy learned in Section III-A. Our obstacle-aware reachability estimator provides the following benefits: 1) It does not need an engineered near-optimal steering function for each robot dynamics. This allows TTR learning for robot systems without near-optimal steering functions. 2) Due to the presence of obstacles, the minimum TTR between states is a function of both robot dynamics and obstacles. Since RL policies can also learn to avoid obstacles, the obstacle-aware reachability estimator can provide additional benefit over TTR estimators that consider only obstacle dynamics such as [20].

#### III-B1 Training data collection

{algorithm}[tb] \[email protected]@algorithmic[1] \[email protected] \[email protected]$\bm{\pi}\left(\bm{o}\right)$: Obstacle avoiding P2P RL policy, ${N}_{\text{episode}}$: Number of episodes, $\mathrm{\Delta}t$: Time step size, ${T}_{\text{horizon}}$: Reachability horizon \[email protected] \[email protected]$\text{trainingData}=({\bm{o}}_{1},{y}_{1}),({\bm{o}}_{2},{y}_{2}),\mathrm{\cdots},({\bm{o}}_{N},{y}_{N})$. \FOR$i=1,\mathrm{\cdots}{N}_{\text{episode}}$ \STATE$\bm{s},\bm{g}$ = sampleStartAndGoal() \STATEelapsedTime = 0 \WHILEisDone is False \STATEelapsedTime += $\mathrm{\Delta}t$ \STATE$\bm{o}$ = makeObservation() \STATEexecutePolicy($\bm{\pi}\left(\bm{o}\right)$, $\mathrm{\Delta}t$) \STATEobsHistory.append($\bm{o}$) \STATE${c}$, isDone = getTTRCost(elapsedTime, ${T}_{\text{horizon}}$) \STATEcostHistory.append(${c}$) \ENDWHILE\STATEcfc = computeCumulativeFutureCost(costHistory) \FORj=0, len(obsHistory) \STATEtrainingData.append(($\bm{o}=$ obsHistory[j], $y=$ cfc[j])) \ENDFOR\STATEobsHistory.clear(); costHistory.clear() \ENDFOR\RETURNtrainingData Algorithm III-B1 summarizes the training data collection. First, for each episode, we initialize the robot with randomly chosen start and goal states (Alg. III-B1 line 2). Next, we execute the policy until the episode terminates (lines 4-11) due to reaching the goal, collision, or reaching a time horizon ${{T}}_{{\text{horizon}}}$. During execution, we record the robot observation at each time step (line 8) and compute and record the TTR cost (lines 9-10). The TTR cost is set to $\mathrm{\Delta}t$ at every time step. To classify whether the robot can reach the goal, we use a simple heuristic that penalizes trajectories that do not reach the goal. If the robot is in collision or the time horizon is reached (elapsedTime equals to ${{T}}_{{\text{horizon}}}$), the TTR cost of that time step is set to ${\mathrm{\Delta}}{}{t}{+}{{T}}_{{\text{horizon}}}$, and the episode is terminated immediately by setting isDone to true. After an episode terminates, we compute the cumulative future TTR cost for all states along the trajectory, i.e., remaining cost-to-go to the end of the trajectory (line 12). The observation and cumulative future cost of each time step form a training sample and is recorded (line 14). The process repeats for ${N}_{\text{episode}}=1000$ episodes. We designed the TTR cost heuristic such that if the robot reaches the goal, the cumulative future cost of each state along the trajectory is the TTR between that state and the goal. Conversely, if the robot failed to reached the goal due to collision or the episode reaches time horizon, all cumulative future cost along the trajectory will be larger than ${T}_{\text{horizon}}$. By employing a common machine learning technique that uses a regressor and a threshold value as a classifier [11], we can quickly classify whether a goal state can be reached during planning.

#### III-B2 Reachability Estimator Network

We train the obstacle-aware reachability estimator network with the training data collected above. The network input is the robot observation $\bm{o}$ and the output is the estimated TTR. We use a simple three-layer fully-connected network with [500, 200, 100] hidden neurons with each a dropout probability of 0.5. We use the L2 loss between estimated TTR and the V-value label from the training data.### III-C RL-RRT

Alg. 0 describes RL-RRT. While the standard RRT algorithm was utilized, modifications were made to efficiently utilize the obstacle-aware reachability estimator and the obstacle-avoiding RL local planner. {algorithm} [tb] \[email protected]@algorithmic[1]## IV Evaluation

To demonstrate RL-RRT, we evaluate our method on three kinodynamic robots in two environments unseen during training, and we experimentally verify the method on a physical differential drive Fetch robot from Fetch Robotics.### IV-A Setup

The three robots we evaluate are: Car, Asteroid, and Fetch. Car is a kinematic car with inertia [19] with a maximum steering angle ${30}^{\circ},$ and a 1.0 $m/{s}^{2}$ maximum acceleration and speed of 1.0 $m/s$. Asteroid has similar dynamics to those found in the popular video game Asteroid, and we chose it since it is highly kinodynamic, unintuitive for a human to control, and has no known optimal steering function. The details are available in the supplemental materials. The Fetch robot has a radius of $0.3\mathrm{m}$, 1.0 m/s maximum speed and 2.0 rad/s turn rate. The sensor noise is simulated by a zero mean Gaussian with a standard deviation of 0.1 m. We use the Fetch robot as a differential drive platform for on-robot experiments. All point-to-point policies are trained in the environment depicted in Figure (a)a. We evaluate RL policies and plan in two office building environments, Map 1 (Figure (a)a) and Map 2 (Figure (c)c), which are roughly 15 and 81 times larger than the training environment, respectively. Map 1 is is generated from a floor plan, while Map 2 is generated using a noisy SLAM of the Fetch physical testbed where we ran the experiments. These environments include parts that are cluttered, as seen in Map 1, and very narrow corridors, such seen in Map 2. We compare RL-RRT to SST [15], a state of the art steering function free kinodynamic motion planner. For Fetch robot, we also compare to RRT with Dynamic Window Approach (DWA) [7] as local planner (denoted RRT-DW). Additionally, we test disabling the clearance term of DWA, essentially turning it into a MPC-based steering function (denoted RRT-S). All experiment are repeated 50 times. Besides AutoRL training, all computation was done on an Intel Xeon E5-1650 @ 3.6GHz using TensorFlow 1.x (Google release) and Python 2.7. AutoRL policies were implemented with Google Vizier [10] and TFAgents [25].### IV-B AutoRL Policy Performance

We use pre-trained P2P policies for Fetch [4] and Car [8] robots. Their short description is available in the Appendix. The Asteroid P2P policy is original to this paper. All agents are trained with AutoRL over DDPG [4]. The goals are randomly placed within $10\mathrm{m}$. We train 100 agents in parallel over 10 generations as in [4]. The training took roughly 7 days. Figure 2 shows the success rate of the P2P agents compared to goal distance. Notice that when the goal distance is $10\mathrm{m}$ or farther than the trained policy, the performance degrades. We also notice that the Car policy is best performing, while the Asteroid policy is the most challenging. These results show that AutoRL produces, without hand-tuning, effective local planners, i.e., both a steering function and an obstacle avoidance policy for a variety of robot dynamics.### IV-C Reachability Estimator Performance

The obstacle-aware reachability estimator is trained in the training environment with goals sampled within $20\mathrm{m}$ from the initial states, twice the distance used for P2P training. The estimator network was trained on 1000 episodes with about 100,000 samples. Data generation takes about 10 minutes. The reachability thresholds are 20 seconds for differential drive and Asteroid, and 40 seconds for Car. Each estimator was trained over 500 epochs and took about 30 minutes. Robot Confusion Matrix Prec. Recall Accur. True (%) (%) (%) (%) Fetch Predicted 42.7 21.6 66.4 92.2 74.8 (%) 3.6 32.1 Car Predicted 44.5 14.2 75.8 90.2 81.0 (%) 4.8 36.5 Asteroid Predicted 26.5 16.3 61.9 73.4 74.1 (%) 9.6 47.6 Accuracy of the models is between 70% and 80% (Table I). Notice that a high recall means that the estimator misses fewer nodes, and suggests that the paths RL-RRT produces should be near-optimal. On the other hand, relatively low precision implies that RL-RRT will explore samples that end up not being useful. This means that we can speed-up RL-RRT further by learning a more precise predictor. The reachability estimator overestimates the TTR of reachable states across all robots (Fig. 3). However, overestimation disappears when trained and evaluated only on reachable states (see Fig. 1 in Appendix for more detail). This suggests that the overestimation of TTR is likely due to the TTR cost heuristic uses a penalty for states unreachable within ${{T}}_{{\text{horizon}}}$. We leave identifying better TTR cost heuristics and estimator network architectures for future work.### IV-D Planning Results

### IV-E Physical Robot Experiments

In order to verify that the RL-RRT produces motion plans that can be used on real robots, we executed the motion plans on the Fetch robot (Figure. (b)b) in Map 2 environment. We ran 10 different motion plans, repeated 3 times. Figure (c)c presents one such trajectory. The straight line distance between the start and goal is 20.8 m. In green are tree nodes for a path, and the blue line is the executed robot path with the P2P AutoRL policy. We notice two things. First, the path is similar to the one humans would take. The shortest path leads through cubicle space, which is cluttered. Because the P2P policy does not consistently navigate the cubicle space, the TTR estimates are high in that region and the tree progress slowly in that area. At the same time, in the uncluttered space near the start position (left and right) the tree grows quickly. The executed trajectory (in blue) stays close to the planned path. Enclosed video contains the footage of the robot traversing the path.## V DISCUSSION

## VI CONCLUSIONS

This paper contributes RL-RRT, a kinodynamic planner which works in three steps: 1) learning obstacle-avoiding local planner; 2) training an obstacle-aware reachability estimator for the learned local planner; and 3) using the estimator as the distance function and to bias sampling in RRT. Unlike traditional kinodynamic motion planners, RL-RRT learns a suitable steering and distance function. The robot is trained once, and the policy and estimator transfer to the new envrionments. We evaluated the method on three kinodynmic robots in two simulated environments. Compared to the baselines, RRT plans faster and produces shorter paths. We also verified RL-RRT on a physical differential drive robot. For future work, following PRM-RL, we plan to improve the noise robustness of RL-RRT by Monte Carlo roll-outs during tree extensions. We also plan to identify better TTR cost heuristics, network architectures and online adaptation of the reachability estimator.## ACKNOWLEDGMENT

We thank Tsang-Wei Edward Lee for assisting with robot experiments, and Brian Ichter for the helpful feedback. Tapia and Chiang are partially supported by the National Science Foundation under Grant Numbers IIS-1528047 and IIS-1553266 (Tapia, CAREER). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.## References

- [1] R. Allen and M. Pavone. A real-time framework for kinodynamic planning with application to quadrotor obstacle avoidance. In AIAA Guidance, Navigation, and Control Conference, page 1374, 2016.
- [2] H.-T. L. Chiang, A. Faust, S. Satomi, and L. Tapia. Fast swept volume estimation with deep learning. In Proc. Int. Workshop on Algorithmic Foundations of Robotics (WAFR), page To appear, 2018.
- [3] H.-T. L. Chiang and L. Tapia. Colreg-rrt: An rrt-based colregs-compliant motion planner for surface vehicle navigation. Robotics and Automat. Lett., 3(3):2024–2031, 2018.
- [4] L. Chiang, A. Faust, M. Fiser, and A. Francis. Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters (RA-L), 2019.
- [5] T. Fan, X. Cheng, J. Pan, P. Long, W. Liu, R. Yang, and D. Manocha. Getting robots unfrozen and unlost in dense pedestrian crowds. Robotics and Automat. Lett., 2019.
- [6] A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia. PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 5113–5120, Brisbane, Australia, 2018.
- [7] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robot. & Automation Mag., 4(1):23–33, 1997.
- [8] A. Francis, A. Faust, H.-T. L. Chiang, J. Hsu, J. C. Kew, M. Fiser, and T.-W. E. Lee. Long-range indoor navigation with prm-rl. arXiv preprint arXiv:1902.09458, 2019.
- [9] A. Francis, A. Faust, H.-T. L. Chiang, J. Hsu, J. C. Kew, M. Fiser, and T.-W. E. Lee. Long-range indoor navigation with prm-rl. volume 1, 2019.
- [10] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proc. of ACM Intl. Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM, 2017.
- [11] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
- [12] Y. Kato, K. Kamiyama, and K. Morioka. Autonomous robot navigation system with learning based on deep q-network and topological maps. In 2017 IEEE/SICE International Symposium on System Integration (SII), pages 1040–1046, Dec 2017.
- [13] A. Layek, N. A. Vien, T. Chung, et al. Deep reinforcement learning algorithms for steering an underactuated ship. In IEEE Int. Conf. on Multisensor Fusion and Integration for Intell. Sys. (MFI), pages 602–607. IEEE, 2017.
- [14] Y. Li and K. E. Bekris. Learning approximate cost-to-go metrics to improve sampling-based motion planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 4196–4201. IEEE, 2011.
- [15] Y. Li, Z. Littlefield, and K. E. Bekris. Asymptotically optimal sampling-based kinodynamic planning. Int. J. Robot. Res., 35(5):528–564, 2016.
- [16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- [17] S. Liu, M. Watterson, K. Mohta, K. Sun, S. Bhattacharya, C. J. Taylor, and V. Kumar. Planning dynamically feasible trajectories for quadrotors using safe flight corridors in 3-d complex environments. Robotics and Automat. Lett., 2(3):1688–1695, 2017.
- [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- [19] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. on intel. vehicles, 1(1):33–55, 2016.
- [20] L. Palmieri and K. O. Arras. Distance metric learning for rrt-based motion planning with constant-time inference. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 637–643. IEEE, 2015.
- [21] M. Pfeiffer, M. Schaeuble, J. I. Nieto, R. Siegwart, and C. Cadena. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 1527–1533, 2017.
- [22] J. M. Phillips, N. Bedrossian, and L. E. Kavraki. Guided expansive spaces trees: A search strategy for motion-and cost-constrained state spaces. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), volume 4, pages 3968–3973. IEEE, 2004.
- [23] A. Richards, T. Schouwenaars, J. P. How, and E. Feron. Spacecraft trajectory planning with avoidance constraints using mixed-integer linear programming. J. of Guidance, Control, and Dynamics, 25(4):755–764, 2002.
- [24] E. Schmerling, L. Janson, and M. Pavone. Optimal sampling-based motion planning under differential constraints: the drift case with linear affine dynamics. In IEEE Conf. on Decision and Control (CDC), pages 2574–2581. IEEE, 2015.
- [25] Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Chris Harris, Vincent Vanhoucke, Eugene Brevdo. TF-Agents: A library for reinforcement learning in tensorflow. https://github.com/tensorflow/agents, 2018.
- [26] L. Tai, G. Paolo, and M. Liu. Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 31–36, Sept 2017.
- [27] D. J. Webb and J. Van Den Berg. Kinodynamic rrt*: Asymptotically optimal motion planning for robots with linear dynamics. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 5054–5061. IEEE, 2013.
- [28] W. J. Wolfslag, M. Bharatheesha, T. M. Moerland, and M. Wisse. RRT-colearn: towards kinodynamic planning without numerical trajectory optimization. Robotics and Automat. Lett., 3(3):1655–1662, 2018.
- [29] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard. Deep reinforcement learning with successor features for navigation across similar environments. In Proc. IEEE Int. Conf. Intel. Rob. Syst. (IROS), pages 2371–2378. IEEE, 2017.