Abstract
We investigate the problem of learning an $\epsilon$-approximate solution forthe discrete-time Linear Quadratic Regulator (LQR) problem via a StochasticVariance-Reduced Policy Gradient (SVRPG) approach. Whilst policy gradientmethods have proven to converge linearly to the optimal solution of themodel-free LQR problem, the substantial requirement for two-point cost queriesin gradient estimations may be intractable, particularly in applications whereobtaining cost function evaluations at two distinct control inputconfigurations is exceptionally costly. To this end, we propose anoracle-efficient approach. Our method combines both one-point and two-pointestimations in a dual-loop variance-reduced algorithm. It achieves anapproximate optimal solution with only$O\left(\log\left(1/\epsilon\right)^{\beta}\right)$ two-point cost informationfor $\beta \in (0,1)$.