Reinforcement learning, mathematically described by Markov Decision Problems,may be approached either through dynamic programming or policy search.Actor-critic algorithms combine the merits of both approaches by alternatingbetween steps to estimate the value function and policy gradient updates. Dueto the fact that the updates exhibit correlated noise and biased gradientupdates, only the asymptotic behavior of actor-critic is known by connectingits behavior to dynamical systems. This work puts forth a new variant ofactor-critic that employs Monte Carlo rollouts during the policy searchupdates, which results in controllable bias that depends on the number ofcritic evaluations. As a result, we are able to provide for the first time theconvergence rate of actor-critic algorithms when the policy search step employspolicy gradient, agnostic to the choice of policy evaluation technique. Inparticular, we establish conditions under which the sample complexity iscomparable to stochastic gradient method for non-convex problems or slower as aresult of the critic estimation error, which is the main complexity bottleneck.These results hold in continuous state and action spaces with linear functionapproximation for the value function. We then specialize these conceptualresults to the case where the critic is estimated by Temporal Difference,Gradient Temporal Difference, and Accelerated Gradient Temporal Difference.These learning rates are then corroborated on a navigation problem involving anobstacle, providing insight into the interplay between optimization andgeneralization in reinforcement learning.