Despite an ever growing literature on reinforcement learning algorithms andapplications, much less is known about their statistical inference. In thispaper, we investigate the large sample behaviors of the Q-value estimates withclosed-form characterizations of the asymptotic variances. This allows us toefficiently construct confidence regions for Q-value and optimal valuefunctions, and to develop policies to minimize their estimation errors. Thisalso leads to a policy exploration strategy that relies on estimating therelative discrepancies among the Q estimates. Numerical experiments showsuperior performances of our exploration strategy than other benchmarkapproaches.