Reinforcement learning (RL) has shown great success in estimating sequentialtreatment strategies which take into account patient heterogeneity. However,health-outcome information, which is used as the reward for reinforcementlearning methods, is often not well coded but rather embedded in clinicalnotes. Extracting precise outcome information is a resource intensive task, somost of the available well-annotated cohorts are small. To address this issue,we propose a semi-supervised learning (SSL) approach that efficiently leveragesa small sized labeled data with true outcome observed, and a large unlabeleddata with outcome surrogates. In particular, we propose a semi-supervised,efficient approach to Q-learning and doubly robust off policy value estimation.Generalizing SSL to sequential treatment regimes brings interesting challenges:1) Feature distribution for Q-learning is unknown as it includes previousoutcomes. 2) The surrogate variables we leverage in the modified SSL frameworkare predictive of the outcome but not informative to the optimal policy orvalue function. We provide theoretical results for our Q-function and valuefunction estimators to understand to what degree efficiency can be gained fromSSL. Our method is at least as efficient as the supervised approach, andmoreover safe as it robust to mis-specification of the imputation models.