Abstract
Reinforcement learning (RL) is about sequential decision making and istraditionally opposed to supervised learning (SL) and unsupervised learning(USL). In RL, given the current state, the agent makes a decision that mayinfluence the next state as opposed to SL (and USL) where, the next stateremains the same, regardless of the decisions taken, either in batch or onlinelearning. Although this difference is fundamental between SL and RL, there areconnections that have been overlooked. In particular, we prove in this paperthat gradient policy method can be cast as a supervised learning problem wheretrue label are replaced with discounted rewards. We provide a new proof ofpolicy gradient methods (PGM) that emphasizes the tight link with the crossentropy and supervised learning. We provide a simple experiment where weinterchange label and pseudo rewards. We conclude that other relationships withSL could be made if we modify the reward functions wisely.