Abstract
Reinforcement learning (RL) aims to estimate the action to take given a(time-varying) state, with the goal of maximizing a cumulative reward function.Predominantly, there are two families of algorithms to solve RL problems:value-based and policy-based methods, with the latter designed to learn aprobabilistic parametric policy from states to actions. Most contemporaryapproaches implement this policy using a neural network (NN). However, NNsusually face issues related to convergence, architectural suitability,hyper-parameter selection, and underutilization of the redundancies of thestate-action representations (e.g. locally similar states). This paperpostulates multi-linear mappings to efficiently estimate the parameters of theRL policy. More precisely, we leverage the PARAFAC decomposition to designtensor low-rank policies. The key idea involves collecting the policyparameters into a tensor and leveraging tensor-completion techniques to enforcelow rank. We establish theoretical guarantees of the proposed methods forvarious policy classes and validate their efficacy through numericalexperiments. Specifically, we demonstrate that tensor low-rank policy modelsreduce computational and sample complexities in comparison to NN models whileachieving similar rewards.