Deep reinforcement learning requires a heavy price in terms of sampleefficiency and overparameterization in the neural networks used for functionapproximation. In this work, we use tensor factorization in order to learn morecompact representation for reinforcement learning policies. We show empiricallythat in the low-data regime, it is possible to learn online policies with 2 to10 times less total coefficients, with little to no loss of performance. Wealso leverage progress in second order optimization, and use the theory ofwavelet scattering to further reduce the number of learned coefficients, byforegoing learning the topmost convolutional layer filters altogether. Weevaluate our results on the Atari suite against recent baseline algorithms thatrepresent the state-of-the-art in data efficiency, and get comparable resultswith an order of magnitude gain in weight parsimony.
Quick Read (beta)
Biologically inspired architectures for sample-efficient deep reinforcement learning
Deep reinforcement learning requires a heavy price in terms of sample efficiency and overparameterization in the neural networks used for function approximation. In this work, we use tensor factorization in order to learn more compact representation for reinforcement learning policies. We show empirically that in the low-data regime, it is possible to learn online policies with 2 to 10 times less total coefficients, with little to no loss of performance. We also leverage progress in second order optimization, and use the theory of wavelet scattering to further reduce the number of learned coefficients, by foregoing learning the topmost convolutional layer filters altogether. We evaluate our results on the Atari suite against recent baseline algorithms that represent the state-of-the-art in data efficiency, and get comparable results with an order of magnitude gain in weight parsimony.
Biologically inspired architectures for sample-efficient deep reinforcement learning
noticebox[b]Deep Reinforcement Learning Workshop, NeurIPS 2019, Vancouver, Canada. \[email protected]
1 Introduction & Related Work
The successes of deep reinforcement learning (thereafter ’RL’) come at a heavy computational price. It is well known that achieving human-level performance in domains such as Atari [36, 28, 17] requires hundreds of millions of frames of environment interaction. As such, the problem of sample efficiency in reinforcement learning is of critical importance. Several tracks of concurrent research are being investigated, and have reduced by orders of magnitude the number of environment interactions required for good performance beyond the previous benchmark of biologically-inspired episodic control methods [5, 34] to a couple hours of human gameplay time [38, 20].
However, while the data-efficiency of RL methods has seen recent drastic performance, their function approximators still use millions of learned weights, potentially still leaving them heavily overparameterized. Independently motivated by biological facts like the behavioural readiness of newborn animals, several authors [15, 12, 39] have recently looked at doing away with learning so many weights for RL tasks. Smaller networks not only train faster, but may yet offer another avenue for gains in the form of better generalization . Very recent work from  studies the effect of inductive bias of neural architectures in reinforcement learning ; they forego training altogether, but transfer networks that only obtain ’better than chance performance on MNIST’. In similar fashion,  investigate the effect of random projections in the restricted setting of imitation learning. Finally,  manage human-level performance on the Atari suite using a separate dictionary learning procedure for their features, bypassing the usual end-to-end learning paradigm. The perspective of neural architecture search applied to RL appears difficult, if not computationally inextricable.
Concurrently, the study of biologically-inspired models of learning has exhibited two mathematical characterizations that might be critical in explaining how biological learning takes place so efficiently. First, the low-rank properties of learned perceptual manifolds [9, 10] are giving rise to a rich theory borrowing from statistical physics. Second, another well known line of work has identified Gabor filters (and more generally wavelet filter-like structures) in the actual visual cortex of animals , and linked those to sparsity-promoting methods and dictionary learning [29, 30, 18]. But these breakthroughs have not, so far, been reflected as inductive priors in the shape of modifications in deep RL neural networks architectures, which remain fairly fixed on the Atari domain.
Therefore the following questions remain: how parsimonious do function approximators in reinforcement learning need to be, in order to maintain good performance? And can we be at once sample-efficient and weight-efficient ? In this work, we turn to the mathematical theories of tensor factorization , second-order optimization [1, 27] and wavelet scattering  to answer this question positively and empirically, in a model-free setting. To the best of our knowledge, this is the first time those fields have been combined together in this context, and that tensor factorization is applied to deep RL.
2.1 Deep Reinforcement Learning
We consider the standard Markov Decision Process framework as in ). This setting is characterised by a tuple , where is a set of states, a set of actions, a reward function that is the immediate, intrinsic desirability of a certain state, a transition dynamics and a discount factor. The purpose of the RL problem is to to find a policy , which represents a mapping from states to a probability distribution over actions, that is optimal, i.e., that maximizes the expected cumulative discounted return at each state . In Q-learning, the policy is given implicitly by acting greedily or -greedily with respect to learned action-value functions , that are learned following the Bellman equation. In deep Q-learning, becomes parameterized by the weights of a neural network and one minimizes the expected Bellman loss :
In practice, this is implemented stochastically via uniform sampling of transitions in an experience replay buffer, as is done in the seminal paper . Several algorithmic refinements to that approach exist. First, Double Q-learning  proposes to decouple learning between two networks in order to alleviate the Q-value overestimation problem. Second, dueling Q-networks  explicitly decompose the learning of an action-value function as the sum of an action-independent state-value, much like what is traditionally done in policy gradient methods , implemented via a two-headed neural network architecture. Finally, prioritized RL  proposes to replace the uniform sampling of transitions in the experience replay buffer with importance sampling, by prioritizing those transitions that present the most Bellman error (those transitions that are deemed the most ’surprising’ by the agent).  uses extra weights to learn the variance of the exploration noise in a granular fashion, while  proposes to learn a full distribution of action-values for each action and state. Combined, those methods form the basis of the Rainbow algorithm in .
2.2 Tensor factorization
Here we introduce notations and concepts from the tensor factorization literature. An intuition is that the two main decompositions below, CP and Tucker decompositions, can be understood as multilinear algebra analogues of SVD or eigendecomposition.
CP decomposition. A tensor , can be decomposed into a sum of rank-1 tensors, known as the Canonical-Polyadic decomposition, where is as the rank of the decomposition. The objective is to find the vectors , for , as well as a vector of weights such that:
Tucker decomposition. A tensor , can be decomposed into a low rank approximation consisting of a core and a set of projection factors , with that, when projected along the corresponding dimension of the core, reconstruct the full tensor . The tensor in its decomposed form can then be written:
Tensor regression layer. For two tensors and , we denote by the contraction of by along their last modes; their generalized inner product is
This enables us to define a tensor regression layer  that is differentiable and learnable end-to-end by gradient descent. Let us denote by the input activation tensor for a sample and the label vector. A tensor regression layer estimates the regression weight tensor under a low-rank decomposition. In the case of a Tucker decomposition (as per our experiments) with ranks , we have :
as , for each in and .
2.3 Wavelet scattering
The wavelet scattering transform was originally introduced by  and  as a non-linear extension to the classical wavelet filter bank decomposition . Its principle is as follows. Denoting by the 2-dimensional, circular convolution of two signals and , let us assume that we have pre-defined two wavelet filter banks available , with and two frequency indices. These wavelet filters correspond to high frequencies, so we also give ourselves the data of a lowpass filter . Finally, and by opposition to traditional linear wavelet transforms, we also assume a given nonlinearity . Then the scattering transform is given by coefficients of order 0,1, and 2, respectively :
This can effectively be understood and implemented as a two-layer convolutional neural network whose weights are not learned but rather frozen and given by the coefficients of wavelets and (with Gabor filters as a special case ). The difference with traditional filter banks comes from the iterated modulus/nonlinear activation function applied at each stage, much like in traditional deep learning convolutional neural networks. In practice, the potential of scattering transforms to accelerate deep learning by providing ready-made convolutional layers weights has been investigated in [31, 32, 3].
2.4 Second order optimization with K-FAC
While stochastic gradient descent is usually performed purely from gradient observations derived from auto-differentiation, faster, second order optimization methods first multiply the weights’ gradient vector by a preconditioning matrix, yielding the weight update . In the case of second order methods, the matrix is chosen to act as a tractable iterative approximation to the inverse Hessian or Empirical Fisher Information Matrix  of the neural network model in question. Kronecker-factored approximate curvature or K-FAC  enforces a Kronecker decomposition of the type , with and being smaller, architecture-dependent matrices. Unlike the above methods, K-FAC has been applied as a plug-in in the RL literature and been shown to promote convergence .
3 Methods & Experimental Results
We do take as a baseline method the data-efficient Rainbow of . However, we change the architecture of the neural network function approximators used, in accordance with the principles described above, combining them to reflect inductive biases promoting fewer learnable parameters:
When possible, we replace the first convolutional layer in the approximating neural network with a scattering layer for further gains in terms of learnable weights.
For all our Atari experiments, we used OpenAI Gym , and a combination of PyTorch , TensorLy  and Kymatio  for auto-differentiation. We evaluated our agents in the low-data regime of 100,000 steps, on half the games, with 3 different random seeds for reproducibility . Our specific hyperparameters are described in appendix. We report our results in tables 1 and 2.
|Game||SimPLe||Rainbow||Denoised||TRL 2.5x||TRL 5x||TRL 10x|
|Average (vs. Rainbow)||100%||118%||96%||90%||71%|
Table 1 shows proof of concept of the online learning of low-rank policies, with a loss of final performance varying in proportion to the compression in the low-rank linear layers, very much like in the deep learning literature [22, 23]. The number of coefficients in the original data-efficient Rainbow is of the order of magnitude of 1M and varies depending on the environment and its action-space size. The corresponding tensor regression layer ranks are in appendix, and chosen to target 400k, 200k and 100k coefficients respectively. While individual game results tend to decrease monotonously with increasing compression, we observe that they are noisy as per the nature of exploration in RL, and average scores reported correspond with the intuition that performance seems to decrease fast after a certain overparameterization threshold is crossed. To take this noisy character into account, we take care to be conservative and report the average of the final three episodes of the learned policy after 80000, 90000 and 100000 steps, respectively. Also, so as to not muddy the discussion and provide fair baselines, we do report on the NoisyNet  ablation of Rainbow (’Denoised’ columns), as the NoisyLinear layer doubles up the number of coefficients required and actually performs worse in our experiments. Interestingly, the approximation error in tensor factorization seems to play a role akin to promoting exploration noise.
We then proceed to assess the impact of second-order optimization to our architecture by substituting ADAM optimization for K-FAC, and introducing scattering, in table 2 (only a handful results being available with scattering, due to computational limitations). In spite of our conservative reporting, the efficiency boost from using a second order scheme more than makes up for low-rank approximation error with five times less coefficients than , suggesting that learning with a full order of magnitude less coefficients is well within reach of our techniques.
|Average (vs. Rainbow)||114%||109%||98%||56%|
We have demonstrated that in the low-data regime, it is possible to leverage biologically plausible characterizations of experience data (namely low-rank properties and wavelet scattering separability) to exhibit architectures that learn policies with many times less weights than current baselines, in an online fashion. We do hope that this will lead to even further progress towards sample efficiency and speedy exploration methods. Further work will first focus on thorough evaluation and research of scattering architectures in order to achieve further gains, and second investigate additional, orthogonal biologically-friendly research directions such as promoting sparsity.
-  (1998) Natural gradient works efficiently in learning. Neural Computation 10, pp. 251–276. Cited by: §1, §2.4.
-  (2018-12) Kymatio: Scattering Transforms in Python. arXiv e-prints. External Links: Cited by: §3, Table 3.
-  (2018-05) Generative networks as inverse problems with Scattering transforms. arXiv e-prints. External Links: Cited by: §2.3.
-  (2017-07) A Distributional Perspective on Reinforcement Learning. arXiv e-prints. External Links: Cited by: §2.1.
-  (2016-06) Model-Free Episodic Control. arXiv e-prints. External Links: Cited by: §1.
-  (2016-06) OpenAI Gym. arXiv e-prints. External Links: Cited by: §3.
-  (2012-03) Invariant Scattering Convolution Networks. arXiv e-prints. External Links: Cited by: §2.3.
-  (2017-12) Tensor Regression Networks with various Low-Rank Tensor Approximations. arXiv e-prints. External Links: Cited by: §2.2.
-  (2018-07) Classification and Geometry of General Perceptual Manifolds. Physical Review X 8 (3), pp. 031003. External Links: Cited by: §1.
-  (2016) Linear readout of object manifolds.. Physical review. E 93 6, pp. 060301. Cited by: §1.
-  (2009) Nonnegative matrix and tensor factorizations - applications to exploratory multi-way data analysis and blind source separation. Wiley. Cited by: §1.
-  (2018-06) Playing Atari with Six Neurons. arXiv e-prints. External Links: Cited by: §1.
-  (2018) Solid harmonic wavelet scattering for predictions of molecule properties. The Journal of chemical physics 148 24, pp. 241732. Cited by: Table 3.
-  (2017-06) Noisy Networks for Exploration. arXiv e-prints. External Links: Cited by: §2.1, §3.
-  (2019-06) Weight Agnostic Neural Networks. arXiv e-prints. External Links: Cited by: §1.
-  (2017-09) Deep Reinforcement Learning that Matters. arXiv e-prints. External Links: Cited by: §3.
-  (2017-10) Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv e-prints. External Links: Cited by: §1, §2.1, 1st item.
-  (2001) A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research 41, pp. 2413–2423. Cited by: §1.
-  (1987) An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex.. Journal of neurophysiology 58 6, pp. 1233–58. Cited by: §1.
-  (2019-03) Model-Based Reinforcement Learning for Atari. arXiv e-prints. External Links: Cited by: §1, 3rd item, Table 1.
-  (2014-12) Adam: A Method for Stochastic Optimization. arXiv e-prints. External Links: Cited by: 2nd item.
-  (2017-06) Tensor Contraction Layers for Parsimonious Deep Nets. arXiv e-prints. External Links: Cited by: §2.2, §3.
-  (2017-07) Tensor Regression Networks. arXiv e-prints. External Links: Cited by: §2.2, §2.2, 1st item, §3.
-  (2016-10) TensorLy: Tensor Learning in Python. arXiv e-prints. External Links: Cited by: §3.
-  (2011-01) Group Invariant Scattering. arXiv e-prints. External Links: Cited by: §1, §2.3.
-  (1998) A wavelet tour of signal processing. Academic Press. Cited by: §2.3, §2.3.
-  (2015) Optimizing neural networks with kronecker-factored approximate curvature. ArXiv abs/1503.05671. Cited by: §1, §2.4, 2nd item.
-  (2013-12) Playing Atari with Deep Reinforcement Learning. arXiv e-prints. External Links: Cited by: §1, §2.1.
-  (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, pp. 607–609. Cited by: §1.
-  (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision Research 37, pp. 3311–3325. Cited by: §1.
-  (2013-12) Generic Deep Networks with Wavelet Scattering. arXiv e-prints. External Links: Cited by: §2.3.
-  (2018-09) Scattering Networks for Hybrid Representation Learning. arXiv e-prints. External Links: Cited by: §2.3.
-  (2017) Automatic differentiation in pytorch. In OpenReview, Cited by: §3.
-  (2017-03) Neural Episodic Control. arXiv e-prints. External Links: Cited by: §1.
-  (2015-11) Prioritized Experience Replay. arXiv e-prints. External Links: Cited by: §2.1, Further results and learning curves.
-  (2018) Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Cited by: §1, §2.1.
-  (2015-09) Deep Reinforcement Learning with Double Q-learning. arXiv e-prints. External Links: Cited by: §2.1, Further results and learning curves.
-  (2019-06) When to use parametric models in reinforcement learning?. arXiv e-prints. External Links: Cited by: §1, 1st item, 3rd item, Table 1, §3, §3, Hyperparameters and Reproducibility, Table 4.
-  (2019-05) Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation. arXiv e-prints. External Links: Cited by: §1.
-  (2015-11) Dueling Network Architectures for Deep Reinforcement Learning. arXiv e-prints. External Links: Cited by: §2.1.
-  (2017-08) Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. arXiv e-prints. External Links: Cited by: §2.4.
-  (2016-11) Understanding deep learning requires rethinking generalization. arXiv e-prints. External Links: Cited by: §1.
Hyperparameters and Reproducibility
Our codebase is available on request. Hyperparameters are as follows. First, our specific architecture-modified hyperparameters:
|Specific architecture hyperparameters||Value|
|Scattering maximum log-scale||3|
|Scattering volume width||1|
|Scattering tensor input shape||(1,4,84,84)|
|Scattering tensor output shape||(1,16,11,11)|
|Scattering type||Harmonic 3D, see [2, 13]|
|Hidden linear layer rank constraint, 2.5x compression||128|
|Final linear layer rank constraint, 2.5x compression||48|
|Hidden linear layer rank constraint, 5x compression||32|
|Final linear layer rank constraint, 5x compression||48|
|Hidden linear layer rank constraint, 10x compression||16|
|Final linear layer rank constraint, 10x compression||10|
|KFAC Tikhonov regularization parameter||0.1|
|KFAC Update frequency for inverses||100|
Furthermore, we mirror the Data-Efficient Rainbow  baseline:
|Data-efficient Rainbow hyperparameters||Value|
|Observation down-sampling||(84, 84)|
|Reward clipping||[-1, 1]|
|Terminal on loss of life||True|
|Max frames per episode||108K|
|Update||Distributional Double Q|
|Target network update period||every 2000 updates|
|Support of Q-distribution||51 bins|
|Optimizer: first moment decay||0.9|
|Optimizer: second moment decay||0.999|
|Max gradient norm||10|
|Priority correction||0.4 1|
|Hardware||NVidia 1080Ti GPU|
|Noisy nets parameter||0.1|
|Min replay size for sampling||1600|
|Replay period every||1 steps|
|Multi-step return length||20|
|Q network: channels||32, 64|
|Q network: filter size||5 x 5, 5 x 5|
|Q network: stride||5, 5|
|Q network: hidden units||256|
|Optimizer: learning rate||0.0001|
Standard deviations for score runs
|Game||Denoised||TRL 2.5x||TRL 10x|
|alien||684 7||688 123||566 38|
|amidar||154 21||118 12||84 15|
|assault||321 224||543 94||513 64|
|asterix||500 124||459 91||363 66|
|bank_heist||77 23||59 22||42 2|
|battle_zone||9378 2042||14466 2845||5744 575|
|boxing||1 2||-2 1||-5 1|
|breakout||3 1.5||2 1||4 0.3|
|chopper_command||1293 445||1255 215||1106 124|
|crazy_climber||9977 3744||3928 221||2340 595|
|demon_attack||450 49||362 147||175 7|
|freeway||28 0.6||26 0||24 0.5|
|frostbite||1101 355||659 523||231 1|
|gopher||391 46||278 39||396 24|
|hero||3013 90||5351 1948||3321 598|
|jamesbond||295 57||215 42||218 22|
|kangaroo||1002 587||804 289||400 278|
|krull||2656 180||2333 309||2308 268|
|kung_fu_master||4037 2962||9392 6289||4031 3068|
|ms_pacman||1053 193||818 94||517 38|
|pong||-20 0.4||-20 0||-21 0.1|
|private_eye||100 0||51 59||1128 1067|
|qbert||672 144||697 78||733 291|
|road_runner||5426 2830||6965 6569||1319 216|
|seaquest||387 24||345 40||287 87|
|up_n_down||5123 3146||2197 231||2179 178|
|alien||996 180||643 51||441 90|
|amidar||163 15||98 26||84 11|
|assault||501 85||496 129||434 304|
|asterix||537 96||526 64||502 91|
|bank_heist||100 14||57 36||29 13|
|battle_zone||8622 5358||6156 1951||4311 1517|
|boxing||0 2||-1 3||-9 12|
|breakout||3 1||2 2||2 0|
|chopper_command||692 81||1302 328||441 80|
|crazy_climber||14242 2936||3546 1231||740 291|
|demon_attack||582 130||318 168||692 232|
|freeway||26 0||24 0||19 1|
|frostbite||1760 448||1483 466||654 709|
|gopher||363 4||265 67||172 3|
|hero||4188 1635||4206 1862||4127 1074|
|jamesbond||263 22||217 68||48 10|
|kangaroo||2085 2055||588 5||391 52|
|krull||2855 156||3392 2205||772 560|
|kung_fu_master||8481 8270||7357 9200||233 205|
|ms_pacman||1137 180||867 128||613 159|
|pong||-19 0.6||-19 1||-20 0|
|private_eye||56 42||100 0||0 0|
|qbert||731 256||538 114||475 161|
|road_runner||4516 2869||7224 4598||1278 463|
|seaquest||349 63||520 97||213 96|
|up_n_down||2557 641||2108 298||993 244|
Further results and learning curves
Learning curves As a simpler version of the experiments in the main text body, we show basic proof of concept on the simple Pong Atari game. Our experimental setup consists in using our own implementation of prioritized double DQN as a baseline, which combines algorithmic advances from  and . We replaced the densely connected layer of the original DQN architecture with a tensor regression layer implementing Tucker decomposition for different Tucker ranks, yielding different network compression factors. (These curves average three different random seeds).
Qualitative behaviour. First results, both in terms of learning performance and compression factor, can be seen in figure 1. The two main findings of this experiment are that first, and overall, the final performance of the agent remains unaffected by the tensor factorization, even with high compression rates nearing 10 times. Second, this obviously comes at the expense of stability during training - in tough compression regimes, learning curves are slightly delayed, and their plateauing phases contain occasional noisy drawdowns illustrating the increased difficulty of learning, as seen in figure 2. The extra pathwise noise, however, can be seen as promoting exploration.