Abstract
Deep reinforcement learning requires a heavy price in terms of sampleefficiency and overparameterization in the neural networks used for functionapproximation. In this work, we use tensor factorization in order to learn morecompact representation for reinforcement learning policies. We show empiricallythat in the lowdata regime, it is possible to learn online policies with 2 to10 times less total coefficients, with little to no loss of performance. Wealso leverage progress in second order optimization, and use the theory ofwavelet scattering to further reduce the number of learned coefficients, byforegoing learning the topmost convolutional layer filters altogether. Weevaluate our results on the Atari suite against recent baseline algorithms thatrepresent the stateoftheart in data efficiency, and get comparable resultswith an order of magnitude gain in weight parsimony.
Quick Read (beta)
Biologically inspired architectures for sampleefficient deep reinforcement learning
Abstract
Deep reinforcement learning requires a heavy price in terms of sample efficiency and overparameterization in the neural networks used for function approximation. In this work, we use tensor factorization in order to learn more compact representation for reinforcement learning policies. We show empirically that in the lowdata regime, it is possible to learn online policies with 2 to 10 times less total coefficients, with little to no loss of performance. We also leverage progress in second order optimization, and use the theory of wavelet scattering to further reduce the number of learned coefficients, by foregoing learning the topmost convolutional layer filters altogether. We evaluate our results on the Atari suite against recent baseline algorithms that represent the stateoftheart in data efficiency, and get comparable results with an order of magnitude gain in weight parsimony.
Biologically inspired architectures for sampleefficient deep reinforcement learning
Pierre H. Richemond Imperial College London [email protected] Arinbjörn Kolbeinsson Imperial College London [email protected] Yike Guo Imperial College London [email protected]
noticebox[b]Deep Reinforcement Learning Workshop, NeurIPS 2019, Vancouver, Canada. \[email protected]
1 Introduction & Related Work
The successes of deep reinforcement learning (thereafter ’RL’) come at a heavy computational price. It is well known that achieving humanlevel performance in domains such as Atari [36, 28, 17] requires hundreds of millions of frames of environment interaction. As such, the problem of sample efficiency in reinforcement learning is of critical importance. Several tracks of concurrent research are being investigated, and have reduced by orders of magnitude the number of environment interactions required for good performance beyond the previous benchmark of biologicallyinspired episodic control methods [5, 34] to a couple hours of human gameplay time [38, 20].
However, while the dataefficiency of RL methods has seen recent drastic performance, their function approximators still use millions of learned weights, potentially still leaving them heavily overparameterized. Independently motivated by biological facts like the behavioural readiness of newborn animals, several authors [15, 12, 39] have recently looked at doing away with learning so many weights for RL tasks. Smaller networks not only train faster, but may yet offer another avenue for gains in the form of better generalization [42]. Very recent work from [15] studies the effect of inductive bias of neural architectures in reinforcement learning ; they forego training altogether, but transfer networks that only obtain ’better than chance performance on MNIST’. In similar fashion, [39] investigate the effect of random projections in the restricted setting of imitation learning. Finally, [12] manage humanlevel performance on the Atari suite using a separate dictionary learning procedure for their features, bypassing the usual endtoend learning paradigm. The perspective of neural architecture search applied to RL appears difficult, if not computationally inextricable.
Concurrently, the study of biologicallyinspired models of learning has exhibited two mathematical characterizations that might be critical in explaining how biological learning takes place so efficiently. First, the lowrank properties of learned perceptual manifolds [9, 10] are giving rise to a rich theory borrowing from statistical physics. Second, another well known line of work has identified Gabor filters (and more generally wavelet filterlike structures) in the actual visual cortex of animals [19], and linked those to sparsitypromoting methods and dictionary learning [29, 30, 18]. But these breakthroughs have not, so far, been reflected as inductive priors in the shape of modifications in deep RL neural networks architectures, which remain fairly fixed on the Atari domain.
Therefore the following questions remain: how parsimonious do function approximators in reinforcement learning need to be, in order to maintain good performance? And can we be at once sampleefficient and weightefficient ? In this work, we turn to the mathematical theories of tensor factorization [11], secondorder optimization [1, 27] and wavelet scattering [25] to answer this question positively and empirically, in a modelfree setting. To the best of our knowledge, this is the first time those fields have been combined together in this context, and that tensor factorization is applied to deep RL.
2 Background
2.1 Deep Reinforcement Learning
We consider the standard Markov Decision Process framework as in [36]). This setting is characterised by a tuple $\u27e8S,A,T,R,\gamma \u27e9$, where $S$ is a set of states, $A$ a set of actions, $R$ a reward function that is the immediate, intrinsic desirability of a certain state, $T$ a transition dynamics and $\gamma $ $\in [0,1]$ a discount factor. The purpose of the RL problem is to to find a policy $\pi $, which represents a mapping from states to a probability distribution over actions, that is optimal, i.e., that maximizes the expected cumulative discounted return ${\sum}_{k=0}^{\mathrm{\infty}}{\gamma}^{k}{R}_{t+k+1}$ at each state ${s}_{t}\in S$. In Qlearning, the policy is given implicitly by acting greedily or $\u03f5$greedily with respect to learned actionvalue functions ${q}^{\pi}(s,a)$, that are learned following the Bellman equation. In deep Qlearning, ${q}_{\theta}$ becomes parameterized by the weights $\theta $ of a neural network and one minimizes the expected Bellman loss :
$$\mathbb{E}{\left({R}_{t+1}+{\gamma}_{t+1}\underset{{a}^{\prime}}{\mathrm{max}}{q}_{\theta}({S}_{t+1},{a}^{\prime}){q}_{\theta}({S}_{t},{A}_{t})\right)}^{2}$$ 
In practice, this is implemented stochastically via uniform sampling of transitions in an experience replay buffer, as is done in the seminal paper [28]. Several algorithmic refinements to that approach exist. First, Double Qlearning [37] proposes to decouple learning between two networks in order to alleviate the Qvalue overestimation problem. Second, dueling Qnetworks [40] explicitly decompose the learning of an actionvalue function ${q}_{\theta}(s,a)$ as the sum of an actionindependent statevalue, much like what is traditionally done in policy gradient methods [36], implemented via a twoheaded neural network architecture. Finally, prioritized RL [35] proposes to replace the uniform sampling of transitions in the experience replay buffer with importance sampling, by prioritizing those transitions that present the most Bellman error (those transitions that are deemed the most ’surprising’ by the agent). [14] uses extra weights to learn the variance of the exploration noise in a granular fashion, while [4] proposes to learn a full distribution of actionvalues for each action and state. Combined, those methods form the basis of the Rainbow algorithm in [17].
2.2 Tensor factorization
Here we introduce notations and concepts from the tensor factorization literature. An intuition is that the two main decompositions below, CP and Tucker decompositions, can be understood as multilinear algebra analogues of SVD or eigendecomposition.
CP decomposition. A tensor $\mathcal{X}\in {\mathbb{R}}^{{I}_{1}\times {I}_{2}\times \mathrm{\cdots}\times {I}_{N}}$, can be decomposed into a sum of $R$ rank1 tensors, known as the CanonicalPolyadic decomposition, where $R$ is as the rank of the decomposition. The objective is to find the vectors ${\mathbf{u}}_{k}^{(1)},{\mathbf{u}}_{k}^{(2)},\mathrm{\cdots},{\mathbf{u}}_{k}^{(N)}$, for $k=[1\mathrm{\dots}R]$, as well as a vector of weights $\bm{\lambda}\in {\mathbb{R}}^{R}$ such that:
$$\mathcal{X}=\sum _{k=1}^{R}\underset{\text{rank1}\text{components}}{\underset{\u23df}{{\lambda}_{k}{\mathbf{u}}_{k}^{(1)}\circ {\mathbf{u}}_{k}^{(2)}\circ \mathrm{\cdots}\circ {\mathbf{u}}_{k}^{(N)}}}$$ 
Tucker decomposition. A tensor $\mathcal{X}\in {\mathbb{R}}^{{I}_{1}\times {I}_{2}\times \mathrm{\cdots}\times {I}_{N}}$, can be decomposed into a low rank approximation consisting of a core $\mathcal{G}\in {\mathbb{R}}^{{R}_{1}\times {R}_{2}\times \mathrm{\cdots}\times {R}_{N}}$ and a set of projection factors $({\mathbf{U}}^{(0)},\mathrm{\cdots},{\mathbf{U}}^{(N1)})$, with ${\mathbf{U}}^{(k)}\in {\mathbb{R}}^{{R}_{k},{\widehat{I}}_{k}},k\in (0,\mathrm{\cdots},N1)$ that, when projected along the corresponding dimension of the core, reconstruct the full tensor $\mathcal{X}$. The tensor in its decomposed form can then be written:
$\mathcal{X}$  $=\mathcal{G}{\times}_{1}{\mathbf{U}}^{(1)}{\times}_{2}{\mathbf{U}}^{(2)}\times \mathrm{\cdots}{\times}_{N}{\mathbf{U}}^{(N)}=[\mathcal{G};{\mathbf{U}}^{(1)},\mathrm{\cdots},{\mathbf{U}}^{(N)}]$ 
Tensor regression layer. For two tensors $\mathcal{X}\mathit{\hspace{1em}}\in \mathit{\hspace{1em}}{\mathbb{R}}^{{K}_{1}\times \mathrm{\cdots}\times {K}_{x}\times {I}_{1}\times \mathrm{\cdots}\times {I}_{N}}$ and $\mathcal{Y}\mathit{\hspace{1em}}\in \mathit{\hspace{1em}}{\mathbb{R}}^{{I}_{1}\times \mathrm{\cdots}\times {I}_{N}\times {L}_{1}\times \mathrm{\cdots}\times {L}_{y}}$, we denote by ${\u27e8\mathcal{X},\mathcal{Y}\u27e9}_{N}\mathit{\hspace{1em}}\in \mathit{\hspace{1em}}{\mathbb{R}}^{{K}_{1}\times \mathrm{\cdots}\times {K}_{x}\times {L}_{1}\times \mathrm{\cdots}\times {L}_{y}}$ the contraction of $\mathcal{X}$ by $\mathcal{Y}$ along their $N$ last modes; their generalized inner product is
$${\u27e8\mathcal{X},\mathcal{Y}\u27e9}_{N}=\sum _{{i}_{1}=1}^{{I}_{1}}\sum _{{i}_{2}=1}^{{I}_{2}}\mathrm{\cdots}\sum _{{i}_{n}=1}^{{I}_{N}}{\mathcal{X}}_{\mathrm{\dots},{i}_{1},{i}_{2},\mathrm{\dots},{i}_{n}}{\mathcal{Y}}_{{i}_{1},{i}_{2},\mathrm{\dots},{i}_{n},\mathrm{\dots}}$$ 
This enables us to define a tensor regression layer [23] that is differentiable and learnable endtoend by gradient descent. Let us denote by $\mathcal{X}\in {\mathbb{R}}^{{I}_{1}\times {I}_{2}\times \mathrm{\cdots}\times {I}_{N}}$ the input activation tensor for a sample and $\mathbf{y}\in {\mathbb{R}}^{{I}_{N}}$ the label vector. A tensor regression layer estimates the regression weight tensor $\mathcal{W}\in {\mathbb{R}}^{{I}_{1}\times {I}_{2}\times \mathrm{\cdots}\times {I}_{N}}$ under a lowrank decomposition. In the case of a Tucker decomposition (as per our experiments) with ranks $({R}_{1},\mathrm{\cdots},{R}_{N})$, we have :
$$\begin{array}{c}\hfill \mathbf{y}={\u27e8\mathcal{X},\mathcal{W}\u27e9}_{N}+\mathbf{b}\mathit{\hspace{1em}\hspace{1em}}\text{with}\mathcal{W}=\mathcal{G}{\times}_{1}{\mathbf{U}}^{(1)}{\times}_{2}{\mathbf{U}}^{(2)}\mathrm{\cdots}{\times}_{N}{\mathbf{U}}^{(N)}\hfill \end{array}$$ 
as $\mathcal{G}\in {\mathbb{R}}^{{R}_{1}\times \mathrm{\cdots}\times {R}_{N}}$, ${\mathbf{U}}^{(k)}\in {\mathbb{R}}^{{I}_{k}\times {R}_{k}}$ for each $k$ in $[1\mathrm{\dots}N]$ and ${\mathbf{U}}^{(N)}\in {\mathbb{R}}^{1\times {R}_{N}}$.
2.3 Wavelet scattering
The wavelet scattering transform was originally introduced by [25] and [7] as a nonlinear extension to the classical wavelet filter bank decomposition [26]. Its principle is as follows. Denoting by $x\u229by[n]$ the 2dimensional, circular convolution of two signals $x[n]$ and $y[n]$, let us assume that we have predefined two wavelet filter banks available ${\left\{{\psi}_{{\lambda}_{1}}^{(1)}[n]\right\}}_{{\lambda}_{1}\in {\mathrm{\Lambda}}_{1}}$ ${\left\{{\psi}_{{\lambda}_{2}}^{(2)}[n]\right\}}_{{\lambda}_{2}\in {\mathrm{\Lambda}}_{2}}$ , with ${\lambda}_{1}$ and ${\lambda}_{2}$ two frequency indices. These wavelet filters correspond to high frequencies, so we also give ourselves the data of a lowpass filter ${\varphi}_{J}[n]$. Finally, and by opposition to traditional linear wavelet transforms, we also assume a given nonlinearity $\rho (t)$. Then the scattering transform is given by coefficients of order 0,1, and 2, respectively :
$${S}_{0}x[n]=x\u229b{\varphi}_{J}[n]$$ 
$${S}_{1}x[n,{\lambda}_{1}]=\rho \left(x\u229b{\psi}_{{\lambda}_{1}}^{(1)}\right)\u229b{\varphi}_{J}[n]\mathit{\hspace{1em}}{\lambda}_{1}\in {\mathrm{\Lambda}}_{1}$$ 
$${S}_{2}x[n,{\lambda}_{1},{\lambda}_{2}]=\rho \left(\rho \left(x\u229b{\psi}_{{\lambda}_{1}}^{(1)}\right)\u229b{\psi}_{{\lambda}_{2}}^{(2)}\right)\u229b{\varphi}_{J}[n]\mathit{\hspace{1em}}{\lambda}_{1}\in {\mathrm{\Lambda}}_{1},{\lambda}_{2}\in {\mathrm{\Lambda}}_{2}\left({\lambda}_{1}\right)$$ 
This can effectively be understood and implemented as a twolayer convolutional neural network whose weights are not learned but rather frozen and given by the coefficients of wavelets $\psi $ and $\varphi $ (with Gabor filters as a special case [26]). The difference with traditional filter banks comes from the iterated modulus/nonlinear activation function applied at each stage, much like in traditional deep learning convolutional neural networks. In practice, the potential of scattering transforms to accelerate deep learning by providing readymade convolutional layers weights has been investigated in [31, 32, 3].
2.4 Second order optimization with KFAC
While stochastic gradient descent is usually performed purely from gradient observations derived from autodifferentiation, faster, second order optimization methods first multiply the weights’ $\theta $ gradient vector ${\nabla}_{\theta}$ by a preconditioning matrix, yielding the weight update $\theta \leftarrow \theta \eta {G}^{1}{\nabla}_{\theta}$. In the case of second order methods, the matrix ${G}^{1}$ is chosen to act as a tractable iterative approximation to the inverse Hessian or Empirical Fisher Information Matrix [1] of the neural network model in question. Kroneckerfactored approximate curvature or KFAC [27] enforces a Kronecker decomposition of the type $G=A\otimes B$, with $A$ and $B$ being smaller, architecturedependent matrices. Unlike the above methods, KFAC has been applied as a plugin in the RL literature and been shown to promote convergence [41].
3 Methods & Experimental Results
We do take as a baseline method the dataefficient Rainbow of [38]. However, we change the architecture of the neural network function approximators used, in accordance with the principles described above, combining them to reflect inductive biases promoting fewer learnable parameters:
 •
 •
 •

•
When possible, we replace the first convolutional layer in the approximating neural network with a scattering layer for further gains in terms of learnable weights.
For all our Atari experiments, we used OpenAI Gym [6], and a combination of PyTorch [33], TensorLy [24] and Kymatio [2] for autodifferentiation. We evaluated our agents in the lowdata regime of 100,000 steps, on half the games, with 3 different random seeds for reproducibility [16]. Our specific hyperparameters are described in appendix. We report our results in tables 1 and 2.
Game  SimPLe  Rainbow  Denoised  TRL 2.5x  TRL 5x  TRL 10x 

alien  405  740  684  688  454  566 
amidar  88  189  154  118  86  84 
assault  369  431  321  543  521  513 
asterix  1090  471  500  459  554  363 
bank_heist  8  51  77  59  134  42 
battle_zone  5184  10125  9378  14466  13466  5744 
boxing  9  0.2  1  2  2  5 
breakout  13  2  3  2  2  4 
chopper_command  1247  862  1293  1255  1243  1106 
crazy_climber  39828  16185  9977  3928  4225  2340 
demon_attack  170  508  450  362  263  175 
freeway  20  28  28  26  25  24 
frostbite  255  867  1101  659  912  231 
gopher  771  349  391  278  255  396 
hero  1295  6857  3013  5351  3732  3321 
jamesbond  125  302  295  215  213  218 
kangaroo  323  779  1002  804  715  400 
krull  4540  2852  2656  2333  2275  2308 
kung_fu_master  17257  14346  4037  9392  4764  4031 
ms_pacman  763  1204  1053  818  838  517 
pong  5  19  20  20  19  21 
private_eye  58  98  100  51  100  1128 
qbert  560  1153  672  697  581  733 
road_runner  5169.4  9600  5426  6965  3914  1319 
seaquest  371  354  387  345  350  287 
up_n_down  2153  2877  5123  2197  2302  2179 
Average (vs. Rainbow)  100%  118%  96%  90%  71% 
Table 1 shows proof of concept of the online learning of lowrank policies, with a loss of final performance varying in proportion to the compression in the lowrank linear layers, very much like in the deep learning literature [22, 23]. The number of coefficients in the original dataefficient Rainbow is of the order of magnitude of 1M and varies depending on the environment and its actionspace size. The corresponding tensor regression layer ranks are in appendix, and chosen to target 400k, 200k and 100k coefficients respectively. While individual game results tend to decrease monotonously with increasing compression, we observe that they are noisy as per the nature of exploration in RL, and average scores reported correspond with the intuition that performance seems to decrease fast after a certain overparameterization threshold is crossed. To take this noisy character into account, we take care to be conservative and report the average of the final three episodes of the learned policy after 80000, 90000 and 100000 steps, respectively. Also, so as to not muddy the discussion and provide fair baselines, we do report on the NoisyNet [14] ablation of Rainbow (’Denoised’ columns), as the NoisyLinear layer doubles up the number of coefficients required and actually performs worse in our experiments. Interestingly, the approximation error in tensor factorization seems to play a role akin to promoting exploration noise.
We then proceed to assess the impact of secondorder optimization to our architecture by substituting ADAM optimization for KFAC, and introducing scattering, in table 2 (only a handful results being available with scattering, due to computational limitations). In spite of our conservative reporting, the efficiency boost from using a second order scheme more than makes up for lowrank approximation error with five times less coefficients than [38], suggesting that learning with a full order of magnitude less coefficients is well within reach of our techniques.
Game  KFAC+Denoised  KFAC+TRL5x  KFAC+TRL10x  Scattering 

alien  996  734  643  441 
amidar  163  101  98  84 
assault  501  491  496  434 
asterix  537  549  526  502 
bank_heist  100  73  57  29 
battle_zone  8622  15178  6156  4311 
boxing  0  4  1  9 
breakout  3  3  2  2 
chopper_command  692  611  1302  441 
crazy_climber  14242  12377  3546  740 
demon_attack  582  434  318  692 
freeway  26  26  24  19 
frostbite  1760  718  1483  654 
gopher  363  341  265  172 
hero  4188  6284  4206  4127 
jamesbond  263  327  217  48 
kangaroo  2085  613  588  391 
krull  2855  3441  3392  772 
kung_fu_master  8481  10738  7357  233 
ms_pacman  1137  920  867  613 
pong  19.3  19  19  20 
private_eye  56  100  100  0 
qbert  731  520  538  475 
road_runner  4516  8493  7224  1278 
seaquest  349  317  520  213 
up_n_down  2557  2291  2108  993 
Average (vs. Rainbow)  114%  109%  98%  56% 
4 Conclusion
We have demonstrated that in the lowdata regime, it is possible to leverage biologically plausible characterizations of experience data (namely lowrank properties and wavelet scattering separability) to exhibit architectures that learn policies with many times less weights than current baselines, in an online fashion. We do hope that this will lead to even further progress towards sample efficiency and speedy exploration methods. Further work will first focus on thorough evaluation and research of scattering architectures in order to achieve further gains, and second investigate additional, orthogonal biologicallyfriendly research directions such as promoting sparsity.
References
 [1] (1998) Natural gradient works efficiently in learning. Neural Computation 10, pp. 251–276. Cited by: §1, §2.4.
 [2] (201812) Kymatio: Scattering Transforms in Python. arXiv eprints. External Links: 1812.11214 Cited by: §3, Table 3.
 [3] (201805) Generative networks as inverse problems with Scattering transforms. arXiv eprints. External Links: 1805.06621 Cited by: §2.3.
 [4] (201707) A Distributional Perspective on Reinforcement Learning. arXiv eprints. External Links: 1707.06887 Cited by: §2.1.
 [5] (201606) ModelFree Episodic Control. arXiv eprints. External Links: 1606.04460 Cited by: §1.
 [6] (201606) OpenAI Gym. arXiv eprints. External Links: 1606.01540 Cited by: §3.
 [7] (201203) Invariant Scattering Convolution Networks. arXiv eprints. External Links: 1203.1513 Cited by: §2.3.
 [8] (201712) Tensor Regression Networks with various LowRank Tensor Approximations. arXiv eprints. External Links: 1712.09520 Cited by: §2.2.
 [9] (201807) Classification and Geometry of General Perceptual Manifolds. Physical Review X 8 (3), pp. 031003. External Links: 1710.06487, Document Cited by: §1.
 [10] (2016) Linear readout of object manifolds.. Physical review. E 93 6, pp. 060301. Cited by: §1.
 [11] (2009) Nonnegative matrix and tensor factorizations  applications to exploratory multiway data analysis and blind source separation. Wiley. Cited by: §1.
 [12] (201806) Playing Atari with Six Neurons. arXiv eprints. External Links: 1806.01363 Cited by: §1.
 [13] (2018) Solid harmonic wavelet scattering for predictions of molecule properties. The Journal of chemical physics 148 24, pp. 241732. Cited by: Table 3.
 [14] (201706) Noisy Networks for Exploration. arXiv eprints. External Links: 1706.10295 Cited by: §2.1, §3.
 [15] (201906) Weight Agnostic Neural Networks. arXiv eprints. External Links: 1906.04358 Cited by: §1.
 [16] (201709) Deep Reinforcement Learning that Matters. arXiv eprints. External Links: 1709.06560 Cited by: §3.
 [17] (201710) Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv eprints. External Links: 1710.02298 Cited by: §1, §2.1, 1st item.
 [18] (2001) A twolayer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research 41, pp. 2413–2423. Cited by: §1.
 [19] (1987) An evaluation of the twodimensional gabor filter model of simple receptive fields in cat striate cortex.. Journal of neurophysiology 58 6, pp. 1233–58. Cited by: §1.
 [20] (201903) ModelBased Reinforcement Learning for Atari. arXiv eprints. External Links: 1903.00374 Cited by: §1, 3rd item, Table 1.
 [21] (201412) Adam: A Method for Stochastic Optimization. arXiv eprints. External Links: 1412.6980 Cited by: 2nd item.
 [22] (201706) Tensor Contraction Layers for Parsimonious Deep Nets. arXiv eprints. External Links: 1706.00439 Cited by: §2.2, §3.
 [23] (201707) Tensor Regression Networks. arXiv eprints. External Links: 1707.08308 Cited by: §2.2, §2.2, 1st item, §3.
 [24] (201610) TensorLy: Tensor Learning in Python. arXiv eprints. External Links: 1610.09555 Cited by: §3.
 [25] (201101) Group Invariant Scattering. arXiv eprints. External Links: 1101.2286 Cited by: §1, §2.3.
 [26] (1998) A wavelet tour of signal processing. Academic Press. Cited by: §2.3, §2.3.
 [27] (2015) Optimizing neural networks with kroneckerfactored approximate curvature. ArXiv abs/1503.05671. Cited by: §1, §2.4, 2nd item.
 [28] (201312) Playing Atari with Deep Reinforcement Learning. arXiv eprints. External Links: 1312.5602 Cited by: §1, §2.1.
 [29] (1996) Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381, pp. 607–609. Cited by: §1.
 [30] (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision Research 37, pp. 3311–3325. Cited by: §1.
 [31] (201312) Generic Deep Networks with Wavelet Scattering. arXiv eprints. External Links: 1312.5940 Cited by: §2.3.
 [32] (201809) Scattering Networks for Hybrid Representation Learning. arXiv eprints. External Links: 1809.06367 Cited by: §2.3.
 [33] (2017) Automatic differentiation in pytorch. In OpenReview, Cited by: §3.
 [34] (201703) Neural Episodic Control. arXiv eprints. External Links: 1703.01988 Cited by: §1.
 [35] (201511) Prioritized Experience Replay. arXiv eprints. External Links: 1511.05952 Cited by: §2.1, Further results and learning curves.
 [36] (2018) Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Link Cited by: §1, §2.1.
 [37] (201509) Deep Reinforcement Learning with Double Qlearning. arXiv eprints. External Links: 1509.06461 Cited by: §2.1, Further results and learning curves.
 [38] (201906) When to use parametric models in reinforcement learning?. arXiv eprints. External Links: 1906.05243 Cited by: §1, 1st item, 3rd item, Table 1, §3, §3, Hyperparameters and Reproducibility, Table 4.
 [39] (201905) Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation. arXiv eprints. External Links: 1905.06750 Cited by: §1.
 [40] (201511) Dueling Network Architectures for Deep Reinforcement Learning. arXiv eprints. External Links: 1511.06581 Cited by: §2.1.
 [41] (201708) Scalable trustregion method for deep reinforcement learning using Kroneckerfactored approximation. arXiv eprints. External Links: 1708.05144 Cited by: §2.4.
 [42] (201611) Understanding deep learning requires rethinking generalization. arXiv eprints. External Links: 1611.03530 Cited by: §1.
Appendix
Hyperparameters and Reproducibility
Our codebase is available on request. Hyperparameters are as follows. First, our specific architecturemodified hyperparameters:
Specific architecture hyperparameters  Value  
Scattering maximum logscale $J$  3  
Scattering volume width $M$  1  
Scattering tensor input shape  (1,4,84,84)  
Scattering tensor output shape  (1,16,11,11)  
Scattering type  Harmonic 3D, see [2, 13]  
Hidden linear layer rank constraint, 2.5x compression  128  
Final linear layer rank constraint, 2.5x compression  48  
Hidden linear layer rank constraint, 5x compression  32  
Final linear layer rank constraint, 5x compression  48  
Hidden linear layer rank constraint, 10x compression  16  
Final linear layer rank constraint, 10x compression  10  
KFAC Tikhonov regularization parameter  0.1  
KFAC Update frequency for inverses  100 
Furthermore, we mirror the DataEfficient Rainbow [38] baseline:
Dataefficient Rainbow hyperparameters  Value  

Greyscaling  True  
Observation downsampling  (84, 84)  
Frames stacked  4  
Action repetitions  4  
Reward clipping  [1, 1]  
Terminal on loss of life  True  
Max frames per episode  108K  
Update  Distributional Double Q  
Target network update period${}^{*}$  every 2000 updates  
Support of Qdistribution  51 bins  
Discount factor  0.99  
Minibatch size  32  
Optimizer  Adam  
Optimizer: first moment decay  0.9  
Optimizer: second moment decay  0.999  
Optimizer: $\u03f5$  $0.00015$  
Max gradient norm  10  
Priority exponent  0.5  
Priority correction${}^{**}$  0.4 $\to $ 1  
Hardware  NVidia 1080Ti GPU  
Noisy nets parameter  0.1  
Training frames  400,000  
Min replay size for sampling  1600  
Memory size  unbounded  
Replay period every  1 steps  
Multistep return length  20  
Q network: channels  32, 64  
Q network: filter size  5 x 5, 5 x 5  
Q network: stride  5, 5  
Q network: hidden units  256  
Optimizer: learning rate  0.0001 
Standard deviations for score runs
Game  Denoised  TRL 2.5x  TRL 10x 

alien  684 $\pm $ 7  688 $\pm $ 123  566$\pm $ 38 
amidar  154 $\pm $ 21  118 $\pm $ 12  84$\pm $ 15 
assault  321 $\pm $ 224  543 $\pm $ 94  513$\pm $ 64 
asterix  500 $\pm $ 124  459 $\pm $ 91  363 $\pm $ 66 
bank_heist  77 $\pm $ 23  59 $\pm $ 22  42 $\pm $ 2 
battle_zone  9378 $\pm $ 2042  14466 $\pm $ 2845  5744 $\pm $ 575 
boxing  1 $\pm $ 2  2 $\pm $ 1  5 $\pm $ 1 
breakout  3 $\pm $ 1.5  2 $\pm $ 1  4 $\pm $ 0.3 
chopper_command  1293 $\pm $ 445  1255 $\pm $ 215  1106 $\pm $ 124 
crazy_climber  9977 $\pm $ 3744  3928 $\pm $ 221  2340 $\pm $ 595 
demon_attack  450 $\pm $ 49  362 $\pm $ 147  175 $\pm $ 7 
freeway  28 $\pm $ 0.6  26 $\pm $ 0  24 $\pm $ 0.5 
frostbite  1101 $\pm $ 355  659 $\pm $ 523  231 $\pm $ 1 
gopher  391 $\pm $ 46  278 $\pm $ 39  396 $\pm $ 24 
hero  3013 $\pm $ 90  5351 $\pm $ 1948  3321 $\pm $ 598 
jamesbond  295 $\pm $ 57  215 $\pm $ 42  218 $\pm $ 22 
kangaroo  1002 $\pm $ 587  804 $\pm $ 289  400 $\pm $ 278 
krull  2656 $\pm $ 180  2333 $\pm $ 309  2308 $\pm $ 268 
kung_fu_master  4037 $\pm $ 2962  9392 $\pm $ 6289  4031 $\pm $ 3068 
ms_pacman  1053 $\pm $ 193  818 $\pm $ 94  517 $\pm $ 38 
pong  20 $\pm $ 0.4  20 $\pm $ 0  21 $\pm $ 0.1 
private_eye  100 $\pm $ 0  51 $\pm $ 59  1128 $\pm $ 1067 
qbert  672 $\pm $ 144  697 $\pm $ 78  733 $\pm $ 291 
road_runner  5426 $\pm $ 2830  6965 $\pm $ 6569  1319$\pm $ 216 
seaquest  387 $\pm $ 24  345 $\pm $ 40  287 $\pm $ 87 
up_n_down  5123 $\pm $ 3146  2197 $\pm $ 231  2179 $\pm $ 178 
Game  KFAC+Denoised  KFAC+TRL10x  Scattering 

alien  996 $\pm $ 180  643 $\pm $ 51  441 $\pm $ 90 
amidar  163 $\pm $ 15  98 $\pm $ 26  84 $\pm $ 11 
assault  501 $\pm $ 85  496 $\pm $ 129  434 $\pm $ 304 
asterix  537 $\pm $ 96  526 $\pm $ 64  502 $\pm $ 91 
bank_heist  100 $\pm $ 14  57 $\pm $ 36  29 $\pm $ 13 
battle_zone  8622 $\pm $ 5358  6156 $\pm $ 1951  4311 $\pm $ 1517 
boxing  0 $\pm $ 2  1 $\pm $ 3  9 $\pm $ 12 
breakout  3 $\pm $ 1  2 $\pm $ 2  2 $\pm $ 0 
chopper_command  692 $\pm $ 81  1302 $\pm $ 328  441 $\pm $ 80 
crazy_climber  14242 $\pm $ 2936  3546 $\pm $ 1231  740 $\pm $ 291 
demon_attack  582 $\pm $ 130  318 $\pm $ 168  692 $\pm $ 232 
freeway  26 $\pm $ 0  24 $\pm $ 0  19 $\pm $ 1 
frostbite  1760 $\pm $ 448  1483 $\pm $ 466  654 $\pm $ 709 
gopher  363 $\pm $ 4  265 $\pm $ 67  172 $\pm $ 3 
hero  4188 $\pm $ 1635  4206 $\pm $ 1862  4127 $\pm $ 1074 
jamesbond  263 $\pm $ 22  217 $\pm $ 68  48 $\pm $ 10 
kangaroo  2085 $\pm $ 2055  588 $\pm $ 5  391 $\pm $ 52 
krull  2855 $\pm $ 156  3392 $\pm $ 2205  772$\pm $ 560 
kung_fu_master  8481 $\pm $ 8270  7357 $\pm $ 9200  233$\pm $ 205 
ms_pacman  1137 $\pm $ 180  867 $\pm $ 128  613 $\pm $ 159 
pong  19 $\pm $ 0.6  19 $\pm $ 1  20 $\pm $ 0 
private_eye  56 $\pm $ 42  100 $\pm $ 0  0 $\pm $ 0 
qbert  731 $\pm $ 256  538 $\pm $ 114  475 $\pm $ 161 
road_runner  4516 $\pm $ 2869  7224 $\pm $ 4598  1278 $\pm $463 
seaquest  349 $\pm $ 63  520 $\pm $ 97  213 $\pm $ 96 
up_n_down  2557 $\pm $ 641  2108 $\pm $ 298  993 $\pm $ 244 
Further results and learning curves
Learning curves As a simpler version of the experiments in the main text body, we show basic proof of concept on the simple Pong Atari game. Our experimental setup consists in using our own implementation of prioritized double DQN as a baseline, which combines algorithmic advances from [35] and [37]. We replaced the densely connected layer of the original DQN architecture with a tensor regression layer implementing Tucker decomposition for different Tucker ranks, yielding different network compression factors. (These curves average three different random seeds).
Qualitative behaviour. First results, both in terms of learning performance and compression factor, can be seen in figure 1. The two main findings of this experiment are that first, and overall, the final performance of the agent remains unaffected by the tensor factorization, even with high compression rates nearing 10 times. Second, this obviously comes at the expense of stability during training  in tough compression regimes, learning curves are slightly delayed, and their plateauing phases contain occasional noisy drawdowns illustrating the increased difficulty of learning, as seen in figure 2. The extra pathwise noise, however, can be seen as promoting exploration.