Biologically inspired architectures for sample-efficient deep reinforcement learning

  • 2019-11-25 23:59:22
  • Pierre H. Richemond, Arinbjörn Kolbeinsson, Yike Guo
  • 5

Abstract

Deep reinforcement learning requires a heavy price in terms of sampleefficiency and overparameterization in the neural networks used for functionapproximation. In this work, we use tensor factorization in order to learn morecompact representation for reinforcement learning policies. We show empiricallythat in the low-data regime, it is possible to learn online policies with 2 to10 times less total coefficients, with little to no loss of performance. Wealso leverage progress in second order optimization, and use the theory ofwavelet scattering to further reduce the number of learned coefficients, byforegoing learning the topmost convolutional layer filters altogether. Weevaluate our results on the Atari suite against recent baseline algorithms thatrepresent the state-of-the-art in data efficiency, and get comparable resultswith an order of magnitude gain in weight parsimony.

 

Quick Read (beta)

Biologically inspired architectures for sample-efficient deep reinforcement learning

Pierre H. Richemond
Imperial College London
[email protected]
&Arinbjörn Kolbeinsson
Imperial College London
[email protected]
&Yike Guo
Imperial College London
[email protected]
Abstract

Deep reinforcement learning requires a heavy price in terms of sample efficiency and overparameterization in the neural networks used for function approximation. In this work, we use tensor factorization in order to learn more compact representation for reinforcement learning policies. We show empirically that in the low-data regime, it is possible to learn online policies with 2 to 10 times less total coefficients, with little to no loss of performance. We also leverage progress in second order optimization, and use the theory of wavelet scattering to further reduce the number of learned coefficients, by foregoing learning the topmost convolutional layer filters altogether. We evaluate our results on the Atari suite against recent baseline algorithms that represent the state-of-the-art in data efficiency, and get comparable results with an order of magnitude gain in weight parsimony.

 

Biologically inspired architectures for sample-efficient deep reinforcement learning


  Pierre H. Richemond Imperial College London [email protected] Arinbjörn Kolbeinsson Imperial College London [email protected] Yike Guo Imperial College London [email protected]

\@float

noticebox[b]Deep Reinforcement Learning Workshop, NeurIPS 2019, Vancouver, Canada. \[email protected]

1 Introduction & Related Work

The successes of deep reinforcement learning (thereafter ’RL’) come at a heavy computational price. It is well known that achieving human-level performance in domains such as Atari [36, 28, 17] requires hundreds of millions of frames of environment interaction. As such, the problem of sample efficiency in reinforcement learning is of critical importance. Several tracks of concurrent research are being investigated, and have reduced by orders of magnitude the number of environment interactions required for good performance beyond the previous benchmark of biologically-inspired episodic control methods [5, 34] to a couple hours of human gameplay time [38, 20].

However, while the data-efficiency of RL methods has seen recent drastic performance, their function approximators still use millions of learned weights, potentially still leaving them heavily overparameterized. Independently motivated by biological facts like the behavioural readiness of newborn animals, several authors [15, 12, 39] have recently looked at doing away with learning so many weights for RL tasks. Smaller networks not only train faster, but may yet offer another avenue for gains in the form of better generalization [42]. Very recent work from [15] studies the effect of inductive bias of neural architectures in reinforcement learning ; they forego training altogether, but transfer networks that only obtain ’better than chance performance on MNIST’. In similar fashion, [39] investigate the effect of random projections in the restricted setting of imitation learning. Finally, [12] manage human-level performance on the Atari suite using a separate dictionary learning procedure for their features, bypassing the usual end-to-end learning paradigm. The perspective of neural architecture search applied to RL appears difficult, if not computationally inextricable.

Concurrently, the study of biologically-inspired models of learning has exhibited two mathematical characterizations that might be critical in explaining how biological learning takes place so efficiently. First, the low-rank properties of learned perceptual manifolds [9, 10] are giving rise to a rich theory borrowing from statistical physics. Second, another well known line of work has identified Gabor filters (and more generally wavelet filter-like structures) in the actual visual cortex of animals [19], and linked those to sparsity-promoting methods and dictionary learning [29, 30, 18]. But these breakthroughs have not, so far, been reflected as inductive priors in the shape of modifications in deep RL neural networks architectures, which remain fairly fixed on the Atari domain.

Therefore the following questions remain: how parsimonious do function approximators in reinforcement learning need to be, in order to maintain good performance? And can we be at once sample-efficient and weight-efficient ? In this work, we turn to the mathematical theories of tensor factorization [11], second-order optimization [1, 27] and wavelet scattering [25] to answer this question positively and empirically, in a model-free setting. To the best of our knowledge, this is the first time those fields have been combined together in this context, and that tensor factorization is applied to deep RL.

2 Background

2.1 Deep Reinforcement Learning

We consider the standard Markov Decision Process framework as in [36]). This setting is characterised by a tuple S,A,T,R,γ, where S is a set of states, A a set of actions, R a reward function that is the immediate, intrinsic desirability of a certain state, T a transition dynamics and γ [0,1] a discount factor. The purpose of the RL problem is to to find a policy π, which represents a mapping from states to a probability distribution over actions, that is optimal, i.e., that maximizes the expected cumulative discounted return k=0γkRt+k+1 at each state stS. In Q-learning, the policy is given implicitly by acting greedily or ϵ-greedily with respect to learned action-value functions qπ(s,a), that are learned following the Bellman equation. In deep Q-learning, qθ becomes parameterized by the weights θ of a neural network and one minimizes the expected Bellman loss :

𝔼(Rt+1+γt+1maxaqθ(St+1,a)-qθ(St,At))2

In practice, this is implemented stochastically via uniform sampling of transitions in an experience replay buffer, as is done in the seminal paper [28]. Several algorithmic refinements to that approach exist. First, Double Q-learning [37] proposes to decouple learning between two networks in order to alleviate the Q-value overestimation problem. Second, dueling Q-networks [40] explicitly decompose the learning of an action-value function qθ(s,a) as the sum of an action-independent state-value, much like what is traditionally done in policy gradient methods [36], implemented via a two-headed neural network architecture. Finally, prioritized RL [35] proposes to replace the uniform sampling of transitions in the experience replay buffer with importance sampling, by prioritizing those transitions that present the most Bellman error (those transitions that are deemed the most ’surprising’ by the agent). [14] uses extra weights to learn the variance of the exploration noise in a granular fashion, while [4] proposes to learn a full distribution of action-values for each action and state. Combined, those methods form the basis of the Rainbow algorithm in [17].

2.2 Tensor factorization

Here we introduce notations and concepts from the tensor factorization literature. An intuition is that the two main decompositions below, CP and Tucker decompositions, can be understood as multilinear algebra analogues of SVD or eigendecomposition.

CP decomposition. A tensor 𝒳I1×I2××IN, can be decomposed into a sum of R rank-1 tensors, known as the Canonical-Polyadic decomposition, where R is as the rank of the decomposition. The objective is to find the vectors 𝐮k(1),𝐮k(2),,𝐮k(N), for k=[1R], as well as a vector of weights 𝝀R such that:

𝒳=k=1Rλk𝐮k(1)𝐮k(2)𝐮k(N) rank-1  components 

Tucker decomposition. A tensor 𝒳I1×I2××IN, can be decomposed into a low rank approximation consisting of a core 𝒢R1×R2××RN and a set of projection factors (𝐔(0),,𝐔(N-1)), with 𝐔(k)Rk,I^k,k(0,,N-1) that, when projected along the corresponding dimension of the core, reconstruct the full tensor 𝒳. The tensor in its decomposed form can then be written:

𝒳 =𝒢×1𝐔(1)×2𝐔(2)××N𝐔(N)=[𝒢;𝐔(1),,𝐔(N)]

Tensor regression layer. For two tensors 𝒳K1××Kx×I1××IN and 𝒴I1××IN×L1××Ly, we denote by 𝒳,𝒴NK1××Kx×L1××Ly the contraction of 𝒳 by 𝒴 along their N last modes; their generalized inner product is

𝒳,𝒴N=i1=1I1i2=1I2in=1IN𝒳,i1,i2,,in𝒴i1,i2,,in,

This enables us to define a tensor regression layer [23] that is differentiable and learnable end-to-end by gradient descent. Let us denote by 𝒳I1×I2××IN the input activation tensor for a sample and 𝐲IN the label vector. A tensor regression layer estimates the regression weight tensor 𝒲I1×I2××IN under a low-rank decomposition. In the case of a Tucker decomposition (as per our experiments) with ranks (R1,,RN), we have :

𝐲=𝒳,𝒲N+𝐛   with 𝒲=𝒢×1𝐔(1)×2𝐔(2)×N𝐔(N)

as 𝒢R1××RN, 𝐔(k)Ik×Rk for each k in [1N] and 𝐔(N)1×RN.

[22, 23, 8] learn parsimonious deep learning fully-connected layers thanks to low-rank constraints.

2.3 Wavelet scattering

The wavelet scattering transform was originally introduced by [25] and [7] as a non-linear extension to the classical wavelet filter bank decomposition [26]. Its principle is as follows. Denoting by xy[n] the 2-dimensional, circular convolution of two signals x[n] and y[n], let us assume that we have pre-defined two wavelet filter banks available {ψλ1(1)[n]}λ1Λ1 {ψλ2(2)[n]}λ2Λ2 , with λ1 and λ2 two frequency indices. These wavelet filters correspond to high frequencies, so we also give ourselves the data of a lowpass filter ϕJ[n]. Finally, and by opposition to traditional linear wavelet transforms, we also assume a given nonlinearity ρ(t). Then the scattering transform is given by coefficients of order 0,1, and 2, respectively :

S0x[n]=xϕJ[n]
S1x[n,λ1]=ρ(xψλ1(1))ϕJ[n]λ1Λ1
S2x[n,λ1,λ2]=ρ(ρ(xψλ1(1))ψλ2(2))ϕJ[n]λ1Λ1,λ2Λ2(λ1)

This can effectively be understood and implemented as a two-layer convolutional neural network whose weights are not learned but rather frozen and given by the coefficients of wavelets ψ and ϕ (with Gabor filters as a special case [26]). The difference with traditional filter banks comes from the iterated modulus/nonlinear activation function applied at each stage, much like in traditional deep learning convolutional neural networks. In practice, the potential of scattering transforms to accelerate deep learning by providing ready-made convolutional layers weights has been investigated in [31, 32, 3].

2.4 Second order optimization with K-FAC

While stochastic gradient descent is usually performed purely from gradient observations derived from auto-differentiation, faster, second order optimization methods first multiply the weights’ θ gradient vector θ by a preconditioning matrix, yielding the weight update θθ-ηG-1θ. In the case of second order methods, the matrix G-1 is chosen to act as a tractable iterative approximation to the inverse Hessian or Empirical Fisher Information Matrix [1] of the neural network model in question. Kronecker-factored approximate curvature or K-FAC [27] enforces a Kronecker decomposition of the type G=AB, with A and B being smaller, architecture-dependent matrices. Unlike the above methods, K-FAC has been applied as a plug-in in the RL literature and been shown to promote convergence [41].

3 Methods & Experimental Results

We do take as a baseline method the data-efficient Rainbow of [38]. However, we change the architecture of the neural network function approximators used, in accordance with the principles described above, combining them to reflect inductive biases promoting fewer learnable parameters:

  • We replace the fully-connected, linear layers used in the Rainbow [17] and data-efficient Rainbow [38] by tensor regression layers [23] in order to learn low-rank policies (ranks in appendix).

  • We use either the K-FAC [27] second order stochastic optimizer, or ADAM [21].

  • We combine the two methods with various rank and therefore weight compression ratios and evaluate those on the same subset of Atari games as [38, 20].

  • When possible, we replace the first convolutional layer in the approximating neural network with a scattering layer for further gains in terms of learnable weights.

For all our Atari experiments, we used OpenAI Gym [6], and a combination of PyTorch [33], TensorLy [24] and Kymatio [2] for auto-differentiation. We evaluated our agents in the low-data regime of 100,000 steps, on half the games, with 3 different random seeds for reproducibility [16]. Our specific hyperparameters are described in appendix. We report our results in tables 1 and 2.

Game SimPLe Rainbow Denoised TRL 2.5x TRL 5x TRL 10x
alien 405 740 684 688 454 566
amidar 88 189 154 118 86 84
assault 369 431 321 543 521 513
asterix 1090 471 500 459 554 363
bank_heist 8 51 77 59 134 42
battle_zone 5184 10125 9378 14466 13466 5744
boxing 9 0.2 1 -2 -2 -5
breakout 13 2 3 2 2 4
chopper_command 1247 862 1293 1255 1243 1106
crazy_climber 39828 16185 9977 3928 4225 2340
demon_attack 170 508 450 362 263 175
freeway 20 28 28 26 25 24
frostbite 255 867 1101 659 912 231
gopher 771 349 391 278 255 396
hero 1295 6857 3013 5351 3732 3321
jamesbond 125 302 295 215 213 218
kangaroo 323 779 1002 804 715 400
krull 4540 2852 2656 2333 2275 2308
kung_fu_master 17257 14346 4037 9392 4764 4031
ms_pacman 763 1204 1053 818 838 517
pong 5 -19 -20 -20 -19 -21
private_eye 58 98 100 51 100 1128
qbert 560 1153 672 697 581 733
road_runner 5169.4 9600 5426 6965 3914 1319
seaquest 371 354 387 345 350 287
up_n_down 2153 2877 5123 2197 2302 2179
Average (vs. Rainbow) 100% 118% 96% 90% 71%
Table 1: Mean episode returns as reported in SimPLe [20] and data-efficient Rainbow [38], versus our agents, on 26 Atari games. ’Denoised’ is the NoisyNet ablation of Rainbow; ’TRL’ shows the performance of the data-efficient Rainbow with tensor regression layers substituted for linear ones.

Table 1 shows proof of concept of the online learning of low-rank policies, with a loss of final performance varying in proportion to the compression in the low-rank linear layers, very much like in the deep learning literature [22, 23]. The number of coefficients in the original data-efficient Rainbow is of the order of magnitude of 1M and varies depending on the environment and its action-space size. The corresponding tensor regression layer ranks are in appendix, and chosen to target 400k, 200k and 100k coefficients respectively. While individual game results tend to decrease monotonously with increasing compression, we observe that they are noisy as per the nature of exploration in RL, and average scores reported correspond with the intuition that performance seems to decrease fast after a certain overparameterization threshold is crossed. To take this noisy character into account, we take care to be conservative and report the average of the final three episodes of the learned policy after 80000, 90000 and 100000 steps, respectively. Also, so as to not muddy the discussion and provide fair baselines, we do report on the NoisyNet [14] ablation of Rainbow (’Denoised’ columns), as the NoisyLinear layer doubles up the number of coefficients required and actually performs worse in our experiments. Interestingly, the approximation error in tensor factorization seems to play a role akin to promoting exploration noise.

We then proceed to assess the impact of second-order optimization to our architecture by substituting ADAM optimization for K-FAC, and introducing scattering, in table 2 (only a handful results being available with scattering, due to computational limitations). In spite of our conservative reporting, the efficiency boost from using a second order scheme more than makes up for low-rank approximation error with five times less coefficients than [38], suggesting that learning with a full order of magnitude less coefficients is well within reach of our techniques.

Game KFAC+Denoised KFAC+TRL5x KFAC+TRL10x Scattering
alien 996 734 643 441
amidar 163 101 98 84
assault 501 491 496 434
asterix 537 549 526 502
bank_heist 100 73 57 29
battle_zone 8622 15178 6156 4311
boxing 0 -4 -1 -9
breakout 3 3 2 2
chopper_command 692 611 1302 441
crazy_climber 14242 12377 3546 740
demon_attack 582 434 318 692
freeway 26 26 24 19
frostbite 1760 718 1483 654
gopher 363 341 265 172
hero 4188 6284 4206 4127
jamesbond 263 327 217 48
kangaroo 2085 613 588 391
krull 2855 3441 3392 772
kung_fu_master 8481 10738 7357 233
ms_pacman 1137 920 867 613
pong -19.3 -19 -19 -20
private_eye 56 100 100 0
qbert 731 520 538 475
road_runner 4516 8493 7224 1278
seaquest 349 317 520 213
up_n_down 2557 2291 2108 993
Average (vs. Rainbow) 114% 109% 98% 56%
Table 2: Mean episode returns of our low-rank agents with second-order optimization and scattering.

4 Conclusion

We have demonstrated that in the low-data regime, it is possible to leverage biologically plausible characterizations of experience data (namely low-rank properties and wavelet scattering separability) to exhibit architectures that learn policies with many times less weights than current baselines, in an online fashion. We do hope that this will lead to even further progress towards sample efficiency and speedy exploration methods. Further work will first focus on thorough evaluation and research of scattering architectures in order to achieve further gains, and second investigate additional, orthogonal biologically-friendly research directions such as promoting sparsity.

References

  • [1] S. Amari (1998) Natural gradient works efficiently in learning. Neural Computation 10, pp. 251–276. Cited by: §1, §2.4.
  • [2] M. Andreux, T. Angles, G. Exarchakis, R. Leonarduzzi, G. Rochette, L. Thiry, J. Zarka, S. Mallat, J. andén, E. Belilovsky, J. Bruna, V. Lostanlen, M. J. Hirn, E. Oyallon, S. Zhang, C. Cella, and M. Eickenberg (2018-12) Kymatio: Scattering Transforms in Python. arXiv e-prints. External Links: 1812.11214 Cited by: §3, Table 3.
  • [3] T. Angles and S. Mallat (2018-05) Generative networks as inverse problems with Scattering transforms. arXiv e-prints. External Links: 1805.06621 Cited by: §2.3.
  • [4] M. G. Bellemare, W. Dabney, and R. Munos (2017-07) A Distributional Perspective on Reinforcement Learning. arXiv e-prints. External Links: 1707.06887 Cited by: §2.1.
  • [5] C. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z. Leibo, J. Rae, D. Wierstra, and D. Hassabis (2016-06) Model-Free Episodic Control. arXiv e-prints. External Links: 1606.04460 Cited by: §1.
  • [6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016-06) OpenAI Gym. arXiv e-prints. External Links: 1606.01540 Cited by: §3.
  • [7] J. Bruna and S. Mallat (2012-03) Invariant Scattering Convolution Networks. arXiv e-prints. External Links: 1203.1513 Cited by: §2.3.
  • [8] X. Cao and G. Rabusseau (2017-12) Tensor Regression Networks with various Low-Rank Tensor Approximations. arXiv e-prints. External Links: 1712.09520 Cited by: §2.2.
  • [9] S. Chung, D. D. Lee, and H. Sompolinsky (2018-07) Classification and Geometry of General Perceptual Manifolds. Physical Review X 8 (3), pp. 031003. External Links: 1710.06487, Document Cited by: §1.
  • [10] S. Chung, D. D. Lee, and H. Sompolinsky (2016) Linear readout of object manifolds.. Physical review. E 93 6, pp. 060301. Cited by: §1.
  • [11] A. Cichocki, R. Zdunek, A. H. Phan, and Sh. Amari (2009) Nonnegative matrix and tensor factorizations - applications to exploratory multi-way data analysis and blind source separation. Wiley. Cited by: §1.
  • [12] G. Cuccu, J. Togelius, and P. Cudre-Mauroux (2018-06) Playing Atari with Six Neurons. arXiv e-prints. External Links: 1806.01363 Cited by: §1.
  • [13] M. Eickenberg, G. Exarchakis, M. J. Hirn, S. Mallat, and L. Thiry (2018) Solid harmonic wavelet scattering for predictions of molecule properties. The Journal of chemical physics 148 24, pp. 241732. Cited by: Table 3.
  • [14] M. Fortunato, M. Gheshlaghi Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg (2017-06) Noisy Networks for Exploration. arXiv e-prints. External Links: 1706.10295 Cited by: §2.1, §3.
  • [15] A. Gaier and D. Ha (2019-06) Weight Agnostic Neural Networks. arXiv e-prints. External Links: 1906.04358 Cited by: §1.
  • [16] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2017-09) Deep Reinforcement Learning that Matters. arXiv e-prints. External Links: 1709.06560 Cited by: §3.
  • [17] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2017-10) Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv e-prints. External Links: 1710.02298 Cited by: §1, §2.1, 1st item.
  • [18] A. Hyvärinen and P. O. Hoyer (2001) A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research 41, pp. 2413–2423. Cited by: §1.
  • [19] J. P. Jones and L. A. Palmer (1987) An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex.. Journal of neurophysiology 58 6, pp. 1233–58. Cited by: §1.
  • [20] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, A. Mohiuddin, R. Sepassi, G. Tucker, and H. Michalewski (2019-03) Model-Based Reinforcement Learning for Atari. arXiv e-prints. External Links: 1903.00374 Cited by: §1, 3rd item, Table 1.
  • [21] D. P. Kingma and J. Ba (2014-12) Adam: A Method for Stochastic Optimization. arXiv e-prints. External Links: 1412.6980 Cited by: 2nd item.
  • [22] J. Kossaifi, A. Khanna, Z. C. Lipton, T. Furlanello, and A. Anandkumar (2017-06) Tensor Contraction Layers for Parsimonious Deep Nets. arXiv e-prints. External Links: 1706.00439 Cited by: §2.2, §3.
  • [23] J. Kossaifi, Z. C. Lipton, A. Khanna, T. Furlanello, and A. Anandkumar (2017-07) Tensor Regression Networks. arXiv e-prints. External Links: 1707.08308 Cited by: §2.2, §2.2, 1st item, §3.
  • [24] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic (2016-10) TensorLy: Tensor Learning in Python. arXiv e-prints. External Links: 1610.09555 Cited by: §3.
  • [25] S. Mallat (2011-01) Group Invariant Scattering. arXiv e-prints. External Links: 1101.2286 Cited by: §1, §2.3.
  • [26] S. Mallat (1998) A wavelet tour of signal processing. Academic Press. Cited by: §2.3, §2.3.
  • [27] J. Martens and R. B. Grosse (2015) Optimizing neural networks with kronecker-factored approximate curvature. ArXiv abs/1503.05671. Cited by: §1, §2.4, 2nd item.
  • [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013-12) Playing Atari with Deep Reinforcement Learning. arXiv e-prints. External Links: 1312.5602 Cited by: §1, §2.1.
  • [29] B. A. Olshausen and D. J. Field (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, pp. 607–609. Cited by: §1.
  • [30] B. A. Olshausen and D. J. Field (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision Research 37, pp. 3311–3325. Cited by: §1.
  • [31] E. Oyallon, S. Mallat, and L. Sifre (2013-12) Generic Deep Networks with Wavelet Scattering. arXiv e-prints. External Links: 1312.5940 Cited by: §2.3.
  • [32] E. Oyallon, S. Zagoruyko, G. Huang, N. Komodakis, S. Lacoste-Julien, M. Blaschko, and E. Belilovsky (2018-09) Scattering Networks for Hybrid Representation Learning. arXiv e-prints. External Links: 1809.06367 Cited by: §2.3.
  • [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In OpenReview, Cited by: §3.
  • [34] A. Pritzel, B. Uria, S. Srinivasan, A. Puigdomènech, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell (2017-03) Neural Episodic Control. arXiv e-prints. External Links: 1703.01988 Cited by: §1.
  • [35] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015-11) Prioritized Experience Replay. arXiv e-prints. External Links: 1511.05952 Cited by: §2.1, Further results and learning curves.
  • [36] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Link Cited by: §1, §2.1.
  • [37] H. van Hasselt, A. Guez, and D. Silver (2015-09) Deep Reinforcement Learning with Double Q-learning. arXiv e-prints. External Links: 1509.06461 Cited by: §2.1, Further results and learning curves.
  • [38] H. van Hasselt, M. Hessel, and J. Aslanides (2019-06) When to use parametric models in reinforcement learning?. arXiv e-prints. External Links: 1906.05243 Cited by: §1, 1st item, 3rd item, Table 1, §3, §3, Hyperparameters and Reproducibility, Table 4.
  • [39] R. Wang, C. Ciliberto, P. Amadori, and Y. Demiris (2019-05) Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation. arXiv e-prints. External Links: 1905.06750 Cited by: §1.
  • [40] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas (2015-11) Dueling Network Architectures for Deep Reinforcement Learning. arXiv e-prints. External Links: 1511.06581 Cited by: §2.1.
  • [41] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba (2017-08) Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. arXiv e-prints. External Links: 1708.05144 Cited by: §2.4.
  • [42] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016-11) Understanding deep learning requires rethinking generalization. arXiv e-prints. External Links: 1611.03530 Cited by: §1.

Appendix

Hyperparameters and Reproducibility

Our codebase is available on request. Hyperparameters are as follows. First, our specific architecture-modified hyperparameters:

Specific architecture hyperparameters Value
Scattering maximum log-scale J 3
Scattering volume width M 1
Scattering tensor input shape (1,4,84,84)
Scattering tensor output shape (1,16,11,11)
Scattering type Harmonic 3D, see [2, 13]
Hidden linear layer rank constraint, 2.5x compression 128
Final linear layer rank constraint, 2.5x compression 48
Hidden linear layer rank constraint, 5x compression 32
Final linear layer rank constraint, 5x compression 48
Hidden linear layer rank constraint, 10x compression 16
Final linear layer rank constraint, 10x compression 10
KFAC Tikhonov regularization parameter 0.1
KFAC Update frequency for inverses 100
Table 3: Our additional, architecture-specific hyperparameters.

Furthermore, we mirror the Data-Efficient Rainbow [38] baseline:

Data-efficient Rainbow hyperparameters Value
Grey-scaling True
Observation down-sampling (84, 84)
Frames stacked 4
Action repetitions 4
Reward clipping [-1, 1]
Terminal on loss of life True
Max frames per episode 108K
Update Distributional Double Q
Target network update period* every 2000 updates
Support of Q-distribution 51 bins
Discount factor 0.99
Minibatch size 32
Optimizer Adam
Optimizer: first moment decay 0.9
Optimizer: second moment decay 0.999
Optimizer: ϵ 0.00015
Max gradient norm 10
Priority exponent 0.5
Priority correction** 0.4 1
Hardware NVidia 1080Ti GPU
Noisy nets parameter 0.1
Training frames 400,000
Min replay size for sampling 1600
Memory size unbounded
Replay period every 1 steps
Multi-step return length 20
Q network: channels 32, 64
Q network: filter size 5 x 5, 5 x 5
Q network: stride 5, 5
Q network: hidden units 256
Optimizer: learning rate 0.0001
Table 4: Data-efficient Rainbow agent hyperparameters, as per [38].

Standard deviations for score runs

Game Denoised TRL 2.5x TRL 10x
alien 684 ± 7 688 ± 123 566± 38
amidar 154 ± 21 118 ± 12 84± 15
assault 321 ± 224 543 ± 94 513± 64
asterix 500 ± 124 459 ± 91 363 ± 66
bank_heist 77 ± 23 59 ± 22 42 ± 2
battle_zone 9378 ± 2042 14466 ± 2845 5744 ± 575
boxing 1 ± 2 -2 ± 1 -5 ± 1
breakout 3 ± 1.5 2 ± 1 4 ± 0.3
chopper_command 1293 ± 445 1255 ± 215 1106 ± 124
crazy_climber 9977 ± 3744 3928 ± 221 2340 ± 595
demon_attack 450 ± 49 362 ± 147 175 ± 7
freeway 28 ± 0.6 26 ± 0 24 ± 0.5
frostbite 1101 ± 355 659 ± 523 231 ± 1
gopher 391 ± 46 278 ± 39 396 ± 24
hero 3013 ± 90 5351 ± 1948 3321 ± 598
jamesbond 295 ± 57 215 ± 42 218 ± 22
kangaroo 1002 ± 587 804 ± 289 400 ± 278
krull 2656 ± 180 2333 ± 309 2308 ± 268
kung_fu_master 4037 ± 2962 9392 ± 6289 4031 ± 3068
ms_pacman 1053 ± 193 818 ± 94 517 ± 38
pong -20 ± 0.4 -20 ± 0 -21 ± 0.1
private_eye 100 ± 0 51 ± 59 1128 ± 1067
qbert 672 ± 144 697 ± 78 733 ± 291
road_runner 5426 ± 2830 6965 ± 6569 1319± 216
seaquest 387 ± 24 345 ± 40 287 ± 87
up_n_down 5123 ± 3146 2197 ± 231 2179 ± 178
Table 5: Standard deviations across seeds for runs presented Table 1.
Game KFAC+Denoised KFAC+TRL10x Scattering
alien 996 ± 180 643 ± 51 441 ± 90
amidar 163 ± 15 98 ± 26 84 ± 11
assault 501 ± 85 496 ± 129 434 ± 304
asterix 537 ± 96 526 ± 64 502 ± 91
bank_heist 100 ± 14 57 ± 36 29 ± 13
battle_zone 8622 ± 5358 6156 ± 1951 4311 ± 1517
boxing 0 ± 2 -1 ± 3 -9 ± 12
breakout 3 ± 1 2 ± 2 2 ± 0
chopper_command 692 ± 81 1302 ± 328 441 ± 80
crazy_climber 14242 ± 2936 3546 ± 1231 740 ± 291
demon_attack 582 ± 130 318 ± 168 692 ± 232
freeway 26 ± 0 24 ± 0 19 ± 1
frostbite 1760 ± 448 1483 ± 466 654 ± 709
gopher 363 ± 4 265 ± 67 172 ± 3
hero 4188 ± 1635 4206 ± 1862 4127 ± 1074
jamesbond 263 ± 22 217 ± 68 48 ± 10
kangaroo 2085 ± 2055 588 ± 5 391 ± 52
krull 2855 ± 156 3392 ± 2205 772± 560
kung_fu_master 8481 ± 8270 7357 ± 9200 233± 205
ms_pacman 1137 ± 180 867 ± 128 613 ± 159
pong -19 ± 0.6 -19 ± 1 -20 ± 0
private_eye 56 ± 42 100 ± 0 0 ± 0
qbert 731 ± 256 538 ± 114 475 ± 161
road_runner 4516 ± 2869 7224 ± 4598 1278 ±463
seaquest 349 ± 63 520 ± 97 213 ± 96
up_n_down 2557 ± 641 2108 ± 298 993 ± 244
Table 6: Standard deviations across seeds for runs presented Table 2.

Further results and learning curves

Learning curves As a simpler version of the experiments in the main text body, we show basic proof of concept on the simple Pong Atari game. Our experimental setup consists in using our own implementation of prioritized double DQN as a baseline, which combines algorithmic advances from [35] and [37]. We replaced the densely connected layer of the original DQN architecture with a tensor regression layer implementing Tucker decomposition for different Tucker ranks, yielding different network compression factors. (These curves average three different random seeds).

Qualitative behaviour. First results, both in terms of learning performance and compression factor, can be seen in figure 1. The two main findings of this experiment are that first, and overall, the final performance of the agent remains unaffected by the tensor factorization, even with high compression rates nearing 10 times. Second, this obviously comes at the expense of stability during training - in tough compression regimes, learning curves are slightly delayed, and their plateauing phases contain occasional noisy drawdowns illustrating the increased difficulty of learning, as seen in figure 2. The extra pathwise noise, however, can be seen as promoting exploration.

Figure 1: Prioritized tensorized DQN on Atari Pong. Original learning curve versus several learning curves for five different Tucker ranks factorizations and therefore parameter compression rates (3 different random seeds each, with a 30 episodes moving average for legibility). Best viewed in colour.
Figure 2: Focus on a typical single run of the tensorized DQN learning. The overall shape of the typical learning curve is preserved, but drawdowns in the plateauing phase do appear.