Inferring Dynamical Systems with Long-Range Dependencies through Line Attractor Regularization

  • 2019-10-08 15:41:50
  • Dominik Schmidt, Georgia Koppe, Max Beutelspacher, Daniel Durstewitz
  • 1

Abstract

Vanilla RNN with ReLU activation have a simple structure that is amenable tosystematic dynamical systems analysis and interpretation, but they suffer fromthe exploding vs. vanishing gradients problem. Recent attempts to retain thissimplicity while alleviating the gradient problem are based on properinitialization schemes or orthogonality/unitary constraints on the RNN'srecurrence matrix, which, however, comes with limitations to its expressivepower with regards to dynamical systems phenomena like chaos ormulti-stability. Here, we instead suggest a regularization scheme that pushespart of the RNN's latent subspace toward a line attractor configuration thatenables long short-term memory and arbitrarily slow time scales. We show thatour approach excels on a number of benchmarks like the sequential MNIST ormultiplication problems, and enables reconstruction of dynamical systems whichharbor widely different time scales.

 

Quick Read (beta)

Inferring dynamical systems with long-range dependencies through line attractor regularization

Dominik Schmidt
equal contribution
   Georgia Koppe11footnotemark: 1    Max Beutelspacher    Daniel Durstewitz
Abstract

Vanilla RNN with ReLU activation have a simple structure that is amenable to systematic dynamical systems analysis and interpretation, but they suffer from the exploding vs. vanishing gradients problem. Recent attempts to retain this simplicity while alleviating the gradient problem are based on proper initialization schemes or orthogonality/unitary constraints on the RNN’s recurrence matrix, which, however, comes with limitations to its expressive power with regards to dynamical systems phenomena like chaos or multi-stability. Here, we instead suggest a regularization scheme that pushes part of the RNN’s latent subspace toward a line attractor configuration that enables long short-term memory and arbitrarily slow time scales. We show that our approach excels on a number of benchmarks like the sequential MNIST or multiplication problems, and enables reconstruction of dynamical systems which harbor widely different time scales.

 

Inferring dynamical systems with long-range dependencies through line attractor regularization

 

Dominik Schmidt1*, Georgia Koppe1*, Max Beutelspacher1,22 2 Current affiliation: Fraunhofer Institute for Manufacturing Engineering and Automation IPA, Department for Robot and Assistive Systems, Group Industrial and commercial Service robots, Stuttgart, Germany, Daniel Durstewitz1,3
1
Department of Theoretical Neuroscience, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
3Faculty of Physics and Astronomy, Heidelberg University
*These authors contributed equally
contact: {dominik.schmidt,georgia.koppe,daniel.durstewitz}@zi-mannheim.de

1 Introduction

Theories of complex systems in biology and physics are often formulated in terms of sets of stochastic differential or difference equations, i.e. as stochastic dynamical systems (DS). A long-standing desire is to retrieve these generating dynamical equations directly from observed time series data (Kantz & Schreiber, 2004). A variety of machine and deep learning methodologies toward this goal have been introduced in recent years (Chen et al., 2017; Champion et al., 2019; Jordan et al., 2019; Duncker et al., 2019; Ayed et al., 2019; Durstewitz, 2017; Koppe et al., 2019), many of them based on recurrent neural networks (RNN) which can universally approximate any DS (i.e., its flow field) under some mild conditions (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998). However, vanilla RNN as often used in this context are well known for their problems in capturing long-term dependencies and slow time scales in the data (Hochreiter & Schmidhuber, 1997; Bengio et al., 1994). In DS terms, this is generally due to the fact that flexible information maintenance over long periods requires precise fine-tuning of model parameters toward ’line attractor’ configurations (Fig. 1), a concept first propagated in computational neuroscience for addressing animal performance in parametric working memory tasks (Seung, 1996; Seung et al., 2000; Durstewitz, 2003). Line attractors introduce directions of zero-flow into the model’s state space that enable long-term maintenance of arbitrary values (Fig. 1). Specially designed RNN architectures equipped with gating mechanisms and (linear) memory cells have been suggested for solving this issue (Hochreiter & Schmidhuber, 1997; Cho et al., 2014). However, from a DS perspective, simpler models that can more easily be analyzed and interpreted in DS terms, and for which more efficient inference algorithms exist that emphasize approximation of the true underlying DS would be preferable.

Recent solutions to the vanishing vs. exploding gradient problem attempt to retain the simplicity of vanilla RNN by initializing or constraining the recurrent weight matrix to be the identity (Le et al., 2015), orthogonal (Henaff et al., 2016; Helfrich et al., 2018) or unitary (Arjovsky et al., 2016). In this way, in a system including piecewise linear (PL) components like rectified-linear units (ReLU), line attractor dimensions are established from the start by construction or ensured throughout training by a specifically parameterized matrix decomposition. However, for many DS problems, line attractors instantiated by mere initialization procedures may be unstable and quickly dissolve during training. On the other hand, orthogonal or unitary constraints are too restrictive for reconstructing DS, and more generally from a computational perspective as well (Kerg et al., 2019): For instance, neither chaotic behavior (that requires diverging directions) nor settings with multiple isolated fixed point or limit cycle attractors are possible.

Here we therefore suggest a different solution to the problem, by pushing (but not strictly enforcing) ReLU-based, piecewise-linear RNN (PLRNN) toward line attractor configurations along some (but not all) directions in state space. We achieve this by adding special regularization terms for a subset of RNN units to the loss function that promote such a configuration. We demonstrate that our approach outperforms, or is en par with, LSTM and other, initialization-based, methods on a number of ’classical’ machine learning benchmarks (Hochreiter & Schmidhuber, 1997). More importantly, we demonstrate that while with previous methods it was difficult to capture slow behavior in a DS that exhibits widely different time scales, our new regularization-supported inference efficiently captures all relevant time scales.

Fig. 1: Line attractors for solving long-time-scale problems. A)–B): Illustration of the state space of a 2-unit RNN (converted into a continuous time ODE, see Suppl. 7.1.2) with flow field (grey) and nullclines (set of points at which the flow of one of the variables vanishes, in blue and red). Insets: Time graphs of z1 for T=30 000. A) Perfect line attractor. The flow converges to the line attractor from all directions and is exactly zero on the line, thus retaining states indefinitely in the absence of perturbations, as illustrated for 3 example trajectories (green) started from different initial conditions. B) Slightly detuned line attractor (cf. Durstewitz (2003)). The system’s state still converges toward the ’line attractor ghost’ (Strogatz, 2015), but then very slowly crawls up within the ’attractor tunnel’ (green trajectory) until it hits the stable fixed point at the intersection of nullclines. Within the tunnel, flow velocity is smoothly regulated by the gap between nullclines, thus enabling arbitrary time constants. Note that along other, not illustrated dimensions of the system’s state space the flow may still evolve freely in all directions. C) Simple 2-unit solution to the addition problem exploiting the line attractor properties of ReLUs in the positive quadrant. The output unit serves as a perfect integrator, while the input unit will only convey those input values to the output unit that are accompanied by a ’1’ in the second input stream (see 7.1.1 for complete parameters).

2 Related work

Long-range dependency problems in RNN. Error gradients in vanilla RNN trained by some form of gradient descent, like back-propagation through time (BPTT, Rumelhart et al. (1986)), tend to either explode or vanish due to the large product of derivative terms that results from recursive application of the chain rule over time steps (Hochreiter, 1991; Bengio et al., 1994; Hochreiter & Schmidhuber, 1997). Formally, RNN 𝒛t=F𝜽(𝒛t-1,𝒔t) are discrete time dynamical systems that tend to either converge, e.g. to fixed point or limit cycle attractors, or diverge (to infinity or as in chaotic systems) over time, unless parameters of the system are precisely tuned to create directions of zero-flow in the system’s state space (Fig. 1), called line attractors (Seung, 1996; Seung et al., 2000; Durstewitz, 2003). Convergence of the RNN in general implies vanishing and global divergence exploding gradients. To address this issue, RNN with gated memory cells have been specifically designed (Hochreiter & Schmidhuber, 1997; Cho et al., 2014), but these are complicated and tedious to analyze from a DS perspective. Recently, Le et al. (2015) observed that initialization of the recurrent weight matrix 𝑾 to the identity in ReLU-based RNN may yield performance en par with LSTMs on standard machine learning benchmarks. For a ReLU with activity 𝒛t0, zero bias and unit slope, this results in the identity mapping, hence a line attractor configuration. Talathi & Vartak (2016) expanded on this idea by initializing the recurrence matrix such that its largest absolute eigenvalue is 1, arguing that this would leave other directions in the system’s state space free for computations other than memory maintenance. Later work enforced orthogonal (Henaff et al., 2016; Helfrich et al., 2018; Jing et al., 2019) or unitary (Arjovsky et al., 2016) constraints on the recurrent weight matrix during training. While this appears to yield long-term memory performance superior to that of LSTMs, these networks are limited in their computational power (Kerg et al., 2019). This may be a consequence of the fact that RNN with orthogonal recurrence matrix are quite restricted in the range of dynamical phenomena they can produce, e.g. chaotic attractors are not possible since diverging eigen-directions are disabled. Our approach therefore is to establish line attractors only along some but not all directions in state space, and to only push the RNN toward these configurations but not strictly enforce them, such that convergence or divergence of RNN dynamics is still possible. We furthermore implement these concepts through regularization terms in the loss functions, such that they are encouraged throughout training unlike when only established through initialization.

Dynamical systems reconstruction. From a natural science perspective, the goal of reconstructing the underlying DS fundamentally differs from building a system that ’merely’ yields good ahead predictions, as in DS reconstruction we require that the inferred model can freely reproduce (when no longer guided by the data) the underlying attractor geometries and state space properties (see section 3.5, Fig. S2; Kantz & Schreiber (2004)). Earlier work using RNN for DS identification (Roweis & Ghahramani, 2002; Yu et al., 2006) mainly focused on inferring the posterior over latent trajectories 𝒁={𝒛1,,𝒛T} given time series data 𝑿={𝒙1,,𝒙T}, p(𝒁|𝑿), and on ahead predictions (Lu et al., 2017), hence did not show that inferred models can generate the underlying attractor geometries on their own. Others (Trischler & D’Eleuterio, 2016; Brunton et al., 2016) attempt to approximate the flow field, obtained e.g. by numerical differentiation, directly through basis expansions or neural networks, but numerical derivatives are problematic for their high variance and other numerical issues (Raissi, 2018; Baydin et al., 2018; Chen et al., 2017). Some approaches assume the form of the DS equations basically to be given (Raissi, 2018; Gorbach et al., 2017) and focus on estimating the system’s latent states and parameters, rather than approximating an unknown DS based on the observed time series information alone. In many biological systems like the brain the intrinsic dynamics are highly stochastic with many noise sources, like probabilistic synaptic release (Stevens, 2003), such that models that do not explicitly account for dynamical process noise (Champion et al., 2019; Rudy et al., 2019) may be less suitable. Finally, some fully probabilistic models for DS reconstruction based on GRU (Fraccaro et al. (2016), cf. Jordan et al. (2019)), LSTM (Zheng et al., 2017), or radial basis function (Zhao & Park, 2017) networks are not easily interpretable and amenable to DS analysis. Most importantly, none of these previous approaches considers the long-range dependency problem within more easily tractable RNN for DS reconstruction.

3 Model formulation and optimization approaches

3.1 Model and preliminaries

Assume we are given two multivariate time series 𝑺={𝒔t} and 𝑿={𝒙t}, one we will denote as ’inputs’ (𝑺) and the other as ’outputs’ (𝑿). We will first consider the ’classical’ (supervised) machine learning setting where we wish to map 𝑺 on 𝑿 through a RNN with latent state equation 𝒛t=Fθ(𝒛t-1,𝒔t), as for instance in the ’addition problem’ (Hochreiter & Schmidhuber, 1997). In DS reconstruction, in contrast, we usually have a dense time series 𝑿 from which we wish to infer (unsupervised) the underlying DS, where 𝑺 may provide an additional forcing function or sparse experimental inputs or perturbations.

The latent RNN we consider here takes the specific form

𝒛t=𝑨𝒛t-1+𝑾ϕ(𝒛t-1)+𝑪𝒔t+𝒉+𝜺t,𝜺t𝒩(0,𝚺), (1)

where 𝒛tM×1 is the hidden state (column) vector of dimension M, 𝑨M×M a diagonal and 𝑾M×M an off-diagonal matrix, 𝒔tK×1 the external input of dimension K, 𝑪M×K the input mapping, 𝒉M×1 a bias, and 𝜺t a Gaussian noise term with diagonal covariance matrix diag(𝚺)+M. The nonlinearity ϕ(𝒛) is a ReLU, ϕ(𝒛)i=max(0,zi),i{1,,M}. This specific formulation is originally motivated by firing rate (population) models in computational neuroscience (Song et al., 2016; Durstewitz, 2017), where latent states 𝒛t may represent membrane voltages or currents, 𝑨 the neurons’ passive time constants, 𝑾 the synaptic coupling among neurons, and ϕ() the voltage-to-rate transfer function. However, for a RNN in the form 𝒛t=𝑾ϕ(𝒛t-1)+𝒉, note that the simple change of variables 𝒚t𝑾-1(𝒛t-𝒉) will yield the more familiar form 𝒚t=ϕ(𝑾𝒚t-1+𝒉) (Beer, 2006).

Besides its neuroscience motivation, note that by letting 𝑨=𝑰, 𝑾=𝟎, 𝒉=𝟎, we get a strict line attractor system across the variables’ whole support which we conjecture will be of advantage for establishing long short-term memory properties. Also we can solve for all of the system’s fixed points analytically by solving the equations 𝒛*=(𝑰-𝑨-𝑾𝑫Ω)-1𝒉, with 𝑫Ω as defined in Suppl. 7.1.2, and can determine their stability from the eigenvalues of matrix 𝑨+𝑾𝑫Ω. We could do the same for limit cycles, in principle, which are fixed points of the r-times iterated map F𝜽r, although practically the number of configurations to consider increases exponentially as 2Mr. Finally, we remark that a discrete piecewise-linear system can, under certain conditions, be transformed into an equivalent continuous-time (ODE) piecewise-linear system 𝜻˙=GΩ(𝜻(t),𝒔(t)) (Suppl. 7.1.2, Ozaki (2012)), in the sense that if 𝜻(t)=𝒛t, then 𝜻(t+Δt)=𝒛t+1 after a defined time step Δt. These are among the properties that make PLRNNs more amenable to rigorous DS analysis than other RNN formulations.

We will assume that the latent RNN states 𝒛t are coupled to the actual observations 𝒙t through a simple observation model of the form

𝒙t=𝑩g(𝒛t)+𝜼t,𝜼t𝒩(0,𝚪) (2)

in the case of real-valued observations 𝒙tN×1, where 𝑩N×M is a factor loading matrix and diag(𝚪)+N the diagonal covariance matrix of the Gaussian observation noise, or

p^i,tp^t(xi,t=1)=(e𝑩i,:𝒛t)(j=1Ne𝑩j,:𝒛t)-1, (3)

in the case of multi-categorical observations xi,t{0,1},ixi,t=1.

3.2 Regularization approach

We start from a similar idea as Le et al. (2015), who initialized RNN parameters such that it performs an identity mapping for zi,t0. However, 1) we use a neuroscientifically motivated network architecture (eq. 1) that enables the identity mapping across the variables whole support, zi,t[-,+], 2) we encourage this mapping only for a subset MregM of units (Fig. S1), leaving others free to perform arbitrary computations, and 3) we stabilize this configuration throughout training by introducing a specific L2 regularization for parameters 𝑨, 𝑾, and 𝒉 in eq. 1.

That way, we divide the units into two types, where the regularized units serve as a memory that tends to decay very slowly (depending on the size of the regularization term), while the remaining units maintain the flexibility to approximate any underlying DS, yet retaining the simplicity of the original RNN model (eq. 1). Specifically, the following penalty is added to the loss function (Fig. S1):

Lreg=τAi=1Mreg(Ai,i-1)2+τWi=1Mregj=1jiMWi,j2+τhi=1Mreghi2 (4)

While this formulation allows us to trade off, for instance, the tendency toward a line attractor (𝑨𝑰, 𝒉𝟎) vs. the sensitivity to other units’ inputs (𝑾𝟎), for all experiments performed here a common value, τA=τW=τh=τ, was assumed for the three regularization factors.

3.3 Optimization procedure for machine learning benchmarks

For comparability with other approaches like LSTMs (Hochreiter & Schmidhuber, 1997) or iRNN (Le et al., 2015), we will assume that the latent state dynamics eq. 1 are deterministic (i.e., 𝚺=𝟎), will take g(𝒛t)=𝒛t and 𝚪=𝑰N in eq. 2 (leading to an implicit Gaussian assumption with identity covariance matrix), and will use stochastic gradient descent (SGD) for training to minimize the squared-error loss across R samples, =n=1R(𝒙^T(n)-𝒙T(n))2, between estimated and actual outputs for the addition and multiplication problems, and the cross entropy loss =n=1R(-i=110xi,T(n)log(p^i,T(n))) for sequential MNIST, to which penalty eq. 4 was added for the regularized PLRNN (rPLRNN). We used the Adam algorithm (Kingma & Ba, 2014) from the PyTorch package (Paszke et al., 2017) with a learning rate of 0.001, a gradient clip parameter of 10, and batch size of 16. In all cases, SGD is stopped after 100 epochs and the fit with the lowest loss across all epochs is chosen.

3.4 Optimization procedure for dynamical systems reconstruction

For DS reconstruction we request that the latent RNN approximates the true generating system of equations, which is a taller order than learning the mapping 𝑺𝑿 or predicting future values in a time series (cf. sect. 3.5). This point has important implications for the design of models, inference algorithms and performance metrics if the primary goal is DS reconstruction rather than ’mere’ time series forecasting. In this context we consider the fully probabilistic, generative RNN eq. 1.

Together with eq. 2 (where we take g(𝒛t)=ϕ(𝒛t)) this gives the typical form of a nonlinear state space model (Durbin & Koopman, 2012) with observation and process noise. We solve for the parameters 𝜽={𝑨,𝑾,𝑪,𝒉,𝚺,𝑩,𝚪} by maximum likelihood, for which an efficient Expectation-Maximization (EM) algorithm has recently been suggested (Durstewitz, 2017; Koppe et al., 2019), which we will briefly summarize here. Since the involved integrals are not tractable, we start off from the evidence-lower bound (ELBO) to the log-likelihood which can be rewritten in various useful ways:

logp(𝑿|𝜽) 𝔼𝒁q[logp𝜽(𝑿,𝒁)]+H(q(𝒁|𝑿))
=logp(𝑿|𝜽)-DKL(q(𝒁|𝑿)p𝜽(𝒁|𝑿))(𝜽,q) (5)

In the E-step, given a current estimate 𝜽* for the parameters, we seek to determine the posterior p𝜽(𝒁|𝑿) which we approximate by a global Gaussian q(𝒁|𝑿) instantiated by the maximizer (mode) 𝒁* of p𝜽(𝒁|𝑿) as an estimator of the mean, and the negative inverse Hessian around this maximizer as an estimator of the state covariance, i.e.

𝔼[𝒁|𝑿]𝒁* =argmax𝒁logp𝜽(𝒁|𝑿)=argmax𝒁[logp𝜽(𝑿|𝒁)+logp𝜽(𝒁)-logp𝜽(𝑿)]
=argmax𝒁[logp𝜽(𝑿|𝒁)+logp𝜽(𝒁)], (6)

since 𝒁 integrates out in p𝜽(𝑿) (equivalently, this result can be derived from a Laplace approximation to the log-likelihood, logp(𝑿|𝜽)logp𝜽(𝑿|𝒁*)+logp𝜽(𝒁*)-12log|-𝑳*|+const, where 𝑳* is the Hessian evaluated at the maximizer). We solve this optimization problem by a fixed-point iteration scheme that efficiently exploits the model’s piecewise linear structure (see Suppl. 7.1.3, Durstewitz (2017); Koppe et al. (2019)).

Using this approximate posterior for p𝜽(𝒁|𝑿), based on the model’s piecewise-linear structure most of the expectation values 𝔼𝒛q[ϕ(𝒛)], 𝔼𝒛q[ϕ(𝒛)𝒛], and 𝔼𝒛q[ϕ(𝒛)ϕ(𝒛)], could be solved for (semi-)analytically (where 𝒛 is the concatenated vector form of 𝒁, as in Suppl. 7.1.3). In the M-step, we seek 𝜽*argmax𝜽(𝜽,q*), assuming proposal density q* to be given from the E-step, which for a Gaussian observation model amounts to a simple linear regression problem (see Suppl. eq. 23). To force the PLRNN to really capture the underlying DS in its governing equations, we use a previously suggested (Koppe et al. 2019) stepwise annealing protocol that gradually shifts the burden of fitting the observations 𝑿 from the observation model eq. 2 to the latent RNN model eq. 1 during training, the idea of which is to establish a mapping from latent states 𝒁 to observations 𝑿 first, fixing this, and then enforcing the temporal consistency constraints implied by eq. 1 while accounting for the actual observations.

3.5 Performance measures

Measures of prediction error. For the machine learning benchmarks we employed the same criteria as used for optimization (MSE or cross-entropy, sect. 3.3) as performance metrics, evaluated across left-out test sets. In addition, we report the relative frequency Pcorrect of correctly predicted trials across the test set. A correct trial in the addition and multiplication task is defined as an absolute prediction error smaller than 0.04 (analogous to Talathi & Vartak (2016)), while a correct trial in the sequential MNIST data set is defined as one for which the largest probability p^i*=max𝑖p^i,T indicated the correct class xi*,T=1.

Agreement in attractor geometries. From a DS perspective, it is not sufficient or even sensible to judge a method’s ability to infer the underlying DS purely based on some form of (ahead-)prediction error like the MSE defined on the time series itself (Ch.12 in Kantz & Schreiber (2004)). Rather, we require that the inferred model can freely reproduce (when no longer guided by the data) the underlying attractor geometries and state space properties. This is not automatically guaranteed for a model that yields agreeable ahead predictions on a time series. Vice versa, if the underlying attractor is chaotic, with a tiny bit of noise even trajectories starting from the same initial condition will quickly diverge and ahead-prediction errors are not even meaningful as a performance metric (Fig. S2A). To quantify how well an inferred PLRNN captured the underlying dynamics we therefore followed Koppe et al. (2019) and used the Kullback-Leibler divergence between the true and reproduced probability distributions across states in state space, thus assessing the agreement in attractor geometries (cf. Takens (1981); Sauer et al. (1991)) rather than in precise matching of time series,

DKL(ptrue(𝒙)pgen(𝒙|𝒛))k=1Kp^true(k)(𝒙)log(p^true(k)(𝒙)p^gen(k)(𝒙|𝒛)), (7)

where ptrue(𝒙) is the true distribution of observations across state space, pgen(𝒙|𝒛) is the distribution of observations generated by running the inferred PLRNN, and the sum indicates a spatial discretization (binning) of the observed state space (see Suppl. 7.1.4 for more details). We emphasize that p^gen(k)(𝒙|𝒛) is obtained from freely simulated trajectories p^(𝒛), not from the inferred posteriors p^(𝒛|𝒙train). In addition, to assess reproduction of time scales by the inferred PLRNN, we computed the average correlation between the power spectra of the true and generated time series.

Fig. 2: Comparison of rPLRNN (τ=5,MregM=0.5, cf. Fig. S3) to other methods for A) addition problem, B) multiplication problem and C) sequential MNIST. Top row gives loss as a function of time series length T (error bars = SEM), bottom row shows relative frequency of correct trials. Dashed lines indicate chance level, black dots in C indicate individual repetitions of the experiment.

4 Numerical experiments

4.1 Machine learning benchmarks

We compared the performance of our rPLRNN to other models on the following three benchmarks requiring long short-term maintenance of information (as in Talathi & Vartak (2016) and Hochreiter & Schmidhuber (1997)): 1) The addition problem of time length T consists of 100 000 training and 10 000 test samples of 2×T input series 𝑺={𝒔1,,𝒔T}, where entries 𝒔1,:[0,1] are drawn from a uniform random distribution and 𝒔2,:{0,1} contains zeros except for two indicator bits placed randomly at times t1<10 and t2<T/2. Constraints on t1 and t2 are chosen such that every trial requires a long memory of at least T/2 time steps. At the last time step T, the target output of the network is the sum of the two inputs in 𝒔1,: indicated by the 1-entries in 𝒔2,:, 𝒙Ttarget=s1,t1+s1,t2. 2) The multiplication problem is the same as the addition problem, only that the product instead of the sum has to be produced by the RNN as an output at time T, 𝒙Ttarget=s1,t1s1,t2. 3) The MNIST dataset (LeCun & Cortes, 2010) consists of 60 000 training and 10 000 28×28 test images of hand written digits. To make this a time series problem, in sequential MNIST the images are presented sequentially, pixel-by-pixel, scanning lines from upper left to bottom-right, resulting in time series of fixed length T=784.

On all three benchmarks we compare the performance of the rPLRNN (eq. 1) to several other models summarized in Table 1. To achieve a meaningful comparison, all models have the same number of hidden states M, except for the LSTM, which requires three additional parameters for each hidden state and hence has only M/4 hidden states, yielding the overall same number of trainable parameters as for the other models. In all cases, M=40, which initial numerical exploration suggested to be a good compromise between model complexity (bias) and data fit (variance) (Fig. S3).

Table 1: Overview over the different models used for comparison
NAME DESCRIPTION
RNN Vanilla ReLU based RNN
iRNN RNN with initialization 𝑾0=𝑰 and 𝒉0=𝟎 (Le et al., 2015)
npRNN RNN with weights initialized to a normalized positive definite matrix with largest eigenvalue of 1 and biases initialized to zero (Talathi & Vartak, 2016)
PLRNN PLRNN as given in eq. 1 (Koppe et al., 2019)
iPLRNN PLRNN with initialization 𝑨0=𝑰, 𝑾0=𝟎 and 𝒉0=𝟎
rPLRNN PLRNN initialized as illustrated in Fig. S1, with additional regularization term (eq. 4) during training
LSTM Long Short-Term Memory (Hochreiter & Schmidhuber, 1997)

Fig. 2 summarizes the results for the machine learning benchmarks. As can be seen, on the addition and multiplication tasks, and in terms of either the MSE or percentage correct, our rPLRNN outperforms all other tested methods, including LSTMs. Indeed, the LSTM performs even significantly worse than the iRNN and the iPLRNN. The large error bars in Fig. 2 result from the fact that the networks mostly learn these tasks in an all-or-none fashion, i.e. either learn the task and succeed in almost 100 percent of the cases or fail completely. The results for the sequential MNIST problem are summarized in Fig. 2C. While in this case the LSTM outperforms all other methods, the rPLRNN is almost en par with it. In addition, the iPLRNN outperforms the iRNN.

4.2 Numerical experiments on a dynamical system with different time scales

Here our goal was to examine whether our regularization approach would also help with the identification of DS that harbor widely different time scales. By tuning systems in the vicinity of line attractors, multiple arbitrary time scales can be realized in theory (Durstewitz, 2003). To test this, we used a biophysically motivated (highly nonlinear) bursting cortical neuron model with one voltage and two conductance recovery variables (see Durstewitz (2009)), one slow and one fast (Suppl. 7.1.5). Reproduction of this DS is challenging since it produces very fast spikes on top of a slow nonlinear oscillation (Fig. 3D). Time series of standardized variables of length T=1500 were generated from this model and provided as observations to the rPLRNN inference algorithm. rPLRNNs with M={818} states were estimated, with the regularization factor varied within τ{0,101,102,103,104,105}/1500.

Fig. 3A confirms our intuition that stronger regularization leads to better DS reconstruction as assessed by the KL divergence between true and generated state distributions. This decrease in DKL is accompanied by a likewise decrease in the MSE between the power spectra of true (Suppl. eq. 7.1.5) and generated (rPLRNN) voltage traces as τ increased (Fig. 3B). Fig. 3D gives an example of voltage traces and gating variables freely simulated (i.e., sampled) from the generative rPLRNN trained with τ=23, illustrating that our model is in principle capable of capturing both the stiff spike dynamics and the slower oscillations in the second gating variable at the same time. Fig. 3C provides more insight into how the regularization worked: While the high frequency components (>50Hz) related to the repetitive spiking activity hardly benefitted from increasing τ, there was a strong reduction in the MSE computed on the power spectrum for the lower frequency range (50Hz), suggesting that increased regularization helps to map slowly evolving components of the dynamics.

Fig. 3: Reconstruction of a 2-time scale DS (biophysical bursting neuron model) in limit cycle regime. A) KL divergence (DKL) between true and generated state space distributions as a function of τ. Unstable (globally diverging) system estimates were removed. B) Average MSE between power spectra of true and reconstructed DS. C) Average normalized MSE between power spectra of true and reconstructed DS split according to low (50Hz) and high (>50Hz) frequency components. Error bars = SEM in all graphs. D) Example of (best) generated time series (red=reconstruction with τ=23).

5 Conclusions

In this work we have introduced a simple solution to the long short-term memory problem in RNN that on the one hand retains the simplicity and tractability of vanilla RNN, yet on the other hand does not curtail the universal computational capabilities of RNN (Koiran et al., 1994; Siegelmann & Sontag, 1995) and their ability to approximate arbitrary DS (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998; Trischler & D’Eleuterio, 2016). We achieved this by adding regularization terms to the loss function that encourage the system to form a ’memory subspace’, that is, line attractor dimensions (Seung, 1996; Durstewitz, 2003) which would store arbitrary values for, if unperturbed, arbitrarily long periods. At the same time we did not rigorously enforce this constraint which has important implications for capturing slow time scales in the data: It allows the RNN to slightly depart from a perfect line attractor, which has been shown to constitute a general dynamical mechanism for regulating the speed of flow and thus the learning of arbitrary time constants that are not naturally included qua RNN design (Durstewitz, 2003; 2004). This is because as we come infinitesimally close to a line attractor and thus a bifurcation in the system’s parameter space, the flow along this direction becomes arbitrarily slow until it vanishes completely in the line attractor configuration (Fig. 1). Moreover, part of the RNN’s latent space was not regularized at all, leaving the system enough degrees of freedom for realizing arbitrary computations or dynamics. We showed that the rPLRNN is en par with or outperforms initialization-based approaches and LSTMs on a number of classical benchmarks, and, more importantly, that the regularization strongly facilitates the identification of challenging DS with widely different time scales in PLRNN-based algorithms for DS reconstruction. Future work will explore a wider range of DS models and empirical data with diverse temporal and dynamical phenomena.

6 Acknowledgements

This work was funded by grants from the German Research Foundation (DFG) to DD (Du 354/10-1, Du 354/8-2 within SPP 1665) and to GK (TRR265: A06 & B08). We would like to cordially thank Dr. Zahra Monfared for her careful reading of the manuscript and her thoughtful suggestions.

References

  • Arjovsky et al. (2016) Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In Proceedings of The 33rd International Conference on Machine Learning, 2016. URL http://proceedings.mlr.press/v48/arjovsky16.html.
  • Ayed et al. (2019) Ibrahim Ayed, Emmanuel de Bézenac, Arthur Pajot, Julien Brajard, and Patrick Gallinari. Learning Dynamical Systems from Partial Observations. arXiv preprint, 2019. URL http://arxiv.org/abs/1902.11136.
  • Baydin et al. (2018) Atılım Güneş Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 18:1–43, feb 2018. ISSN 15337928. URL http://arxiv.org/abs/1502.05767.
  • Beer (2006) Randall D. Beer. Parameter space structure of continuous-time recurrent neural networks. Neural computation, 18(12):3009–51, 2006. doi: 10.1162/neco.2006.18.12.3009. URL http://www.ncbi.nlm.nih.gov/pubmed/17052157.
  • Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181. URL https://doi.org/10.1109/72.279181.
  • Brunton et al. (2016) Steven L. Brunton, Joshua L. Proctor, J. Nathan Kutz, and William Bialek. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 113(15):3932–3937, 2016. doi: 10.1073/pnas.1517384113. URL https://www.pnas.org/content/113/15/3932.
  • Champion et al. (2019) Kathleen Champion, Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Data-driven discovery of coordinates and governing equations. arXiv preprint, 2019. URL http://arxiv.org/abs/1904.02107.
  • Chen et al. (2017) Shizhe Chen, Ali Shojaie, and Daniela M. Witten. Network Reconstruction From High-Dimensional Ordinary Differential Equations. Journal of the American Statistical Association, 112(520):1697–1707, 2017. doi: 10.1080/01621459.2016.1229197. URL http://doi.org/10.1080/01621459.2016.1229197.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734. Association for Computational Linguistics (ACL), 2014.
  • Duncker et al. (2019) Lea Duncker, Gergo Bohner, Julien Boussard, and Maneesh Sahani. Learning interpretable continuous-time models of latent stochastic dynamical systems. arXiv preprint, 2019. URL http://arxiv.org/abs/1902.04420.
  • Durbin & Koopman (2012) James Durbin and Siem Jan Koopman. Time Series Analysis by State Space Methods. Oxford University Press, May 2012. doi: 10.1093/acprof:oso/9780199641178.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199641178.001.0001.
  • Durstewitz (2003) Daniel Durstewitz. Self-Organizing Neural Integrator Predicts Interval Times. Journal of Neuroscience, 23(12):5342–5353, 2003. URL https://www.jneurosci.org/content/23/12/5342.
  • Durstewitz (2004) Daniel Durstewitz. Neural representation of interval time. NeuroReport, 15(5):745–749, April 2004. doi: 10.1097/00001756-200404090-00001. URL https://doi.org/10.1097/00001756-200404090-00001.
  • Durstewitz (2009) Daniel Durstewitz. Implications of synaptic biophysics for recurrent network dynamics and active memory. Neural Networks, 22(8):1189–1200, oct 2009. doi: 10.1016/j.neunet.2009.07.016. URL https://doi.org/10.1016/j.neunet.2009.07.016.
  • Durstewitz (2017) Daniel Durstewitz. A State Space Approach for Piecewise‐Linear Recurrent Neural Networks for Reconstructing Nonlinear Dynamics from Neural Measurements. PLoS Computational Biology, 13(6):e1005542, 2017. doi: 10.1371/journal.pcbi.1005542. URL https://doi.org/10.1371/journal.pcbi.1005542.
  • Fraccaro et al. (2016) Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. Advances in Neural Information Processing Systems, 2016.
  • Funahashi & Nakamura (1993) Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6(6):801–806, 1993. doi: 10.1016/s0893-6080(05)80125-x. URL https://doi.org/10.1016/s0893-6080(05)80125-x.
  • Gorbach et al. (2017) Nico S. Gorbach, Stefan Bauer, and Joachim M. Buhmann. Scalable variational inference for dynamical systems. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4806–4815. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7066-scalable-variational-inference-for-dynamical-systems.pdf.
  • Helfrich et al. (2018) Kyle E. Helfrich, Devin Whimott, and Qiang Ye. Orthogonal recurrent neural networks with scaled Cayley transform. 35th International Conference on Machine Learning, ICML 2018, 5, 2018.
  • Henaff et al. (2016) Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and long-memory tasks. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2034–2042, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/henaff16.html.
  • Hershey & Olsen (2007) John R. Hershey and Peder A. Olsen. Approximating the Kullback Leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, volume 4, pp. IV–317–IV–320, April 2007. doi: 10.1109/ICASSP.2007.366913. URL https://doi.org/10.1109/ICASSP.2007.366913.
  • Hochreiter (1991) Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–80, 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://www.ncbi.nlm.nih.gov/pubmed/9377276.
  • Jing et al. (2019) Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and Yoshua Bengio. Gated Orthogonal Recurrent Units: On Learning to Forget. Neural Computation, 31(4):765–783, apr 2019. ISSN 1530888X. doi: 10.1162/neco˙a˙01174.
  • Jordan et al. (2019) Ian D. Jordan, Piotr A. Sokol, and Il Memming Park. Gated recurrent units viewed through the lens of continuous time dynamical systems. arXiv preprint, abs/1906.01005, 2019. URL http://arxiv.org/abs/1906.01005.
  • Kantz & Schreiber (2004) Holger Kantz and Thomas Schreiber. Nonlinear Time Series Analysis. Cambridge University Press, 2. edition, 2004. doi: 10.1017/CBO9780511755798. URL https://doi.org/10.1017/CBO9780511755798.
  • Kerg et al. (2019) Giancarlo Kerg, Kyle Goyette, Maximilian P. Touzel, Gauthier Gidel, Eugene Vorontsov, Yoshua Bengio, and Guillaume Lajoie. Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics. arXiv preprint, 2019. URL http://arxiv.org/abs/1905.12080.
  • Kimura & Nakano (1998) M. Kimura and R. Nakano. Learning dynamical systems by recurrent neural networks from orbits. Neural Networks, 11(9):1589–1599, 1998. doi: 10.1016/s0893-6080(98)00098-7. URL https://doi.org/10.1016/s0893-6080(98)00098-7.
  • Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint, 2014. URL http://arxiv.org/abs/1412.6980.
  • Koiran et al. (1994) Pascal Koiran, Michel Cosnard, and Max H. Garzon. Computability with low-dimensional dynamical systems. Theoretical Computer Science, 132:113–128, 1994.
  • Koppe et al. (2019) Georgia Koppe, Hazem Toutounji, Peter Kirsch, Stefanie Lis, and Daniel Durstewitz. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fMRI. PLOS Computational Biology, 15(8):e1007263, 2019. doi: 10.1371/journal.pcbi.1007263. URL https://doi.org/10.1371/journal.pcbi.1007263.
  • Le et al. (2015) Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv preprint, 2015. URL http://arxiv.org/abs/1504.00941.
  • LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
  • Lu et al. (2017) Zhixin Lu, Jaideep Pathak, Brian Hunt, Michelle Girvan, Roger Brockett, and Edward Ott. Reservoir observers: Model-free inference of unmeasured variables in chaotic systems. Chaos, 27(4), 2017. doi: 10.1063/1.4979665.
  • Ozaki (2012) Tohru Ozaki. Time Series Modeling of Neuroscience Data. CRC Press, 2012. doi: 10.1201/b11527. URL https://doi.org/10.1201/b11527.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
  • Raissi (2018) Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. Journal of Machine Learning Research, 19:1–24, 2018. URL http://jmlr.org/papers/v19/18-046.html.
  • Roweis & Ghahramani (2002) Sam Roweis and Zoubin Ghahramani. Learning Nonlinear Dynamical Systems Using the Expectation-Maximization Algorithm. In Kalman Filtering and Neural Networks, chapter 6, pp. 175–220. Wiley-Blackwell, 2002. ISBN 9780471221548. doi: 10.1002/0471221546.ch6. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/0471221546.ch6.
  • Rudy et al. (2019) Samuel H. Rudy, Steven L. Brunton, and J. Nathan Kutz. Smoothing and parameter estimation by soft-adherence to governing equations. Journal of Computational Physics, 398:108860, December 2019. doi: 10.1016/j.jcp.2019.108860. URL https://doi.org/10.1016/j.jcp.2019.108860.
  • Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986. doi: 10.1038/323533a0. URL https://doi.org/10.1038/323533a0.
  • Sauer et al. (1991) Tim Sauer, James A. Yorke, and Martin Casdagli. Embedology. Journal of Statistical Physics, 65(3-4):579–616, November 1991. doi: 10.1007/bf01053745. URL https://doi.org/10.1007/bf01053745.
  • Seung (1996) H. Sebastian Seung. How the brain keeps the eyes still. Proceedings of the National Academy of Sciences, 93(23):13339–13344, 1996. doi: 10.1073/pnas.93.23.13339. URL https://doi.org/10.1073/pnas.93.23.13339.
  • Seung et al. (2000) H. Sebastian Seung, Daniel D. Lee, Ben Y. Reis, and David W. Tank. Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 2000. doi: 10.1016/S0896-6273(00)81155-1. URL https://doi.org/10.1016/S0896-6273(00)81155-1.
  • Siegelmann & Sontag (1995) Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, February 1995. doi: 10.1006/jcss.1995.1013. URL https://doi.org/10.1006/jcss.1995.1013.
  • Song et al. (2016) H. Francis Song, Guangyu R. Yang, and Xiao Jing Wang. Training Excitatory-Inhibitory Recurrent Neural Networks for Cognitive Tasks: A Simple and Flexible Framework. PLoS Computational Biology, 12(2):1–30, 2016. doi: 10.1371/journal.pcbi.1004792. URL https://doi.org/10.1371/journal.pcbi.1004792.
  • Stevens (2003) Charles F Stevens. Neurotransmitter release at central synapses. Neuron, 40(2):381–388, October 2003. doi: 10.1016/s0896-6273(03)00643-3. URL https://doi.org/10.1016/s0896-6273(03)00643-3.
  • Strogatz (2015) Steven H. Strogatz. Nonlinear Dynamics and Chaos: Applications to Physics, Biology, Chemistry, and Engineering: With Applications to Physics, Biology, Chemistry and Engineering. CRC Press, 2015. doi: 10.1201/9780429492563. URL https://doi.org/10.1201/9780429492563.
  • Takens (1981) Floris Takens. Detecting strange attractors in turbulence. In Lecture Notes in Mathematics, pp. 366–381. Springer Berlin Heidelberg, 1981. doi: 10.1007/bfb0091924. URL https://doi.org/10.1007/bfb0091924.
  • Talathi & Vartak (2016) Sachin S. Talathi and Aniket Vartak. Improving performance of recurrent neural network with ReLU nonlinearity. ICLR Workshop submission, 2016. URL https://arxiv.org/abs/1511.03771.
  • Trischler & D’Eleuterio (2016) Adam P. Trischler and Gabriele M.T. D’Eleuterio. Synthesis of recurrent neural networks for dynamical system simulation. Neural Networks, 80:67–78, 2016. doi: 10.1016/j.neunet.2016.04.001. URL http://doi.org/10.1016/j.neunet.2016.04.001.
  • Yu et al. (2006) Byron M. Yu, Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, and Maneesh Sahani. Extracting dynamical structure embedded in neural activity. In Y. Weiss, B. Schölkopf, and J. C. Platt (eds.), Advances in Neural Information Processing Systems 18, pp. 1545–1552. MIT Press, 2006. URL http://papers.nips.cc/paper/2823-extracting-dynamical-structure-embedded-in-neural-activity.pdf.
  • Zhao & Park (2017) Yuan Zhao and Il Memming Park. Variational Joint Filtering. arXiv preprint, 2017. URL http://arxiv.org/abs/1707.09049.
  • Zheng et al. (2017) Xun Zheng, Manzil Zaheer, Amr Ahmed, Yuan Wang, Eric P. Xing, and Alexander J. Smola. State Space LSTM Models with Particle MCMC Inference. arXiv preprint, 2017. URL http://arxiv.org/abs/1711.11179.

7 Supplementary Material

7.1 Supplementary text

7.1.1 Simple exact PLRNN solution for addition problem

The exact PLRNN parameter settings (cf. eq. 1) for solving the addition problem with 2 units (cf. Fig. 1C) are as follows:

𝑨=(1000),𝑾=(0100),𝒉=(0-1),𝑪=(0011),𝑩=(10) (8)

7.1.2 Conversion from discrete to continuous time PLRNN

Under some conditions we can translate the discrete into an equivalent continuous time PLRNN. Using 𝑫Ω(t) as defined below (7.1.3) for a single time step t, we can rewrite (ignoring the noise term and inputs) PLRNN eq. 1 in the form

𝒛t+1=F(𝒛t)=𝑾Ω(t)𝒛t+𝒉,with𝑾Ω(t):=𝑨+𝑾𝑫Ω(t), (9)

where Ω(t){m|zm,t>0} is the set of all unit indices with activation larger 0 at time t. To convert this into an equivalent (in the sense defined in eq. 11) system of (piecewise) ordinary differential equations (ODE), we need to find parameters 𝑾~Ω and 𝒉~,

𝜻˙=G(𝜻)=𝑾~Ω𝜻(t)+𝒉~, (10)

such that

𝒛0=𝜻(0)𝒛1=F(𝒛0)=𝜻(Δt), (11)

where Δt is the time step with which the empirically observed time series 𝑿 was sampled. From these conditions it follows that for each of the s{1,,2M} subregions (orthants) defined by fixed index sets Ωs{1,,M} we must have

(𝑨+𝑾𝑫Ωs-𝑰)𝒛0+𝒉=0Δt𝑾~Ωs𝜻(t)+𝒉~dt, (12)

where we assume that 𝑫Ωs is constant for one time step, i.e. between 0 and Δt. We approach this by first solving the homogeneous system using the general ansatz for systems of linear ODEs,

(𝑨+𝑾𝑫Ωs-𝑰)𝒛0 =!0Δt𝑾~Ωskckeλ~kt𝒗~kdt (13)
=kck𝑾~Ωs𝒗k0Δteλ~kt𝑑t (14)
=kckλ~k𝒗k1λ~k(eλ~kΔt-1) (15)
𝑾Ωs𝒛0 =!kck𝒗keλ~kΔt (16)
=𝑽(eλ~1Δt𝟎𝟎eλ~MΔt)𝑬𝒄 (17)
𝑾Ωs =𝑽𝑬𝑽-1. (18)

where we have used 𝒛0=kck𝒗k on lines 15 and 16. Hence we can infer matrix 𝑾~Ωs from the eigendecomposition of matrix 𝑾Ωs, by letting λ~k=1Δtlogλk, where λk are the eigenvalues of 𝑾Ωs, and reassembling

𝑾~Ωs =𝑽1Δtlog(𝚲)𝑽-1. (19)

We obtain the general solution for the inhomogeneous case by requiring that for all fixed points 𝒛*=F(𝒛*) of the map eq. 9 we have G(𝒛*)=0. Using this we obtain

𝒉~ =-𝑾~Ωs(𝑰-𝑾Ωs)-1𝒉 (20)

Assuming inputs 𝒔t to be constant across time step Δt, we can apply the same transformation to input matrix 𝑪. Fig. S5 illustrates the discrete to continuous PLRNN conversion for a nonlinear oscillator.

Note that in the above derivations we have assumed that matrix 𝑾Ωs can be diagonalized with distinct eigenvalues λiλjij, for all Ωs, and that all eigenvalues are nonzero (in fact, 𝑾Ωs should not have any negative real eigenvalues; problems also arise when solutions 𝒛t fall exactly onto the coordinates (boundaries between regions)). In general, not every discrete-time PLRNN can be converted into a continuous-time ODE system in the sense defined above. For instance, we can have chaos in a 1d nonlinear map, while we need at least a 3d ODE system to create chaos (Strogatz, 2015).

7.1.3 More details on EM algorithm

Here we briefly outline the fixed-point-iteration algorithm for solving the maximization problem in eq. 6 (for more details see Durstewitz (2017); Koppe et al. (2019)). Given a Gaussian latent PLRNN and a Gaussian observation model, the joint density p(𝑿,𝒁) will be piecewise Gaussian, hence eq. 6 piecewise quadratic in 𝒁. Let us concatenate all state variables across m and t into one long column vector 𝒛=(z1,1,,zM,1,,z1,T,,zM,T), arrange matrices 𝑨, 𝑾 into large MT×MT block tri-diagonal matrices, define 𝒅Ω(𝟏z1,1>0,𝟏z2,1>0,,𝟏zM,T>0) as an indicator vector with a 1 for all states zm,t>0 and zeros otherwise, and 𝑫Ωdiag(𝒅Ω) as the diagonal matrix formed from this vector. Collecting all terms quadratic, linear, or constant in 𝒛, we can then write down the optimization criterion in the form

QΩ*(𝒛) =-12[𝒛(𝑼0+𝑫Ω𝑼1+𝑼1𝑫Ω+𝑫Ω𝑼2𝑫Ω)𝒛-𝒛(𝒗0+𝑫Ω𝒗1)
-(𝒗0+𝑫Ω𝒗1)𝒛]+const. (21)

In essence, the algorithm now iterates between the two steps:

  1. 1.

    Given fixed 𝑫Ω, solve 𝒛*=(𝑼0+𝑫Ω𝑼1+𝑼1𝑫Ω+𝑫Ω𝑼2𝑫Ω)-1(𝒗0+𝑫Ω𝒗1)

  2. 2.

    Given fixed 𝒛*, recompute 𝑫Ω

until either convergence or one of several stopping criteria (partly likelihood-based, partly to avoid loops) is reached. The solution may afterwards be refined by one quadratic programming step. Numerical experiments showed this algorithm to be very fast and efficient (Durstewitz, 2017; Koppe et al., 2019). At 𝒛*, an estimate of the state covariance is then obtained as the inverse negative Hessian,

𝑽=(𝑼0+𝑫Ω𝑼1+𝑼1𝑫Ω+𝑫Ω𝑼2𝑫Ω)-1. (22)

In the M-step, using the proposal density q* from the E-step, the solution to the maximization problem 𝜽*argmax𝜽(𝜽,q*), can generally be expressed in the form

𝜽*=(t𝔼[𝜶t𝜷t])(t𝔼[𝜷t𝜷t])-1, (23)

where, for the latent model, eq. 1, 𝜶t=𝒛t and 𝜷t[𝒛t-1,ϕ(𝒛t-1),𝒔t,1]2M+K+1, and for the observation model, eq. 2, 𝜶t=𝒙t and 𝜷t=g(𝒛t).

7.1.4 More details on DS performance measure

The measure DKL introduced in the main text for assessing the agreement in attractor geometries only works for situations where the ground truth ptrue(𝑿) is known. Following Koppe et al. (2019), here we would like to briefly indicate how a proxy for DKL may be obtained in empirical situations where no ground truth is available. Reasoning that for a well reconstructed DS the inferred posterior pinf(𝒛|𝒙) given the observations should be a good representative of the prior generative dynamics pgen(𝒛), one may use the Kullback-Leibler divergence between the distribution over latent states, obtained by sampling from the prior density pgen(𝒛), and the (data-constrained) posterior distribution pinf(𝒛|𝒙) (where 𝒛M×1 and 𝒙N×1), taken across the system’s state space:

DKL(pinf(𝒛|𝒙)pgen(𝒛))=𝒛M×1pinf(𝒛|𝒙)logpinf(𝒛|𝒙)pgen(𝒛)d𝒛 (24)

As evaluating this integral is difficult, one could further approximate pinf(𝒛|𝒙) and pgen(𝒛) by Gaussian mixtures across trajectories, i.e. pinf(𝒛|𝒙)1Tt=1Tp(𝒛t|𝒙1:T) and pgen(𝒛)1Ll=1Lp(𝒛l|𝒛l-1), where the mean and covariance of p(𝒛t|𝒙1:T) and p(𝒛l|𝒛l-1) are obtained by marginalizing over the multivariate distributions p(𝒁|𝑿) and pgen(𝒁), respectively, yielding 𝔼[𝒛t|𝒙1:T], 𝔼[𝒛l|𝒛l-1], and covariance matrices Var(𝒛t|𝒙1:T) and Var(𝒛l|𝒛l-1). Supplementary eq. 24 may then be numerically approximated through Monte Carlo sampling (Hershey & Olsen, 2007) by

DKL(pinf(𝒛|𝒙)pgen(𝒛))1ni=1nlogpinf(𝒛(i)|𝒙)pgen(𝒛(i)),𝒛(i)pinf(𝒛|𝒙) (25)

For high-dimensional state spaces, for which MC sampling becomes challenging, there is luckily a variational approximation of eq. 24 available (Hershey & Olsen, 2007):

DKLvariational(pinf(𝒛|𝒙)pgen(𝒛))1Tt=1Tlogj=1Te-DKL(p(𝒛t|𝒙1:T)p(𝒛j|𝒙1:T))k=1Te-DKL(p(𝒛t|𝒙1:T)p(𝒛k|𝒛k-1)), (26)

where the KL divergences in the exponentials are among Gaussians for which we have an analytical expression.

7.1.5 More details on single neuron model

The neuron model used in section 4.2 is described by

-CmV˙ =gL(V-EL)+gNam(V)(V-ENa)+gKn(V-EK)
+gMh(V-EK)+gNMDAσ(V)(V-ENMDA) (27)
h˙ =h(V)-hτh (28)
n˙ =n(V)-nτn (29)
σ(V) =[1+.33e-.0625V]-1 (30)

where Cm refers to the neuron’s membrane capacitance, the g to different membrane conductances, E to the respective reversal potentials, and m, h, and n are gating variables with limiting values given by

{m,n,h} =[1+e({VhNa,VhK,VhM}-V)/{kNa,kK,kM}]-1 (31)

Different parameter settings in this model lead to different dynamical phenomena, including regular spiking, slow bursting or chaos (see Durstewitz (2009) for details). Parameter settings used here were: Cm=6µF, gL=8mS, EL=-80mV, gNa=20mS, ENa=60mV, VhNa=-20mV, kNa=15, gK=10mS, EK=-90mV, VhK=-25mV, kK=5, τn=1ms, gM=25mS, VhM=-15mV, kM=5, τh=200ms, gNMDA=10.2mS.

7.2 Supplementary Figures

Fig. S1: Illustration of the L2-regularization for the PLRNN’s auto-regression matrix 𝑨, coupling matrix 𝑾, and bias terms 𝒉. Regularized values are indicated in red, crosses mark arbitrary values (all other values set to 0 as indicated).
Fig. S2: MSE evaluated between time series is not a good measure for DS reconstruction. A) Time graph (top) and state space (bottom) for the single neuron model (see section 4.2 and Suppl. 7.1.5) with parameters in the chaotic regime (blue curves) and with simple fixed point dynamics in the limit (red line). Although the system has vastly different limiting behaviors (attractor geometries) in these two cases, as visualized in the state space, the agreement in time series initially seems to indicate a perfect fit. B) Same as in A) for two trajectories drawn from exactly the same DS (i.e., same parameters) with slightly different initial conditions. Despite identical dynamics, the trajectories immediately diverge, resulting in a high MSE. Dash-dotted grey lines in top graphs indicate the point from which onward the state space trajectories were depicted.
Fig. S3: Performance of the rPLRNN on the addition problem for different A) numbers of latent states M, B) values of τ and C) proportions Mreg/M. Dashed lines denote the values used for the results reported in section 4.1
Fig. S4: Effect of regularization strength τ on rPLRNN network parameters (cf. eq. 1) (regularized parameters for states mMreg, eq. 1, in red). Note that some of the non-regularized network parameters (in blue) appear to systematically change as well as τ is varied.
Fig. S5: Illustration of conversion of discrete into continuous time PLRNN for a PLRNN emulation of the nonlinear van-der-Pol oscillator. Shown are the first two latent dimensions. Red lines: continuous solution; blue circles: discrete solution; black bars: perturbations (external inputs).