Abstract
Vanilla RNN with ReLU activation have a simple structure that is amenable tosystematic dynamical systems analysis and interpretation, but they suffer fromthe exploding vs. vanishing gradients problem. Recent attempts to retain thissimplicity while alleviating the gradient problem are based on properinitialization schemes or orthogonality/unitary constraints on the RNN'srecurrence matrix, which, however, comes with limitations to its expressivepower with regards to dynamical systems phenomena like chaos ormultistability. Here, we instead suggest a regularization scheme that pushespart of the RNN's latent subspace toward a line attractor configuration thatenables long shortterm memory and arbitrarily slow time scales. We show thatour approach excels on a number of benchmarks like the sequential MNIST ormultiplication problems, and enables reconstruction of dynamical systems whichharbor widely different time scales.
Quick Read (beta)
Inferring dynamical systems with longrange dependencies through line attractor regularization
Abstract
Vanilla RNN with ReLU activation have a simple structure that is amenable to systematic dynamical systems analysis and interpretation, but they suffer from the exploding vs. vanishing gradients problem. Recent attempts to retain this simplicity while alleviating the gradient problem are based on proper initialization schemes or orthogonality/unitary constraints on the RNN’s recurrence matrix, which, however, comes with limitations to its expressive power with regards to dynamical systems phenomena like chaos or multistability. Here, we instead suggest a regularization scheme that pushes part of the RNN’s latent subspace toward a line attractor configuration that enables long shortterm memory and arbitrarily slow time scales. We show that our approach excels on a number of benchmarks like the sequential MNIST or multiplication problems, and enables reconstruction of dynamical systems which harbor widely different time scales.
Inferring dynamical systems with longrange dependencies through line attractor regularization
Dominik Schmidt^{1*}, Georgia Koppe^{1*}, Max Beutelspacher^{1,}^{2}^{2}
2
Current affiliation: Fraunhofer Institute for Manufacturing Engineering and Automation IPA, Department for Robot and Assistive Systems, Group Industrial and commercial Service robots, Stuttgart, Germany, Daniel Durstewitz^{1,3}
^{1}Department of Theoretical Neuroscience,
Central Institute of Mental Health,
Medical Faculty Mannheim,
Heidelberg University,
Mannheim, Germany
^{3}Faculty of Physics and Astronomy, Heidelberg University
^{*}These authors contributed equally
contact: {dominik.schmidt,georgia.koppe,daniel.durstewitz}@zimannheim.de
1 Introduction
Theories of complex systems in biology and physics are often formulated in terms of sets of stochastic differential or difference equations, i.e. as stochastic dynamical systems (DS). A longstanding desire is to retrieve these generating dynamical equations directly from observed time series data (Kantz & Schreiber, 2004). A variety of machine and deep learning methodologies toward this goal have been introduced in recent years (Chen et al., 2017; Champion et al., 2019; Jordan et al., 2019; Duncker et al., 2019; Ayed et al., 2019; Durstewitz, 2017; Koppe et al., 2019), many of them based on recurrent neural networks (RNN) which can universally approximate any DS (i.e., its flow field) under some mild conditions (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998). However, vanilla RNN as often used in this context are well known for their problems in capturing longterm dependencies and slow time scales in the data (Hochreiter & Schmidhuber, 1997; Bengio et al., 1994). In DS terms, this is generally due to the fact that flexible information maintenance over long periods requires precise finetuning of model parameters toward ’line attractor’ configurations (Fig. 1), a concept first propagated in computational neuroscience for addressing animal performance in parametric working memory tasks (Seung, 1996; Seung et al., 2000; Durstewitz, 2003). Line attractors introduce directions of zeroflow into the model’s state space that enable longterm maintenance of arbitrary values (Fig. 1). Specially designed RNN architectures equipped with gating mechanisms and (linear) memory cells have been suggested for solving this issue (Hochreiter & Schmidhuber, 1997; Cho et al., 2014). However, from a DS perspective, simpler models that can more easily be analyzed and interpreted in DS terms, and for which more efficient inference algorithms exist that emphasize approximation of the true underlying DS would be preferable.
Recent solutions to the vanishing vs. exploding gradient problem attempt to retain the simplicity of vanilla RNN by initializing or constraining the recurrent weight matrix to be the identity (Le et al., 2015), orthogonal (Henaff et al., 2016; Helfrich et al., 2018) or unitary (Arjovsky et al., 2016). In this way, in a system including piecewise linear (PL) components like rectifiedlinear units (ReLU), line attractor dimensions are established from the start by construction or ensured throughout training by a specifically parameterized matrix decomposition. However, for many DS problems, line attractors instantiated by mere initialization procedures may be unstable and quickly dissolve during training. On the other hand, orthogonal or unitary constraints are too restrictive for reconstructing DS, and more generally from a computational perspective as well (Kerg et al., 2019): For instance, neither chaotic behavior (that requires diverging directions) nor settings with multiple isolated fixed point or limit cycle attractors are possible.
Here we therefore suggest a different solution to the problem, by pushing (but not strictly enforcing) ReLUbased, piecewiselinear RNN (PLRNN) toward line attractor configurations along some (but not all) directions in state space. We achieve this by adding special regularization terms for a subset of RNN units to the loss function that promote such a configuration. We demonstrate that our approach outperforms, or is en par with, LSTM and other, initializationbased, methods on a number of ’classical’ machine learning benchmarks (Hochreiter & Schmidhuber, 1997). More importantly, we demonstrate that while with previous methods it was difficult to capture slow behavior in a DS that exhibits widely different time scales, our new regularizationsupported inference efficiently captures all relevant time scales.
2 Related work
Longrange dependency problems in RNN. Error gradients in vanilla RNN trained by some form of gradient descent, like backpropagation through time (BPTT, Rumelhart et al. (1986)), tend to either explode or vanish due to the large product of derivative terms that results from recursive application of the chain rule over time steps (Hochreiter, 1991; Bengio et al., 1994; Hochreiter & Schmidhuber, 1997). Formally, RNN ${\bm{z}}_{t}={F}_{\bm{\theta}}({\bm{z}}_{t1},{\bm{s}}_{t})$ are discrete time dynamical systems that tend to either converge, e.g. to fixed point or limit cycle attractors, or diverge (to infinity or as in chaotic systems) over time, unless parameters of the system are precisely tuned to create directions of zeroflow in the system’s state space (Fig. 1), called line attractors (Seung, 1996; Seung et al., 2000; Durstewitz, 2003). Convergence of the RNN in general implies vanishing and global divergence exploding gradients. To address this issue, RNN with gated memory cells have been specifically designed (Hochreiter & Schmidhuber, 1997; Cho et al., 2014), but these are complicated and tedious to analyze from a DS perspective. Recently, Le et al. (2015) observed that initialization of the recurrent weight matrix $\bm{W}$ to the identity in ReLUbased RNN may yield performance en par with LSTMs on standard machine learning benchmarks. For a ReLU with activity ${\bm{z}}_{t}\ge 0$, zero bias and unit slope, this results in the identity mapping, hence a line attractor configuration. Talathi & Vartak (2016) expanded on this idea by initializing the recurrence matrix such that its largest absolute eigenvalue is 1, arguing that this would leave other directions in the system’s state space free for computations other than memory maintenance. Later work enforced orthogonal (Henaff et al., 2016; Helfrich et al., 2018; Jing et al., 2019) or unitary (Arjovsky et al., 2016) constraints on the recurrent weight matrix during training. While this appears to yield longterm memory performance superior to that of LSTMs, these networks are limited in their computational power (Kerg et al., 2019). This may be a consequence of the fact that RNN with orthogonal recurrence matrix are quite restricted in the range of dynamical phenomena they can produce, e.g. chaotic attractors are not possible since diverging eigendirections are disabled. Our approach therefore is to establish line attractors only along some but not all directions in state space, and to only push the RNN toward these configurations but not strictly enforce them, such that convergence or divergence of RNN dynamics is still possible. We furthermore implement these concepts through regularization terms in the loss functions, such that they are encouraged throughout training unlike when only established through initialization.
Dynamical systems reconstruction. From a natural science perspective, the goal of reconstructing the underlying DS fundamentally differs from building a system that ’merely’ yields good ahead predictions, as in DS reconstruction we require that the inferred model can freely reproduce (when no longer guided by the data) the underlying attractor geometries and state space properties (see section 3.5, Fig. S2; Kantz & Schreiber (2004)). Earlier work using RNN for DS identification (Roweis & Ghahramani, 2002; Yu et al., 2006) mainly focused on inferring the posterior over latent trajectories $\bm{Z}=\{{\bm{z}}_{1},\mathrm{\dots},{\bm{z}}_{T}\}$ given time series data $\bm{X}=\{{\bm{x}}_{1},\mathrm{\dots},{\bm{x}}_{T}\}$, $p(\bm{Z}\bm{X})$, and on ahead predictions (Lu et al., 2017), hence did not show that inferred models can generate the underlying attractor geometries on their own. Others (Trischler & D’Eleuterio, 2016; Brunton et al., 2016) attempt to approximate the flow field, obtained e.g. by numerical differentiation, directly through basis expansions or neural networks, but numerical derivatives are problematic for their high variance and other numerical issues (Raissi, 2018; Baydin et al., 2018; Chen et al., 2017). Some approaches assume the form of the DS equations basically to be given (Raissi, 2018; Gorbach et al., 2017) and focus on estimating the system’s latent states and parameters, rather than approximating an unknown DS based on the observed time series information alone. In many biological systems like the brain the intrinsic dynamics are highly stochastic with many noise sources, like probabilistic synaptic release (Stevens, 2003), such that models that do not explicitly account for dynamical process noise (Champion et al., 2019; Rudy et al., 2019) may be less suitable. Finally, some fully probabilistic models for DS reconstruction based on GRU (Fraccaro et al. (2016), cf. Jordan et al. (2019)), LSTM (Zheng et al., 2017), or radial basis function (Zhao & Park, 2017) networks are not easily interpretable and amenable to DS analysis. Most importantly, none of these previous approaches considers the longrange dependency problem within more easily tractable RNN for DS reconstruction.
3 Model formulation and optimization approaches
3.1 Model and preliminaries
Assume we are given two multivariate time series $\bm{S}=\{{\bm{s}}_{t}\}$ and $\bm{X}=\{{\bm{x}}_{t}\}$, one we will denote as ’inputs’ ($\bm{S}$) and the other as ’outputs’ ($\bm{X}$). We will first consider the ’classical’ (supervised) machine learning setting where we wish to map $\bm{S}$ on $\bm{X}$ through a RNN with latent state equation ${\bm{z}}_{t}={F}_{\theta}({\bm{z}}_{t1},{\bm{s}}_{t})$, as for instance in the ’addition problem’ (Hochreiter & Schmidhuber, 1997). In DS reconstruction, in contrast, we usually have a dense time series $\bm{X}$ from which we wish to infer (unsupervised) the underlying DS, where $\bm{S}$ may provide an additional forcing function or sparse experimental inputs or perturbations.
The latent RNN we consider here takes the specific form
$${\bm{z}}_{t}=\bm{A}{\bm{z}}_{t1}+\bm{W}\varphi ({\bm{z}}_{t1})+\bm{C}{\bm{s}}_{t}+\bm{h}+{\bm{\epsilon}}_{t},{\bm{\epsilon}}_{t}\sim \mathcal{N}(0,\mathbf{\Sigma}),$$  (1) 
where ${\bm{z}}_{t}\in {\mathbb{R}}^{M\times 1}$ is the hidden state (column) vector of dimension $M$, $\bm{A}\in {\mathbb{R}}^{M\times M}$ a diagonal and $\bm{W}\in {\mathbb{R}}^{M\times M}$ an offdiagonal matrix, ${\bm{s}}_{t}\in {\mathbb{R}}^{K\times 1}$ the external input of dimension $K$, $\bm{C}\in {\mathbb{R}}^{M\times K}$ the input mapping, $\bm{h}\in {\mathbb{R}}^{M\times 1}$ a bias, and ${\bm{\epsilon}}_{t}$ a Gaussian noise term with diagonal covariance matrix $\mathrm{diag}(\mathbf{\Sigma})\in {\mathbb{R}}_{+}^{M}$. The nonlinearity $\varphi (\bm{z})$ is a ReLU, $\varphi {(\bm{z})}_{i}=\mathrm{max}(0,{z}_{i}),i\in \{1,\mathrm{\dots},M\}$. This specific formulation is originally motivated by firing rate (population) models in computational neuroscience (Song et al., 2016; Durstewitz, 2017), where latent states ${\bm{z}}_{t}$ may represent membrane voltages or currents, $\bm{A}$ the neurons’ passive time constants, $\bm{W}$ the synaptic coupling among neurons, and $\varphi (\cdot )$ the voltagetorate transfer function. However, for a RNN in the form ${\bm{z}}_{t}=\bm{W}\varphi \left({\bm{z}}_{t1}\right)+\bm{h}$, note that the simple change of variables ${\bm{y}}_{t}\to {\bm{W}}^{1}({\bm{z}}_{t}\bm{h})$ will yield the more familiar form ${\bm{y}}_{t}=\varphi \left(\bm{W}{\bm{y}}_{t1}+\bm{h}\right)$ (Beer, 2006).
Besides its neuroscience motivation, note that by letting $\bm{A}=\bm{I}$, $\bm{W}=\mathrm{\U0001d7ce}$, $\bm{h}=\mathrm{\U0001d7ce}$, we get a strict line attractor system across the variables’ whole support which we conjecture will be of advantage for establishing long shortterm memory properties. Also we can solve for all of the system’s fixed points analytically by solving the equations ${\bm{z}}^{*}={\left(\bm{I}\bm{A}\bm{W}{\bm{D}}_{\mathrm{\Omega}}\right)}^{1}\bm{h}$, with ${\bm{D}}_{\mathrm{\Omega}}$ as defined in Suppl. 7.1.2, and can determine their stability from the eigenvalues of matrix $\bm{A}+\bm{W}{\bm{D}}_{\mathrm{\Omega}}$. We could do the same for limit cycles, in principle, which are fixed points of the $r$times iterated map ${F}_{\bm{\theta}}^{r}$, although practically the number of configurations to consider increases exponentially as ${2}^{M\cdot r}$. Finally, we remark that a discrete piecewiselinear system can, under certain conditions, be transformed into an equivalent continuoustime (ODE) piecewiselinear system $\dot{\bm{\zeta}}={G}_{\mathrm{\Omega}}(\bm{\zeta}(t),\bm{s}(t))$ (Suppl. 7.1.2, Ozaki (2012)), in the sense that if $\bm{\zeta}(t)={\bm{z}}_{t}$, then $\bm{\zeta}(t+\mathrm{\Delta}t)={\bm{z}}_{t+1}$ after a defined time step $\mathrm{\Delta}t$. These are among the properties that make PLRNNs more amenable to rigorous DS analysis than other RNN formulations.
We will assume that the latent RNN states ${\bm{z}}_{t}$ are coupled to the actual observations ${\bm{x}}_{t}$ through a simple observation model of the form
$${\bm{x}}_{t}=\bm{B}g({\bm{z}}_{t})+{\bm{\eta}}_{t},{\bm{\eta}}_{t}\sim \mathcal{N}(0,\mathbf{\Gamma})$$  (2) 
in the case of realvalued observations ${\bm{x}}_{t}\in {\mathbb{R}}^{N\times 1}$, where $\bm{B}\in {\mathbb{R}}^{N\times M}$ is a factor loading matrix and $\mathrm{diag}(\mathbf{\Gamma})\in {\mathbb{R}}_{+}^{N}$ the diagonal covariance matrix of the Gaussian observation noise, or
$${\widehat{p}}_{i,t}\u2254{\widehat{p}}_{t}({x}_{i,t}=1)=\left({e}^{{\bm{B}}_{i,:}{\bm{z}}_{t}}\right){\left(\sum _{j=1}^{N}{e}^{{\bm{B}}_{j,:}{\bm{z}}_{t}}\right)}^{1},$$  (3) 
in the case of multicategorical observations ${x}_{i,t}\in \{0,1\},{\sum}_{i}{x}_{i,t}=1$.
3.2 Regularization approach
We start from a similar idea as Le et al. (2015), who initialized RNN parameters such that it performs an identity mapping for ${z}_{i,t}\ge 0$. However, 1) we use a neuroscientifically motivated network architecture (eq. 1) that enables the identity mapping across the variables whole support, ${z}_{i,t}\in [\mathrm{\infty},+\mathrm{\infty}]$, 2) we encourage this mapping only for a subset ${M}_{\mathrm{reg}}\le M$ of units (Fig. S1), leaving others free to perform arbitrary computations, and 3) we stabilize this configuration throughout training by introducing a specific ${L}_{2}$ regularization for parameters $\bm{A}$, $\bm{W}$, and $\bm{h}$ in eq. 1.
That way, we divide the units into two types, where the regularized units serve as a memory that tends to decay very slowly (depending on the size of the regularization term), while the remaining units maintain the flexibility to approximate any underlying DS, yet retaining the simplicity of the original RNN model (eq. 1). Specifically, the following penalty is added to the loss function (Fig. S1):
$${\mathrm{L}}_{\mathrm{reg}}={\tau}_{A}\sum _{i=1}^{{M}_{\mathrm{reg}}}{\left({A}_{i,i}1\right)}^{2}+{\tau}_{W}\sum _{i=1}^{{M}_{\mathrm{reg}}}\sum _{\begin{array}{c}j=1\\ j\ne i\end{array}}^{M}{W}_{i,j}^{2}+{\tau}_{h}\sum _{i=1}^{{M}_{\mathrm{reg}}}{h}_{i}^{2}$$  (4) 
While this formulation allows us to trade off, for instance, the tendency toward a line attractor ($\bm{A}\to \bm{I}$, $\bm{h}\to \mathrm{\U0001d7ce}$) vs. the sensitivity to other units’ inputs ($\bm{W}\to \mathrm{\U0001d7ce}$), for all experiments performed here a common value, ${\tau}_{A}={\tau}_{W}={\tau}_{h}=\tau $, was assumed for the three regularization factors.
3.3 Optimization procedure for machine learning benchmarks
For comparability with other approaches like LSTMs (Hochreiter & Schmidhuber, 1997) or iRNN (Le et al., 2015), we will assume that the latent state dynamics eq. 1 are deterministic (i.e., $\mathbf{\Sigma}=\mathrm{\U0001d7ce}$), will take $g({\bm{z}}_{t})={\bm{z}}_{t}$ and $\mathbf{\Gamma}={\bm{I}}_{N}$ in eq. 2 (leading to an implicit Gaussian assumption with identity covariance matrix), and will use stochastic gradient descent (SGD) for training to minimize the squarederror loss across $R$ samples, $\mathcal{L}={\sum}_{n=1}^{R}{\left({\widehat{\bm{x}}}_{T}^{(n)}{\bm{x}}_{T}^{(n)}\right)}^{2}$, between estimated and actual outputs for the addition and multiplication problems, and the cross entropy loss $\mathcal{L}={\sum}_{n=1}^{R}\left({\sum}_{i=1}^{10}{x}_{i,T}^{(n)}\mathrm{log}({\widehat{p}}_{i,T}^{(n)})\right)$ for sequential MNIST, to which penalty eq. 4 was added for the regularized PLRNN (rPLRNN). We used the Adam algorithm (Kingma & Ba, 2014) from the PyTorch package (Paszke et al., 2017) with a learning rate of 0.001, a gradient clip parameter of 10, and batch size of 16. In all cases, SGD is stopped after 100 epochs and the fit with the lowest loss across all epochs is chosen.
3.4 Optimization procedure for dynamical systems reconstruction
For DS reconstruction we request that the latent RNN approximates the true generating system of equations, which is a taller order than learning the mapping $\bm{S}\to \bm{X}$ or predicting future values in a time series (cf. sect. 3.5). This point has important implications for the design of models, inference algorithms and performance metrics if the primary goal is DS reconstruction rather than ’mere’ time series forecasting. In this context we consider the fully probabilistic, generative RNN eq. 1.
Together with eq. 2 (where we take $g({\bm{z}}_{t})=\varphi ({\bm{z}}_{t}$)) this gives the typical form of a nonlinear state space model (Durbin & Koopman, 2012) with observation and process noise. We solve for the parameters $\bm{\theta}=\{\bm{A},\bm{W},\bm{C},\bm{h},\mathbf{\Sigma},\bm{B},\mathbf{\Gamma}\}$ by maximum likelihood, for which an efficient ExpectationMaximization (EM) algorithm has recently been suggested (Durstewitz, 2017; Koppe et al., 2019), which we will briefly summarize here. Since the involved integrals are not tractable, we start off from the evidencelower bound (ELBO) to the loglikelihood which can be rewritten in various useful ways:
$\mathrm{log}p(\bm{X}\bm{\theta})$  $\ge {\mathbb{E}}_{\bm{Z}\sim q}[\mathrm{log}{p}_{\bm{\theta}}(\bm{X},\bm{Z})]+H\left(q(\bm{Z}\bm{X})\right)$  
$=\mathrm{log}p(\bm{X}\bm{\theta}){D}_{\mathrm{KL}}(q(\bm{Z}\bm{X})\parallel {p}_{\bm{\theta}}(\bm{Z}\bm{X}))\u2255\mathcal{L}(\bm{\theta},q)$  (5) 
In the Estep, given a current estimate ${\bm{\theta}}^{*}$ for the parameters, we seek to determine the posterior ${p}_{\bm{\theta}}\left(\bm{Z}\bm{X}\right)$ which we approximate by a global Gaussian $q(\bm{Z}\bm{X})$ instantiated by the maximizer (mode) ${\bm{Z}}^{*}$ of ${p}_{\bm{\theta}}(\bm{Z}\bm{X})$ as an estimator of the mean, and the negative inverse Hessian around this maximizer as an estimator of the state covariance, i.e.
$\mathbb{E}[\bm{Z}\bm{X}]\approx {\bm{Z}}^{*}$  $=\underset{\bm{Z}}{\mathrm{arg}\mathrm{max}}\mathrm{log}{p}_{\bm{\theta}}(\bm{Z}\bm{X})=\underset{\bm{Z}}{\mathrm{arg}\mathrm{max}}\left[\mathrm{log}{p}_{\bm{\theta}}(\bm{X}\bm{Z})+\mathrm{log}{p}_{\bm{\theta}}(\bm{Z})\mathrm{log}{p}_{\bm{\theta}}(\bm{X})\right]$  
$=\underset{\bm{Z}}{\mathrm{arg}\mathrm{max}}\left[\mathrm{log}{p}_{\bm{\theta}}(\bm{X}\bm{Z})+\mathrm{log}{p}_{\bm{\theta}}(\bm{Z})\right],$  (6) 
since $\bm{Z}$ integrates out in ${p}_{\bm{\theta}}(\bm{X})$ (equivalently, this result can be derived from a Laplace approximation to the loglikelihood, $\mathrm{log}p(\bm{X}\bm{\theta})\approx \mathrm{log}{p}_{\bm{\theta}}(\bm{X}{\bm{Z}}^{*})+\mathrm{log}{p}_{\bm{\theta}}({\bm{Z}}^{*})\frac{1}{2}\mathrm{log}{\bm{L}}^{*}+\mathrm{const}$, where ${\bm{L}}^{*}$ is the Hessian evaluated at the maximizer). We solve this optimization problem by a fixedpoint iteration scheme that efficiently exploits the model’s piecewise linear structure (see Suppl. 7.1.3, Durstewitz (2017); Koppe et al. (2019)).
Using this approximate posterior for ${p}_{\bm{\theta}}(\bm{Z}\bm{X})$, based on the model’s piecewiselinear structure most of the expectation values ${\mathbb{E}}_{\bm{z}\sim q}\left[\varphi (\bm{z})\right]$, ${\mathbb{E}}_{\bm{z}\sim q}\left[\varphi (\bm{z}){\bm{z}}^{\u22ba}\right]$, and ${\mathbb{E}}_{\bm{z}\sim q}\left[\varphi (\bm{z})\varphi {(\bm{z})}^{\u22ba}\right]$, could be solved for (semi)analytically (where $\bm{z}$ is the concatenated vector form of $\bm{Z}$, as in Suppl. 7.1.3). In the Mstep, we seek ${\bm{\theta}}^{*}\u2254\underset{\bm{\theta}}{\mathrm{arg}\mathrm{max}}\mathcal{L}(\bm{\theta},{q}^{*})$, assuming proposal density ${q}^{*}$ to be given from the Estep, which for a Gaussian observation model amounts to a simple linear regression problem (see Suppl. eq. 23). To force the PLRNN to really capture the underlying DS in its governing equations, we use a previously suggested (Koppe et al. 2019) stepwise annealing protocol that gradually shifts the burden of fitting the observations $\bm{X}$ from the observation model eq. 2 to the latent RNN model eq. 1 during training, the idea of which is to establish a mapping from latent states $\bm{Z}$ to observations $\bm{X}$ first, fixing this, and then enforcing the temporal consistency constraints implied by eq. 1 while accounting for the actual observations.
3.5 Performance measures
Measures of prediction error. For the machine learning benchmarks we employed the same criteria as used for optimization (MSE or crossentropy, sect. 3.3) as performance metrics, evaluated across leftout test sets. In addition, we report the relative frequency ${P}_{\mathrm{correct}}$ of correctly predicted trials across the test set. A correct trial in the addition and multiplication task is defined as an absolute prediction error smaller than 0.04 (analogous to Talathi & Vartak (2016)), while a correct trial in the sequential MNIST data set is defined as one for which the largest probability ${\widehat{p}}_{{i}^{*}}=\underset{\mathit{i}}{\mathrm{max}}{\widehat{p}}_{i,T}$ indicated the correct class ${x}_{{i}^{*},T}=1.$
Agreement in attractor geometries. From a DS perspective, it is not sufficient or even sensible to judge a method’s ability to infer the underlying DS purely based on some form of (ahead)prediction error like the MSE defined on the time series itself (Ch.12 in Kantz & Schreiber (2004)). Rather, we require that the inferred model can freely reproduce (when no longer guided by the data) the underlying attractor geometries and state space properties. This is not automatically guaranteed for a model that yields agreeable ahead predictions on a time series. Vice versa, if the underlying attractor is chaotic, with a tiny bit of noise even trajectories starting from the same initial condition will quickly diverge and aheadprediction errors are not even meaningful as a performance metric (Fig. S2A). To quantify how well an inferred PLRNN captured the underlying dynamics we therefore followed Koppe et al. (2019) and used the KullbackLeibler divergence between the true and reproduced probability distributions across states in state space, thus assessing the agreement in attractor geometries (cf. Takens (1981); Sauer et al. (1991)) rather than in precise matching of time series,
$${D}_{\mathrm{KL}}({p}_{\mathrm{true}}(\bm{x})\parallel {p}_{\mathrm{gen}}(\bm{x}\bm{z}))\approx \sum _{k=1}^{K}{\widehat{p}}_{\mathrm{true}}^{(k)}(\bm{x})\mathrm{log}\left(\frac{{\widehat{p}}_{\mathrm{true}}^{(k)}(\bm{x})}{{\widehat{p}}_{\mathrm{gen}}^{(k)}(\bm{x}\bm{z})}\right),$$  (7) 
where ${p}_{\mathrm{true}}(\bm{x})$ is the true distribution of observations across state space, ${p}_{\mathrm{gen}}(\bm{x}\bm{z})$ is the distribution of observations generated by running the inferred PLRNN, and the sum indicates a spatial discretization (binning) of the observed state space (see Suppl. 7.1.4 for more details). We emphasize that ${\widehat{p}}_{\mathrm{gen}}^{(k)}(\bm{x}\bm{z})$ is obtained from freely simulated trajectories $\widehat{p}(\bm{z})$, not from the inferred posteriors $\widehat{p}(\bm{z}{\bm{x}}_{\mathrm{train}})$. In addition, to assess reproduction of time scales by the inferred PLRNN, we computed the average correlation between the power spectra of the true and generated time series.
4 Numerical experiments
4.1 Machine learning benchmarks
We compared the performance of our rPLRNN to other models on the following three benchmarks requiring long shortterm maintenance of information (as in Talathi & Vartak (2016) and Hochreiter & Schmidhuber (1997)): 1) The addition problem of time length $T$ consists of $\mathrm{100\hspace{0.17em}000}$ training and $\mathrm{10\hspace{0.17em}000}$ test samples of $2\times T$ input series $\bm{S}=\{{\bm{s}}_{1},\mathrm{\dots},{\bm{s}}_{T}\}$, where entries ${\bm{s}}_{1,:}\in [0,1]$ are drawn from a uniform random distribution and ${\bm{s}}_{2,:}\in \{0,1\}$ contains zeros except for two indicator bits placed randomly at times $$ and $$. Constraints on ${t}_{1}$ and ${t}_{2}$ are chosen such that every trial requires a long memory of at least $T/2$ time steps. At the last time step $T$, the target output of the network is the sum of the two inputs in ${\bm{s}}_{1,:}$ indicated by the 1entries in ${\bm{s}}_{2,:}$, ${\bm{x}}_{T}^{\mathrm{target}}={s}_{1,{t}_{1}}+{s}_{1,{t}_{2}}$. 2) The multiplication problem is the same as the addition problem, only that the product instead of the sum has to be produced by the RNN as an output at time $T$, ${\bm{x}}_{T}^{\mathrm{target}}={s}_{1,{t}_{1}}\cdot {s}_{1,{t}_{2}}$. 3) The MNIST dataset (LeCun & Cortes, 2010) consists of $\mathrm{60\hspace{0.17em}000}$ training and $\mathrm{10\hspace{0.17em}000}$ $28\times 28$ test images of hand written digits. To make this a time series problem, in sequential MNIST the images are presented sequentially, pixelbypixel, scanning lines from upper left to bottomright, resulting in time series of fixed length $T=784$.
On all three benchmarks we compare the performance of the rPLRNN (eq. 1) to several other models summarized in Table 1. To achieve a meaningful comparison, all models have the same number of hidden states $M$, except for the LSTM, which requires three additional parameters for each hidden state and hence has only $M/4$ hidden states, yielding the overall same number of trainable parameters as for the other models. In all cases, $M=40$, which initial numerical exploration suggested to be a good compromise between model complexity (bias) and data fit (variance) (Fig. S3).
NAME  DESCRIPTION 

RNN  Vanilla ReLU based RNN 
iRNN  RNN with initialization ${\bm{W}}_{0}=\bm{I}$ and ${\bm{h}}_{0}=\mathrm{\U0001d7ce}$ (Le et al., 2015) 
npRNN  RNN with weights initialized to a normalized positive definite matrix with largest eigenvalue of 1 and biases initialized to zero (Talathi & Vartak, 2016) 
PLRNN  PLRNN as given in eq. 1 (Koppe et al., 2019) 
iPLRNN  PLRNN with initialization ${\bm{A}}_{0}=\bm{I}$, ${\bm{W}}_{0}=\mathrm{\U0001d7ce}$ and ${\bm{h}}_{0}=\mathrm{\U0001d7ce}$ 
rPLRNN  PLRNN initialized as illustrated in Fig. S1, with additional regularization term (eq. 4) during training 
LSTM  Long ShortTerm Memory (Hochreiter & Schmidhuber, 1997) 
Fig. 2 summarizes the results for the machine learning benchmarks. As can be seen, on the addition and multiplication tasks, and in terms of either the MSE or percentage correct, our rPLRNN outperforms all other tested methods, including LSTMs. Indeed, the LSTM performs even significantly worse than the iRNN and the iPLRNN. The large error bars in Fig. 2 result from the fact that the networks mostly learn these tasks in an allornone fashion, i.e. either learn the task and succeed in almost 100 percent of the cases or fail completely. The results for the sequential MNIST problem are summarized in Fig. 2C. While in this case the LSTM outperforms all other methods, the rPLRNN is almost en par with it. In addition, the iPLRNN outperforms the iRNN.
4.2 Numerical experiments on a dynamical system with different time scales
Here our goal was to examine whether our regularization approach would also help with the identification of DS that harbor widely different time scales. By tuning systems in the vicinity of line attractors, multiple arbitrary time scales can be realized in theory (Durstewitz, 2003). To test this, we used a biophysically motivated (highly nonlinear) bursting cortical neuron model with one voltage and two conductance recovery variables (see Durstewitz (2009)), one slow and one fast (Suppl. 7.1.5). Reproduction of this DS is challenging since it produces very fast spikes on top of a slow nonlinear oscillation (Fig. 3D). Time series of standardized variables of length $T=1500$ were generated from this model and provided as observations to the rPLRNN inference algorithm. rPLRNNs with $M=\{8\mathrm{\dots}18\}$ states were estimated, with the regularization factor varied within $\tau \in \{0,{10}^{1},{10}^{2},{10}^{3},{10}^{4},{10}^{5}\}/1500$.
Fig. 3A confirms our intuition that stronger regularization leads to better DS reconstruction as assessed by the KL divergence between true and generated state distributions. This decrease in ${D}_{\mathrm{KL}}$ is accompanied by a likewise decrease in the MSE between the power spectra of true (Suppl. eq. 7.1.5) and generated (rPLRNN) voltage traces as $\tau $ increased (Fig. 3B). Fig. 3D gives an example of voltage traces and gating variables freely simulated (i.e., sampled) from the generative rPLRNN trained with $\tau =\frac{2}{3}$, illustrating that our model is in principle capable of capturing both the stiff spike dynamics and the slower oscillations in the second gating variable at the same time. Fig. 3C provides more insight into how the regularization worked: While the high frequency components ($>50\mathrm{Hz}$) related to the repetitive spiking activity hardly benefitted from increasing $\tau $, there was a strong reduction in the MSE computed on the power spectrum for the lower frequency range ($\le 50\mathrm{Hz}$), suggesting that increased regularization helps to map slowly evolving components of the dynamics.
5 Conclusions
In this work we have introduced a simple solution to the long shortterm memory problem in RNN that on the one hand retains the simplicity and tractability of vanilla RNN, yet on the other hand does not curtail the universal computational capabilities of RNN (Koiran et al., 1994; Siegelmann & Sontag, 1995) and their ability to approximate arbitrary DS (Funahashi & Nakamura, 1993; Kimura & Nakano, 1998; Trischler & D’Eleuterio, 2016). We achieved this by adding regularization terms to the loss function that encourage the system to form a ’memory subspace’, that is, line attractor dimensions (Seung, 1996; Durstewitz, 2003) which would store arbitrary values for, if unperturbed, arbitrarily long periods. At the same time we did not rigorously enforce this constraint which has important implications for capturing slow time scales in the data: It allows the RNN to slightly depart from a perfect line attractor, which has been shown to constitute a general dynamical mechanism for regulating the speed of flow and thus the learning of arbitrary time constants that are not naturally included qua RNN design (Durstewitz, 2003; 2004). This is because as we come infinitesimally close to a line attractor and thus a bifurcation in the system’s parameter space, the flow along this direction becomes arbitrarily slow until it vanishes completely in the line attractor configuration (Fig. 1). Moreover, part of the RNN’s latent space was not regularized at all, leaving the system enough degrees of freedom for realizing arbitrary computations or dynamics. We showed that the rPLRNN is en par with or outperforms initializationbased approaches and LSTMs on a number of classical benchmarks, and, more importantly, that the regularization strongly facilitates the identification of challenging DS with widely different time scales in PLRNNbased algorithms for DS reconstruction. Future work will explore a wider range of DS models and empirical data with diverse temporal and dynamical phenomena.
6 Acknowledgements
This work was funded by grants from the German Research Foundation (DFG) to DD (Du 354/101, Du 354/82 within SPP 1665) and to GK (TRR265: A06 & B08). We would like to cordially thank Dr. Zahra Monfared for her careful reading of the manuscript and her thoughtful suggestions.
References
 Arjovsky et al. (2016) Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In Proceedings of The 33rd International Conference on Machine Learning, 2016. URL http://proceedings.mlr.press/v48/arjovsky16.html.
 Ayed et al. (2019) Ibrahim Ayed, Emmanuel de Bézenac, Arthur Pajot, Julien Brajard, and Patrick Gallinari. Learning Dynamical Systems from Partial Observations. arXiv preprint, 2019. URL http://arxiv.org/abs/1902.11136.
 Baydin et al. (2018) Atılım Güneş Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 18:1–43, feb 2018. ISSN 15337928. URL http://arxiv.org/abs/1502.05767.
 Beer (2006) Randall D. Beer. Parameter space structure of continuoustime recurrent neural networks. Neural computation, 18(12):3009–51, 2006. doi: 10.1162/neco.2006.18.12.3009. URL http://www.ncbi.nlm.nih.gov/pubmed/17052157.
 Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994. doi: 10.1109/72.279181. URL https://doi.org/10.1109/72.279181.
 Brunton et al. (2016) Steven L. Brunton, Joshua L. Proctor, J. Nathan Kutz, and William Bialek. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences of the United States of America, 113(15):3932–3937, 2016. doi: 10.1073/pnas.1517384113. URL https://www.pnas.org/content/113/15/3932.
 Champion et al. (2019) Kathleen Champion, Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Datadriven discovery of coordinates and governing equations. arXiv preprint, 2019. URL http://arxiv.org/abs/1904.02107.
 Chen et al. (2017) Shizhe Chen, Ali Shojaie, and Daniela M. Witten. Network Reconstruction From HighDimensional Ordinary Differential Equations. Journal of the American Statistical Association, 112(520):1697–1707, 2017. doi: 10.1080/01621459.2016.1229197. URL http://doi.org/10.1080/01621459.2016.1229197.
 Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734. Association for Computational Linguistics (ACL), 2014.
 Duncker et al. (2019) Lea Duncker, Gergo Bohner, Julien Boussard, and Maneesh Sahani. Learning interpretable continuoustime models of latent stochastic dynamical systems. arXiv preprint, 2019. URL http://arxiv.org/abs/1902.04420.
 Durbin & Koopman (2012) James Durbin and Siem Jan Koopman. Time Series Analysis by State Space Methods. Oxford University Press, May 2012. doi: 10.1093/acprof:oso/9780199641178.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199641178.001.0001.
 Durstewitz (2003) Daniel Durstewitz. SelfOrganizing Neural Integrator Predicts Interval Times. Journal of Neuroscience, 23(12):5342–5353, 2003. URL https://www.jneurosci.org/content/23/12/5342.
 Durstewitz (2004) Daniel Durstewitz. Neural representation of interval time. NeuroReport, 15(5):745–749, April 2004. doi: 10.1097/0000175620040409000001. URL https://doi.org/10.1097/0000175620040409000001.
 Durstewitz (2009) Daniel Durstewitz. Implications of synaptic biophysics for recurrent network dynamics and active memory. Neural Networks, 22(8):1189–1200, oct 2009. doi: 10.1016/j.neunet.2009.07.016. URL https://doi.org/10.1016/j.neunet.2009.07.016.
 Durstewitz (2017) Daniel Durstewitz. A State Space Approach for Piecewise‐Linear Recurrent Neural Networks for Reconstructing Nonlinear Dynamics from Neural Measurements. PLoS Computational Biology, 13(6):e1005542, 2017. doi: 10.1371/journal.pcbi.1005542. URL https://doi.org/10.1371/journal.pcbi.1005542.
 Fraccaro et al. (2016) Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. Advances in Neural Information Processing Systems, 2016.
 Funahashi & Nakamura (1993) Kenichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6(6):801–806, 1993. doi: 10.1016/s08936080(05)80125x. URL https://doi.org/10.1016/s08936080(05)80125x.
 Gorbach et al. (2017) Nico S. Gorbach, Stefan Bauer, and Joachim M. Buhmann. Scalable variational inference for dynamical systems. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4806–4815. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7066scalablevariationalinferencefordynamicalsystems.pdf.
 Helfrich et al. (2018) Kyle E. Helfrich, Devin Whimott, and Qiang Ye. Orthogonal recurrent neural networks with scaled Cayley transform. 35th International Conference on Machine Learning, ICML 2018, 5, 2018.
 Henaff et al. (2016) Mikael Henaff, Arthur Szlam, and Yann LeCun. Recurrent orthogonal networks and longmemory tasks. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2034–2042, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/henaff16.html.
 Hershey & Olsen (2007) John R. Hershey and Peder A. Olsen. Approximating the Kullback Leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing  ICASSP ’07, volume 4, pp. IV–317–IV–320, April 2007. doi: 10.1109/ICASSP.2007.366913. URL https://doi.org/10.1109/ICASSP.2007.366913.
 Hochreiter (1991) Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–80, 1997. ISSN 08997667. doi: 10.1162/neco.1997.9.8.1735. URL http://www.ncbi.nlm.nih.gov/pubmed/9377276.
 Jing et al. (2019) Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and Yoshua Bengio. Gated Orthogonal Recurrent Units: On Learning to Forget. Neural Computation, 31(4):765–783, apr 2019. ISSN 1530888X. doi: 10.1162/neco˙a˙01174.
 Jordan et al. (2019) Ian D. Jordan, Piotr A. Sokol, and Il Memming Park. Gated recurrent units viewed through the lens of continuous time dynamical systems. arXiv preprint, abs/1906.01005, 2019. URL http://arxiv.org/abs/1906.01005.
 Kantz & Schreiber (2004) Holger Kantz and Thomas Schreiber. Nonlinear Time Series Analysis. Cambridge University Press, 2. edition, 2004. doi: 10.1017/CBO9780511755798. URL https://doi.org/10.1017/CBO9780511755798.
 Kerg et al. (2019) Giancarlo Kerg, Kyle Goyette, Maximilian P. Touzel, Gauthier Gidel, Eugene Vorontsov, Yoshua Bengio, and Guillaume Lajoie. Nonnormal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics. arXiv preprint, 2019. URL http://arxiv.org/abs/1905.12080.
 Kimura & Nakano (1998) M. Kimura and R. Nakano. Learning dynamical systems by recurrent neural networks from orbits. Neural Networks, 11(9):1589–1599, 1998. doi: 10.1016/s08936080(98)000987. URL https://doi.org/10.1016/s08936080(98)000987.
 Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint, 2014. URL http://arxiv.org/abs/1412.6980.
 Koiran et al. (1994) Pascal Koiran, Michel Cosnard, and Max H. Garzon. Computability with lowdimensional dynamical systems. Theoretical Computer Science, 132:113–128, 1994.
 Koppe et al. (2019) Georgia Koppe, Hazem Toutounji, Peter Kirsch, Stefanie Lis, and Daniel Durstewitz. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fMRI. PLOS Computational Biology, 15(8):e1007263, 2019. doi: 10.1371/journal.pcbi.1007263. URL https://doi.org/10.1371/journal.pcbi.1007263.
 Le et al. (2015) Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv preprint, 2015. URL http://arxiv.org/abs/1504.00941.
 LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Lu et al. (2017) Zhixin Lu, Jaideep Pathak, Brian Hunt, Michelle Girvan, Roger Brockett, and Edward Ott. Reservoir observers: Modelfree inference of unmeasured variables in chaotic systems. Chaos, 27(4), 2017. doi: 10.1063/1.4979665.
 Ozaki (2012) Tohru Ozaki. Time Series Modeling of Neuroscience Data. CRC Press, 2012. doi: 10.1201/b11527. URL https://doi.org/10.1201/b11527.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
 Raissi (2018) Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. Journal of Machine Learning Research, 19:1–24, 2018. URL http://jmlr.org/papers/v19/18046.html.
 Roweis & Ghahramani (2002) Sam Roweis and Zoubin Ghahramani. Learning Nonlinear Dynamical Systems Using the ExpectationMaximization Algorithm. In Kalman Filtering and Neural Networks, chapter 6, pp. 175–220. WileyBlackwell, 2002. ISBN 9780471221548. doi: 10.1002/0471221546.ch6. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/0471221546.ch6.
 Rudy et al. (2019) Samuel H. Rudy, Steven L. Brunton, and J. Nathan Kutz. Smoothing and parameter estimation by softadherence to governing equations. Journal of Computational Physics, 398:108860, December 2019. doi: 10.1016/j.jcp.2019.108860. URL https://doi.org/10.1016/j.jcp.2019.108860.
 Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. Nature, 323:533–536, 1986. doi: 10.1038/323533a0. URL https://doi.org/10.1038/323533a0.
 Sauer et al. (1991) Tim Sauer, James A. Yorke, and Martin Casdagli. Embedology. Journal of Statistical Physics, 65(34):579–616, November 1991. doi: 10.1007/bf01053745. URL https://doi.org/10.1007/bf01053745.
 Seung (1996) H. Sebastian Seung. How the brain keeps the eyes still. Proceedings of the National Academy of Sciences, 93(23):13339–13344, 1996. doi: 10.1073/pnas.93.23.13339. URL https://doi.org/10.1073/pnas.93.23.13339.
 Seung et al. (2000) H. Sebastian Seung, Daniel D. Lee, Ben Y. Reis, and David W. Tank. Stability of the memory of eye position in a recurrent network of conductancebased model neurons. Neuron, 2000. doi: 10.1016/S08966273(00)811551. URL https://doi.org/10.1016/S08966273(00)811551.
 Siegelmann & Sontag (1995) Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, February 1995. doi: 10.1006/jcss.1995.1013. URL https://doi.org/10.1006/jcss.1995.1013.
 Song et al. (2016) H. Francis Song, Guangyu R. Yang, and Xiao Jing Wang. Training ExcitatoryInhibitory Recurrent Neural Networks for Cognitive Tasks: A Simple and Flexible Framework. PLoS Computational Biology, 12(2):1–30, 2016. doi: 10.1371/journal.pcbi.1004792. URL https://doi.org/10.1371/journal.pcbi.1004792.
 Stevens (2003) Charles F Stevens. Neurotransmitter release at central synapses. Neuron, 40(2):381–388, October 2003. doi: 10.1016/s08966273(03)006433. URL https://doi.org/10.1016/s08966273(03)006433.
 Strogatz (2015) Steven H. Strogatz. Nonlinear Dynamics and Chaos: Applications to Physics, Biology, Chemistry, and Engineering: With Applications to Physics, Biology, Chemistry and Engineering. CRC Press, 2015. doi: 10.1201/9780429492563. URL https://doi.org/10.1201/9780429492563.
 Takens (1981) Floris Takens. Detecting strange attractors in turbulence. In Lecture Notes in Mathematics, pp. 366–381. Springer Berlin Heidelberg, 1981. doi: 10.1007/bfb0091924. URL https://doi.org/10.1007/bfb0091924.
 Talathi & Vartak (2016) Sachin S. Talathi and Aniket Vartak. Improving performance of recurrent neural network with ReLU nonlinearity. ICLR Workshop submission, 2016. URL https://arxiv.org/abs/1511.03771.
 Trischler & D’Eleuterio (2016) Adam P. Trischler and Gabriele M.T. D’Eleuterio. Synthesis of recurrent neural networks for dynamical system simulation. Neural Networks, 80:67–78, 2016. doi: 10.1016/j.neunet.2016.04.001. URL http://doi.org/10.1016/j.neunet.2016.04.001.
 Yu et al. (2006) Byron M. Yu, Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, and Maneesh Sahani. Extracting dynamical structure embedded in neural activity. In Y. Weiss, B. Schölkopf, and J. C. Platt (eds.), Advances in Neural Information Processing Systems 18, pp. 1545–1552. MIT Press, 2006. URL http://papers.nips.cc/paper/2823extractingdynamicalstructureembeddedinneuralactivity.pdf.
 Zhao & Park (2017) Yuan Zhao and Il Memming Park. Variational Joint Filtering. arXiv preprint, 2017. URL http://arxiv.org/abs/1707.09049.
 Zheng et al. (2017) Xun Zheng, Manzil Zaheer, Amr Ahmed, Yuan Wang, Eric P. Xing, and Alexander J. Smola. State Space LSTM Models with Particle MCMC Inference. arXiv preprint, 2017. URL http://arxiv.org/abs/1711.11179.
7 Supplementary Material
7.1 Supplementary text
7.1.1 Simple exact PLRNN solution for addition problem
The exact PLRNN parameter settings (cf. eq. 1) for solving the addition problem with 2 units (cf. Fig. 1C) are as follows:
$$\bm{A}=\left(\begin{array}{cc}\hfill 1\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill \end{array}\right),\bm{W}=\left(\begin{array}{cc}\hfill 0\hfill & \hfill 1\hfill \\ \hfill 0\hfill & \hfill 0\hfill \end{array}\right),\bm{h}=\left(\begin{array}{c}\hfill 0\hfill \\ \hfill 1\hfill \end{array}\right),\bm{C}=\left(\begin{array}{cc}\hfill 0\hfill & \hfill 0\hfill \\ \hfill 1\hfill & \hfill 1\hfill \end{array}\right),\bm{B}=\left(\begin{array}{cc}\hfill 1\hfill & \hfill 0\hfill \end{array}\right)$$  (8) 
7.1.2 Conversion from discrete to continuous time PLRNN
Under some conditions we can translate the discrete into an equivalent continuous time PLRNN. Using ${\bm{D}}_{\mathrm{\Omega}(t)}$ as defined below (7.1.3) for a single time step $t$, we can rewrite (ignoring the noise term and inputs) PLRNN eq. 1 in the form
$${\bm{z}}_{t+1}=F({\bm{z}}_{t})={\bm{W}}_{\mathrm{\Omega}(t)}{\bm{z}}_{t}+\bm{h},\text{with}{\bm{W}}_{\mathrm{\Omega}(t)}:=\bm{A}+\bm{W}{\bm{D}}_{\mathrm{\Omega}(t)},$$  (9) 
where $\mathrm{\Omega}(t)\u2254\{m{z}_{m,t}>0\}$ is the set of all unit indices with activation larger 0 at time $t$. To convert this into an equivalent (in the sense defined in eq. 11) system of (piecewise) ordinary differential equations (ODE), we need to find parameters ${\stackrel{~}{\bm{W}}}_{\mathrm{\Omega}}$ and $\stackrel{~}{\bm{h}}$,
$\dot{\bm{\zeta}}=G(\bm{\zeta})={\stackrel{~}{\bm{W}}}_{\mathrm{\Omega}}\bm{\zeta}(t)+\stackrel{~}{\bm{h}},$  (10) 
such that
$${\bm{z}}_{0}=\bm{\zeta}(0)\Rightarrow {\bm{z}}_{1}=F({\bm{z}}_{0})=\bm{\zeta}(\mathrm{\Delta}t),$$  (11) 
where $\mathrm{\Delta}t$ is the time step with which the empirically observed time series $\bm{X}$ was sampled. From these conditions it follows that for each of the $s\in \{1,\mathrm{\dots},{2}^{M}\}$ subregions (orthants) defined by fixed index sets ${\mathrm{\Omega}}^{s}\subseteq \{1,\mathrm{\dots},M\}$ we must have
$$\left(\bm{A}+\bm{W}{\bm{D}}_{{\mathrm{\Omega}}^{s}}\bm{I}\right){\bm{z}}_{0}+\bm{h}={\int}_{0}^{\mathrm{\Delta}t}{\stackrel{~}{\bm{W}}}_{{\mathrm{\Omega}}^{s}}\bm{\zeta}(t)+\stackrel{~}{\bm{h}}dt,$$  (12) 
where we assume that ${\bm{D}}_{{\mathrm{\Omega}}^{s}}$ is constant for one time step, i.e. between $0$ and $\mathrm{\Delta}t$. We approach this by first solving the homogeneous system using the general ansatz for systems of linear ODEs,
$\left(\bm{A}+\bm{W}{\bm{D}}_{{\mathrm{\Omega}}^{s}}\bm{I}\right){\bm{z}}_{0}$  $\stackrel{!}{=}{\displaystyle {\int}_{0}^{\mathrm{\Delta}t}}{\stackrel{~}{\bm{W}}}_{{\mathrm{\Omega}}^{s}}{\displaystyle \sum _{k}}{c}_{k}{e}^{{\stackrel{~}{\lambda}}_{k}t}{\stackrel{~}{\bm{v}}}_{k}dt$  (13)  
$={\displaystyle \sum _{k}}{c}_{k}{\stackrel{~}{\bm{W}}}_{{\mathrm{\Omega}}^{s}}{\bm{v}}_{k}{\displaystyle {\int}_{0}^{\mathrm{\Delta}t}}{e}^{{\stackrel{~}{\lambda}}_{k}t}\mathit{d}t$  (14)  
$={\displaystyle \sum _{k}}{c}_{k}{\stackrel{~}{\lambda}}_{k}{\bm{v}}_{k}{\displaystyle \frac{1}{{\stackrel{~}{\lambda}}_{k}}}\left({e}^{{\stackrel{~}{\lambda}}_{k}\mathrm{\Delta}t}1\right)$  (15)  
$\Rightarrow {\bm{W}}_{{\mathrm{\Omega}}^{s}}{\bm{z}}_{0}$  $\stackrel{!}{=}{\displaystyle \sum _{k}}{c}_{k}{\bm{v}}_{k}{e}^{{\stackrel{~}{\lambda}}_{k}\mathrm{\Delta}t}$  (16)  
$=\bm{V}\underset{\u2254\bm{E}}{\underset{\u23df}{\left(\begin{array}{ccc}\hfill {e}^{{\stackrel{~}{\lambda}}_{1}\mathrm{\Delta}t}\hfill & \hfill \mathrm{\dots}\hfill & \hfill \mathrm{\U0001d7ce}\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\ddots}\hfill & \hfill \mathrm{\vdots}\hfill \\ \hfill \mathrm{\U0001d7ce}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {e}^{{\stackrel{~}{\lambda}}_{M}\mathrm{\Delta}t}\hfill \end{array}\right)}}\bm{c}$  (17)  
$\Rightarrow {\bm{W}}_{{\mathrm{\Omega}}^{s}}$  $=\bm{V}\bm{E}{\bm{V}}^{1}.$  (18) 
where we have used ${\bm{z}}_{0}={\sum}_{k}{c}_{k}{\bm{v}}_{k}$ on lines 15 and 16. Hence we can infer matrix ${\stackrel{~}{\bm{W}}}_{{\mathrm{\Omega}}^{s}}$ from the eigendecomposition of matrix ${\bm{W}}_{{\mathrm{\Omega}}^{s}}$, by letting ${\stackrel{~}{\lambda}}_{k}=\frac{1}{\mathrm{\Delta}t}\mathrm{log}{\lambda}_{k}$, where ${\lambda}_{k}$ are the eigenvalues of ${\bm{W}}_{{\mathrm{\Omega}}^{s}}$, and reassembling
${\stackrel{~}{\bm{W}}}_{{\mathrm{\Omega}}^{s}}$  $=\bm{V}{\displaystyle \frac{1}{\mathrm{\Delta}t}}\mathrm{log}(\mathbf{\Lambda}){\bm{V}}^{1}.$  (19) 
We obtain the general solution for the inhomogeneous case by requiring that for all fixed points ${\bm{z}}^{*}=F({\bm{z}}^{*})$ of the map eq. 9 we have $G({\bm{z}}^{*})=0$. Using this we obtain
$\stackrel{~}{\bm{h}}$  $={\stackrel{~}{\bm{W}}}_{{\mathrm{\Omega}}^{s}}{\left(\bm{I}{\bm{W}}_{{\mathrm{\Omega}}^{s}}\right)}^{1}\bm{h}$  (20) 
Assuming inputs ${\bm{s}}_{t}$ to be constant across time step $\mathrm{\Delta}t$, we can apply the same transformation to input matrix $\bm{C}$. Fig. S5 illustrates the discrete to continuous PLRNN conversion for a nonlinear oscillator.
Note that in the above derivations we have assumed that matrix ${\bm{W}}_{{\mathrm{\Omega}}^{s}}$ can be diagonalized with distinct eigenvalues ${\lambda}_{i}\ne {\lambda}_{j}\forall i\ne j$, for all ${\mathrm{\Omega}}^{s}$, and that all eigenvalues are nonzero (in fact, ${\bm{W}}_{{\mathrm{\Omega}}^{s}}$ should not have any negative real eigenvalues; problems also arise when solutions ${\bm{z}}_{t}$ fall exactly onto the coordinates (boundaries between regions)). In general, not every discretetime PLRNN can be converted into a continuoustime ODE system in the sense defined above. For instance, we can have chaos in a 1d nonlinear map, while we need at least a 3d ODE system to create chaos (Strogatz, 2015).
7.1.3 More details on EM algorithm
Here we briefly outline the fixedpointiteration algorithm for solving the maximization problem in eq. 6 (for more details see Durstewitz (2017); Koppe et al. (2019)). Given a Gaussian latent PLRNN and a Gaussian observation model, the joint density $p(\bm{X},\bm{Z})$ will be piecewise Gaussian, hence eq. 6 piecewise quadratic in $\bm{Z}$. Let us concatenate all state variables across $m$ and $t$ into one long column vector $\bm{z}={({z}_{1,1},\mathrm{\dots},{z}_{M,1},\mathrm{\dots},{z}_{1,T},\mathrm{\dots},{z}_{M,T})}^{\u22ba}$, arrange matrices $\bm{A}$, $\bm{W}$ into large $MT\times MT$ block tridiagonal matrices, define ${\bm{d}}_{\mathrm{\Omega}}\u2254{({\mathrm{\U0001d7cf}}_{{z}_{1,1}>0},{\mathrm{\U0001d7cf}}_{{z}_{2,1}>0},\mathrm{\dots},{\mathrm{\U0001d7cf}}_{{z}_{M,T}>0})}^{\u22ba}$ as an indicator vector with a 1 for all states ${z}_{m,t}>0$ and zeros otherwise, and ${\bm{D}}_{\mathrm{\Omega}}\u2254\mathrm{diag}({\bm{d}}_{\mathrm{\Omega}})$ as the diagonal matrix formed from this vector. Collecting all terms quadratic, linear, or constant in $\bm{z}$, we can then write down the optimization criterion in the form
${Q}_{\mathrm{\Omega}}^{*}(\bm{z})$  $={\displaystyle \frac{1}{2}}[{\bm{z}}^{\u22ba}({\bm{U}}_{0}+{\bm{D}}_{\mathrm{\Omega}}{\bm{U}}_{1}+{\bm{U}}_{1}^{\u22ba}{\bm{D}}_{\mathrm{\Omega}}+{\bm{D}}_{\mathrm{\Omega}}{\bm{U}}_{2}{\bm{D}}_{\mathrm{\Omega}})\bm{z}{\bm{z}}^{\u22ba}({\bm{v}}_{0}+{\bm{D}}_{\mathrm{\Omega}}{\bm{v}}_{1})$  
${({\bm{v}}_{0}+{\bm{D}}_{\mathrm{\Omega}}{\bm{v}}_{1})}^{\u22ba}\bm{z}]+\mathrm{const}.$  (21) 
In essence, the algorithm now iterates between the two steps:

1.
Given fixed ${\bm{D}}_{\mathrm{\Omega}}$, solve ${\bm{z}}^{*}={\left({\bm{U}}_{0}+{\bm{D}}_{\mathrm{\Omega}}{\bm{U}}_{1}+{\bm{U}}_{1}^{\u22ba}{\bm{D}}_{\mathrm{\Omega}}+{\bm{D}}_{\mathrm{\Omega}}{\bm{U}}_{2}{\bm{D}}_{\mathrm{\Omega}}\right)}^{1}\left({\bm{v}}_{0}+{\bm{D}}_{\mathrm{\Omega}}{\bm{v}}_{1}\right)$

2.
Given fixed ${\bm{z}}^{*}$, recompute ${\bm{D}}_{\mathrm{\Omega}}$
until either convergence or one of several stopping criteria (partly likelihoodbased, partly to avoid loops) is reached. The solution may afterwards be refined by one quadratic programming step. Numerical experiments showed this algorithm to be very fast and efficient (Durstewitz, 2017; Koppe et al., 2019). At ${\bm{z}}^{*}$, an estimate of the state covariance is then obtained as the inverse negative Hessian,
$$\bm{V}={\left({\bm{U}}_{0}+{\bm{D}}_{\mathrm{\Omega}}{\bm{U}}_{1}+{\bm{U}}_{1}^{\u22ba}{\bm{D}}_{\mathrm{\Omega}}+{\bm{D}}_{\mathrm{\Omega}}{\bm{U}}_{2}{\bm{D}}_{\mathrm{\Omega}}\right)}^{1}.$$  (22) 
In the Mstep, using the proposal density ${q}^{*}$ from the Estep, the solution to the maximization problem ${\bm{\theta}}^{*}\u2254\underset{\bm{\theta}}{\mathrm{arg}\mathrm{max}}\mathcal{L}(\bm{\theta},{q}^{*})$, can generally be expressed in the form
$${\bm{\theta}}^{*}=\left(\sum _{t}\mathbb{E}\left[{\bm{\alpha}}_{t}{\bm{\beta}}_{t}^{\u22ba}\right]\right){\left(\sum _{t}\mathbb{E}\left[{\bm{\beta}}_{t}{\bm{\beta}}_{t}^{\u22ba}\right]\right)}^{1},$$  (23) 
where, for the latent model, eq. 1, ${\bm{\alpha}}_{t}={\bm{z}}_{t}$ and ${\bm{\beta}}_{t}\u2254{[{\bm{z}}_{t1}^{\u22ba},\varphi {({\bm{z}}_{t1})}^{\u22ba},{\bm{s}}_{t}^{\u22ba},1]}^{\u22ba}\in {\mathbb{R}}^{2M+K+1}$, and for the observation model, eq. 2, ${\bm{\alpha}}_{t}={\bm{x}}_{t}$ and ${\bm{\beta}}_{t}=g\left({\bm{z}}_{t}\right)$.
7.1.4 More details on DS performance measure
The measure ${D}_{\mathrm{KL}}$ introduced in the main text for assessing the agreement in attractor geometries only works for situations where the ground truth ${p}_{\mathrm{true}}(\bm{X})$ is known. Following Koppe et al. (2019), here we would like to briefly indicate how a proxy for ${D}_{\mathrm{KL}}$ may be obtained in empirical situations where no ground truth is available. Reasoning that for a well reconstructed DS the inferred posterior ${p}_{\mathrm{inf}}(\bm{z}\bm{x})$ given the observations should be a good representative of the prior generative dynamics ${p}_{\mathrm{gen}}(\bm{z})$, one may use the KullbackLeibler divergence between the distribution over latent states, obtained by sampling from the prior density ${p}_{\mathrm{gen}}(\bm{z})$, and the (dataconstrained) posterior distribution ${p}_{\mathrm{inf}}(\bm{z}\bm{x})$ (where $\bm{z}\in {\mathbb{R}}^{M\times 1}$ and $\bm{x}\in {\mathbb{R}}^{N\times 1}$), taken across the system’s state space:
$${D}_{\mathrm{KL}}\left({p}_{\mathrm{inf}}(\bm{z}\bm{x})\parallel {p}_{\mathrm{gen}}(\bm{z})\right)={\int}_{\bm{z}\in {\mathbb{R}}^{M\times 1}}{p}_{\mathrm{inf}}(\bm{z}\bm{x})\mathrm{log}\frac{{p}_{\mathrm{inf}}(\bm{z}\bm{x})}{{p}_{\mathrm{gen}}(\bm{z})}d\bm{z}$$  (24) 
As evaluating this integral is difficult, one could further approximate ${p}_{\mathrm{inf}}(\bm{z}\bm{x})$ and ${p}_{\mathrm{gen}}(\bm{z})$ by Gaussian mixtures across trajectories, i.e. ${p}_{\mathrm{inf}}(\bm{z}\bm{x})\approx \frac{1}{T}{\sum}_{t=1}^{T}p({\bm{z}}_{t}{\bm{x}}_{1:T})$ and ${p}_{\mathrm{gen}}(\bm{z})\approx \frac{1}{L}{\sum}_{l=1}^{L}p({\bm{z}}_{l}{\bm{z}}_{l1})$, where the mean and covariance of $p({\bm{z}}_{t}{\bm{x}}_{1:T})$ and $p({\bm{z}}_{l}{\bm{z}}_{l1})$ are obtained by marginalizing over the multivariate distributions $p(\bm{Z}\bm{X})$ and ${p}_{\mathrm{gen}}(\bm{Z})$, respectively, yielding $\mathbb{E}[{\bm{z}}_{t}{\bm{x}}_{1:T}]$, $\mathbb{E}[{\bm{z}}_{l}{\bm{z}}_{l1}]$, and covariance matrices $\mathrm{Var}({\bm{z}}_{t}{\bm{x}}_{1:T})$ and $\mathrm{Var}({\bm{z}}_{l}{\bm{z}}_{l1})$. Supplementary eq. 24 may then be numerically approximated through Monte Carlo sampling (Hershey & Olsen, 2007) by
$${D}_{\mathrm{KL}}\left({p}_{\mathrm{inf}}(\bm{z}\bm{x})\parallel {p}_{\mathrm{gen}}(\bm{z})\right)\approx \frac{1}{n}\sum _{i=1}^{n}\mathrm{log}\frac{{p}_{\mathrm{inf}}({\bm{z}}^{(i)}\bm{x})}{{p}_{\mathrm{gen}}({\bm{z}}^{(i)})},{\bm{z}}^{(i)}\sim {p}_{\mathrm{inf}}(\bm{z}\bm{x})$$  (25) 
For highdimensional state spaces, for which MC sampling becomes challenging, there is luckily a variational approximation of eq. 24 available (Hershey & Olsen, 2007):
$${D}_{\mathrm{KL}}^{\mathrm{variational}}\left({p}_{\mathrm{inf}}(\bm{z}\bm{x})\parallel {p}_{\mathrm{gen}}(\bm{z})\right)\approx \frac{1}{T}\sum _{t=1}^{T}\mathrm{log}\frac{{\sum}_{j=1}^{T}{e}^{{D}_{\mathrm{KL}}(p({\bm{z}}_{t}{\bm{x}}_{1:T})\parallel p({\bm{z}}_{j}{\bm{x}}_{1:T}))}}{{\sum}_{k=1}^{T}{e}^{{D}_{\mathrm{KL}}(p({\bm{z}}_{t}{\bm{x}}_{1:T})\parallel p({\bm{z}}_{k}{\bm{z}}_{k1}))}},$$  (26) 
where the KL divergences in the exponentials are among Gaussians for which we have an analytical expression.
7.1.5 More details on single neuron model
The neuron model used in section 4.2 is described by
${C}_{m}\dot{V}$  $={g}_{L}(V{E}_{L})+{g}_{Na}{m}_{\mathrm{\infty}}(V)(V{E}_{Na})+{g}_{K}n(V{E}_{K})$  
$+{g}_{M}h(V{E}_{K})+{g}_{NMDA}\sigma (V)(V{E}_{NMDA})$  (27)  
$\dot{h}$  $={\displaystyle \frac{{h}_{\mathrm{\infty}}(V)h}{{\tau}_{h}}}$  (28)  
$\dot{n}$  $={\displaystyle \frac{{n}_{\mathrm{\infty}}(V)n}{{\tau}_{n}}}$  (29)  
$\sigma (V)$  $={\left[1+.33{e}^{.0625V}\right]}^{1}$  (30) 
where ${C}_{m}$ refers to the neuron’s membrane capacitance, the ${g}_{\bullet}$ to different membrane conductances, ${E}_{\bullet}$ to the respective reversal potentials, and $m$, $h$, and $n$ are gating variables with limiting values given by
$\mathrm{\{}{m}_{\mathrm{\infty}},{n}_{\mathrm{\infty}},{h}_{\mathrm{\infty}}\}$  $={\left[1+{e}^{(\{{V}_{hNa},{V}_{hK},{V}_{hM}\}V)/\{{k}_{Na},{k}_{K},{k}_{M}\}}\right]}^{1}$  (31) 
Different parameter settings in this model lead to different dynamical phenomena, including regular spiking, slow bursting or chaos (see Durstewitz (2009) for details). Parameter settings used here were: ${C}_{m}=6\mathrm{\mu}\mathrm{F}$, ${g}_{L}=8\mathrm{mS}$, ${E}_{L}=80\mathrm{mV}$, ${g}_{Na}=20\mathrm{mS}$, ${E}_{Na}=60\mathrm{mV}$, ${V}_{hNa}=20\mathrm{mV}$, ${k}_{Na}=15$, ${g}_{K}=10\mathrm{mS}$, ${E}_{K}=90\mathrm{mV}$, ${V}_{hK}=25\mathrm{mV}$, ${k}_{K}=5$, ${\tau}_{n}=1\mathrm{ms}$, ${g}_{M}=25\mathrm{mS}$, ${V}_{hM}=15\mathrm{mV}$, ${k}_{M}=5$, ${\tau}_{h}=200\mathrm{ms}$, ${g}_{NMDA}=10.2\mathrm{mS}$.