Abstract
We propose reinforcement learning on simple networks consisting of randomconnections of spiking neurons (both recurrent and feedforward) that can learncomplex tasks with very little trainable parameters. Such sparse and randomlyinterconnected recurrent spiking networks exhibit highly nonlinear dynamicsthat transform the inputs into rich highdimensional representations based onpast context. The random input representations can be efficiently interpretedby an output (or readout) layer with trainable parameters. Systematicinitialization of the random connections and training of the readout layerusing Qlearning algorithm enable such small random spiking networks to learnoptimally and achieve the same learning efficiency as humans on complexreinforcement learning tasks like Atari games. The spikebased approach usingsmall random recurrent networks provides a computationally efficientalternative to stateoftheart deep reinforcement learning networks withseveral layers of trainable parameters. The lowcomplexity spiking networks canlead to improved energy efficiency in eventdriven neuromorphic hardware forcomplex reinforcement learning tasks.
Quick Read (beta)
Reinforcement Learning
with LowComplexity Liquid State Machines
Abstract
We propose reinforcement learning on simple networks consisting of random connections of spiking neurons (both recurrent and feedforward) that can learn complex tasks with very little trainable parameters. Such sparse and randomly interconnected recurrent spiking networks exhibit highly nonlinear dynamics that transform the inputs into rich highdimensional representations based on past context. The random input representations can be efficiently interpreted by an output (or readout) layer with trainable parameters. Systematic initialization of the random connections and training of the readout layer using Qlearning algorithm enable such small random spiking networks to learn optimally and achieve the same learning efficiency as humans on complex reinforcement learning tasks like Atari games. The spikebased approach using small random recurrent networks provides a computationally efficient alternative to stateoftheart deep reinforcement learning networks with several layers of trainable parameters. The lowcomplexity spiking networks can lead to improved energy efficiency in eventdriven neuromorphic hardware for complex reinforcement learning tasks.
1 Introduction
High degree of recurrent connectivity among neuronal populations is a key attribute of neural microcircuits in the cerebral cortex and many different brain regions [1, 2, 3]. Such common structure suggests the existence of a general principle for information processing. However, the principle underlying information processing in such recurrent population of spiking neurons is still largely elusive due to the complexity of training large recurrent Spiking Neural Networks (SNNs). In this regard, reservoir computing architectures [4, 5, 6] were proposed to minimize the training complexity of large recurrent neuronal populations. Liquid State Machine (LSM) [4, 5] is a recurrent SNN consisting of an input layer sparsely connected to a randomly interlinked reservoir (or liquid) of excitatory and inhibitory spiking neurons whose activations are passed on to a readout (or output) layer, trained using supervised algorithms, for inference. The key attribute of an LSM is that the inputtoliquid and the recurrent excitatory$\leftrightarrow $inhibitory synaptic connectivity matrices and weights are fixed a priori. LSM effectively utilizes the rich nonlinear dynamics of LeakyIntegrateandFire spiking neurons [7] and the sparse random inputtoliquid and recurrentliquid synaptic connectivity for processing spatiotemporal inputs. At any time instant, the spatiotemporal inputs are transformed into a highdimensional representation, referred to as the liquid states (or spike patterns), which evolves dynamically based on decaying memory of the past inputs. The memory capacity of the liquid is dictated by its size and degree of recurrent connectivity. Although the LSM, by construction, does not have stable instantaneous internal states like Turing machines [8] or attractor neural networks [9], prior studies have successfully trained the readout layer using liquid activations, estimated by integrating the liquid states (spikes) over time, for speech recognition [4, 10, 11, 12], image recognition [13], gesture recognition [14, 15], and sequence generation tasks [16, 17, 18].
In this work, we propose such sparse randomlyinterlinked lowcomplexity LSMs for solving complex Reinforcement Learning (RL) tasks, which involve an autonomous agent (modeled using the LSM) trained to select actions in a manner that maximizes the expected future rewards received from the environment. For instance, a robot (agent) learning to navigate a maze (environment) based on the reward and punishment received from the environment is an example RL task. At any given time, the environment state (converted to spike trains) is fed to the liquid, which produces a highdimensional liquid state (spike pattern) based on decaying memory of the past environment states. We present an optimal initialization strategy for the fixed inputtoliquid and recurrentliquid synaptic connectivity matrices and weights to enable the liquid to produce highdimensional representations that lead to efficient training of the liquidtoreadout weights. Artificial ratebased neurons for the readout layer takes the liquid activations and produces actionvalues to guide action selection for a given environment state. The liquidtoreadout weights are trained using the Qlearning RL algorithm proposed for deep learning networks [19]. In RL theory [20], the Qvalue, also known as the actionvalue, estimates the expected future rewards for a stateaction pair that specifies how good is the action for the current environment state. The readout layer of the LSM contains as many neurons as the number of possible actions for a particular RL task. At any given time, the readout neurons predict the Qvalue for all possible actions based on the highdimensional state representation provided by the liquid. The liquidtoreadout weights are then trained using backpropagation [21] to minimize the error between the Qvalues predicted by the LSM and the target Qvalues estimated from RL theory [22] as described in subsection 3.2. We adopt $\u03f5$greedy policy (explained in subsection 3.2) to select the appropriate action based on the predicted Qvalues during training and evaluation. Based on $\u03f5$greedy policy, a lot of random actions are picked in the beginning of the training phase to better explore the environment. Towards the end of training and during inference, the action corresponding to the maximum Qvalue is selected with higher probability to exploit the learnt experiences. We first demonstrate results for training the readout weights based on the highdimensional representations provided by the liquid, as a result of the sparse recurrentliquid connectivity, on simple Cartpolebalancing RL task [20]. We then comprehensively validate the capability of the LSM and the presented training methodology on complex RL tasks like Pacman [23] and Atari games [24]. We note that LSM has been previously trained using Qlearning for RL tasks pertaining to robotic motion control [25, 26, 27]. We demonstrate and benchmark the efficacy of appropriately initialized LSM for solving RL tasks commonly used to evaluate deep reinforcement learning networks. In essence, this work provides a promising step towards incorporating bioplausible lowcomplexity recurrent SNNs like LSMs for complex RL tasks, which can potentially lead to much improved energy efficiency in eventdriven asynchronous neuromorphic hardware implementations [28, 29].
2 Introduction
High degree of recurrent connectivity among neuronal populations is a key attribute of neural microcircuits in the cerebral cortex and many different brain regions [1, 2, 3]. Such common structure suggests the existence of a general principle for information processing. However, the principle underlying information processing in such recurrent population of spiking neurons is still largely elusive due to the complexity of training large recurrent Spiking Neural Networks (SNNs). In this regard, reservoir computing architectures [4, 5, 6] were proposed to minimize the training complexity of large recurrent neuronal populations. Liquid State Machine (LSM) [4, 5] is a recurrent SNN consisting of an input layer sparsely connected to a randomly interlinked reservoir (or liquid) of excitatory and inhibitory spiking neurons whose activations are passed on to a readout (or output) layer, trained using supervised algorithms, for inference. The key attribute of an LSM is that the inputtoliquid and the recurrent excitatory$\leftrightarrow $inhibitory synaptic connectivity matrices and weights are fixed a priori. LSM effectively utilizes the rich nonlinear dynamics of LeakyIntegrateandFire spiking neurons [7] and the sparse random inputtoliquid and recurrentliquid synaptic connectivity for processing spatiotemporal inputs. At any time instant, the spatiotemporal inputs are transformed into a highdimensional representation, referred to as the liquid states (or spike patterns), which evolves dynamically based on decaying memory of the past inputs. The memory capacity of the liquid is dictated by its size and degree of recurrent connectivity. Although the LSM, by construction, does not have stable instantaneous internal states like Turing machines [8] or attractor neural networks [9], prior studies have successfully trained the readout layer using liquid activations, estimated by integrating the liquid states (spikes) over time, for speech recognition [4, 10, 11, 12], image recognition [13], gesture recognition [14, 15], and sequence generation tasks [16, 17, 18].
In this work, we propose such sparse randomlyinterlinked lowcomplexity LSMs for solving complex Reinforcement Learning (RL) tasks, which involve an autonomous agent (modeled using the LSM) trained to select actions in a manner that maximizes the expected future rewards received from the environment. For instance, a robot (agent) learning to navigate a maze (environment) based on the reward and punishment received from the environment is an example RL task. At any given time, the environment state (converted to spike trains) is fed to the liquid, which produces a highdimensional liquid state (spike pattern) based on decaying memory of the past environment states. We present an optimal initialization strategy for the fixed inputtoliquid and recurrentliquid synaptic connectivity matrices and weights to enable the liquid to produce highdimensional representations that lead to efficient training of the liquidtoreadout weights. Artificial ratebased neurons for the readout layer takes the liquid activations and produces actionvalues to guide action selection for a given environment state. The liquidtoreadout weights are trained using the Qlearning RL algorithm proposed for deep learning networks [19]. In RL theory [20], the Qvalue, also known as the actionvalue, estimates the expected future rewards for a stateaction pair that specifies how good is the action for the current environment state. The readout layer of the LSM contains as many neurons as the number of possible actions for a particular RL task. At any given time, the readout neurons predict the Qvalue for all possible actions based on the highdimensional state representation provided by the liquid. The liquidtoreadout weights are then trained using backpropagation [21] to minimize the error between the Qvalues predicted by the LSM and the target Qvalues estimated from RL theory [22] as described in subsection 3.2. We adopt $\u03f5$greedy policy (explained in subsection 3.2) to select the appropriate action based on the predicted Qvalues during training and evaluation. Based on $\u03f5$greedy policy, a lot of random actions are picked in the beginning of the training phase to better explore the environment. Towards the end of training and during inference, the action corresponding to the maximum Qvalue is selected with higher probability to exploit the learnt experiences. We first demonstrate results for training the readout weights based on the highdimensional representations provided by the liquid, as a result of the sparse recurrentliquid connectivity, on simple Cartpolebalancing RL task [20]. We then comprehensively validate the capability of the LSM and the presented training methodology on complex RL tasks like Pacman [23] and Atari games [24]. We note that LSM has been previously trained using Qlearning for RL tasks pertaining to robotic motion control [25, 26, 27]. We demonstrate and benchmark the efficacy of appropriately initialized LSM for solving RL tasks commonly used to evaluate deep reinforcement learning networks. In essence, this work provides a promising step towards incorporating bioplausible lowcomplexity recurrent SNNs like LSMs for complex RL tasks, which can potentially lead to much improved energy efficiency in eventdriven asynchronous neuromorphic hardware implementations [28, 29].
3 Materials and Methods
3.1 Liquid State Machine: Architecture and Initialization
Liquid State Machine (LSM) consists of an input layer sparsely connected via fixed synaptic weights to a randomly interlinked liquid of excitatory and inhibitory spiking neurons followed by a readout layer as depicted in Figure 1. The input layer (denoted by $P$) is modeled as a group of excitatory neurons that spike based on the input environment state following a Poisson process. The sparse inputtoliquid connections are initialized such that each excitatory neuron in the liquid receives synaptic connections from approximately $K$ random input neurons. This guarantees uniform excitation of the liquidexcitatory neurons by the external input spikes. The fixed inputtoliquid synaptic weights are chosen from a uniform distribution between 0 and $\alpha $ as shown in Table 3, where $\alpha $ is the maximum bound imposed on the weights. The liquid consists of excitatory neurons (denoted by $E$) and inhibitory neurons (denoted by $I$) recurrently connected in a sparse random manner as illustrated in Figure 1. The number of excitatory neurons is chosen to be $4\times $ the number of inhibitory neurons as observed in the cortical circuits [30]. We use the LeakyIntegrateandFire (LIF) model [7] to mimic the dynamics of both excitatory and inhibitory spiking neurons as described by the following differential equations:
$$\frac{d{V}_{i}}{dt}=\frac{{V}_{rest}{V}_{i}}{\tau}+{I}_{i}(t)$$  (1) 
$${I}_{i}(t)=\sum _{l\in {N}_{P}}{W}_{li}\cdot \delta (t{t}_{l})+\sum _{j\in {N}_{E}}{W}_{ji}\cdot \delta (t{t}_{j})\sum _{k\in {N}_{I}}{W}_{ki}\cdot \delta (t{t}_{k})$$  (2) 
where ${V}_{i}$ is the membrane potential of the $i$th neuron in the liquid, ${V}_{rest}$ is the resting potential to which ${V}_{i}$ decays to, with time constant $\tau $, in the absence of input current, and ${I}_{i}(t)$ is the instantaneous current projecting into the $i$th neuron, and ${N}_{P}$, ${N}_{E}$, and ${N}_{I}$ are the number of input, excitatory, and inhibitory neurons, respectively. The instantaneous current is a sum of three terms: current from input neurons, current from excitatory neurons, and current from inhibitory neurons. The first term integrates the sum of presynaptic spikes, denoted by $\delta (t{t}_{l})$ where ${t}_{l}$ is the time instant of prespikes, with the corresponding synaptic weights (${W}_{li}$ in Equation 2). Likewise, the second (third) term integrates the sum of presynaptic spikes from the excitatory (inhibitory) neurons, denoted by $\delta (t{t}_{j})$ ($\delta (t{t}_{k})$), with the respective weights ${W}_{ji}$ (${W}_{ki}$) in Equation 2. The neuronal membrane potential is updated with the sum of the input, excitatory, and negative inhibitory currents as shown in Equation 1. When the membrane potential reaches a certain threshold ${V}_{thres}$, the neuron fires an output spike. The membrane potential is thereafter reset to ${V}_{reset}$ and the neuron is restrained from spiking for an ensuing refractory period by holding its membrane potential constant. The LIF model parameters for the excitatory and inhibitory neurons are listed in Table 4.
There are four types of recurrent synaptic connections in the liquid, namely, $E\to E$, $E\to I$, $I\to E$, and $I\to I$. We express each connection in the form of a matrix that is initialized to be sparse and random, which causes the spiking dynamics of a particular neuron to be independent of most other neurons and maintains separability in the neuronal spiking activity. However, the degree of sparsity needs to be tuned to achieve rich network dynamics. We find that excessive sparsity (reduced connectivity) leads to weakened interaction between the liquid neurons and renders the liquid memoryless. On the contrary, lower sparsity (increased connectivity) results in chaotic spiking activity, which eliminates the separability in neuronal spiking activity. We initialize the connectivity matrices such that each excitatory neuron receives approximately $C$ synaptic connections from inhibitory neurons, and vice versa. The hyperparameter $C$ is tuned empirically as discussed in subsection 4.1 to avoid common chaotic spiking activity problems that occur when (1) excitatory neurons connect to each other and form a loop that always leads to positive drift in membrane potential, and when (2) an excitatory neuron connects to itself and repeatedly gets excited from its activity. Specifically, for the first situation, we have nonzero elements in the connectivity matrix $E\to E$ (denoted by ${W}_{EE}$) only at locations where elements in the product of connectivity matrices $E\to I$ and $I\to E$ (denoted by ${W}_{EI}$ and ${W}_{IE}$, respectively) are nonzero. This ensures that excitatory synaptic connections are created only for those neurons that also receive inhibitory synaptic connections, which mitigates the possibility of continuous positive drift in the respective membrane potentials. To circumvent the second situation, we force the diagonal elements of ${W}_{EE}$ to be zero and eliminate the possibility of repeated selfexcitation. Throughout this work, we create a recurrent connectivity matrix for liquid with $m$ excitatory neurons and $n$ inhibitory neurons by forming an $m\times n$ matrix whose values are randomly drawn from a uniform distribution between $0$ and $1$. Connection is formed between those pairs of neurons where the corresponding matrix entries are lesser than the target connection probability ($=C/m$). For illustration, consider a liquid with $m=1000$ excitatory and $n=250$ inhibitory neurons. In order to create the $E\to I$ connectivity matrix such that each inhibitory neuron receives synaptic connection from a single excitatory neuron ($C=1$), we first form a $1000\times 250$ random matrix whose values are drawn from a uniform distribution between $0$ and $1$. We then create a connection between those pairs of neurons where the matrix entries are lesser than 0.1% (1/1000). Similar process is repeated for connection $I\to E$. We then initialize connection $E\to E$ based on the product of ${W}_{EI}$ and ${W}_{IE}$. Similarly, the connectivity matrix for $I\to I$ (denoted by ${W}_{II}$) is initialized based on the product of ${W}_{IE}$ and ${W}_{EI}$. The connection weights are initialized from a uniform distribution between $0$ and $\beta $ as shown in Table 3 for different recurrent connectivity matrices. Note that the weights of the synaptic connections from inhibitory neurons are greater than that for synaptic connections from excitatory neurons to account for the lower number of inhibitory neurons relative to excitatory neurons. Stronger inhibitory connection weights help ensure that every neuron receives similar amount of excitatory and inhibitory input currents, which improves the stability of the liquid as experimentally validated in subsection 4.1.
The liquidexcitatory neurons are fullyconnected to artificial ratebased neurons in the readout layer for inference. The readout layer, which consists of as many output neurons as the number of actions for a given RL task, uses the average firing rate/activation of the excitatory neurons to predict the Qvalue for every stateaction pair. We translate the liquid spiking activity to average rate by accumulating the excitatory neuronal spikes over the time period for which the input (current environment state) is presented. We then normalize the spike counts with the maximum possible spike count over the LSMsimulation period, which is computed as the LSMsimulation period divided by the simulation timestep, to obtain the average firing rate of the excitatory neurons that are fed to the readout layer. Since the number of excitatory neurons is larger than the number of output neurons in the readout layer, we gradually reduce the dimension by introducing an additional fullyconnected hidden layer between the liquid and the output layer. We use ReLU nonlinearity [31] after the first hidden layer but none after the final output layer since the Qvalues are unbounded and can assume positive or negative values. We train the synaptic weights constituting the fullyconnected readout layer using the Qlearning based training methodology that is described in the following subsection 3.2.
3.2 QLearning Based LSM Training Methodology
Reinforcement Learning (RL) tasks fundamentally involve an agent (for instance, a robot) that is trained to navigate a certain environment (for instance, a maze) in a manner that maximizes the total rewards in the future. Formally, at any time instant $t$, the agent receives the environment state ${s}_{t}$ and picks action ${a}_{t}$ from the set of all possible actions. After the environment receives the action ${a}_{t}$, it transitions to the next state based on the chosen action and feeds back an immediate reward ${r}_{t+1}$ and the new environment state ${s}_{t+1}$. As mentioned in the beginning, the goal of the agent is to maximize the accumulated reward in the future, which is mathematically expressed as
$${R}_{t}=\sum _{t=1}^{\mathrm{\infty}}{\gamma}^{t}{r}_{t}$$  (3) 
where $\gamma \in [0,1]$ is the discount factor that determines the relative significance attributed to immediate and future reward. If $\gamma $ is chosen to be 0, the agent maximizes only the immediate reward. However, as $\gamma $ approaches unity, the agent learns to maximize the accumulated reward in the future. Qlearning [22] is a widely used RL algorithm that enables the agent to achieve this objective by computing the stateaction value function (or commonly known as the Qfunction), which is the expected future reward for a stateaction pair that is specified by
$${Q}_{\pi}(s,a)=\mathrm{E}[{R}_{t}{s}_{t}=s,{a}_{t}=a,\pi ]$$  (4) 
where ${Q}_{\pi}(s,a)$ measures the value of choosing an action $a$ when in state $s$ following a policy $\pi $. If the agent follows the optimal policy (denoted by ${\pi}_{\ast}$) such that ${Q}_{{\pi}_{\ast}}(s,a)=\underset{\pi}{\mathrm{max}}{Q}_{\pi}(s,a)$, the Qfunction can be estimated recursively using the Bellman optimality equation that is described by
$${Q}_{{\pi}_{\ast}}(s,a)=\mathrm{E}[{r}_{t+1}+\gamma \underset{{a}_{t+1}}{\mathrm{max}}{Q}_{{\pi}_{\ast}}({s}_{t+1},{a}_{t+1})s,a]$$  (5) 
where ${Q}_{{\pi}_{\ast}}(s,a)$ is the Qvalue for choosing action $a$ from state $s$ following the optimal policy ${\pi}_{\ast}$, ${r}_{t+1}$ is the immediate reward received from the environment, ${Q}_{{\pi}_{\ast}}({s}_{t+1},{a}_{t+1})$ is the Qvalue for selecting action ${a}_{t+1}$ from the next environment state ${s}_{t+1}$. Learning the Qvalues for all possible stateaction pairs is intractable for practical RL applications. Popular approaches approximate Qfunction using deep convolutional neural networks [19, 32, 33, 34].
In this work, we model the agent using an LSM, wherein the liquidtoreadout weights are trained to approximate the Qfunction as described below. At any time instant $t$, we map the current environment state vector ${s}_{t}$ to input neurons firing at a rate constrained between $0$ and $\varphi $ $\mathrm{Hz}$ over certain time period (denoted by ${T}_{LSM}$) following a Poisson process. The maximum Poisson firing rate $\varphi $ is tuned to ensure sufficient input spiking activity for a given RL task. We follow the method outlined in [35] to generate the Poisson spike trains as explained below. For a particular input neuron in the state vector, we first compute the probability of generating a spike at every LSMsimulation timestep based on the corresponding Poisson firing rate. Note that the timesteps in the RL task are orthogonal to the timesteps used for the numerical simulation of the liquid. Specifically, inbetween successive timesteps $t$ and $t+1$ in the RL task, the liquid is simulated for a time period of ${T}_{LSM}$ with $1$$\mathrm{ms}$ separation between consecutive LSMsimulation timesteps. The probability of producing a spike at any LSMsimulation timestep is obtained by scaling the corresponding firing rate by $1,000$. We generate a random number drawn from a uniform distribution between $0$ and $1$, and produce a spike if the random number is lesser than the neuronal spiking probability. At every LSMsimulation timestep, we feed the spike map of the current environment state and record the spiking outputs of the liquidexcitatory neurons. We accumulate the excitatory neuronal spikes and normalize the individual neuronal spike counts with the maximum possible spike count over the LSMsimulation period to obtain the highdimensional representation (activation) of the environment state as discussed in the previous subsection 3.1. It is important to note that appropriate initialization of the LSM (detailed in subsection 3.1) is necessary to obtain useful highdimensional representation for efficient training of the liquidtoreadout weights as experimentally validated in section 4.
The highdimensional liquid activations are fed to the readout layer that is trained using backpropagation to approximate the Qfunction by minimizing the mean square error between the Qvalues predicted by the readout layer and the target Qvalues following [19] as described by the following equations:
$${\theta}_{t+1}={\theta}_{t}+\eta \left({Y}_{t}Q({s}_{t},{a}_{t}{\theta}_{t})\right){\nabla}_{{\theta}_{t}}Q({s}_{t},{a}_{t}{\theta}_{t})$$  (6) 
$${Y}_{t}={r}_{t+1}+\gamma \underset{{a}_{t+1}}{\mathrm{max}}Q({s}_{t+1},{a}_{t+1}{\theta}_{t})$$  (7) 
where ${\theta}_{t+1}$ and ${\theta}_{t}$ are the updated and previous synaptic weights in the readout layer, respectively, $\eta $ is learning rate, $Q({s}_{t},{a}_{t}{\theta}_{t})$ is vector representing the Qvalues predicted by the readout layer for all possible actions given the current environment state ${s}_{t}$ using the previous readout weights, ${\nabla}_{{\theta}_{t}}Q({s}_{t},{a}_{t}{\theta}_{t})$ is the gradient of the Qvalues with respect to the readout weights, and ${Y}_{t}$ is the vector containing the target Qvalues that is obtained by feeding the next environment state ${s}_{t+1}$ to the LSM while using the previous readout weights. To encourage exploration during training, we follow $\u03f5$greedy policy [36] for selecting the actions based on the Qvalues predicted by the LSM. Based on $\u03f5$greedy policy, we select a random action with probability $\u03f5$ and the optimal action, i.e., the action pertaining to the highest Qvalue with probability $(1\u03f5)$ during training. Initially, $\u03f5$ is set to a large value (closer to unity), thereby permitting the agent to pick a lot of random actions and effectively explore the environment. As training progresses, $\u03f5$ gradually decays to a small value, thereby allowing the agent to exploit its past experiences. During evaluation, we similarly follow $\u03f5$greedy policy albeit with much smaller $\u03f5$ so that there is a strong bias towards exploitation. Employing $\u03f5$greedy policy during evaluation also serves to mitigate the negative impact of overfitting or underfitting. In an effort to further improve stability during training and achieve better generalization performance, we use the experience replay technique proposed by [19]. Based on experience replay, we store the experience discovered at each timestep (i.e. ${s}_{t}$, ${a}_{t}$, ${r}_{t}$, and ${s}_{t+1}$) in a large table and later train the LSM by sampling minibatches of experiences in a random manner over multiple training epochs, leading to improved generalization performance. For all the experiments reported in this work, we use the RMSProp algorithm [37] as the optimizer for error backpropagation with minibatch size of $32$. We adopt $\u03f5$greedy policy, wherein $\u03f5$ gradually decays from $1$ to $0.0010.1$ over the first $10\%$ of the training steps. Replay memory stores one million recently played frames, which are then used for minibatch weight updates that are carried out after the initial $100$ training steps. The simulation parameters for Qlearning are summarized in Table 5.
4 Experimental Results
We first present results motivating the importance of careful LSM initialization for obtaining rich highdimensional state representation, which is necessary for efficient training of the liquidtoreadout weights. We then demonstrate the utility of the recurrentliquid synaptic connections of careful LSM initialization using classic cartpolebalancing RL task [20]. We then validate the capability of appropriately initialized LSM, trained using the presented methodology, for solving complex RL tasks like Pacman [23] and Atari games [24].
4.1 LSM Hyperparameter Tuning
Initializing LSM with appropriate parameters is an important step to construct a model that produces useful highdimensional representations. Since the inputtoliquid and recurrentliquid connectivity matrices of the LSM are fixed a priori during training, how these connections are initialized dictates the liquid dynamics. We choose the parameters $K$ (governing the inputtoliquid connectivity matrix) and $C$ (governing the recurrentliquid connectivity matrices) empirically based on three observations: (1) stable spiking activity of the liquid, (2) eigenvalue analysis of the recurrent connectivity matrices, and (3) development of liquidexcitatory neuron membrane potential.
Spiking activity of the liquid is said to be stable if every finite stream of inputs results in a finite period of response. Sustained activity indicates that small input noise can perturb the liquid state and lead to chaotic activity that is no longer dependent on the input stimuli. It is impractical to analyze the stability of the liquid for all possible input streams within a finite time. We investigate the liquid stability by feeding in random input stimuli and sampling the excitatory neuronal spike counts at regular time intervals over the LSMsimulation period for different values of $K$ and $C$. We separately adjust these parameters for each learning task using random representations of the environment from the games. Values of $K$ and $C$ are experimentally determined to be $3$ and $4$ for cartpole and Pacman experiment, respectively, which ensures stable liquid spiking activity while enabling the liquid to exhibit fading memory of the past inputs. Fading memory indicates that the liquid retains input information for a short period of time after the input stimuli are cutoff.
Analyzing the eigenvalue spectrum of the recurrent connectivity matrix is another tool to assess the stability of the liquid. Each eigenvalue in the spectrum represents an individual mode of the liquid. Real part indicates decay rate of the mode while the imaginary part corresponds to the frequency of the mode [38]. Liquid spiking activity remains stable as long as all eigenvalues remain within the unit circle. However, this condition is not easily met for realistic recurrentliquid connections with random synaptic weight initialization [39]. We constrain the recurrent weights (hyperparameter $\beta $) such that each neuron receives balanced excitatory and inhibitory synaptic currents as previously discussed in subsection 3.1. This results in eigenvalues that lie within the unit circle as illustrated in Figure 2(A). In order to emphasize the importance of LSM initialization, we also show the eigenvalue spectrum of the recurrentliquid connectivity matrix when the weights are not properly initialized as shown in Figure 2(B) where many eigenvalues are outside the unit circle. Finally, we also use the development of the excitatory neuronal membrane potential to guide hyperparameter tuning. The hyperparameters $C$ and $\beta $ are chosen to ensure that membrane potential exhibits balanced fluctuation as illustrated in Figure 2(C) that plots the membrane potential of 10 randomly picked neurons in the liquid.
4.2 Learning to Balance a Cartpole
Cartpolebalancing is a classic control problem wherein the agent has to balance a pole attached to a wheeled cart that can move freely on a rail of certain length as shown in Figure 3(A). The agent can exert a unit force on the cart either to the left or right side for balancing the pole and keeping the cart within the rail. The environment state is characterized by cart position, cart velocity, angle of the pole, and angular velocity of the pole, which are designated by the tuple $(\chi ,\dot{\chi},\phi ,\dot{\phi})$. The environment returns a unit reward every timestep and concludes after $200$ timesteps if the pole does not fall or the cart does not goes out of the rail. Because the game is played for a finite time period, we constrain $(\chi ,\dot{\chi},\phi ,\dot{\phi})$ to be within the range specified by $(\pm 2.5,\pm 0.5,\pm 0.28,\pm 0.88)$ for efficiently mapping the realvalued state inputs to spike trains feeding into the LSM. Each realvalued state input is mapped to $10$ input neurons which have firing rates proportional to onehot encoding of the input value representing $10$ distinct levels within the corresponding range.
We model the agent using an LSM containing $150$ liquid neurons, $32$ hidden neurons in the fullyconnected layer between the liquid and output layer, and $2$ output neurons. The maximum firing rate for the input neurons representing the environment state is set to $100$ $\mathrm{Hz}$. The LSM is trained for ${10}^{5}$ timesteps, which are equally divided into $100$ training epochs containing $1,000$ timesteps per epoch. After each epoch, the LSM is evaluated for $1,000$ timesteps with the probability of choosing a random action $\u03f5$ set to $0.05$. Note that the LSM is evaluated for $1,000$ timesteps (multiple gameplays) even though single gameplay lasts a maximum of only $200$ timesteps as mentioned in the previous paragraph. We use the accumulated reward averaged over multiple gameplays as the true indicator of the LSM (agent) performance to account for the randomness in actionselection as a result of the $\u03f5$greedy policy. We train the LSM initialized with $10$ different random seeds and obtain median accumulated reward as shown in Figure 3(B). Note that the maximum possible accumulated reward per gameplay is $200$ since each gameplay lasts at most $200$ timesteps. Increase in median accumulated reward over epochs indicates that the LSM learnt to balance the cartpole using the dynamically evolving highdimensional liquid states. The ability of the liquid to provide rich highdimensional input representations can be attributed to the careful initialization of the connectivity matrices and weights (explained in subsection 3.1), which ensures balance between the excitatory and inhibitory currents to the liquid neurons and preserves fading memory of past liquid activity. However, the median accumulated reward after $100$ training epochs saturates around $125$ and does not reach the maximum value of $200$. We hypothesize that the game score saturation comes from the quantized representation of the environment state, and demonstrate in the following experiment with Pacman that the LSM can learn optimally given a better state representation. Finally, in order to emphasize the importance of LSM initialization, we also show the median accumulated reward per training epoch for training in which the LSM is initialized to have few synaptic connections. Figure 3(C) indicates that the median accumulated reward is around $90$ when the LSM initialization is suboptimal.
To visualize the learnt actionvalue function guiding action selection, we compare Qvalues produced by the LSM during evaluation in three different scenarios depicted in Figure 3(D). Note that each Qvalue represents how good is the corresponding action for a given environment state. In scenario $1$ (see Figure 3(D)1) that corresponds to the beginning of the gameplay wherein the pole is almost balanced, the value of both the actions are identical. This implies that either action (moving the cart left or right) will lead to a similar outcome. In scenario $2$ (see Figure 3(D)2) wherein the pole is unbalanced to the left side, the difference between the predicted Q values increases. Specifically, the Q value for applying a unit force on the right side of the cart is higher, which causes the cart to move to the left. Pushing the cart to the left in turn causes the pole to swing back right towards the balanced position. Similarly, in scenario $3$ (see Figure 3(D)3) wherein the pole is unbalanced to the right side, the Q value is higher for applying a unit force on the left side of the cart, which causes the cart to move right and enables the pole to swing left towards the balanced position. This visually demonstrates the ability of the LSM (agent) to successfully balance the pole by pushing the cart appropriately to the left or right based on the learnt Q values.
4.3 Learning to Play Pacman
In order to comprehensively validate the efficacy of the highdimensional environment representations provided by the liquid, we train the LSM to play a game of Pacman [23]. The objective of the game is to control Pacman (yellow in color) to capture all the foods (represented by small white dots) in a grid without being eaten by the ghosts as illustrated in Figure 4. The ghosts always hunt the Pacman; however, cherry (represented by large white dots) make the ghosts temporarily scared of the Pacman and run away. The game environment returns unit reward whenever Pacman consumes food, cherry, or the scared ghost (white in color). The game environment also returns a unit reward and restarts when all foods are captured. We use the location of Pacman, food, cherry, ghost and scared ghost as the environment state representation. The location of each object is encoded as a twodimensional binary array whose dimension matches with that of the Pacman grid as shown in Figure 4. The binary intermediate representations of all the objects are then concatenated and flattened into a single vector to be fed to the input layer of the LSM.
The LSM configurations and game settings used for Pacman experiments are summarized in Table 1, where each game setting has different degree of complexity with regards to the Pacman grid size and the number of foods, ghosts, and cherries. In the first experiment, we use a $7\times 7$ grid with $3$ foods for Pacman to capture and a single ghost to prevent it from achieving its objective. Thus, the maximum possible accumulated reward at the end of a successful game is $4$. Figure 5(A) shows that the median accumulated reward gradually increases with the number of training epochs and converges closer to the maximum possible reward, thereby validating the capability of the liquid to provide useful highdimensional representation of the environment state necessary for efficient training of the readout weights using the presented methodology. Interestingly, in the second experiment using a larger $7\times 17$ grid, we find that the median reward converges to $12$, which is greater than the number of foods available in the grid. This indicates that the LSM does not only learn to capture all the foods; in addition, it also learns to capture the cherry and the scared ghosts, leading to further increase the accumulated reward since consuming the scared ghost results in a unit immediate reward. In the final experiment, we train the LSM to control Pacman in $17\times 19$ grid with sparsely dispersed foods. We find that larger grid requires more exploration and training steps for the agent to perform well and achieve the maximum possible reward, resulting in a learning curve that is less steep compared to that obtained for smaller grid sizes in the earlier experiments as shown in Figure 5(C).
Finally, we plot the average of Qvalues produced by the LSM as the Pacman navigates the grid to visualize the correspondence between the learnt Qvalues and the enviroment state. As discussed in subsection 3.2, each Qvalue produced by the LSM provides a measure of how good is a particular action for a give environment state. The Qvalue averaged over the set of all possible actions (known as the statevalue function) thus indicates the value of being in a certain state. Figure 5(D) illustrates the statevalue function while playing the Pacman game in a $7\times 17$ grid. The predicted statevalue starts at a relatively high level because the foods are abundant in the grid and the ghosts are far away from the Pacman (see Figure 5(D)1). The statevalue gradually decreases as the Pacman navigates through the grid and gets closer to the ghosts. The predicted statevalue then shoots up after the Pacman consumes cherry and makes the ghosts temporarily consumable (see Figure 5(D)2), leading to potential additional reward. The predicted statevalue drops after the ghosts are reborn (see Figure 5(D)3). Finally, we observe a slight increase in the statevalue towards the end of the game when the Pacman is closer to the last food after it consumes a cherry (see Figure 5(D)4). It is interesting to note that although the scenario in Figure 5(D)4 is similar to that in Figure 5(D)2, the statevalue is smaller since the expected accumulated reward at this step is at most $3$ assuming that the Pacman can capture both the scared ghost and the last food. On the other hand, in the environment state shown in Figure 5(D)2, the expected accumulated reward is greater than $3$ since $4$ foods and $2$ scared ghosts are available for the Pacman to capture.
4.4 Learning to Play Atari Games
Finally, we train the LSM using the presented methodology to play Atari games [24], which are widely used to benchmark deep reinforcement learning networks. We arbitrarily select 4 games for evaluation, namely, Boxing, Gopher, Freeway, and Krull. We use the RAM of the Atari machine, which stores 128 bytes of information about an Atari game, as a representation of the environment [24]. During training, we modified the reward structure of the game by clipping all positive immediate rewards to $1$ and all negative immediate rewards to $1$. However, we do not clip the immediate reward during testing and measure the actual accumulated reward following [19]. For all selected Atari games, we model the agent using an LSM containing $500$ liquid neurons and $128$ hidden neurons. Number of output neurons varies for each game as the number of possible actions is different. The maximum Poisson firing rate for the input neurons is set to $100$ $\mathrm{Hz}$. The LSM is trained for $5\times {10}^{3}$ steps. Figure 6 illustrates that the LSM learnt the optimal strategies to play Boxing and Krull without any prior knowledge of the rules, leading to high accumulated reward towards the end of the training. However, learning in Gopher and Freeway progresses relatively slow. For detailed evaluation, we compare the median accumulated reward obtained from playing with the trained LSM to the average accumulated reward obtained from playing with random actions for $1\times {10}^{5}$ steps. We also compare the accumulated reward with that reported for human players in [19]. Table 2 shows that the LSM achieves better score than human players on Boxing and Krull while comparable albeit lower score on Freeway and Gopher.
5 Discussion
LSM, an important class of biologically plausible recurrent SNNs, has thus far been primarily demonstrated for pattern (speech/image) recognition [12, 13], gesture recognition [14, 15], and sequence generation tasks [16, 17, 18] using standard datasets. To the best of our knowledge, our work is the first demonstration of LSMs, trained using Qlearning based methodology, for complex RL tasks like Pacman and Atari games commonly used to evaluate deep reinforcement learning networks. The benefits of the proposed LSMbased RL framework over the stateoftheart deep learning models are twofold. First, LSM entails fewer trainable parameters as a result of using fixed inputtoliquid and recurrentliquid synaptic connections. However, this requires careful initialization of the respective matrices for efficient training of the liquidtoreadout weights as experimentally validated in section 4. We note that the stability of LSMs could be further enhanced by training the recurrent weights using localized Spike Timing Dependent Plasticity based learning rules [40, 41, 42], which incur lower computational complexity compared to the backpropagationthroughtime algorithm [43, 12] used for training recurrent SNNs. Second, LSMs can be efficiently implemented on eventdriven neuromorphic hardware like IBM TrueNorth [28] or Intel Loihi [29], leading to potentially much improved energy efficiency while achieving comparable performance to deep learning models on the chosen benchmark tasks. Note that the readout layer in the presented LSM needs to be implemented outside the neuromorphic fabric since they are composed of artificial ratebased neurons that are typically not supported in neuromorphic hardware realizations. Alternatively, readout layer composed of spiking neurons could be used that can be trained using spikebased error backpropagation algorithms [44, 45, 46, 47, 48, 18]. Future works could also explore STDPbased reinforcement learning rules [49, 50, 51, 52] to render the training algorithm amenable for neuromorphic hardware implementations.
6 Conclusion
Liquid State Machine (LSM) is a bioinspired recurrent spiking neural network composed of an input layer sparsely connected to a randomly interlinked liquid of spiking neurons for the realtime processing of spatiotemporal inputs. In this work, we proposed LSMs, trained using the presented Qlearning based methodology, for solving complex Reinforcement Learning (RL) tasks like playing Pacman and Atari that have been hitherto benchmarked for deep reinforcement learning networks. We presented initialization strategies for the fixed inputtoliquid and recurrentliquid synaptic connectivity matrices and weights to enable the liquid to produce useful highdimensional representation of the environment state necessary for efficient training of the liquidtoreadout weights. We demonstrated the significance of the inherent capability of the liquid to produce rich representation by training the LSM to successfully balance a cartpole. Our experiments on the Pacman game showed that the LSM learns the optimal strategies for different game settings and grid sizes. Our analyses on a subset of Atari games indicated that the LSM achieves comparable score to that reported for human players in existing works.
Author Contributions
Gopalakrishnan Srinivasan and Wachirawit Ponghiran wrote the paper. Wachirawit Ponghiran performed the simulations. All authors helped with developing the concepts, conceiving the experiments, and writing the paper.
Acknowledgments
This work was supported in part by the Center for Brain Inspired Computing (CBRIC), one of the six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, by the Semiconductor Research Corporation, the National Science Foundation, Intel Corporation, the DoD Vannevar Bush Fellowship, and by the U.S. Army Research Laboratory and the U.K. Ministry of Defence under Agreement Number W911NF1630001.
References
 [1] Rodney J Douglas, Christof Koch, Misha Mahowald, KA Martin, and Humbert H Suarez. Recurrent excitation in neocortical circuits. Science, 269(5226):981–985, 1995.
 [2] Kenneth D Harris and Thomas D MrsicFlogel. Cortical connectivity and sensory coding. Nature, 503(7474):51, 2013.
 [3] Xiaolong Jiang, Shan Shen, Cathryn R Cadwell, Philipp Berens, Fabian Sinz, Alexander S Ecker, Saumil Patel, and Andreas S Tolias. Principles of connectivity among morphologically defined cell types in adult neocortex. Science, 350(6264):aac9462, 2015.
 [4] Wolfgang Maass, Thomas Natschläger, and Henry Markram. Realtime computing without stable states: A new framework for neural computation based on perturbations. Neural computation, 14(11):2531–2560, 2002.
 [5] Wolfgang Maass, Thomas Natschläger, and Henry Markram. A model for realtime computation in generic neural microcircuits. In Advances in neural information processing systems, pages 229–236, Vancouver, British Columbia, Canada, 2003.
 [6] Mantas Lukoševičius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
 [7] Peter Dayan, LF Abbott, et al. Theoretical neuroscience: computational and mathematical modeling of neural systems. Journal of Cognitive Neuroscience, 15(1):154–155, 2003.
 [8] John E. Savage. Models of computation, volume 136. AddisonWesley, Reading, MA, USA, 1998.
 [9] Daniel J. Amit. Modeling brain function: The world of attractor neural networks. Cambridge university press, New York, NY, USA, 1992.
 [10] Peter Auer, Harald Burgsteiner, and Wolfgang Maass. Reducing communication for distributed learning in neural networks. In International Conference on Artificial Neural Networks, pages 123–128. Springer, 2002.
 [11] David Verstraeten, Benjamin Schrauwen, Dirk Stroobandt, and Jan Van Campenhout. Isolated word recognition with the liquid state machine: a case study. Information Processing Letters, 95(6):521–528, 2005.
 [12] Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long shortterm memory and learningtolearn in networks of spiking neurons. arXiv preprint arXiv:1803.09574, 2018.
 [13] Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Spilinc: Spiking liquidensemble computing for unsupervised speech and image recognition. Frontiers in neuroscience, 12, 2018.
 [14] Joseph ChrolCannon and Yaochu Jin. Learning structure of sensory inputs with synaptic plasticity leads to interference. Frontiers in Computational Neuroscience, 9:103, 2015.
 [15] Priyadarshini Panda and Narayan Srinivasa. Learning to recognize actions from limited training examples using a recurrent spiking neural model. Frontiers in neuroscience, 12:126, 2018.
 [16] Priyadarshini Panda and Kaushik Roy. Learning to generate sequences with combination of hebbian and nonhebbian plasticity in recurrent spiking neural networks. Frontiers in neuroscience, 11:693, 2017.
 [17] Wilten Nicola and Claudia Clopath. Supervised learning in spiking neural networks with force training. Nature communications, 8(1):2208, 2017.
 [18] Guillaume Bellec, Franz Scherr, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. Biologically inspired alternatives to backpropagation through time for learning in recurrent neural nets. arXiv preprint arXiv:1901.09049, 2019.
 [19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [20] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1998.
 [21] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. nature, 323(6088):533, 1986.
 [22] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 [23] John DeNero, Dan Klein, and Pieter Abbeel. The Pacman Projects, 2016.
 [24] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [25] Prashant Joshi and Wolfgang Maass. Movement generation with circuits of spiking neurons. Neural Computation, 17(8):1715–1738, 2005.
 [26] Nicolas Berberich. Implementation of a real time liquid state machine on spinnaker for biomimetic robot controll. Masterarbeit, TUM, 2017.
 [27] Juan Camilo Vasquez Tieck, Marin Vlastelica Pogančić, Jacques Kaiser, Arne Roennau, MarcOliver Gewaltig, and Rüdiger Dillmann. Learning continuous muscle control for a multijoint arm by extending proximal policy optimization with a liquid state machine. In International Conference on Artificial Neural Networks, pages 211–221. Springer, 2018.
 [28] Paul A Merolla, John V Arthur, Rodrigo AlvarezIcaza, Andrew S Cassidy, Jun Sawada, Filipp Akopyan, Bryan L Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, et al. A million spikingneuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
 [29] Mike Davies, Narayan Srinivasa, TsungHan Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor with onchip learning. IEEE Micro, 38(1):82–99, 2018.
 [30] Michael Wehr and Anthony M Zador. Balanced inhibition underlies tuning and sharpens spike timing in auditory cortex. Nature, 426(6965):442, 2003.
 [31] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, Haifa, Israel, 2010.
 [32] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 [33] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [34] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 [35] David Heeger. Poisson model of spike generation. Handout, University of Standford, 5:1–13, 2000.
 [36] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
 [37] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 [38] Kanaka Rajan, LF Abbott, and Haim Sompolinsky. Stimulusdependent suppression of chaos in recurrent neural networks. Physical Review E, 82(1):011903, 2010.
 [39] Kanaka Rajan and LF Abbott. Eigenvalue spectra of random matrices for neural networks. Physical review letters, 97(18):188104, 2006.
 [40] Guoqiang Bi and Muming Poo. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of neuroscience, 18(24):10464–10472, 1998.
 [41] Sen Song, Kenneth D Miller, and Larry F Abbott. Competitive hebbian learning through spiketimingdependent synaptic plasticity. Nature neuroscience, 3(9):919, 2000.
 [42] Peter U Diehl and Matthew Cook. Unsupervised learning of digit recognition using spiketimingdependent plasticity. Frontiers in computational neuroscience, 9:99, 2015.
 [43] Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
 [44] Priyadarshini Panda and Kaushik Roy. Unsupervised regenerative learning of hierarchical features in spiking deep networks for object recognition. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 299–306, Vancouver, British Columbia, Canada, 2016. IEEE.
 [45] Jun Haeng Lee, Tobi Delbruck, and Michael Pfeiffer. Training deep spiking neural networks using backpropagation. Frontiers in neuroscience, 10:508, 2016.
 [46] Chankyu Lee, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. Training deep spiking convolutional neural networks with stdpbased unsupervised pretraining followed by supervised finetuning. Frontiers in neuroscience, 12:435, 2018.
 [47] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatiotemporal backpropagation for training highperformance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
 [48] Yingyezhe Jin, Wenrui Zhang, and Peng Li. Hybrid macro/micro level backpropagation for training deep spiking neural networks. In Advances in Neural Information Processing Systems, pages 7005–7015, Montréal, Quebec, Canada, 2018.
 [49] JeanPascal Pfister, Taro Toyoizumi, David Barber, and Wulfram Gerstner. Optimal spiketimingdependent plasticity for precise action potential firing in supervised learning. Neural computation, 18(6):1318–1348, 2006.
 [50] Răzvan V Florian. Reinforcement learning through modulation of spiketimingdependent synaptic plasticity. Neural Computation, 19(6):1468–1502, 2007.
 [51] Michael A Farries and Adrienne L Fairhall. Reinforcement learning with modulated spike timing–dependent synaptic plasticity. Journal of neurophysiology, 98(6):3648–3665, 2007.
 [52] Robert Legenstein, Dejan Pecevski, and Wolfgang Maass. A learning theory for rewardmodulated spiketimingdependent plasticity with application to biofeedback. PLoS computational biology, 4(10):e1000180, 2008.
Grid size  Ghost  Food  Cherry  Training steps  Liquid neurons  Hidden neurons 

$7\times 7$  $1$  $3$  $0$  $5\times {10}^{5}$  $500$  $128$ 
$7\times 17$  $2$  $6$  $2$  $5\times {10}^{5}$  $2,000$  $512$ 
$17\times 19$  $1$  $6$  $0$  $3\times {10}^{6}$  $3,000$  $512$ 
Game  Human player  Random actions  This work 

Boxing  $4.3$  $0.75$  $20.2$ 
Freeway  $29.6$  $0.0$  $19.75$ 
Gopher  $2,321$  $279.3$  $611.1$ 
Krull  $2,395$  $1,590$  $3,686$ 
Inputtoliquid connections  

Connection type  Weight 
$P\to E$  $[0,0.6]$ 
Recurrentliquid connections  

Connection type  Weight 
$E\to E$  $[0,0.05]$ 
$E\to I$  $[0,0.25]$ 
$I\to E$  $[0,0.3]$ 
$I\to I$  $[0,0.01]$ 
Excitatory and inhibitory neurons  

Parameter  Value 
${V}_{rest}$  $0$ 
${V}_{reset}$  $0$ 
${V}_{thres}$  $0.5$ 
$\tau $  20 $\mathrm{ms}$ 
${\tau}_{refrac}$  1 $\mathrm{ms}$ 
$\mathrm{\Delta}t$ (simulation timestep)  1 $\mathrm{ms}$ 
Parameter  Value 
Readout weights update frequency  Once every gamestep 
Warm up steps before training begins  $100$ 
Batch size for experience replay  $32$ 
Experience replay buffer size  $1\times {10}^{6}$ 
Discount factor  $0.95$ 
Initial exploration probability during training  $1$ 
Final exploration probability during training (Cartpole)  $1\times {10}^{3}$ 
Final exploration probability during training (Pacman & Atari)  $1\times {10}^{1}$ 
Exploration probability during evaluation (Cartpole & Atari)  $5\times {10}^{2}$ 
Exploration probability during evaluation (Pacman)  $0$ 
Learning rate for RMSProp algorithm  $2\times {10}^{4}$ 
Term added to denominator for RMSProp algorithm  $1\times {10}^{6}$ 
Weight decay for RMSProp algorithm  $0$ 
Smoothing constant for RMSProp algorithm  $0.99$ 
