Pre-training in Deep Reinforcement Learning for Automatic Speech Recognition

  • 2019-10-26 14:20:20
  • Thejan Rajapakshe, Rajib Rana, Siddique Latif, Sara Khalifa, Björn W. Schuller
  • 0


Deep reinforcement learning (deep RL) is a combination of deep learning withreinforcement learning principles to create efficient methods that can learn byinteracting with its environment. This led to breakthroughs in many complextasks that were previously difficult to solve. However, deep RL requires alarge amount of training time that makes it difficult to use in variousreal-life applications like human-computer interaction (HCI). Therefore, inthis paper, we study pre-training in deep RL to reduce the training time andimprove the performance in speech recognition, a popular application of HCI. Weachieve significantly improved performance in less time on a publicly availablespeech command recognition dataset.


Quick Read (beta)

Pre-training in Deep Reinforcement Learning for
Automatic Speech Recognition


Deep reinforcement learning (deep RL) is a combination of deep learning with reinforcement learning principles to create efficient methods that can learn by interacting with its environment. This led to breakthroughs in many complex tasks that were previously difficult to solve. However, deep RL requires a large amount of training time that makes it difficult to use in various real-life applications like human-computer interaction (HCI). Therefore, in this paper, we study pre-training in deep RL to reduce the training time and improve the performance in speech recognition, a popular application of HCI. We achieve significantly improved performance in less time on a publicly available speech command recognition dataset.

Pre-training in Deep Reinforcement Learning for

Automatic Speech Recognition

Thejan Rajapakshe1, Rajib Rana1, Siddique Latif1,2, Sara Khalifa2, Björn W. Schuller3,4
1University of Southern Queensland, Australia
2Distributed Sensing Systems Group, Data61, CSIRO Australia
3GLAM – Group on Language, Audio & Music, Imperial College London, UK
4ZD.B Chair of Embedded Intelligence for Health Care & Wellbeing, University of Augsburg, Germany

Index Terms—  Speech Recognition, Machine Learning, Deep Reinforcement Learning

1 Introduction

Reinforcement Learning (RL) follows the principle of behaviourist psychology and learns in a similar way as a child learns to perform a new task. RL has been repeatedly successful in the past [1, 2], however, the successes were mostly limited to low-dimensional problems. In recent years, deep learning has significantly advanced the field of RL, with the use of deep learning algorithms within RL giving rise to the field of “deep reinforcement learning”. Deep learning enables RL to operate in high-dimensional state and action spaces and can now be used for complex decision-making problems [3].

Deep RL algorithms have mostly been applied to video or image processing domains that include playing video games [4, 5] to indoor navigation [6]. Only a very limited number of studies have explored the promising aspects of deep RL in the field of audio processing in particular for speech processing. In this paper, we study this under-researched topic. In particular, we conduct a case study of the feasibility of deep RL for automatic speech command classification.

A major challenge of deep RL is that it often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable in real-world settings [7]. Leveraging humans to provide demonstrations (known as learning from demonstration (LfD) in RL has recently gained traction as a possible way of speeding up deep RL [8, 9, 10]. In LfD, actions demonstrated by the human are considered as the ground truth labels for a given input game/image frame. An agent closely simulates the demonstrator’s policy at the start, and later on learns to surpass the demonstrator [7]. However, LfD holds a distinct challenge, in the sense that it often requires the agent to acquire skills from only a few demonstrations and interactions due to the time and expense of acquiring them [11]. Therefore, LfDs are generally not scalable for especially high-dimensional problems.

Pre-training the underlying deep neural network (in Section 2 we discuss the structure of RL in detail) is another approach to speed up training in deep RL. In [12], the authors combine Deep Belief Networks (DBNs) with RL to take benifit of the unsupervised pre-training phase in DBNs, and then use the DBN as the opening point for a neural network function approximator. Furthermore, in [13], the authors demonstrate that a pre-trained hidden layer architecture can reduce the time required to solve reinforcement learning problems. While these studies show the promise of using pre-trained deep neural networks in non-audio domains, the feasibility of pre-training is not well understood for the audio field in general.

We found very few studies in audio using RL/deep RL. In [14], the authors describe an avenue of using RL to classify audio files into several mood classes depending upon listener response during a performance. In [15], the authors introduce the ‘EmoRL’ model that triggers an emotion classifier as soon as it gains enough confidence while listening to an emotional speech. The authors cast this problem into a RL problem by training an emotion classification agent to perform two actions: wait and terminate. The agent selects the terminate action to stop processing incoming speech and classify it based on the observation. The objective was to achieve a trade-off between performance (accuracy) and latency by punishing wrong classifications actions as well as too delayed predictions through the reward function. While the authors in the above studies use RL for audio, they do not have a focus on pre-training to improve the performance of deep RL.

In this paper, we propose pre-training for improving the performance and speed of Deep RL while conducting speech classification. Results from the case study with the recent public Speech Commands Dataset [16] show that pre-training offers significant improvement in accuracy and helps achieve faster convergence.

2 Methodology

In this study, we investigate the feasibility of pre-training in RL algorithms for speech recognition. We present the details of the proposed model in this section.

2.1 Speech Command Recognition Model

The considered policy network model consists of a speech command recognition model as shown in figure 1.

Fig. 1: Speech command recognition model architecture.

Considering the fact that Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) can be combined to improve the performance [17], we assembled CNN layers on top of an LSTM RNN layer. LSTM RNNs are good at learning the temporal structure of a feature map [18], and CNNs are strong in diminishing frequency variations [19]. The outputs from the LSTM RNN layer are passed on to fully connected layers to learn discriminative features during training [17]. In this way, our proposed policy network is empowered by convolutional layers for learning high-level abstraction, an LSTM layer to capture long-term temporal context, and finally fully connected layers for learning discriminative representation.

2.2 Deep Reinforcement Learning Framework

Fig. 2: Framework of the proposed Deep RL.

The reinforcement learning framework mainly consists of two major entities namely “agent” and “environment”. The action decided by the agent is executed on the environment and it notifies the agent with the reward and next state in the environment. In this work, we focus on deep RL that involves a Deep Neural Network (DNN) structure in the agent module to resolve the action taken by observing the state which is illustrated in Figure 2. We modelled this problem as a Markov decision process (MDP) [20]. This can be considered as a tuple (S,A,P,R), where S is the state space, A is the action space, P is the state transition policy, and R is the reward function. Since the core goal of this problem is classification, we modelled the MDP in such a way that the predicted classes are to be as actions, A, and the states, S are the features of each audio segment in a batch of size η. An action decision is carried out by an RL agent which receives a reward (rt) using the following reward function:

rt={+1,ifat=gt-1,otherwise, (1)

where gt is the ground truth value of the specific speech utterance. We modelled the probability of actions using the following equation:

at=Softmax(WaSt+ba), (2)

where at is the action selection probability, and Wa and ba are the weight and bias values. St is the output from the previous hidden layer. The softmax function is defined as:

Softmax(αj)=eαji=1neαi. (3)

The target of the RL agent is to maximise the expected return in the policy

Ja(θa,θs)=𝔼π(at|st;θa,θs)[rt], (4)

where π(at|st;θa,θs) is the policy of agent, and rt is the expected reward return at state t.

2.2.1 REINFORCE Training

The REINFORCE algorithm is used to approximate the gradient to maximise the objective function J(θa,θs).

\SetAlgoLinedinitialise state space\[email protected]
initialise policy network model\[email protected]
pre-train policy network\[email protected]
retrieve initial state s1\[email protected]
\Fori1 \KwToNE initialise Es,Ea,Er\[email protected]
\While!d ai get action(si)\[email protected]
si+1, ri, dexecute(ai)\[email protected]
Essi+Es\[email protected]
Erri+Er\[email protected]
Eaai+Ea\[email protected]
train(Es,Ea,Er)\[email protected]
\algorithmcfname 1 REINFORCE algorithm implementation

Algorithm 1 describes the algorithmic steps followed throughout the RL action prediction process, where NE indicates the maximum number of episodes to run (10,000 experiments), Si is the state at instant i, ai is the predicted action for the si at ith instant, ri is the reward obtained by executing the predicted action ai, d is a boolean flag indicating the end of an episode, where the end of the episode is decided when i reaches the step size η (50). Es,Ea,Er are arrays collecting the values of si, ai, ri for each step, which is consumed by the policy model’s training method train.

For a given set of examples in the state space S, initially, the environment sends the s1 to the RL agent. The RL agent infers the corresponding action probabilities through the policy network and selects the highest probable action a1 and returns it to the environment. The environment then calculates the reward r1 for the action-state combination and returns to the agent with the reward and next state st+1. Each ri, ai, and si are stored, and then, the policy network is trained with those values in an episode.

3 Experimental Setup

3.1 Datasets

The Speech Commands Dataset [16] is an audio corpus of 105,829 utterances containing 30 command keywords spoken by 2,618 speakers. Each utterance of a one-second file is stored in the ‘.wav’ file format with 16 kHz sampling rate. This dataset contains mainly two subsets of command keywords, namely “Main Commands”, and “Sub Commands”. Table 1 shows the distribution of the 30 keywords among the two subsets.

Table 1: Distribution of keywords in the Speech Commands Dataset
Subset Commands
Main Commands one, two, three, four, five, six, seven, eight, nine, down, go, left, no, off, on, right, stop, up, yes, zero
Sub Commands bed, bird, cat, dog, happy, house, Marvin, Sheila, tree, wow

3.2 Feature Extraction

In this study, we use Mel Frequency Cepstral Coefficients (MFCC) to represent the speech signal. MFCCs are very popular features and widely used in speech and audio analysis/processing [17, 21]. We extract 40 MFCCs from the mel-spectrograms with a frame length of 2,048 and a hop length of 512 using Librosa [22].

3.3 Model Recipe

We use the Tensorflow library to implement the model, a combination of CNN and LSTM: The initial layers are 1d convolution layers wrapped in time distributed wrappers with filter sizes of 16 and 8, respectively, followed by a max-pooling layer. The feature maps are then passed to an LSTM layer of 50 cells for learning the temporal features. A dropout layer of dropout rate 0.3 is used for regularisation. Finally, three fully connected layers of 512, 256, and 64 units respectively are added before the softmax layer. The input to the model is a matrix of n×f, where n is the number of MFCCs (40), and f is the number of frames (87) in the MFCC spectrum. We use a Stochastic gradient descent optimiser with a learning rate of 10-4.

The score value Vj is defined as the sum of the rewards ri produced within the jth episode. The score variable can be utilised to infer the overall accuracy (H) of the RL Agent within a given episode.

Hi=Vi-η×min(r)η×(max(r)-min(r))×100%. (5)

4 Results

Experiments were carried out focusing on the effect that pre-training has on the accuracy of the RL Agent. Three subsets of Speech Command datasets were selected, namely “binary”, “20 class”, and “30 class”. A binary subset contains only the speech commands of the “left” and “right” classes. The 20 classes and 30 classes subsets contain ”Main” commands and the merge of “Main” and “Sub” commands, respectively. Each subset was experimented as without pre-training and with pre-training.

Fig. 3: Performance evaluations of the model on three different scenarios
Fig. 4: Standard Deviation of the score with episode on three different scenarios

The mean score of a batch of 200 episodes is plotted against the episode number in Figure 3. Interpreting the graphs in Figure 3, one can find that the pre-training increases the overall score. It can be seen that the binary class classification nearly reaches its maximum score at 50 within initial 2,500 episodes. The other 2 subsets show a score over 25 within 10,000 episodes.

The rates of change of score (velocity of score) for the initial 500 and 1,000 episodes were calculated by equation 6 and tabulated in Table 2.

Velocityx=mean(Vx:Vx+5)-mean(V0:V5)x. (6)

The results in Table 2 convey that the velocity of score increases by pre-training within the initial 1,000 episodes for all 3 subsets of experiments. This lets one conclude that the pre-training can decrease the time taken for the RL agent to converge to better accuracy.

Table 2: Velocity of score change in the initial episodes.
# Classes Change of Velocity of score (%)
Initial 500 episodes Initial 1000 episodes
2 -0.9 4.4
20 15.2 8.8
30 6.4 11.2
Table 3: Improvement of the score with pre-training.
# Classes Score after 10000 episodes Improvement (%)
w/o pre-train w/ pre-train
2 29.4 49.0 19.6
20 -23.8 37.0 60.8
30 -23.4 29.0 52.4

Table 3 shows the mean score of the latest 5 episodes for the “with” (w/) and “without” (w/o) pre-train scenarios. The improvement column shows the increment of score of the “with pre-train” with respect to “without pre-train” scenario, where the improvement is calculated by equation 7. Each improvement is a positive improvement. This concludes that the overall final score (accuracy) of the RL agent policy network model can be improved by pre-training.

Improvement=xwith-xwithoutη×(max(r)-min(r))×100%. (7)

According to Figure 4, the standard deviation of the score decreases rapidly with time in the pre-trained scenario. This again shows that predictions of the RL agent are increasing with time and pre-training is accelerating the process.

5 Conclusions

In this paper, we propose the use of pre-training in deep reinforcement learning (deep RL) for speech recognition. The proposed model uses pre-training knowledge to achieve a better score while reducing the convergence time. We evaluated the proposed RL model using the speech command dataset for three different classifications scenarios, which include binary (two different speech commands), and 10 and 30 class tasks. Results show that pre-training helps to achieve considerably better results in a lower number of episodes. In future efforts, we want to study the feasibility of using unrelated data to pre-train deep RL to further improve its performance and convergence.


  • [1] Satinder Singh, Diane Litman, Michael Kearns, and Marilyn Walker, “Optimizing dialogue management with reinforcement learning: Experiments with the njfun system,” Journal of Artificial Intelligence Research, vol. 16, pp. 105–133, 2002.
  • [2] Gerald Tesauro, “Temporal difference learning and td-gammon,” Communications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.
  • [3] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath, “A brief survey of deep reinforcement learning,” arXiv, vol. 2017, no. 1708.05866, 2017.
  • [4] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484, 2016.
  • [5] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529, 2015.
  • [6] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proceedings ICRA. IEEE, 2017, pp. 3357–3364.
  • [7] Gabriel V Cruz Jr, Yunshu Du, and Matthew E Taylor, “Pre-training neural networks with human demonstrations for deep reinforcement learning,” arXiv, vol. 2017, no. 1709.04083, 2017.
  • [8] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al., “Starcraft ii: A new challenge for reinforcement learning,” arXiv, vol. 2017, no. 1708.04782, 2017.
  • [9] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al., “Deep q-learning from demonstrations,” in Proceedings AAAI, 2018.
  • [10] Vitaly Kurin, Sebastian Nowozin, Katja Hofmann, Lucas Beyer, and Bastian Leibe, “The atari grand challenge dataset,” arXiv, , no. 1705.10998, 2017.
  • [11] Sylvain Calinon, Learning from Demonstration (Programming by Demonstration), pp. 1–8, Springer Berlin Heidelberg, Berlin, Heidelberg, 2018.
  • [12] Farnaz Abtahi and Ian Fasel, “Deep belief nets as function approximators for reinforcement learning,” in Proceedings Workshops AAAI, 2011.
  • [13] Charles W Anderson, Minwoo Lee, and Daniel L Elliott, “Faster reinforcement learning after pretraining deep networks to predict state dynamics,” in Proceedings IJCNN. IEEE, 2015, pp. 1–7.
  • [14] Jack Stockholm and Philippe Pasquier, “Reinforcement learning of listener response for mood classification of audio,” in Proceeings ICCSE. IEEE, 2009, vol. 4, pp. 849–853.
  • [15] Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, and Stefan Wermter, “Emorl: continuous acoustic emotion classification using deep reinforcement learning,” in Proceedings ICRA. IEEE, 2018, pp. 1–6.
  • [16] Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv, vol. 2018, no. 1804.03209.
  • [17] Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, and Julien Epps, “Direct modelling of speech emotion from raw speech,” arXiv, vol. 2019, no. 1904.03833, 2019.
  • [18] Siddique Latif, Muhammad Usman, Rajib Rana, and Junaid Qadir, “Phonocardiographic sensing using deep learning for abnormal heartbeat detection,” IEEE Sensors Journal, vol. 18, no. 22, pp. 9393–9400, 2018.
  • [19] Rajib Rana, Siddique Latif, Sara Khalifa, and Raja Jurdak, “Multi-task semi-supervised adversarial autoencoding for speech emotion,” arXiv preprint arXiv:1907.06078, 2019.
  • [20] Joel R. Tetreault and Diane J. Litman, “A reinforcement learning approach to evaluating state representations in spoken dialogue systems,” vol. 50, no. 8, pp. 683–696.
  • [21] Steven Davis and Paul Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
  • [22] Brian McFee, Colin Raffel, Dawen Liang, Daniel Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music signal analysis in python,” pp. 18–24.