Augmenting conformers with structured state space models for online speech recognition

Abstract

Online speech recognition, where the model only accesses context to the left,is an important and challenging use case for ASR systems. In this work, weinvestigate augmenting neural encoders for online ASR by incorporatingstructured state-space sequence models (S4), which are a family of models thatprovide a parameter-efficient way of accessing arbitrarily long left context.We perform systematic ablation studies to compare variants of S4 models andpropose two novel approaches that combine them with convolutions. We find thatthe most effective design is to stack a small S4 using real-valued recurrentweights with a local convolution, allowing them to work complementarily. Ourbest model achieves WERs of 4.01%/8.53% on test sets from Librispeech,outperforming Conformers with extensively tuned convolution.

Quick Read (beta)

loading the full paper ...