Abstract
The Transformer architecture has been successful across many domains,including natural language processing, computer vision and speech recognition.In keyword spotting, self-attention has primarily been used on top ofconvolutional or recurrent encoders. We investigate a range of ways to adaptthe Transformer architecture to keyword spotting and introduce the KeywordTransformer (KWT), a fully self-attentional architecture that exceedsstate-of-the-art performance across multiple tasks without any pre-training oradditional data. Surprisingly, this simple architecture outperforms morecomplex models that mix convolutional, recurrent and attentive layers. KWT canbe used as a drop-in replacement for these models, setting two new benchmarkrecords on the Google Speech Commands dataset with 98.6% and 97.7% accuracy onthe 12 and 35-command tasks respectively.