Abstract
Transformer-based models recently reached state-of-the-art single-channelspeech separation accuracy; However, their extreme computational load makes itdifficult to deploy them in resource-constrained mobile or IoT devices. We thuspresent Papez, a lightweight and computation-efficient single-channel speechseparation model. Papez is based on three key techniques. We first replace theinter-chunk Transformer with small-sized auditory working memory. Second, weadaptively prune the input tokens that do not need further processing. Finally,we reduce the number of parameters through the recurrent transformer. Ourextensive evaluation shows that Papez achieves the best resource and accuracytradeoffs with a large margin. We publicly share our source code at\texttt{https://github.com/snuhcs/Papez}