Abstract
We introduce AuriStream, a biologically inspired model for encoding speechvia a two-stage framework inspired by the human auditory processing hierarchy.The first stage transforms raw audio into a time-frequency representation basedon the human cochlea, from which we extract discrete \textbf{cochlear tokens}.The second stage applies an autoregressive sequence model over the cochleartokens. AuriStream learns meaningful phoneme and word representations, andstate-of-the-art lexical semantics. AuriStream shows competitive performance ondiverse downstream SUPERB speech tasks. Complementing AuriStream's strongrepresentational capabilities, it generates continuations of audio which can bevisualized in a spectrogram space and decoded back into audio, providinginsights into the model's predictions. In summary, we present a two-stageframework for speech representation learning to advance the development of morehuman-like models that efficiently handle a range of speech-based tasks.