ASiT: Audio Spectrogram vIsion Transformer for General Audio Representation

Abstract

Vision transformers, which were originally developed for natural languageprocessing, have recently generated significant interest in the computer visionand audio communities due to their flexibility in learning long-rangerelationships. Constrained by data hungry nature of transformers and limitedlabelled data most transformer-based models for audio tasks are finetuned fromImageNet pretrained models, despite the huge gap between the natural imagesdomain and audio domain. This has motivated the research in self-supervisedpretraining of audio transformers, which reduces the dependency on largeamounts of labeled data and focuses on extracting concise representation of theaudio spectrograms. In this paper, we propose ASiT, a novel self-supervisedtransformer for general audio representations that captures local and globalcontextual information employing group masked model learning andself-distillation. We evaluate our pretrained models on both audio and speechclassification tasks including audio event classification, keyword spotting,and speaker identification. We further conduct comprehensive ablation studies,including evaluations of different pretraining strategies. The proposed ASiTframework significantly boosts the performance on all tasks and sets a newstate-of-the-art performance on five audio and speech classification tasks,outperforming recent methods, including the approaches that use additionaldatasets for pretraining. The code and pretrained weights will be made publiclyavailable for the scientific community.

Quick Read (beta)

loading the full paper ...