Abstract
Conformers have shown great results in speech processing due to their abilityto capture both local and global interactions. In this work, we utilize aself-supervised contrastive learning framework to train conformer-basedencoders that are capable of generating unique embeddings for small segments ofaudio, generalizing well to previously unseen data. We achieve state-of-the-artresults for audio retrieval tasks while using only 3 seconds of audio togenerate embeddings. Our models are almost completely immune to temporalmisalignments and achieve state-of-the-art results in cases of other audiodistortions such as noise, reverb or extreme temporal stretching. Code andmodels are made publicly available and the results are easy to reproduce as wetrain and test using popular and freely available datasets of different sizes.