Sound and Visual Representation Learning with Multiple Pretraining Tasks

Abstract

Different self-supervised tasks (SSL) reveal different features from thedata. The learned feature representations can exhibit different performance foreach downstream task. In this light, this work aims to combine Multiple SSLtasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically,for this study, we investigate binaural sounds and image data in isolation. Forbinaural sounds, we propose three SSL tasks namely, spatial alignment, temporalsynchronization of foreground objects and binaural audio and temporal gapprediction. We investigate several approaches of Multi-SSL and give insightsinto the downstream task performance on video retrieval, spatial sound superresolution, and semantic prediction on the OmniAudio dataset. Our experimentson binaural sound representations demonstrate that Multi-SSL via incrementallearning (IL) of SSL tasks outperforms single SSL task models and fullysupervised models in the downstream task performance. As a check ofapplicability on other modality, we also formulate our Multi-SSL models forimage representation learning and we use the recently proposed SSL tasks,MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2,DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83,+1.56 and +1.61 AP on COCO detection. Code will be made publicly available.

Quick Read (beta)

loading the full paper ...