Abstract
Existing audio-language task-specific predictive approaches focus on buildingcomplicated late-fusion mechanisms. However, these models are facing challengesof overfitting with limited labels and low model generalization abilities. Inthis paper, we present a Cross-modal Transformer for Audio-and-Language, i.e.,CTAL, which aims to learn the intra-modality and inter-modality connectionsbetween audio and language through two proxy tasks on a large amount ofaudio-and-language pairs: masked language modeling and masked cross-modalacoustic modeling. After fine-tuning our pre-trained model on multipledownstream audio-and-language tasks, we observe significant improvements acrossvarious tasks, such as, emotion classification, sentiment analysis, and speakerverification. On this basis, we further propose a specially-designed fusionmechanism that can be used in fine-tuning phase, which allows our pre-trainedmodel to achieve better performance. Lastly, we demonstrate detailed ablationstudies to prove that both our novel cross-modality fusion component andaudio-language pre-training methods significantly contribute to the promisingresults.