Audio Captioning Transformer

Abstract

Audio captioning aims to automatically generate a natural languagedescription of an audio clip. Most captioning models follow an encoder-decoderarchitecture, where the decoder predicts words based on the audio featuresextracted by the encoder. Convolutional neural networks (CNNs) and recurrentneural networks (RNNs) are often used as the audio encoder. However, CNNs canbe limited in modelling temporal relationships among the time frames in anaudio signal, while RNNs can be limited in modelling the long-rangedependencies among the time frames. In this paper, we propose an AudioCaptioning Transformer (ACT), which is a full Transformer network based on anencoder-decoder architecture and is totally convolution-free. The proposedmethod has a better ability to model the global information within an audiosignal as well as capture temporal relationships between audio events. Weevaluate our model on AudioCaps, which is the largest audio captioning datasetpublicly available. Our model shows competitive performance compared to otherstate-of-the-art approaches.

Quick Read (beta)

loading the full paper ...