Abstract
We investigate end-to-end speech-to-text translation on a corpus ofaudiobooks specifically augmented for this task. Previous works investigatedthe extreme case where source language transcription is not available duringlearning nor decoding, but we also study a midway case where source languagetranscription is available at training time only. In this case, a single modelis trained to decode source speech into target text in a single pass.Experimental results show that it is possible to train compact and efficientend-to-end speech translation models in this setup. We also distribute thecorpus and hope that our speech translation baseline on this corpus will bechallenged in the future.
Quick Read (beta)
loading the full paper ...