Abstract
Current translation systems, despite being highly multilingual, cover only 5%of the world's languages. Expanding language coverage to the long-tail oflow-resource languages requires data-efficient methods that rely oncross-lingual and cross-modal knowledge transfer. To this end, we propose acharacter-based approach to improve adaptability to new languages andmodalities. Our method leverages SONAR, a multilingual fixed-size embeddingspace with different modules for encoding and decoding. We use ateacher-student approach with parallel translation data to obtain acharacter-level encoder. Then, using ASR data, we train a lightweight adapterto connect a massively multilingual CTC ASR model (MMS), to the character-levelencoder, potentially enabling speech translation from 1,000+ languages.Experimental results in text translation for 75 languages on FLORES+demonstrate that our character-based approach can achieve better languagetransfer than traditional subword-based models, especially outperforming themin low-resource settings, and demonstrating better zero-shot generalizabilityto unseen languages. Our speech adaptation, maximizing knowledge transfer fromthe text modality, achieves state-of-the-art results in speech-to-texttranslation on the FLEURS benchmark on 33 languages, surpassing previoussupervised and cascade models, albeit being a zero-shot model with minimalsupervision from ASR data.