Adaptive Sparse Transformer for Multilingual Translation

Abstract

Multilingual machine translation has attracted much attention recently due toits support of knowledge transfer among languages and the low cost of trainingand deployment compared with numerous bilingual models. A known challenge ofmultilingual models is the negative language interference. In order to enhancethe translation quality, deeper and wider architectures are applied tomultilingual modeling for larger model capacity, which suffers from theincreased inference cost at the same time. It has been pointed out in recentstudies that parameters shared among languages are the cause of interferencewhile they may also enable positive transfer. Based on these insights, wepropose an adaptive and sparse architecture for multilingual modeling, andtrain the model to learn shared and language-specific parameters to improve thepositive transfer and mitigate the interference. The sparse architecture onlyactivates a sub-network which preserves inference efficiency, and the adaptivedesign selects different sub-networks based on the input languages. Our modeloutperforms strong baselines across multiple benchmarks. On the large-scaleOPUS dataset with $100$ languages, we achieve $+2.1$, $+1.3$ and $+6.2$ BLEUimprovements in one-to-many, many-to-one and zero-shot tasks respectivelycompared to standard Transformer without increasing the inference cost.

Quick Read (beta)

loading the full paper ...