Abstract
We study the power of cross-attention in the Transformer architecture withinthe context of transfer learning for machine translation, and extend thefindings of studies into cross-attention when training from scratch. We conducta series of experiments through fine-tuning a translation model on data whereeither the source or target language has changed. These experiments reveal thatfine-tuning only the cross-attention parameters is nearly as effective asfine-tuning all parameters (i.e., the entire translation model). We provideinsights into why this is the case and observe that limiting fine-tuning inthis manner yields cross-lingually aligned embeddings. The implications of thisfinding for researchers and practitioners include a mitigation of catastrophicforgetting, the potential for zero-shot translation, and the ability to extendmachine translation models to several new language pairs with reduced parameterstorage overhead.