Abstract
Finetuning pretrained models on downstream generation tasks often leads tocatastrophic forgetting in zero-shot conditions. In this work, we focus onsummarization and tackle the problem through the lens of language-independentrepresentations. After training on monolingual summarization, we performzero-shot transfer to new languages or language pairs. We first show naivelyfinetuned models are highly language-specific in both output behavior andinternal representations, resulting in poor zero-shot performance. Next, wepropose query-key (QK) finetuning to decouple task-specific knowledge from thepretrained language generation abilities. Then, after showing downsides of thestandard adversarial language classifier, we propose a balanced variant thatmore directly enforces language-agnostic representations. Moreover, ourqualitative analyses show removing source language identity correlates tozero-shot summarization performance. Our code is openly available.