Demystify Optimization Challenges in Multilingual Transformers

  • 2021-04-15 17:51:03
  • Xian Li, Hongyu Gong
  • 13

Abstract

Multilingual Transformer improves parameter efficiency and crosslingualtransfer. How to effectively train multilingual models has not been wellstudied. Using multilingual machine translation as a testbed, we studyoptimization challenges from loss landscape and parameter plasticityperspectives. We found that imbalanced training data poses task interferencebetween high and low resource languages, characterized by nearly orthogonalgradients for major parameters and the optimization trajectory being mostlydominated by high resource. We show that local curvature of the loss surfaceaffects the degree of interference, and existing heuristics of data subsamplingimplicitly reduces the sharpness, although still face a trade-off between highand low resource languages. We propose a principled multi-objectiveoptimization algorithm, Curvature Aware Task Scaling (CATS), which improvesboth optimization and generalization especially for low resource. Experimentson TED, WMT and OPUS-100 benchmarks demonstrate that CATS advances the Paretofront of accuracy while being efficient to apply to massive multilingualsettings at the scale of 100 languages.

 

Quick Read (beta)

loading the full paper ...