Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Abstract

Tokenization is the first -- and often least scrutinized -- step of most NLPpipelines. Standard algorithms for learning tokenizers rely on frequency-basedobjectives, which favor languages dominant in the training data andconsequently leave lower-resource languages with tokenizations that aredisproportionately longer, morphologically implausible, or even riddled with<UNK> placeholders. This phenomenon ultimately amplifies computational andfinancial inequalities between users from different language backgrounds. Toremedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant ofthe widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizesthe compression gain of the currently worst-compressed language, trading asmall amount of global compression for cross-lingual parity. We findempirically that Parity-aware BPE leads to more equitable token counts acrosslanguages, with negligible impact on global compression rate and no substantialeffect on language-model performance in downstream tasks.

Quick Read (beta)

loading the full paper ...