TGIF: Tree-Graph Integrated-Format Parser for Enhanced UD with Two-Stage Generic- to Individual-Language Finetuning

Abstract

We present our contribution to the IWPT 2021 shared task on parsing intoenhanced Universal Dependencies. Our main system component is a hybridtree-graph parser that integrates (a) predictions of spanning trees for theenhanced graphs with (b) additional graph edges not present in the spanningtrees. We also adopt a finetuning strategy where we first train alanguage-generic parser on the concatenation of data from all availablelanguages, and then, in a second step, finetune on each individual languageseparately. Additionally, we develop our own complete set of pre-processingmodules relevant to the shared task, including tokenization, sentencesegmentation, and multiword token expansion, based on pre-trained XLM-R modelsand our own pre-training of character-level language models. Our submissionreaches a macro-average ELAS of 89.24 on the test set. It ranks top among allteams, with a margin of more than 2 absolute ELAS over the next best-performingsubmission, and best score on 16 out of 17 languages.

Quick Read (beta)

loading the full paper ...