75 Languages, 1 Model: Parsing Universal Dependencies Universally

  • 2019-04-03 16:52:55
  • Daniel Kondratyuk
  • 11

Abstract

We present UDify, a multilingual multi-task model capable of accuratelypredicting universal part-of-speech, morphological features, lemmas, anddependency trees simultaneously for all 124 Universal Dependencies treebanksacross 75 languages. By leveraging a multilingual BERT self-attention modelpretrained on 104 languages, we found that fine-tuning it on all datasetsconcatenated together with simple softmax classifiers for each UD task canresult in state-of-the-art UPOS, UFeats, Lemmas, UAS, and LAS scores, withoutrequiring any recurrent or language-specific components. We evaluate UDify formultilingual learning, showing that low-resource languages benefit the mostfrom cross-linguistic annotations. We also evaluate for zero-shot learning,with results suggesting that multilingual training provides strong UDpredictions even for languages that neither UDify nor BERT have ever beentrained on. Code for UDify is available athttps://github.com/hyperparticle/udify.

 

Quick Read (beta)

75 Languages, 1 Model: Parsing Universal Dependencies Universally

Daniel Kondratyuk
Charles University
Institute of Formal and Applied Linguistics
Saarland University
Department of Computational Linguistics
[email protected]
Abstract

We present UDify, a multilingual multi-task model capable of accurately predicting universal part-of-speech, morphological features, lemmas, and dependency trees simultaneously for all 124 Universal Dependencies treebanks across 75 languages. By leveraging a multilingual BERT self-attention model pretrained on 104 languages, we found that fine-tuning it on all datasets concatenated together with simple softmax classifiers for each UD task can result in state-of-the-art UPOS, UFeats, Lemmas, UAS, and LAS scores, without requiring any recurrent or language-specific components. We evaluate UDify for multilingual learning, showing that low-resource languages benefit the most from cross-linguistic annotations. We also evaluate for zero-shot learning, with results suggesting that multilingual training provides strong UD predictions even for languages that neither UDify nor BERT have ever been trained on. Code for UDify is available at https://github.com/hyperparticle/udify.

75 Languages, 1 Model: Parsing Universal Dependencies Universally


Daniel Kondratyuk Charles University Institute of Formal and Applied Linguistics Saarland University Department of Computational Linguistics [email protected]

1 Introduction

Recent work in unsupervised textual representations shows that supervised NLP models which use these representations for transfer learning are capable of boosting evaluation performance significantly (Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018; Howard and Ruder, 2018). One such model, BERT (Devlin et al., 2018), defines a self-attention (Transformer) network that uses a simple masked language model strategy trained on Wikipedia, substantially improving state-of-the-art models when fine-tuned on BERT’s contextual embeddings. The authors released several versions of the model, including a multilingual model pretrained on the entirety of the top 104 resourced languages of Wikipedia. Conveniently, these languages nearly completely overlap with languages supported by the Universal Dependencies treebanks, and is what enables the training of UDify, a single model capable of annotating text for any language that has a treebank.

The Universal Dependencies (UD) framework provides syntactic annotations consistent across a large collection of languages (Nivre et al., 2018; Zeman et al., 2018). This makes it an excellent candidate for analyzing annotations across multiple languages. UD offers tokenized sentences with annotations ideal for multi-task learning, including lemmas (Lem), treebank-specific part-of-speech tags (XPOS), universal part-of-speech tags (UPOS), morphological features (Feat), and dependency edges and labels (Deps) for each sentence.

We propose UDify, a semi-supervised multi-task model for automatically producing UD annotations based on UDPipe Future, a winner of the CoNLL 2018 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies (Straka, 2018; Zeman et al., 2018). We perform three primary modifications to UDPipe Future:

  1. 1.

    We replace all recurrent components with a pretrained multilingual BERT self-attention network which provides contextual embeddings, and introduce a novel layer-wise attention for each task.

  2. 2.

    We apply a heavy amount of regularization to BERT, including dropout, input masking, weight freezing, discriminative fine-tuning, and layer dropout.

  3. 3.

    We train UDify on the entirety of UD by concatenating all available training sets together.

We evaluate our model with respect to UDPipe Future as a strong baseline, provide additional analysis for languages that multilingual training benefits prediction the most, and also analyze treebanks which do not have a training set for zero-shot learning.

Our work uses the AllenNLP library built for the PyTorch framework. Code for UDify and a release of the fine-tuned BERT weights is available at https://github.com/hyperparticle/udify.

2 Multilingual Multi-Task Learning

In this section, we detail the multilingual training setup and the UDify multi-task model architecture. For an architecture diagram, see Appendix A.1.

2.1 Multilingual Pretraining with BERT

BERT provides a simple yet empirically powerful method for training an unsupervised masked language model on raw text (Devlin et al., 2018). We leverage the provided BERT base multilingual cased pretrained model11 1 https://github.com/google-research/bert/blob/master/multilingual.md, with a self-attention network of 12 layers, 12 attention heads per layer, and hidden dimensions of 768. The model was trained by predicting randomly masked input words on the entirety of the top 104 languages with the largest Wikipedias. BERT uses a wordpiece tokenizer (Wu et al., 2016), which segments all text into (unnormalized) sub-word units.

2.2 Cross-Linguistic Training Issues

Table 1 displays a list of vocabulary sizes, showing that UD treebanks possess nearly 1.6M unique tokens combined. To sidestep the problem of a ballooning vocabulary, we use BERT’s wordpiece tokenizer directly. UD expects predictions to be along word boundaries, so we take the simple approach of applying the tokenizer to each word using UD’s provided segmentation. For prediction, we use the outputs of BERT corresponding to the first wordpiece per word, ignoring the rest22 2 We found last, max, or average pooling of the wordpieces were not any better or worse for evaluation..

The XPOS annotations are not universal across languages, or even across treebanks. Because each treebank can possess a different annotation scheme for XPOS which can slow down inference, we omit training and evaluation of XPOS from our experiments.

Token Vocab Size
Word Form 1,588,655
BERT Wordpieces 119,547
UPOS 17
XPOS 19,126
UFeats 23,974
Lemmas (tags) 109,639
Deps 251
Table 1: Vocabulary sizes of words and tags over all of UD v2.3, with a total of 12,032,309 word tokens and 668,939 sentences.

2.3 Multi-Task Learning with UD

For predicting UD annotations, we employ a multi-task network like UDPipe Future (Straka, 2018), but with all embedding, encoder, and projection layers replaced with BERT. The remaining parts include the prediction layers for each task, detailed below. We compute softmax cross entropy loss to train the network. For more details on reasons behind architecture choices, see Appendix A.1.

UPOS, UFeats: As is standard for neural sequence tagging, we apply a softmax classifier along each word input, predicting the annotation string.

Lemmas: Similar to Lemmatus (Bergmanis and Goldwater, 2018), we reduce the problem of lemmatization to a sequence tagging problem by predicting a class representing the sequence of character operations to transform the word form to the lemma. To precompute the tags, we first find the longest common substring between the form and the lemma, and then compute the shortest edit script converting the prefix and suffix of the form into the prefix and suffix of the lemma using the Wagner–Fischer algorithm (Wagner and Fischer, 1974). Upon predicting a lemma edit script, we apply the edit operations to the word form to produce the final lemma. See also Straka (2018) for more details.

Deps: We use the graph-based biaffine attention parser developed by Dozat and Manning (2016), replacing the bidirectional LSTM layers with BERT. The final embeddings are projected through arc-head and arc-dep feedforward layers, which are combined using biaffine attention to produce a probability distribution of arc heads for each word. We then decode each tree with the Chu-Liu/Edmonds algorithm (Chu, 1965; Edmonds, 1967).

3 Fine-Tuning BERT on UD Annotations

We employ several strategies for fine-tuning BERT for UD prediction, finding that regularization is absolutely crucial for producing a high-scoring network.

3.1 Layer Attention

Empirical results suggest that when fine-tuning BERT, combining the output of the last several layers is more beneficial for the downstream tasks than just using the last layer (Devlin et al., 2018). Instead of restricting the model to any subset of layers, we devise a simple layer-wise dot-product attention where the network computes a weighted sum of all intermediate outputs of the 12 BERT layers using the same weights for each token.

More formally, let \mathboldαi be a trainable scalar for BERT embeddings \mathboldBij at layer i with a token at position j, and let β be a global trainable scalar. We compute contextual embeddings \mathbolde such that

\mathboldej=βi\mathboldBijsoftmax(\mathboldα)i (1)

To prevent the UD classifiers from overfitting to the information in any single layer, we devise layer dropout, where at each training step, we set each parameter \mathboldαi to - with probability 0.1. This effectively redistributes probability mass to all other layers, forcing the network to incorporate the information content of all BERT layers. We compute layer attention per task, using one set of \mathboldα,β parameters for each of UPOS, UFeats, Lemmas, and Deps.

3.2 Transfer Learning with ULMFiT

The ULMFiT strategy defines several useful methods for fine-tuning a network on a pretrained language model (Howard and Ruder, 2018). We apply the same methods, with a few minor modifications.

We split the network into two parameter groups, i.e., the parameters of BERT and all other parameters. We apply discriminative fine-tuning, setting the base learning rate of BERT to be 5e-5 and 1e-3 everywhere else. We also freeze the BERT parameters for the first epoch to increase training stability.

While ULMFiT recommends decaying the learning rate linearly after a linear warmup, we found that this is prone to training divergence, introducing vanishing gradients and underfitting. Instead, we apply an inverse square root learning rate decay with linear warmup (Noam) seen in training Transformer networks for machine translation (Vaswani et al., 2017).

3.3 Input Masking

The authors of BERT recommend not to mask words randomly with [MASK] when fine-tuning the network. However, we discovered that masking often reduces the tendency of the classifiers to overfit to BERT by forcing the network to rely on the context of surrounding words. This “word dropout” strategy has been observed in other works showing improved test performance on a variety of NLP tasks (Iyyer et al., 2015; Bowman et al., 2016; Clark et al., 2018; Straka, 2018).

4 Experiments

We evaluate UDify with respect to every test set in each treebank. As there are too many results to fit within one page, we display a salient subset of scores and compare them with UDPipe Future. See Appendix A.3 for the full results.

4.1 Datasets

For all experiments, we use the full Universal Dependencies v2.3 corpus available on LINDAT (Nivre et al., 2018). We omit evaluation of datasets that do not have their training annotations freely available, i.e., ar_nyuad, en_esl, fr_ftb, qhe_heincs, and ja_bccwj.

To train the multilingual model, we concatenate all available training sets together, similar to McDonald et al. (2011). Before each epoch, we shuffle all sentences and feed mixed batches of sentences to the network, where each batch may contain sentences from any language or treebank.

4.2 Hyperparameters

A summary of hyperparameters can be found in Table 2. For additional explanation on hyperparameter choices, see Appendix A.2.

Hyperparameter Value
Dependency tag dimension 256
Dependency arc dimension 768
Optimizer Adam
β1,β2 0.9, 0.99
Weight decay 0.01
Label Smoothing 0.03
Dropout 0.5
BERT dropout 0.2
Mask probability 0.2
Layer dropout 0.1
Batch size 32
Epochs 80
Base learning rate 1e-3
BERT learning rate 5e-5
Learning rate warmup steps 8000
Gradient clipping 5.0
Table 2: A summary of model hyperparameters.

5 Results

We show scores of UPOS, UFeats, and Lemma accuracies, along with unlabeled and labeled attachment scores (UAS, LAS) evaluated using the offical CoNLL 2018 Shared Task evaluation script33 3 https://universaldependencies.org/conll18/evaluation.html. Results for a subset of high-resource and low-resource languages are shown in Table 3, with comparison between UDPipe Future and UDify fine-tuning on all languages. In addition, the table compares UDify with fine-tuning on a single language, or both (fine-tuning multilingually, then fine-tuning on the language with the saved multilingual weights) to provide a reference point for multilingual influences on UDify. We provide a full table of scores for all treebanks in Appendix A.3.

A more comprehensive overview is shown in Table 4, comparing different attention strategies applied to UDify. We display an average of scores over all 89 treebanks with a training set.

For zero-shot learning evaluation, Table 5 displays a subset of test set evaluations of treebanks that do not have a training set.

Finally, we plot the layer attention weights \mathboldα after fine-tuning BERT in Figure 1, showing a set of weights per task.

Treebank Model UPOS Feats Lem UAS LAS
Czech PDT (cs_pdt) UDPipe 99.18 97.23 99.02 93.33 91.31
Lang 99.18 96.87 98.72 94.35 92.41
UDify 99.18 96.85 98.56 94.73 92.88
UDify+Lang 99.24 97.44 98.93 95.07 93.38
German GSD (de_gsd) UDPipe 94.48 90.68 96.80 85.53 81.07
Lang 94.77 91.73 96.34 87.54 83.39
UDify 94.55 90.65 94.82 87.81 83.59
UDify+Lang 95.29 91.94 96.74 88.11 84.13
Spanish AnCora (es_ancora) UDPipe 98.91 98.49 99.17 92.34 90.26
Lang 98.60 98.14 98.52 92.82 90.52
UDify 98.53 97.84 98.09 92.99 90.50
UDify+Lang 98.68 98.25 98.68 93.35 91.28
French GSD (fr_gsd) UDPipe 97.63 97.13 98.35 90.65 88.06
Lang 98.05 96.26 97.96 92.77 90.61
UDify 97.83 96.59 97.48 93.60 91.45
UDify+Lang 97.96 96.73 98.17 93.56 91.45
Russian SynTagRus (ru_syntagrus) UDPipe 99.12 97.57 98.53 93.80 92.32
Lang 98.90 96.58 95.16 94.40 92.72
UDify 98.97 96.35 94.43 94.83 93.13
UDify+Lang 99.08 97.22 96.58 95.13 93.70
Belarusian HSE (be_hse) UDPipe 93.63 73.30 87.34 78.58 72.72
Lang 95.88 76.12 84.52 83.94 79.02
UDify 97.54 89.36 85.46 91.82 87.19
UDify+Lang 97.25 85.02 88.71 90.67 86.98
Buryat BDT (bxr_bdt) UDPipe 40.34 32.40 58.17 32.60 18.83
Lang 52.54 37.03 54.64 29.63 15.82
UDify 61.73 47.86 61.06 48.43 26.28
UDify+Lang 61.73 42.79 58.20 33.06 18.65
Upper Sorbian UFAL (hsb_ufal) UDPipe 62.93 41.10 68.68 45.58 34.54
Lang 73.70 46.28 58.02 39.02 28.70
UDify 84.87 48.63 72.73 71.55 62.82
UDify+Lang 87.58 53.19 71.88 71.40 60.65
Kazakh KTB (kk_ktb) UDPipe 55.84 40.40 63.96 53.30 33.38
Lang 73.52 46.60 57.84 50.38 32.61
UDify 85.59 65.14 77.40 74.77 63.66
UDify+Lang 81.32 60.50 67.30 69.16 53.14
Lithuanian HSE (lt_hse) UDPipe 81.70 60.47 76.89 51.98 42.17
Lang 83.40 54.34 58.77 51.23 38.96
UDify 90.47 68.96 67.83 79.06 69.34
UDify+Lang 84.53 56.98 58.21 58.40 39.91
Table 3: Test set scores for a subset of high-resource (top) and low-resource (bottom) languages in comparison to UDPipe Future, with 3 UDify configurations: Lang, fine-tune on the treebank. UDify, fine-tune on all UD treebanks combined. UDify+Lang, fine-tune on the treebank using BERT weights saved from fine-tuning on all UD treebanks combined.
Model Configuration UPOS Feats Lem UAS LAS
UDPipe 93.76 91.04 94.63 84.37 79.76
UDify Task Layer Attn 93.40 88.72 90.41 85.69 80.43
UDify Global Layer Attn 93.12 87.53 89.03 85.07 79.49
UDify Sum Layers 93.02 87.20 88.70 84.97 79.33
Table 4: Ablation comparing the average of scores over all treebanks: task-specific layer attention, global layer attention for all tasks, and simple sum of layers.
Treebank UPOS Feats Lem UAS LAS
Breton KEB (br_keb) 63.67 46.75 53.15 63.97 40.19
Tagalog TRG (tl_trg) 61.64 35.27 75.00 64.73 39.38
Faroese OFT (fo_oft) 77.86 35.71 53.82 69.28 61.03
Naija NSC (pcm_nsc) 56.59 52.75 97.52 47.13 33.43
Sanskrit UFAL (sa_ufal) 40.21 18.45 37.60 41.73 19.80
Table 5: Test set results for zero-shot learning, i.e., no UD training annotations available. Languages that are pretrained with BERT are bolded.

6 Discussion

On the average, UDify reveals a strong set of results that are comparable in performance with the state-of-the-art in parsing UD annotations. UDify excels in dependency parsing, exceeding UDPipe Future by a large margin especially for low-resource langauges. UDify slightly underperforms with respect to Lemmas and Universal Features, likely due to UDPipe Future additionally using character-level embeddings, while (for simplicity) UDify does not. Additionally, UDify severely underperforms the baseline on a few low-resource languages, e.g., cop_scriptorum. We surmise that this is due to using mixed batches on an unbalanced training set, which skews the model towards predicting larger treebanks more accurately. However, we find that fine-tuning on the treebank individually with BERT weights saved from UDify eliminates most of these gaps in performance.

UDify also shows strong improvement leveraging multilingual data. In low-resource cases, fine-tuning BERT on all treebanks can be far superior than fine-tuning monolingually. A second round of fine-tuning on an individual treebank using UDify’s BERT weights can improve this further, but this depends on a case-by-case basis.

Interestingly, Slavic languages tend to perform the best with multilingual training. While languages like Czech and Russian possess the largest UD treebanks and do not differ much from monolingual fine-tuning, evidenced by the improvements over single-language fine-tuning, we can see a large degree of morphological and syntactic structure has transferred to low-resource Slavic languages like Kazakh, whose treebank contains only 1k sentences.

The zero-shot results indicate that fine-tuning on BERT can result in reasonably high scores on languages that do not have a training set. It can be seen that a combination of BERT pretraining and multilingual learning can improve predictions for Breton and Tagalog, which implies that the network has learned representations of syntax that cross lingual boundaries. Furthermore, despite the fact that neither BERT nor UDify have directly observed Faroese, Naija, or Sanskrit, we see unusually high performance in these languages. This can be partially attributed to each language closely resembling another: Faroese overlaps heavily with Danish, Naija (Nigerian Pidgin) is a variant of English, and Sanskrit is an ancient Indian language with roots in Greek, Latin, and Hindi.

Table 4 shows that layer attention on BERT for each task is beneficial for test performance, much more than using a global weighted average. In fact, Figure 1 shows that each task prefers the layers of BERT differently. All tasks favor the information content in the last 3 layers, with a tendency to disprefer layers closer to the input. However, an interesting observation is that for Lemmas and UFeats, the classifier prefers to also incorporate the information of the first 3 layers. This meshes well with the linguistic intuition that morphological features are more closely related to the surface form of a word and rely less on context than other syntactic tasks. Curiously enough, the middle layers are highly dispreferred, meaning that the useful processing for multilingual syntax (tagging, dependency parsing) occurs in the last 3-4 layers.

Figure 1: The unnormalized BERT layer attention weights \mathboldαi contributing to layer i for each task after training. A linear change in weight scales each BERT layer exponentially due to the softmax in Equation 1

7 Conclusion

We have proposed and evaluated UDify, a multilingual multi-task self-attention network fine-tuned on BERT pretrained embeddings, capable of producing annotations for any UD treebank, and exceeding the state-of-the-art in UD dependency parsing while being comparable in tagging and lemmatization accuracy. Strong regularization and task-specific layer attention is highly beneficial for fine-tuning, while also reducing the number of required models to train down to one by training multilingually. Multilingual learning is most beneficial for low-resource languages, even ones that do not possess a training set, and can be further improved by fine-tuning monolingually using BERT weights saved from UDify’s multilingual training.

References

  • Bergmanis and Goldwater (2018) Toms Bergmanis and Sharon Goldwater. 2018. Context sensitive neural lemmatization with lematus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1391–1400.
  • Bowman et al. (2016) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. CoNLL 2016, page 10.
  • Chu (1965) Yoeng-Jin Chu. 1965. On the shortest arborescence of a directed graph. Scientia Sinica, 14:1396–1400.
  • Clark et al. (2018) Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. 2018. Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dozat and Manning (2016) Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734.
  • Edmonds (1967) Jack Edmonds. 1967. Optimum branchings. Journal of Research of the national Bureau of Standards B, 71(4):233–240.
  • Goldberg (2019) Yoav Goldberg. 2019. Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287.
  • Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 328–339.
  • Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1681–1691.
  • McDonald et al. (2011) Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the conference on empirical methods in natural language processing, pages 62–72. Association for Computational Linguistics.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Nivre et al. (2018) Joakim Nivre, Mitchell Abrams, Željko Agić, and Ahrenberg. 2018. Universal dependencies 2.3. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
  • Straka (2018) Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels, Belgium. Association for Computational Linguistics.
  • Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5027–5038.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Wagner and Fischer (1974) Robert A Wagner and Michael J Fischer. 1974. The string-to-string correction problem. Journal of the ACM (JACM), 21(1):168–173.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Zeman et al. (2018) Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium. Association for Computational Linguistics.

Appendix A Appendix

In this section, we illustrate the main data flow of the UDify architecture in Section A.1, explain some additional hyperparameter and architecture choices and other interesting observations in Section A.2, and show full tables of results seen in the paper in Section A.3.

A.1 UDify Architecture

Figure 2 illustrates an example sentence being processed by UDify. An input word-tokenized sentence is further tokenized by BERT. This sentence is then processed through BERT, producing 12 hidden vectors (one for each layer) of dimension 768 for each wordpiece. The vectors whose BERT layers correspond to the first wordpiece of each word are combined together using layer attention, producing 4 vectors per token, one for each of the 4 tasks. These are independently processed through dense projection and softmax operations, which are finally decoded to their respective output values.

The UPOS, UFeats, and Lemma tasks use simple projection layers for sequence tagging, while the dependency parser produces two sets of intermediate projections corresponding to tag and arc dependency predictions, which are combined together using biaffine matrix operations to calculate scores across all possible head-relation pairs. See Dozat and Manning (2016) for more details.

Figure 2: An illustration of the UDify network architecture with task-specific layer attention, inputting word tokens and outputting UD annotations for each token.

A.2 Miscellaneous Details

To the best of our knowledge, this is the first self-attention only model providing strong results in POS tagging and dependency parsing. Previous attempts have shown promise in replacing recurrent architectures with self-attention, but for simple tasks that do not have high-resource data comparable to machine translation, RNNs have shown to be superior (Strubell et al., 2018). We have shown that with enough pretraining and regularization, we can surpass RNN performance.

Self-attention networks have been shown to excel at extracting dependencies between words (Goldberg, 2019), which manifests in strong performance in unlabeled dependency parsing.

The final multilingual UDify model was trained over 25 days on an NVIDIA GTX 1080 Ti. We use half-precision training to be able to keep the BERT model in memory.

A.2.1 Extremely Long Sentences

BERT limits its positional encoding to 512 wordpieces, causing some sentences in UD to be too long to fit into the model. We use a sliding window approach to break up long sentences into windows of 512 wordpieces, overlapping each window by 256 wordpieces. After feeding the windows into BERT, we select the first 256 wordpieces of each window and any remaining wordpieces in the last window to represent the contextual embeddings of each word in the original sentence.

A.3 Full Results of UD Scores

We show in Tables 678, and 9 UDify scores evaluated on all 124 treebanks with the official CoNLL 2018 Shared Task evaluation script. For comparison, we also include the full test evaluation of UDPipe Future on the subset of 89 treebanks with a training set. We also make a machine-readable version of each table with higher numerical precision available in the supplemental material and repository code.

Treebank Model UPOS UFeats Lemmas UAS LAS CLAS MLAS BLEX
Afrikaans AfriBooms af_afribooms UDPipe 98.25 97.66 97.46 89.38 86.58 81.44 77.66 77.82
UDify 97.48 96.63 95.23 86.97 83.48 77.42 70.57 70.93
Akkadian PISANDUB akk_pisandub UDify 19.92 99.51 2.32 27.65 4.54 3.27 1.04 0.30
Amharic ATT am_att UDify 15.25 43.95 58.04 17.38 3.49 4.88 0.23 2.53
Ancient Greek PROIEL grc_proiel UDPipe 97.86 92.44 93.51 85.93 82.11 77.70 67.16 71.22
UDify 91.20 82.29 76.16 78.91 72.66 66.07 50.79 47.27
Ancient Greek Perseus grc_perseus UDPipe 93.27 91.39 85.02 78.85 73.54 67.60 53.87 53.19
UDify 85.67 81.67 70.51 70.51 62.64 55.60 39.15 35.05
Arabic PADT ar_padt UDPipe 96.83 94.11 95.28 87.54 82.94 79.77 73.92 75.87
UDify 96.58 91.77 73.55 87.72 82.88 79.47 70.52 50.26
Arabic PUD ar_pud UDify 79.98 40.32 0.00 76.17 67.07 65.10 10.67 0.00
Armenian ArmTDP hy_armtdp UDPipe 93.49 82.85 92.86 78.62 71.27 65.77 48.11 60.11
UDify 94.42 76.90 85.63 85.63 78.61 73.72 46.80 59.14
Bambara CRB bm_crb UDify 30.86 57.96 20.42 30.28 8.60 6.56 1.04 0.76
Basque BDT eu_bdt UDPipe 96.11 92.48 96.29 86.11 82.86 81.79 72.33 78.54
UDify 95.45 86.80 90.53 84.94 80.97 79.52 63.60 71.56
Belarusian HSE be_hse UDPipe 93.63 73.30 87.34 78.58 72.72 69.14 46.20 58.28
UDify 97.54 89.36 85.46 91.82 87.19 85.05 71.54 68.66
Breton KEB br_keb UDify 62.78 47.12 51.31 63.52 39.84 35.14 4.64 16.34
Bulgarian BTB bg_btb UDPipe 98.98 97.82 97.94 93.38 90.35 87.01 83.63 84.42
UDify 98.89 96.18 93.49 95.54 92.40 89.59 83.43 80.44
Buryat BDT bxr_bdt UDPipe 40.34 32.40 58.17 32.60 18.83 12.36 1.26 6.49
UDify 61.73 47.45 61.03 48.43 26.28 20.61 5.51 11.68
Cantonese HK yue_hk UDify 67.11 91.01 96.01 46.82 32.01 33.35 14.29 31.26
Catalan AnCora ca_ancora UDPipe 98.88 98.37 99.07 93.22 91.06 87.18 84.48 86.18
UDify 98.89 98.34 98.14 94.25 92.33 89.27 86.21 86.61
Chinese CFL zh_cfl UDify 83.75 82.72 98.75 62.46 42.48 43.46 21.07 42.22
Chinese GSD zh_gsd UDPipe 94.88 99.22 99.99 84.64 80.50 76.79 71.04 76.78
UDify 95.35 99.35 99.97 87.93 83.75 80.33 74.36 80.28
Chinese HK zh_hk UDify 82.86 86.47 100.00 65.53 49.32 47.84 22.85 47.84
Chinese PUD zh_pud UDify 92.68 98.40 100.00 79.08 56.51 55.22 40.92 55.22
Coptic Scriptorium cop_scriptorium UDPipe 94.70 96.35 95.49 85.58 80.97 72.24 64.45 68.48
UDify 27.17 52.85 55.71 27.58 10.82 6.50 0.19 1.44
Croatian SET hr_set UDPipe 98.13 92.25 97.27 91.10 86.78 84.11 73.61 81.19
UDify 98.02 89.67 95.34 94.08 89.79 87.70 72.72 82.00
Czech CAC cs_cac UDPipe 99.37 96.34 98.57 92.99 90.71 88.84 84.30 87.18
UDify 99.14 95.42 98.32 94.33 92.41 91.03 84.68 89.21
Czech CLTT cs_cltt UDPipe 98.88 91.59 98.25 86.90 84.03 80.55 71.63 79.20
UDify 99.17 93.66 98.86 91.69 89.96 87.59 79.50 86.79
Czech FicTree cs_fictree UDPipe 98.55 95.87 98.63 92.91 89.75 86.97 81.04 85.49
UDify 98.34 91.82 98.13 95.19 92.77 90.99 77.77 88.39
Czech PDT cs_pdt UDPipe 99.18 97.23 99.02 93.33 91.31 89.64 86.15 88.60
UDify 99.18 96.69 98.52 94.73 92.88 91.64 87.13 89.95
Czech PUD cs_pud UDify 97.93 93.98 96.94 92.59 87.95 84.85 77.39 82.81
Danish DDT da_ddt UDPipe 97.78 97.33 97.52 86.88 84.31 81.20 76.29 78.51
UDify 97.50 95.41 94.60 87.76 84.50 81.60 73.76 75.15
Dutch Alpino nl_alpino UDPipe 96.83 96.33 97.09 91.37 88.38 83.51 77.28 79.82
UDify 97.67 97.66 95.44 94.23 91.21 87.32 82.81 80.76
Dutch LassySmall nl_lassysmall UDPipe 96.50 96.42 97.41 90.20 86.39 81.88 77.19 78.83
UDify 96.70 96.57 95.10 94.34 91.22 88.03 82.06 81.40
Table 6: The full test results of UDify on 124 treebanks (part 1 of 4).
Treebank Model UPOS UFeats Lemmas UAS LAS CLAS MLAS BLEX
English EWT en_ewt UDPipe 96.29 97.10 98.25 89.63 86.97 84.02 79.00 82.36
UDify 96.21 96.02 97.28 90.96 88.50 86.25 79.80 83.39
English GUM en_gum UDPipe 96.02 96.82 96.85 87.27 84.12 78.55 73.51 74.68
UDify 95.44 94.12 93.15 89.14 85.73 83.03 72.55 74.30
English LinES en_lines UDPipe 96.91 96.31 96.45 84.15 79.71 77.44 71.38 73.22
UDify 95.31 91.34 94.50 87.33 83.71 82.95 68.62 76.23
English PUD en_pud UDify 96.18 93.50 94.20 91.52 88.66 87.83 75.61 80.57
English ParTUT en_partut UDPipe 96.10 95.51 97.74 90.29 87.27 82.58 76.44 80.33
UDify 96.16 92.61 96.45 92.84 90.14 86.28 74.59 82.01
Erzya JR myv_jr UDify 46.66 31.82 45.73 31.90 16.38 10.83 0.58 2.83
Estonian EDT et_edt UDPipe 97.64 96.23 95.30 88.00 85.18 83.62 78.72 78.51
UDify 97.44 95.13 86.56 89.53 86.67 85.17 79.20 69.31
Faroese OFT fo_oft UDify 77.46 35.20 51.09 67.24 59.26 51.17 2.39 21.92
Finnish FTB fi_ftb UDPipe 96.65 96.62 95.49 90.68 87.89 85.11 80.58 81.18
UDify 93.80 90.38 88.80 86.37 81.40 81.01 68.16 70.15
Finnish PUD fi_pud UDify 96.48 93.84 84.64 89.76 86.58 86.64 77.83 69.12
Finnish TDT fi_tdt UDPipe 97.45 95.43 91.45 89.88 87.46 85.87 80.43 76.64
UDify 94.43 90.48 82.89 86.42 82.03 82.62 70.89 63.66
French GSD fr_gsd UDPipe 97.63 97.13 98.35 90.65 88.06 84.35 79.76 82.39
UDify 97.83 96.17 97.34 93.60 91.45 88.54 81.61 84.51
French PUD fr_pud UDify 91.67 59.65 100.00 88.36 82.76 81.74 25.24 81.74
French ParTUT fr_partut UDPipe 96.93 94.43 95.70 92.17 89.63 84.62 75.22 78.07
UDify 96.12 88.36 93.97 90.55 88.06 83.19 63.03 74.03
French Sequoia fr_sequoia UDPipe 98.79 98.09 98.57 92.37 90.73 87.55 84.51 85.93
UDify 97.89 88.97 97.15 92.53 90.05 86.67 67.98 82.52
French Spoken fr_spoken UDPipe 95.91 100.00 96.92 82.90 77.53 71.82 68.24 69.47
UDify 96.23 98.67 96.59 85.24 80.01 75.40 69.74 72.77
Galician CTG gl_ctg UDPipe 97.84 99.83 98.58 86.44 83.82 78.58 72.46 77.21
UDify 96.51 97.10 97.08 84.75 80.89 74.62 65.86 72.17
Galician TreeGal gl_treegal UDPipe 95.82 93.96 97.06 82.72 77.69 71.69 63.73 68.89
UDify 94.59 80.67 94.93 84.08 76.77 73.06 49.76 66.99
German GSD de_gsd UDPipe 94.48 90.68 96.80 85.53 81.07 76.26 58.82 72.13
UDify 94.55 90.43 94.42 87.81 83.59 80.03 61.27 72.48
German PUD de_pud UDify 89.49 30.66 94.77 89.86 84.46 80.50 2.10 72.95
Gothic PROIEL got_proiel UDPipe 96.61 90.73 94.75 85.28 79.60 76.92 66.70 72.93
UDify 95.55 85.97 80.57 85.61 79.37 76.26 63.09 58.65
Greek GDT el_gdt UDPipe 97.98 94.96 95.82 92.10 89.79 85.71 78.60 79.72
UDify 97.72 93.29 89.43 94.33 92.15 88.67 77.89 71.83
Hebrew HTB he_htb UDPipe 97.02 95.87 97.12 89.70 86.86 81.45 75.52 78.14
UDify 96.94 93.41 94.15 91.63 88.11 83.04 72.55 74.87
Hindi HDTB hi_hdtb UDPipe 97.52 94.15 98.67 94.85 91.83 88.21 78.49 86.83
UDify 97.12 92.59 98.23 95.13 91.46 87.80 75.54 86.10
Hindi PUD hi_pud UDify 87.54 22.81 100.00 71.64 58.42 53.03 3.32 53.03
Hungarian Szeged hu_szeged UDPipe 95.76 91.75 95.05 84.04 79.73 78.65 67.63 73.63
UDify 96.36 86.16 90.19 89.68 84.88 83.93 64.27 72.21
Indonesian GSD id_gsd UDPipe 93.69 95.58 99.64 85.31 78.99 76.76 67.74 76.38
UDify 93.36 93.32 98.37 86.45 80.10 78.05 66.93 76.31
Indonesian PUD id_pud UDify 76.10 44.23 100.00 77.47 56.90 54.88 7.41 54.88
Irish IDT ga_idt UDPipe 92.72 82.43 90.48 80.39 72.34 63.48 46.49 55.32
UDify 90.49 71.84 81.27 80.05 69.28 60.02 34.39 43.07
Italian ISDT it_isdt UDPipe 98.39 98.11 98.66 93.49 91.54 87.34 84.28 85.49
UDify 98.51 98.01 97.72 95.54 93.69 90.40 86.54 86.70
Italian PUD it_pud UDify 94.73 58.16 96.08 94.18 91.76 90.05 25.55 83.74
Italian ParTUT it_partut UDPipe 98.38 97.77 98.16 92.64 90.47 85.05 81.87 82.99
UDify 98.21 98.38 97.55 95.96 93.68 89.83 86.83 86.44
Italian PoSTWITA it_postwita UDPipe 96.61 96.90 97.00 86.03 81.78 77.05 72.88 74.33
UDify 96.14 95.56 93.60 87.20 82.77 78.27 72.41 71.67
Table 7: The full test results of UDify on 124 treebanks (part 2 of 4).
Treebank Model UPOS UFeats Lemmas UAS LAS CLAS MLAS BLEX
Japanese GSD ja_gsd UDPipe 98.13 99.98 99.52 95.06 93.73 88.35 86.37 88.04
UDify 97.08 99.97 98.80 94.37 92.08 86.19 82.99 85.12
Japanese Modern ja_modern UDify 74.94 96.14 79.70 74.99 55.62 42.67 30.89 35.47
Japanese PUD ja_pud UDify 97.89 99.98 99.31 94.89 93.62 87.92 84.86 87.15
Kazakh KTB kk_ktb UDPipe 55.84 40.40 63.96 53.30 33.38 27.06 4.82 15.10
UDify 85.59 65.49 77.18 74.77 63.66 61.84 34.23 45.51
Komi Zyrian IKDP kpv_ikdp UDify 59.92 39.32 57.56 36.01 22.12 17.45 1.54 6.80
Komi Zyrian Lattice kpv_lattice UDify 38.57 29.45 55.33 28.85 12.99 10.79 0.72 3.28
Korean GSD ko_gsd UDPipe 96.29 99.77 93.40 87.70 84.24 82.05 79.74 76.35
UDify 90.56 99.63 82.84 82.74 74.26 71.72 65.94 57.58
Korean Kaist ko_kaist UDPipe 95.59 100.00 94.30 88.42 86.48 84.12 80.72 79.22
UDify 94.67 99.98 85.89 87.57 84.52 82.05 78.27 68.99
Korean PUD ko_pud UDify 64.43 60.47 70.47 63.57 46.89 45.29 16.26 30.94
Kurmanji MG kmr_mg UDPipe 53.36 41.54 69.58 45.23 34.32 29.41 2.74 19.39
UDify 60.23 37.78 58.08 35.86 20.40 14.75 1.42 7.28
Latin ITTB la_ittb UDPipe 98.34 96.97 98.99 91.06 88.80 86.40 82.35 85.71
UDify 98.48 95.81 98.08 92.43 90.12 87.93 82.24 85.97
Latin PROIEL la_proiel UDPipe 97.01 91.53 96.32 83.34 78.66 76.20 67.40 73.65
UDify 96.79 89.49 91.79 84.85 80.52 77.96 67.18 71.00
Latin Perseus la_perseus UDPipe 88.40 79.10 81.45 71.20 61.28 56.32 41.58 45.09
UDify 90.96 82.09 81.08 78.33 69.60 65.95 50.26 51.33
Latvian LVTB lv_lvtb UDPipe 96.11 93.01 95.46 87.20 83.35 80.90 71.92 76.64
UDify 96.02 89.78 91.00 89.33 85.09 82.34 69.51 72.58
Lithuanian HSE lt_hse UDPipe 81.70 60.47 76.89 51.98 42.17 38.93 18.17 28.70
UDify 90.47 70.00 67.17 79.06 69.34 66.00 36.21 36.35
Maltese MUDT mt_mudt UDPipe 95.99 100.00 100.00 84.65 79.71 71.49 66.75 71.49
UDify 91.98 99.89 100.00 83.07 75.56 65.08 58.14 65.08
Marathi UFAL mr_ufal UDPipe 80.10 67.23 81.31 70.63 61.41 57.44 29.34 45.87
UDify 88.59 59.22 72.82 79.37 67.72 60.13 21.71 39.25
Naija NSC pcm_nsc UDify 55.44 51.32 97.03 45.75 32.16 31.62 4.73 29.33
North Sami Giella sme_giella UDPipe 92.54 90.03 88.31 78.30 73.49 70.94 62.40 61.45
UDify 90.21 83.55 71.50 74.30 67.13 64.41 51.20 40.63
Norwegian Bokmaal no_bokmaal UDPipe 98.31 97.14 98.64 92.39 90.49 88.18 84.06 86.53
UDify 98.18 96.36 97.33 93.97 92.18 90.40 85.02 87.13
Norwegian Nynorsk no_nynorsk UDPipe 98.14 97.02 98.18 92.09 90.01 87.68 82.97 85.47
UDify 98.14 96.55 97.18 94.34 92.37 90.39 85.01 86.71
Norwegian NynorskLIA no_nynorsklia UDPipe 89.59 86.13 93.93 68.08 60.07 54.89 44.47 50.98
UDify 95.01 93.36 96.13 75.40 69.60 65.33 56.90 62.27
Old Church Slavonic PROIEL cu_proiel UDPipe 96.91 90.66 93.11 89.66 85.04 83.41 73.63 77.81
UDify 84.23 71.30 65.70 76.71 66.67 64.10 46.25 43.88
Old French SRCMF fro_srcmf UDPipe 96.09 97.81 100.00 91.74 86.83 83.85 79.91 83.85
UDify 95.73 96.98 100.00 91.74 86.65 83.49 78.85 83.49
Persian Seraji fa_seraji UDPipe 97.75 97.78 97.44 90.05 86.66 83.26 81.23 80.93
UDify 96.22 94.73 92.55 89.59 85.84 81.98 76.65 74.74
Polish LFG pl_lfg UDPipe 98.80 95.49 97.54 96.58 94.76 93.01 87.04 90.26
UDify 98.80 87.71 94.04 96.67 94.58 93.03 76.50 85.15
Polish SZ pl_sz UDPipe 98.34 93.04 97.16 93.39 91.24 89.39 81.06 85.99
UDify 98.36 67.11 93.92 93.67 89.20 87.31 48.47 80.24
Portuguese Bosque pt_bosque UDPipe 97.07 96.40 98.46 91.36 89.04 85.19 76.67 83.06
UDify 97.10 89.70 91.60 91.37 87.84 84.13 69.09 78.64
Portuguese GSD pt_gsd UDPipe 98.31 99.92 99.30 93.01 91.63 87.67 85.96 86.94
UDify 98.04 95.75 98.95 94.22 92.54 89.37 82.32 87.90
Portuguese PUD pt_pud UDify 90.14 51.16 99.79 87.02 80.17 74.10 17.51 74.10
Romanian Nonstandard ro_nonstandard UDPipe 96.68 90.88 94.78 89.12 84.20 78.91 65.93 73.44
UDify 96.83 88.89 89.33 90.36 85.26 80.41 64.68 68.11
Table 8: The full test results of UDify on 124 treebanks (part 3 of 4).
Treebank Model UPOS UFeats Lemmas UAS LAS CLAS MLAS BLEX
Romanian RRT ro_rrt UDPipe 97.96 97.53 98.41 91.31 86.74 82.57 79.02 81.09
UDify 97.73 96.12 95.84 93.16 88.56 84.87 79.20 79.92
Russian GSD ru_gsd UDPipe 97.10 92.66 97.37 88.15 84.37 82.66 74.07 80.03
UDify 96.91 87.45 77.73 90.71 86.03 84.51 67.24 62.08
Russian PUD ru_pud UDify 93.06 63.60 77.93 93.51 87.14 83.96 37.25 61.86
Russian SynTagRus ru_syntagrus UDPipe 99.12 97.57 98.53 93.80 92.32 90.85 87.91 89.17
UDify 98.97 96.29 94.47 94.83 93.13 91.87 86.91 85.44
Russian Taiga ru_taiga UDPipe 93.18 82.87 89.99 75.45 69.11 65.31 48.81 57.21
UDify 95.39 88.47 90.19 84.02 77.80 75.12 59.71 65.15
Sanskrit UFAL sa_ufal UDify 37.33 17.63 37.38 40.21 18.56 15.38 0.85 4.12
Serbian SET sr_set UDPipe 98.33 94.35 97.36 92.70 89.27 87.08 79.14 84.18
UDify 98.30 92.22 95.86 95.68 91.95 90.30 78.45 84.93
Slovak SNK sk_snk UDPipe 96.83 90.82 96.40 89.82 86.90 84.81 74.00 81.37
UDify 97.46 89.30 93.80 95.92 93.87 92.86 77.33 85.12
Slovenian SSJ sl_ssj UDPipe 98.61 95.92 98.25 92.96 91.16 88.76 83.85 86.89
UDify 98.73 93.44 96.50 94.74 93.07 90.94 81.55 86.38
Slovenian SST sl_sst UDPipe 93.79 86.28 95.17 73.51 67.51 63.46 52.67 60.32
UDify 95.40 89.81 95.15 80.37 75.03 71.19 61.32 67.24
Spanish AnCora es_ancora UDPipe 98.91 98.49 99.17 92.34 90.26 86.39 83.97 85.51
UDify 98.53 97.89 98.07 92.99 90.50 87.26 83.43 84.85
Spanish GSD es_gsd UDPipe 96.85 97.09 98.97 90.71 88.03 82.85 75.98 81.47
UDify 95.91 95.08 96.52 90.82 87.23 82.83 72.47 78.08
Spanish PUD es_pud UDify 88.98 54.58 100.00 90.45 83.08 77.42 18.06 77.42
Swedish LinES sv_lines UDPipe 96.78 89.43 97.03 86.07 81.86 80.32 66.48 77.38
UDify 96.85 87.24 92.70 88.77 85.49 85.61 66.99 77.62
Swedish PUD sv_pud UDify 96.36 80.04 88.81 89.17 86.10 85.25 57.12 72.92
Swedish Sign Language SSLC swl_sslc UDPipe 68.09 100.00 100.00 50.35 37.94 39.51 30.96 39.51
UDify 63.48 96.10 100.00 40.43 26.95 30.12 23.29 30.12
Swedish Talbanken sv_talbanken UDPipe 97.94 96.86 98.01 89.63 86.61 84.45 79.67 82.26
UDify 98.11 95.92 95.50 91.91 89.03 87.26 80.72 81.31
Tagalog TRG tl_trg UDify 60.62 35.62 73.63 64.04 40.07 36.84 0.00 13.16
Tamil TTB ta_ttb UDPipe 91.05 87.28 93.92 74.11 66.37 63.71 55.31 59.58
UDify 91.50 83.21 80.84 79.34 71.29 69.10 53.62 54.84
Telugu MTG te_mtg UDPipe 93.07 99.03 100.00 91.26 85.02 81.76 77.75 81.76
UDify 93.48 99.31 100.00 92.23 83.91 79.92 76.10 79.92
Thai PUD th_pud UDify 56.78 62.48 100.00 49.05 26.06 18.42 3.77 18.42
Turkish IMST tr_imst UDPipe 96.01 92.55 96.01 74.19 67.56 63.83 56.96 61.37
UDify 94.29 84.49 87.71 74.56 67.44 63.87 49.42 54.10
Turkish PUD tr_pud UDify 77.34 24.59 84.31 67.68 46.07 39.95 2.61 32.50
Ukrainian IU uk_iu UDPipe 97.59 92.66 97.23 88.29 85.25 81.90 73.81 79.10
UDify 97.71 88.63 94.00 92.83 90.30 88.15 72.93 81.04
Upper Sorbian UFAL hsb_ufal UDPipe 62.93 41.10 68.68 45.58 34.54 27.18 3.37 16.65
UDify 84.87 48.84 72.68 71.55 62.82 56.04 16.19 37.89
Urdu UDTB ur_udtb UDPipe 93.66 81.92 97.40 87.50 81.62 75.20 55.02 73.07
UDify 94.37 82.80 96.68 88.43 82.84 77.00 56.70 73.97
Uyghur UDT ug_udt UDPipe 89.87 88.30 95.31 78.46 67.09 60.85 47.84 57.08
UDify 75.88 70.80 79.70 65.89 48.80 38.95 21.75 31.31
Vietnamese VTB vi_vtb UDPipe 89.68 99.72 99.55 70.38 62.56 60.03 55.56 59.54
UDify 91.29 99.58 99.21 74.11 66.00 63.34 58.71 62.61
Warlpiri UFAL wbp_ufal UDify 33.44 18.15 39.17 21.66 7.96 7.49 0.00 0.88
Yoruba YTB yo_ytb UDify 50.86 78.32 85.56 37.62 19.09 16.56 6.30 12.15
Table 9: The full test results of UDify on 124 treebanks (part 4 of 4).