Abstract
Transformers have recently taken the center stage in language modeling afterLSTM's were considered the dominant model architecture for a long time. In thisproject, we investigate the performance of the Transformer architectures-BERTand Transformer-XL for the language modeling task. We use a sub-word modelsetting with the Finnish language and compare it to the previous State of theart (SOTA) LSTM model. BERT achieves a pseudo-perplexity score of 14.5, whichis the first such measure achieved as far as we know. Transformer-XL improvesupon the perplexity score to 73.58 which is 27\% better than the LSTM model.
Quick Read (beta)
loading the full paper ...