FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Abstract

Recent work in language modeling has shown that training large-scaleTransformer models has promoted the latest developments in natural languageprocessing applications. However, there is very little work to unify thecurrent effective models. In this work, we use the current effective modelstructure to launch a model set through the current most mainstream technology.We think this will become the basic model in the future. For Chinese, using theGPT-2[9] model, a 10.3 billion parameter language model was trained on theChinese dataset, and, in particular, a 2.9 billion parameter language modelbased on dialogue data was trained; the BERT model was trained on the Chinesedataset with 495 million parameters; the Transformer model has trained alanguage model with 5.6 billion parameters on the Chinese dataset. In English,corresponding training work has also been done. Using the GPT-2 model, alanguage model with 6.4 billion parameters was trained on the English dataset;the BERT[3] model trained a language model with 1.24 billion parameters on theEnglish dataset, and in particular, it trained a 688 million parameter based onsingle card training technology Language model; Transformer model trained alanguage model with 5.6 billion parameters on the English dataset. In the TNEWSclassification task evaluated by CLUE[13], the BERT-C model exceeded the 59.46%accuracy of ALBERT-xxlarge with an accuracy rate of 59.99%, an increase of0.53%. In the QQP classification task evaluated by GLUE[11], the accuracy rateof 78.95% surpassed the accuracy rate of BERT-Large of 72.1%, an increase of6.85%. Compared with the current accuracy rate of ERNIE, the first place in theGLUE evaluation of 75.2%, an increase of 3.75%.

Quick Read (beta)

loading the full paper ...