BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark

Abstract

To advance Chinese financial natural language processing (NLP), we introduceBBT-FinT5, a new Chinese financial pre-training language model based on the T5model. To support this effort, we have built BBT-FinCorpus, a large-scalefinancial corpus with approximately 300GB of raw text from four differentsources. In general domain NLP, comprehensive benchmarks like GLUE andSuperGLUE have driven significant advancements in language model pre-trainingby enabling head-to-head comparisons among models. Drawing inspiration fromthese benchmarks, we propose BBT-CFLEB, a Chinese Financial Languageunderstanding and generation Evaluation Benchmark, which includes six datasetscovering both understanding and generation tasks. Our aim is to facilitateresearch in the development of NLP within the Chinese financial domain. Ourmodel, corpus and benchmark are released athttps://github.com/ssymmetry/BBT-FinCUGE-Applications. Our work belongs to theBig Bang Transformer (BBT), a large-scale pre-trained language model project.

Quick Read (beta)

loading the full paper ...