TCM-SD: A Benchmark for Probing Syndrome Differentiation via Natural Language Processing

  • 2022-08-03 04:18:00
  • Mucheng Ren, Heyan Huang, Yuxiang Zhou, Qianwen Cao, Yuan Bu, Yang Gao
  • 0

Abstract

Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapythat has spread and been applied worldwide. The unique TCM diagnosis andtreatment system requires a comprehensive analysis of a patient's symptomshidden in the clinical record written in free text. Prior studies have shownthat this system can be informationized and intelligentized with the aid ofartificial intelligence (AI) technology, such as natural language processing(NLP). However, existing datasets are not of sufficient quality nor quantity tosupport the further development of data-driven AI technology in TCM. Therefore,in this paper, we focus on the core task of the TCM diagnosis and treatmentsystem -- syndrome differentiation (SD) -- and we introduce the first publiclarge-scale dataset for SD, called TCM-SD. Our dataset contains 54,152real-world clinical records covering 148 syndromes. Furthermore, we collect alarge-scale unlabelled textual corpus in the field of TCM and propose adomain-specific pre-trained language model, called ZY-BERT. We conductedexperiments using deep neural networks to establish a strong performancebaseline, reveal various challenges in SD, and prove the potential ofdomain-specific pre-trained language model. Our study and analysis revealopportunities for incorporating computer science and linguistics knowledge toexplore the empirical validity of TCM theories.

 

Quick Read (beta)

loading the full paper ...