Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

  • 2025-10-28 15:43:26
  • Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyunghyun Cho, Alice Oh
  • 0

Abstract

The history of the Korean language is characterized by a discrepancy betweenits spoken and written forms and a pivotal shift from Chinese characters to theHangul alphabet. However, this linguistic evolution has remained largelyunexplored in NLP due to a lack of accessible historical corpora. To addressthis gap, we introduce the Open Korean Historical Corpus, a large-scale, openlylicensed dataset spanning 1,300 years and 6 languages, as well asunder-represented writing systems like Korean-style Sinitic (Idu) andHanja-Hangul mixed script. This corpus contains 18 million documents and 5billion tokens from 19 sources, ranging from the 7th century to 2025. Weleverage this resource to quantitatively analyze major linguistic shifts: (1)Idu usage peaked in the 1860s before declining sharply; (2) the transition fromHanja to Hangul was a rapid transformation starting around 1890; and (3) NorthKorea's lexical divergence causes modern tokenizers to produce up to 51 timeshigher out-of-vocabulary rates. This work provides a foundational resource forquantitative diachronic analysis by capturing the history of the Koreanlanguage. Moreover, it can serve as a pre-training corpus for large languagemodels, potentially improving their understanding of Sino-Korean vocabulary inmodern Hangul as well as archaic writing systems.

 

Quick Read (beta)

loading the full paper ...