LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

  • 2024-08-08 18:58:06
  • Danlu Chen, Freda Shi, Aditi Agarwal, Jacobo Myerston, Taylor Berg-Kirkpatrick
  • 0

Abstract

Standard natural language processing (NLP) pipelines operate on symbolicrepresentations of language, which typically consist of sequences of discretetokens. However, creating an analogous representation for ancient logographicwriting systems is an extremely labor intensive process that requires expertknowledge. At present, a large portion of logographic data persists in a purelyvisual form due to the absence of transcription -- this issue poses abottleneck for researchers seeking to apply NLP toolkits to study ancientlogographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representationsof language offers a potential solution. We introduce LogogramNLP, the firstbenchmark enabling NLP analysis of ancient logographic languages, featuringboth transcribed and visual datasets for four writing systems along withannotations for tasks like classification, translation, and parsing. Ourexperiments compare systems that employ recent visual and text encodingstrategies as backbones. The results demonstrate that visual representationsoutperform textual representations for some investigated tasks, suggesting thatvisual processing pipelines may unlock a large amount of cultural heritage dataof logographic languages for NLP-based analyses.

 

Quick Read (beta)

loading the full paper ...