Benchmarking Large Language Models for Handwritten Text Recognition

Abstract

Traditional machine learning models for Handwritten Text Recognition (HTR)rely on supervised training, requiring extensive manual annotations, and oftenproduce errors due to the separation between layout and text processing. Incontrast, Multimodal Large Language Models (MLLMs) offer a general approach torecognizing diverse handwriting styles without the need for model-specifictraining. The study benchmarks various proprietary and open-source LLMs againstTranskribus models, evaluating their performance on both modern and historicaldatasets written in English, French, German, and Italian. In addition, emphasisis placed on testing the models' ability to autonomously correct previouslygenerated outputs. Findings indicate that proprietary models, especially Claude3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMsachieve excellent results in recognizing modern handwriting and exhibit apreference for the English language due to their pre-training datasetcomposition. Comparisons with Transkribus show no consistent advantage foreither approach. Moreover, LLMs demonstrate limited ability to autonomouslycorrect errors in zero-shot transcriptions.

Quick Read (beta)

loading the full paper ...