Vision-Language Model Based Handwriting Verification

Abstract

Handwriting Verification is a critical in document forensics. Deep learningbased approaches often face skepticism from forensic document examiners due totheir lack of explainability and reliance on extensive training data andhandcrafted features. This paper explores using Vision Language Models (VLMs),such as OpenAI's GPT-4o and Google's PaliGemma, to address these challenges. Byleveraging their Visual Question Answering capabilities and 0-shotChain-of-Thought (CoT) reasoning, our goal is to provide clear,human-understandable explanations for model decisions. Our experiments on theCEDAR handwriting dataset demonstrate that VLMs offer enhancedinterpretability, reduce the need for large training datasets, and adapt betterto diverse handwriting styles. However, results show that the CNN-basedResNet-18 architecture outperforms the 0-shot CoT prompt engineering approachwith GPT-4o (Accuracy: 70%) and supervised fine-tuned PaliGemma (Accuracy:71%), achieving an accuracy of 84% on the CEDAR AND dataset. These findingshighlight the potential of VLMs in generating human-interpretable decisionswhile underscoring the need for further advancements to match the performanceof specialized deep learning models.

Quick Read (beta)

loading the full paper ...