Jochre 3 and the Yiddish OCR corpus

  • 2025-01-14 21:21:39
  • Assaf Urieli, Amber Clooney, Michelle Sigiel, Grisha Leyfer
  • 0

Abstract

We describe the construction of a publicly available Yiddish OCR Corpus, anddescribe and evaluate the open source OCR tool suite Jochre 3, including anAlto editor for corpus annotation, OCR software for Alto OCR layer generation,and a customizable OCR search engine. The current version of the Yiddish OCRcorpus contains 658 pages, 186K tokens and 840K glyphs. The Jochre 3 OCR tooluses various fine-tuned YOLOv8 models for top-down page layout analysis, and acustom CNN network for glyph recognition. It attains a CER of 1.5% on our testcorpus, far out-performing all other existing public models for Yiddish. Weanalyzed the full 660M word Yiddish Book Center with Jochre 3 OCR, and the newOCR is searchable through the Yiddish Book Center OCR search engine.

 

Quick Read (beta)

loading the full paper ...