Abstract
In contrast to Connectionist Temporal Classification (CTC) approaches,Sequence-To-Sequence (S2S) models for Handwritten Text Recognition (HTR) sufferfrom errors such as skipped or repeated words which often occur at the end of asequence. In this paper, to combine the best of both approaches, we propose touse the CTC-Prefix-Score during S2S decoding. Hereby, during beam search, pathsthat are invalid according to the CTC confidence matrix are penalised. Ournetwork architecture is composed of a Convolutional Neural Network (CNN) asvisual backbone, bidirectional Long-Short-Term-Memory-Cells (LSTMs) as encoder,and a decoder which is a Transformer with inserted mutual attention layers. TheCTC confidences are computed on the encoder while the Transformer is only usedfor character-wise S2S decoding. We evaluate this setup on three HTR data sets:IAM, Rimes, and StAZH. On IAM, we achieve a competitive Character Error Rate(CER) of 2.95% when pretraining our model on synthetic data and including acharacter-based language model for contemporary English. Compared to otherstate-of-the-art approaches, our model requires about 10-20 times lessparameters. Access our shared implementations via this link to GitHub:https://github.com/Planet-AI-GmbH/tfaip-hybrid-ctc-s2s.