Scene text spotting is of great importance to the computer vision communitydue to its wide variety of applications. Recent methods attempt to introducelinguistic knowledge for challenging recognition rather than pure visualclassification. However, how to effectively model the linguistic rules inend-to-end deep networks remains a research challenge. In this paper, we arguethat the limited capacity of language models comes from 1) implicit languagemodeling; 2) unidirectional feature representation; and 3) language model withnoise input. Correspondingly, we propose an autonomous, bidirectional anditerative ABINet++ for scene text spotting. Firstly, the autonomous suggestsenforcing explicitly language modeling by decoupling the recognizer into visionmodel and language model and blocking gradient flow between both models.Secondly, a novel bidirectional cloze network (BCN) as the language model isproposed based on bidirectional feature representation. Thirdly, we propose anexecution manner of iterative correction for the language model which caneffectively alleviate the impact of noise input. Finally, to polish ABINet++ inlong text recognition, we propose to aggregate horizontal features by embeddingTransformer units inside a U-Net, and design a position and content attentionmodule which integrates character order and content to attend to characterfeatures precisely. ABINet++ achieves state-of-the-art performance on bothscene text recognition and scene text spotting benchmarks, which consistentlydemonstrates the superiority of our method in various environments especiallyon low-quality images. Besides, extensive experiments including in English andChinese also prove that, a text spotter that incorporates our language modelingmethod can significantly improve its performance both in accuracy and speedcompared with commonly used attention-based recognizers.