Abstract
Handwritten Mathematical Expression Recognition (HMER) methods have maderemarkable progress, with most existing HMER approaches based on either ahybrid CNN/RNN-based with GRU architecture or Transformer architectures. Eachof these has its strengths and weaknesses. Leveraging different modelstructures as viewers and effectively integrating their diverse capabilitiespresents an intriguing avenue for exploration. This involves addressing two keychallenges: 1) How to fuse these two methods effectively, and 2) How to achievehigher performance under an appropriate level of complexity. This paperproposes an efficient CNN-Transformer multi-viewer, multi-task approach toenhance the model's recognition performance. Our MMHMER model achieves 63.96%,62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19, outperformingPosformer with an absolute gain of 1.28%, 1.48%, and 0.58%. The maincontribution of our approach is that we propose a new multi-view, multi-taskframework that can effectively integrate the strengths of CNN and Transformer.By leveraging the feature extraction capabilities of CNN and the sequencemodeling capabilities of Transformer, our model can better handle thecomplexity of handwritten mathematical expressions.