Tokenization Repair in the Presence of Spelling Errors

Abstract

We consider the following tokenization repair problem: Given a naturallanguage text with any combination of missing or spurious spaces, correctthese. Spelling errors can be present, but it's not part of the problem tocorrect them. For example, given: "Tispa per isabout token izaionrep air",compute "Tis paper is about tokenizaion repair". It is tempting to think ofthis problem as a special case of spelling correction or to treat the twoproblems together. We make a case that tokenization repair and spellingcorrection should and can be treated as separate problems. We investigate avariety of neural models as well as a number of strong baselines. We identifythree main ingredients to high-quality tokenization repair: deep languagemodels with a bidirectional component, training the models on text withspelling errors, and making use of the space information already present. Ourbest methods can repair all tokenization errors on 97.5% of the correctlyspelled test sentences and on 96.0% of the misspelled test sentences. With allspaces removed from the given text (the scenario from previous work), theaccuracy falls to 94.5% and 90.1%, respectively. We conduct a detailed erroranalysis.

Quick Read (beta)

loading the full paper ...