Abstract
Fuzzy string matching and language classification are important tools inNatural Language Processing pipelines, this paper provides advances in bothareas. We propose a fast novel approach to string tokenisation for fuzzylanguage matching and experimentally demonstrate an 83.6% decrease inprocessing time with an estimated improvement in recall of 3.1% at the cost ofa 2.6% decrease in precision. This approach is able to work even where keywordsare subdivided into multiple words, without needing to scancharacter-to-character. So far there has been little work considering usingmetadata to enhance language classification algorithms. We provideobservational data and find the Accept-Language header is 14% more likely tomatch the classification than the IP Address.