HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

Abstract

Although there are a couple of open-source language processing pipelinesavailable for Hungarian, none of them satisfies the requirements of today's NLPapplications. A language processing pipeline should consist of close tostate-of-the-art lemmatization, morphosyntactic analysis, entity recognitionand word embeddings. Industrial text processing applications have to satisfynon-functional software quality requirements, what is more, frameworkssupporting multiple languages are more and more favored. This paper introducesHuSpaCy, an industry-ready Hungarian language processing toolkit. The presentedtool provides components for the most important basic linguistic analysistasks. It is open-source and is available under a permissive license. Oursystem is built upon spaCy's NLP components resulting in an easily usable, fastyet accurate application. Experiments confirm that HuSpaCy has high accuracywhile maintaining resource-efficient prediction capabilities.

Quick Read (beta)

loading the full paper ...