HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

  • 2022-01-06 07:49:45
  • György Orosz, Zsolt Szántó, Péter Berkecz, Gergő Szabó, Richárd Farkas
  • 2

Abstract

Although there are a couple of open-source language processing pipelinesavailable for Hungarian, none of them satisfies the requirements of today's NLPapplications. A language processing pipeline should consist of close tostate-of-the-art lemmatization, morphosyntactic analysis, entity recognitionand word embeddings. Industrial text processing applications have to satisfynon-functional software quality requirements, what is more, frameworkssupporting multiple languages are more and more favored. This paper introducesHuSpaCy, an industryready Hungarian language processing pipeline. The presentedtool provides components for the most important basic linguistic analysistasks. It is open-source and is available under a permissive license. Oursystem is built upon spaCy's NLP components which means that it is fast, has arich ecosystem of NLP applications and extensions, comes with extensivedocumentation and a well-known API. Besides the overview of the underlyingmodels, we also present rigorous evaluation on common benchmark datasets. Ourexperiments confirm that HuSpaCy has high accuracy in all subtasks whilemaintaining resource-efficient prediction capabilities.

 

Quick Read (beta)

loading the full paper ...