Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

Abstract

This paper presents SwissCrawl, the largest Swiss German text corpus to date.Composed of more than half a million sentences, it was generated using acustomized web scraping tool that could be applied to other low-resourcelanguages as well. The approach demonstrates how freely available web pages canbe used to construct comprehensive text corpora, which are of fundamentalimportance for natural language processing. In an experimental evaluation, weshow that using the new corpus leads to significant improvements for the taskof language modeling. To capture new content, our approach will runcontinuously to keep increasing the corpus over time.

Quick Read (beta)

loading the full paper ...