Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages

Abstract

Natural Language Processing (NLP) is increasingly used as a key ingredient incritical decision-making systems such as resume parsers used in sorting a listof job candidates. NLP systems often ingest large corpora of human text,attempting to learn from past human behavior and decisions in order to producesystems that will make recommendations about our future world. Over 7000 humanlanguages are being spoken today and the typical NLP pipeline underrepresentsspeakers of most of them while amplifying the voices of speakers of otherlanguages. In this paper, a team including speakers of 8 languages - English,Chinese, Urdu, Farsi, Arabic, French, Spanish, and Wolof - takes a criticallook at the typical NLP pipeline and how even when a language is technicallysupported, substantial caveats remain to prevent full participation. Despitehuge and admirable investments in multilingual support in many tools andresources, we are still making NLP-guided decisions that systematically anddramatically underrepresent the voices of much of the world.

Quick Read (beta)

loading the full paper ...