Machine Translation for Accessible Multi-Language Text Analysis

  • 2023-01-20 04:11:38
  • Edward W. Chew, William D. Weisman, Jingying Huang, Seth Frey
  • 3

Abstract

English is the international standard of social research, but scholars areincreasingly conscious of their responsibility to meet the need for scholarlyinsight into communication processes globally. This tension is as true incomputational methods as any other area, with revolutionary advances in thetools for English language texts leaving most other languages far behind. Inthis paper, we aim to leverage those very advances to demonstrate thatmulti-language analysis is currently accessible to all computational scholars.We show that English-trained measures computed after translation to Englishhave adequate-to-excellent accuracy compared to source-language measurescomputed on original texts. We show this for three major analytics -- sentimentanalysis, topic analysis, and word embeddings -- over 16 languages, includingSpanish, Chinese, Hindi, and Arabic. We validate this claim by comparingpredictions on original language tweets and their backtranslations: doubletranslations from their source language to English and back to the sourcelanguage. Overall, our results suggest that Google Translate, a simple andwidely accessible tool, is effective in preserving semantic content acrosslanguages and methods. Modern machine translation can thus help computationalscholars make more inclusive and general claims about human communication.

 

Quick Read (beta)

loading the full paper ...