A Theory of Unsupervised Translation Motivated by Understanding Animal Communication

Abstract

Recent years have seen breakthroughs in neural language models that capturenuances of language, culture, and knowledge. Neural networks are capable oftranslating between languages -- in some cases even between two languages wherethere is little or no access to parallel translations, in what is known asUnsupervised Machine Translation (UMT). Given this progress, it is intriguingto ask whether machine learning tools can ultimately enable understandinganimal communication, particularly that of highly intelligent animals. Our workis motivated by an ambitious interdisciplinary initiative, Project CETI, whichis collecting a large corpus of sperm whale communications for machineanalysis. We propose a theoretical framework for analyzing UMT when no parallel dataare available and when it cannot be assumed that the source and target corporaaddress related subject domains or posses similar linguistic structure. Theframework requires access to a prior probability distribution that shouldassign non-zero probability to possible translations. We instantiate ourframework with two models of language. Our analysis suggests that accuracy oftranslation depends on the complexity of the source language and the amount of``common ground'' between the source language and target prior. We also prove upper bounds on the amount of data required from the sourcelanguage in the unsupervised setting as a function of the amount of datarequired in a hypothetical supervised setting. Surprisingly, our bounds suggestthat the amount of source data required for unsupervised translation iscomparable to the supervised setting. For one of the language models which weanalyze we also prove a nearly matching lower bound. Our analysis is purely information-theoretic and as such can inform how muchsource data needs to be collected, but does not yield a computationallyefficient procedure.

Quick Read (beta)

loading the full paper ...