What do Language Representations Really Represent?

Abstract

A neural language model trained on a text corpus can be used to inducedistributed representations of words, such that similar words end up withsimilar representations. If the corpus is multilingual, the same model can beused to learn distributed representations of languages, such that similarlanguages end up with similar representations. We show that this holds evenwhen the multilingual corpus has been translated into English, by picking upthe faint signal left by the source languages. However, just like it is athorny problem to separate semantic from syntactic similarity in wordrepresentations, it is not obvious what type of similarity is captured bylanguage representations. We investigate correlations and causal relationshipsbetween language representations learned from translations on one hand, andgenetic, geographical, and several levels of structural similarity betweenlanguages on the other. Of these, structural similarity is found to correlatemost strongly with language representation similarity, while geneticrelationships---a convenient benchmark used for evaluation in previouswork---appears to be a confounding factor. Apart from implications abouttranslation effects, we see this more generally as a case where NLP andlinguistic typology can interact and benefit one another.

Quick Read (beta)

loading the full paper ...