Predictability and Causality in Spanish and English Natural Language Generation

Abstract

In recent years, the field of Natural Language Generation (NLG) has beenboosted by the recent advances in deep learning technologies. Nonetheless,these new data-intensive methods introduce language-dependent disparities inNLG as the main training data sets are in English. Also, most neural NLGsystems use decoder-only (causal) transformer language models, which work wellfor English, but were not designed with other languages in mind. In this workwe depart from the hypothesis that they may introduce generation bias in targetlanguages with less rigid word ordering, subject omission, or differentattachment preferences for relative clauses, so that for these target languagesother language generation strategies may be more desirable. This paper firstcompares causal and non-causal language modeling for English and Spanish, twolanguages with different grammatical structures and over 1.5 billion and 0.5billion speakers, respectively. For this purpose, we define a novel metric ofaverage causal and non-causal context-conditioned entropy of the grammaticalcategory distribution for both languages as an information-theoretic a prioriapproach. The evaluation of natural text sources (such as training data) inboth languages reveals lower average non-causal conditional entropy in Spanishand lower causal conditional entropy in English. According to this experiment,Spanish is more predictable than English given a non-causal context. Then, byapplying a conditional relative entropy metric to text generation experiments,we obtain as insights that the best performance is respectively achieved withcausal NLG in English, and with non-causal NLG in Spanish. These insightssupport further research in NLG in Spanish using bidirectional transformerlanguage models.

Quick Read (beta)

loading the full paper ...