On the Discrepancy between Density Estimation and Sequence Generation

Abstract

Many sequence-to-sequence generation tasks, including machine translation andtext-to-speech, can be posed as estimating the density of the output y giventhe input x: p(y|x). Given this interpretation, it is natural to evaluatesequence-to-sequence models using conditional log-likelihood on a test set.However, the goal of sequence-to-sequence generation (or structured prediction)is to find the best output y^ given an input x, and each task has its owndownstream metric R that scores a model output by comparing against a set ofreferences y*: R(y^, y* | x). While we hope that a model that excels in densityestimation also performs well on the downstream metric, the exact correlationhas not been studied for sequence generation tasks. In this paper, by comparingseveral density estimators on five machine translation tasks, we find that thecorrelation between rankings of models based on log-likelihood and BLEU variessignificantly depending on the range of the model families being compared.First, log-likelihood is highly correlated with BLEU when we consider modelswithin the same family (e.g. autoregressive models, or latent variable modelswith the same parameterization of the prior). However, we observe nocorrelation between rankings of models across different families: (1) amongnon-autoregressive latent variable models, a flexible prior distribution isbetter at density estimation but gives worse generation quality than a simpleprior, and (2) autoregressive models offer the best translation performanceoverall, while latent variable models with a normalizing flow prior give thehighest held-out log-likelihood across all datasets. Therefore, we recommendusing a simple prior for the latent variable non-autoregressive model when fastgeneration speed is desired.

Quick Read (beta)

loading the full paper ...