Abstract
The variational autoencoder (VAE) imposes a probabilistic distribution(typically Gaussian) on the latent space and penalizes the Kullback--Leibler(KL) divergence between the posterior and prior. In NLP, VAEs are extremelydifficult to train due to the problem of KL collapsing to zero. One has toimplement various heuristics such as KL weight annealing and word dropout in acarefully engineered manner to successfully train a VAE for text. In thispaper, we propose to use the Wasserstein autoencoder (WAE) for probabilisticsentence generation, where the encoder could be either stochastic ordeterministic. We show theoretically and empirically that, in the original WAE,the stochastically encoded Gaussian distribution tends to become a Dirac-deltafunction, and we propose a variant of WAE that encourages the stochasticity ofthe encoder. Experimental results show that the latent space learned by WAEexhibits properties of continuity and smoothness as in VAEs, whilesimultaneously achieving much higher BLEU scores for sentence reconstruction.