Abstract
Semantic sentence embedding models encode natural language sentences intovectors, such that closeness in embedding space indicates closeness in thesemantics between the sentences. Bilingual data offers a useful signal forlearning such embeddings: properties shared by both sentences in a translationpair are likely semantic, while divergent properties are likely stylistic orlanguage-specific. We propose a deep latent variable model that attempts toperform source separation on parallel sentences, isolating what they have incommon in a latent semantic vector, and explaining what is left over withlanguage-specific latent vectors. Our proposed approach differs from past workon semantic sentence encoding in two ways. First, by using a variationalprobabilistic framework, we introduce priors that encourage source separation,and can use our model's posterior to predict sentence embeddings formonolingual data at test time. Second, we use high-capacity transformers asboth data generating distributions and inference networks -- contrasting withmost past work on sentence embeddings. In experiments, our approachsubstantially outperforms the state-of-the-art on a standard suite ofunsupervised semantic similarity evaluations. Further, we demonstrate that ourapproach yields the largest gains on more difficult subsets of theseevaluations where simple word overlap is not a good indicator of similarity.