Text Embeddings Reveal (Almost) As Much As Text

  • 2023-10-10 18:39:03
  • John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M. Rush
  • 0

Abstract

How much private information do text embeddings reveal about the originaltext? We investigate the problem of embedding \textit{inversion},reconstructing the full text represented in dense text embeddings. We frame theproblem as controlled generation: generating text that, when reembedded, isclose to a fixed point in latent space. We find that although a na\"ive modelconditioned on the embedding performs poorly, a multi-step method thatiteratively corrects and re-embeds text is able to recover $92\%$ of$32\text{-token}$ text inputs exactly. We train our model to decode textembeddings from two state-of-the-art embedding models, and also show that ourmodel can recover important personal information (full names) from a dataset ofclinical notes. Our code is available on Github:\href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.

 

Quick Read (beta)

loading the full paper ...