Text Embedding Inversion Attacks on Multilingual Language Models

Abstract

Representing textual information as real-numbered embeddings has become thenorm in NLP. Moreover, with the rise of public interest in large languagemodels (LLMs), Embeddings as a Service (EaaS) has rapidly gained traction as abusiness model. This is not without outstanding security risks, as previousresearch has demonstrated that sensitive data can be reconstructed fromembeddings, even without knowledge of the underlying model that generated them.However, such work is limited by its sole focus on English, leaving all otherlanguages vulnerable to attacks by malicious actors. %As many international andmultilingual companies leverage EaaS, there is an urgent need for research intomultilingual LLM security. To this end, this work investigates LLM securityfrom the perspective of multilingual embedding inversion. Concretely, we definethe problem of black-box multilingual and cross-lingual inversion attacks, withspecial attention to a cross-domain scenario. Our findings reveal thatmultilingual models are potentially more vulnerable to inversion attacks thantheir monolingual counterparts. This stems from the reduced data requirementsfor achieving comparable inversion performance in settings where the underlyinglanguage is not known a-priori. To our knowledge, this work is the first todelve into multilinguality within the context of inversion attacks, and ourfindings highlight the need for further investigation and enhanced defenses inthe area of NLP Security.

Quick Read (beta)

loading the full paper ...