Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers

Abstract

A central question in multilingual language modeling is whether largelanguage models (LLMs) develop a universal concept representation, disentangledfrom specific languages. In this paper, we address this question by analyzinglatent representations (latents) during a word translation task intransformer-based LLMs. We strategically extract latents from a sourcetranslation prompt and insert them into the forward pass on a targettranslation prompt. By doing so, we find that the output language is encoded inthe latent at an earlier layer than the concept to be translated. Building onthis insight, we conduct two key experiments. First, we demonstrate that we canchange the concept without changing the language and vice versa throughactivation patching alone. Second, we show that patching with the mean overlatents across different languages does not impair and instead improves themodels' performance in translating the concept. Our results provide evidencefor the existence of language-agnostic concept representations within theinvestigated models.

Quick Read (beta)

loading the full paper ...