Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

Abstract

The mechanisms behind multilingual capabilities in Large Language Models(LLMs) have been examined using neuron-based or internal-activation-basedmethods. However, these methods often face challenges such as superposition andlayer-wise activation variance, which limit their reliability. SparseAutoencoders (SAEs) offer a more nuanced analysis by decomposing theactivations of LLMs into sparse linear combination of SAE features. Weintroduce a novel metric to assess the monolinguality of features obtained fromSAEs, discovering that some features are strongly related to specificlanguages. Additionally, we show that ablating these SAE features onlysignificantly reduces abilities in one language of LLMs, leaving others almostunaffected. Interestingly, we find some languages have multiple synergistic SAEfeatures, and ablating them together yields greater improvement than ablatingindividually. Moreover, we leverage these SAE-derived language-specificfeatures to enhance steering vectors, achieving control over the languagegenerated by LLMs.

Quick Read (beta)

loading the full paper ...