Abstract
Probing techniques for large language models (LLMs) have primarily focused onEnglish, overlooking the vast majority of the world's languages. In this paper,we extend these probing methods to a multilingual context, investigating thebehaviors of LLMs across diverse languages. We conduct experiments on severalopen-source LLM models, analyzing probing accuracy, trends across layers, andsimilarities between probing vectors for multiple languages. Our key findingsreveal: (1) a consistent performance gap between high-resource and low-resourcelanguages, with high-resource languages achieving significantly higher probingaccuracy; (2) divergent layer-wise accuracy trends, where high-resourcelanguages show substantial improvement in deeper layers similar to English; and(3) higher representational similarities among high-resource languages, withlow-resource languages demonstrating lower similarities both among themselvesand with high-resource languages. These results highlight significantdisparities in LLMs' multilingual capabilities and emphasize the need forimproved modeling of low-resource languages.