Prediction hubs are context-informed frequent tokens in LLMs

Abstract

Hubness, the tendency for a few points to be among the nearest neighbours ofa disproportionate number of other points, commonly arises when applyingstandard distance measures to high-dimensional data, often negatively impactingdistance-based analysis. As autoregressive large language models (LLMs) operateon high-dimensional representations, we ask whether they are also affected byhubness. We first prove that the only large-scale representation comparisonoperation performed by LLMs, namely that between context and unembeddingvectors to determine continuation probabilities, is not characterized by theconcentration of distances phenomenon that typically causes the appearance ofnuisance hubness. We then empirically show that this comparison still leads toa high degree of hubness, but the hubs in this case do not constitute adisturbance. They are rather the result of context-modulated frequent tokensoften appearing in the pool of likely candidates for next token prediction.However, when other distances are used to compare LLM representations, we donot have the same theoretical guarantees, and, indeed, we see nuisance hubsappear. There are two main takeaways. First, hubness, while omnipresent inhigh-dimensional spaces, is not a negative property that needs to be mitigatedwhen LLMs are being used for next token prediction. Second, when comparingrepresentations from LLMs using Euclidean or cosine distance, there is a highrisk of nuisance hubs and practitioners should use mitigation techniques ifrelevant.

Quick Read (beta)

loading the full paper ...