Geometric Signatures of Compositionality Across a Language Model's Lifetime

Abstract

By virtue of linguistic compositionality, few syntactic rules and a finitelexicon can generate an unbounded number of sentences. That is, language,though seemingly high-dimensional, can be explained using relatively fewdegrees of freedom. An open question is whether contemporary language models(LMs) reflect the intrinsic simplicity of language that is enabled bycompositionality. We take a geometric view of this problem by relating thedegree of compositionality in a dataset to the intrinsic dimension (ID) of itsrepresentations under an LM, a measure of feature complexity. We find not onlythat the degree of dataset compositionality is reflected in representations'ID, but that the relationship between compositionality and geometric complexityarises due to learned linguistic features over training. Finally, our analysesreveal a striking contrast between nonlinear and linear dimensionality, showingthey respectively encode semantic and superficial aspects of linguisticcomposition.

Quick Read (beta)

loading the full paper ...