Abstract
Since self-attention layers in Transformers are permutation invariant bydesign, positional encodings must be explicitly incorporated to enable spatialunderstanding. However, fixed-size lookup tables used in traditional learnableposition embeddings (PEs) limit extrapolation capabilities beyond pre-trainedsequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate thislimitation but demand extensive modifications for adapting to new modalities,underscoring fundamental challenges in adaptability and scalability. In thiswork, we present SeqPE, a unified and fully learnable position encodingframework that represents each $n$-dimensional position index as a symbolicsequence and employs a lightweight sequential position encoder to learn theirembeddings in an end-to-end manner. To regularize SeqPE's embedding space, weintroduce two complementary objectives: a contrastive objective that alignsembedding distances with a predefined position-distance function, and aknowledge distillation loss that anchors out-of-distribution positionembeddings to in-distribution teacher representations, further enhancingextrapolation performance. Experiments across language modeling, long-contextquestion answering, and 2D image classification demonstrate that SeqPE not onlysurpasses strong baselines in perplexity, exact match (EM), andaccuracy--particularly under context length extrapolation--but also enablesseamless generalization to multi-dimensional inputs without requiring manualarchitectural redesign. We release our code, data, and checkpoints athttps://github.com/ghrua/seqpe.