
Theoretical Analysis of Positional Encodings
in Transformer Models
Impact on Expressiveness and Generalization
Abstract
Positional encodings are a core component of transformer-based architectures, enabling such models to process sequential data without recurrence. Despite their critical role, the theoretical properties of various positional encoding schemes—including sinusoidal, learned, relative, and recent bias-based methods such as Attention with Linear Biases (ALiBi)—remain poorly understood. In this paper, we present a comprehensive theoretical framework to analyze how different positional encodings affect a transformer’s expressiveness, generalization ability, and extrapolation to sequences longer than those seen during training.
Key Contributions
- Expressiveness Characterization: Formal analysis of model capacity under different PE schemes.
- Generalization Bounds: Rademacher complexity analysis for variable sequence lengths.
- Extrapolation Analysis: Unified framework for understanding length extrapolation in bias-based PEs.
- Novel Schemes: Proposed orthogonal function-based encodings (wavelets, Legendre) with superior extrapolation.
Core Questions
- What classes of functions can different PEs approximate?
- How do PEs affect generalization to unseen data?
- Why do methods like ALiBi enable effective extrapolation?
- Can orthogonal transforms surpass sinusoidal baselines?
Key Findings
Our lightweight experiments confirm that ALiBi effectively extrapolates to longer sequences by imposing a monotonic distance bias. Furthermore, our proposed Wavelet-based encodings provide strong extrapolation, matching or surpassing ALiBi due to the exponential decay of high-frequency components beyond the training range.
Encoding Extrapolation Behavior --------------------------------------------------------- Sinusoidal Limited Cyclic ambiguity beyond Nmax ALiBi Strong Linear bias ensures monotonicity Wavelet Strong Bounded difference, exponential decay Legendre Limited Collapses to constant vector
Conclusion
Positional encodings are far more than an implementation detail; they fundamentally shape a transformer’s capabilities. Our theoretical framework, combined with novel PE schemes and lightweight validation, lays the groundwork for more robust, generalizable, and extrapolatable transformer architectures.