Positional Encoding

Sinusoidal

Attention is All You Need

\begin{aligned} P E (p o s, 2 i) & = \sin (p o s / 10000^{2 i / d_{model}}) \\ P E (p o s, 2 i + 1) & = \cos (p o s / 10000^{2 i / d_{model}}) \end{aligned}

Then add it to the input vectors.

NoPE

Length Generalization of Causal Transformers without Position Encoding

No positional encoding.

Additive

ALiBi

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Original attention score: $softmax (q_{i} K^{T})$
With AliBi: $softmax (q_{i} K^{T} + m \cdot [- (i - 1), \dots, - 2, - 1, 0])$
$m$ is a const head-specific scalar: $2^{\frac{- 8}{n}}$ for the $n$ th head.
No positional encoding.

T5's RPE

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

B_{i, j} = r_{\min (i - j, K)}

$K$ is hyper-parameter.
$r_{i}$ are learnable scalars.

Kerpel

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

B_{i, j} = - r_{1} \log (1 + r_{2} | i - j |)

Sandwich

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

B_{i, j} = r_{1} \sum_{k = 1}^{r_{2}} \cos (\frac{i - j}{10000^{k / d^{'}}})

FIRE

Functional Interpolation for Relative Positions Improves Long Context Transformers

$B_{i, j} = f_{θ} (\frac{ψ (i - j)}{ψ (\max {L, i})})$

$f_{θ} : R \to R$ is MLP.
$ϕ : N \to R_{+}$ is monotonically increasing (e.g. $ϕ (x) = \log (c x + 1)$ ).
$L > 0$ is a learnable scalar.

CaPE

CAPE: Context-Adaptive Positional Encoding for Length Extrapolation

$A_{CAPE} (X) = X W_{Q} (X W_{K})^{⊤} + f (X W_{Q} (X W_{K})^{⊤}, B)$
$f$ is a two-layer LeakyReLU neural network.
$B$ is positional bias matrices (e.g. ALiBi and FIRE).

FoX

Forgetting Transformer: Softmax Attention with a Forget Gate

Dynamic down-weighting of past information.
No need of position embeddings.
Compatible with FlashAttention.

Scalar Forget Gate
$f_{t} = σ (w_{f}^{⊤} x_{t} + b_{f})$
Where $w_{f}$ and $b_{f}$ are learnable and per-head (for multiple head attention).
Cumulative Forget Factor
$\begin{aligned} F_{i j} = \prod_{l = j + 1}^{i} f_{l} (1 if i = j) \\ B_{i j} = \log F_{i j} = \sum_{l = j + 1}^{i} \log f_{l} \end{aligned}$

FoX (Pro)

CoPE

Contextual Position Encoding: Learning to Count What's Important

Dynamically decide which tokens should be counted based on the context.
More flexible position addressing (e.g. i-th specific word, noun, or sentence).

Gate Computation
$g_{i j} = σ (q_{i}^{T} k_{j})$
Contextual Position Calculation
$p_{i j} = \sum_{k = j}^{i} g_{i k}$
Position Embedding Interpolation
- Because $p_{i j}$ may be a fraction, interpolation is used to compute the embedding vector.
- For each integer position, a learnable embedding vector $e [p]$ is used.
- For decimal position: $e [p_{i j}] = (p_{i j} - ⌊ p_{i j} ⌋) e [[p_{i j}]] + (1 - p_{i j} + ⌊ p_{i j} ⌋) e [⌊ p_{i j} ⌋]$
Attention Calculation
Raw:
- $a_{i j} = Softmax (q_{i}^{T} k_{j} + q_{i}^{T} e [p_{i j}])$
Optimized: (Interacts with the query vector before interpolation)
- Pre-computed for all integer positions: $z_{i} [p] = q_{i}^{T} e [p]$
- Interpolating scalar attention contribution: $z_{i} [p_{i j}] = (p_{i j} - ⌊ p_{i j} ⌋) z_{i} [⌈ p_{i j} ⌉] + (1 - p_{i j} + ⌊ p_{i j} ⌋) z_{i} [⌊ p_{i j} ⌋]$
- $a_{i j} = Softmax (q_{i}^{T} k_{j} + z_{i} [p_{i j}])$

SBA

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

Using the stick-breaking process as a replacement for softmax for attention.
Naturally incorporating recency bias.
No need of positional encoding.

Original Logits
$z_{i j} = \frac{q_{j}^{T} k_{i}}{\sqrt{d_{head}}}$
Breakpoint Possibility
$β_{i j} = σ (z_{i j})$
Attention Weights
- From $j$ to $i$ (backwards in time).
$A_{i, j} = β_{i, j} \prod_{i < k < j} (1 - β_{k, j})$
Output
$o_{j} = \sum_{i = 1}^{j - 1} A_{i, j} v_{i}$

Numerically Stable Implementation
By Log-Space Formulation.
1. Sigmoid in log-space:
  $l o g β_{i, j} = l o g σ (z_{i, j}) = z_{i, j} - l o g (1 + e x p (z_{i, j}))$
  $l o g (1 - β_{k, j}) = l o g (1 - σ (z_{k, j})) = - l o g (1 + e x p (z_{k, j}))$
  Where $l o g (1 + e x p (x))$ is commonly known as softplus(x).
2. Compute $A_{i, j}$ in log-space:
  $A_{i, j} = e x p (l o g β_{i, j} + \sum_{k = i + 1}^{j - 1} l o g (1 - β_{k, j}))$
  $A_{i, j} = e x p (z_{i, j} - l o g (1 + e x p (z_{i, j})) - \sum_{k = i + 1}^{j - 1} l o g (1 + e x p (z_{k, j})))$
  $A_{i, j} = e x p (z_{i, j} - \sum_{k = i}^{j - 1} l o g (1 + e x p (z_{k, j})))$
3. Stabilized softplus: $softplus (x) = {\begin{cases} l o g (1 + e x p (x)) & if x \leq 15 \\ x & otherwise \end{cases}$

Rotary

RoPE

Roformer: Enhanced transformer with rotary position embedding

\underset{W_{m}}{\underset{⏟}{(\begin{matrix} \cos m θ_{1} & - \sin m θ_{1} & 0 & 0 & \dots & 0 & 0 \\ \sin m θ_{1} & \cos m θ_{1} & 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & \cos m θ_{2} & - \sin m θ_{2} & \dots & 0 & 0 \\ 0 & 0 & \sin m θ_{2} & \cos m θ_{2} & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & 0 & 0 & \dots & \cos m θ_{d / 2} & - \sin m θ_{d / 2} \\ 0 & 0 & 0 & 0 & \dots & \sin m θ_{d / 2} & \cos m θ_{d / 2} \end{matrix})}} (\begin{matrix} q_{0} \\ q_{1} \\ q_{2} \\ q_{3} \\ ⋮ \\ q_{d - 2} \\ q_{d - 1} \end{matrix})

where $θ_{i} = 10000^{- 2 (i - 1) / d}$

It works because $(W_{m} q)^{⊤} (W_{n} k) = q^{⊤} W_{m}^{⊤} W_{n} k = q^{⊤} W_{n - m} k$

2D-RoPE

Rotary Position Embedding for Vision Transformer

R (n, 2 t) = e^{i θ_{t} x_{n}}, R (n, 2 t + 1) = e^{i θ_{t} y_{n}}

θ_{t} = 100^{- t / (d_{head} / 4)}, where t \in {0, 1, \dots, d_{head} / 4}

LieRE

LieRE: Lie Rotational Positional Encodings

For high dimension ( $n$ ).
Learn skew-symmetric basis of matrices ${A_{i}}$
Skew-symmetric: $A_{i}^{⊤} = - A$
For position $p$ , encode as $p = \sum_{i = 0}^{n} p_{i} A_{i}$
$R (p) = \exp (\sum_{i = 0}^{n} p_{i} A_{i})$
$Q_{i}^{'} = R (p_{i}) Q_{i}$ , $K_{i}^{'} = R (p_{i}) K_{i}$

ComRoPE

ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

$f$ is a Relative Positional Encoding if and only if there is a $g$ satisfies:
$g (q, k, x - y) = ρ (f (q, x), f (k, y))$
where $ρ$ is a similarity function
In this form, RoPE can be represented as:
${\begin{cases} f (q, x) = R_{f} (x) q \\ ρ (q, k) = q^{⊤} k \\ g (q, k, x - y) = q^{⊤} R_{f} (y - x) k \end{cases}$
which also satisfies:
$R_{f} (x)^{⊤} R_{f} (y) = R_{f} (y - x)$
$R$ is a rotation matrix function if
$R (x; {A_{1}, A_{2}, \dots}) = \exp (\sum_{i = 1}^{N} A_{i} x_{i}), where \exp (X) = \sum_{k = 0}^{\infty} \frac{X^{k}}{k!}$
THEOREM:
$R_{f} (x)^{⊤} R_{f} (y) = R_{f} (y - x)$ $⟺ \forall i, j, A_{i} A_{j} = A_{j} A_{i}$
Let $A_{i} = diag (B_{i 1}, B_{i 2}, \dots, B_{i m})$
$⟺ \forall i, j, k B_{i k} B_{j k} = B_{j k} B_{i k}$
Then we need to find $B$ that satisfies above.
- ComRoPE-AngleMatrices:
  $B_{i j} = {\begin{cases} P_{j} - P_{j}^{⊤}, & if j \equiv i (\mod N) \\ O, & otherwise \end{cases}$
  where $P_{j}$ is trainable
- ComRoPE-LinearlyDependent:
  $λ_{1} B_{1} = λ_{2} B_{2} = \dots = λ_{N} B_{N}$
  Specially,
  $B_{i} = θ_{i} (P - P^{⊤})$

Method	Commutativity	Extra Parameters	Extra Time Complexity
APE	—	$n d$	$O (n d)$
Vanilla RoPE	Yes	0	$O (L n d (b N + b^{2} + \frac{d}{h})) \approx O (\frac{L n d^{2}}{h})$
LieRE	Commonly Not	$L N d b$	$O (L n d (b N + b^{2} + \frac{d}{h}))$
ComRoPE-AP	Yes	$L d b$	$O (L n d (b N + b^{2} + \frac{d}{h}))$
ComRoPE-LD	Yes	$L d (b + \frac{N}{b})$	$O (L n d (b N + b^{2} + \frac{d}{h}))$

FoPE

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

TaPE

conTextualized equivariAnt Position Embedding
Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

Permutation Equivariance

Positional Encoding ​

Sinusoidal ​

NoPE ​

Additive ​

ALiBi ​

T5's RPE ​

Kerpel ​

Sandwich ​

FIRE ​

CaPE ​

FoX ​

FoX (Pro) ​

CoPE ​

SBA ​

Rotary ​

RoPE ​

2D-RoPE ​

LieRE ​

ComRoPE ​

FoPE ​

TaPE ​