Skip to content

Positional Encoding

Sinusoidal

Attention is All You Need

PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)

Then add it to the input vectors.

NoPE

Length Generalization of Causal Transformers without Position Encoding

No positional encoding.

Additive

ALiBi

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

  • Original attention score: softmax(qiKT)
  • With AliBi: softmax(qiKT+m[(i1),,2,1,0])
  • m is a const head-specific scalar: 28n for the nth head.
  • No positional encoding.

T5's RPE

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Bi,j=rmin(ij,K)
  • K is hyper-parameter.
  • ri are learnable scalars.

Kerpel

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Bi,j=r1log(1+r2|ij|)

Sandwich

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Bi,j=r1k=1r2cos(ij10000k/d)

FIRE

Functional Interpolation for Relative Positions Improves Long Context Transformers

Bi,j=fθ(ψ(ij)ψ(max{L,i}))

  • fθ:RR is MLP.
  • ϕ:NR+ is monotonically increasing (e.g. ϕ(x)=log(cx+1)).
  • L>0 is a learnable scalar.

CaPE

CAPE: Context-Adaptive Positional Encoding for Length Extrapolation

  • ACAPE(X)=XWQ(XWK)+f(XWQ(XWK),B)
  • f is a two-layer LeakyReLU neural network.
  • B is positional bias matrices (e.g. ALiBi and FIRE).

FoX

Forgetting Transformer: Softmax Attention with a Forget Gate

  • Dynamic down-weighting of past information.
  • No need of position embeddings.
  • Compatible with FlashAttention.
  1. Scalar Forget Gate

    ft=σ(wfxt+bf)

    Where wf and bf are learnable and per-head (for multiple head attention).

  2. Cumulative Forget Factor

    Fij=l=j+1ifl (1 if i=j)Bij=logFij=l=j+1ilogfl

FoX (Pro)

CoPE

Contextual Position Encoding: Learning to Count What's Important

  • Dynamically decide which tokens should be counted based on the context.
  • More flexible position addressing (e.g. i-th specific word, noun, or sentence).

  1. Gate Computation

    gij=σ(qiTkj)
  2. Contextual Position Calculation

    pij=k=jigik
  3. Position Embedding Interpolation

    • Because pij may be a fraction, interpolation is used to compute the embedding vector.
    • For each integer position, a learnable embedding vector e[p] is used.
    • For decimal position: e[pij]=(pijpij)e[[pij]]+(1pij+pij)e[pij]
  4. Attention Calculation

    Raw:

    • aij=Softmax(qiTkj+qiTe[pij])

    Optimized: (Interacts with the query vector before interpolation)

    • Pre-computed for all integer positions: zi[p]=qiTe[p]
    • Interpolating scalar attention contribution: zi[pij]=(pijpij)zi[pij]+(1pij+pij)zi[pij]
    • aij=Softmax(qiTkj+zi[pij])

SBA

Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study

  • Using the stick-breaking process as a replacement for softmax for attention.
  • Naturally incorporating recency bias.
  • No need of positional encoding.
  1. Original Logits

    zij=qjTkidhead
  2. Breakpoint Possibility

    βij=σ(zij)
  3. Attention Weights

    • From j to i (backwards in time).
    Ai,j=βi,ji<k<j(1βk,j)
  4. Output

    oj=i=1j1Ai,jvi
  • Numerically Stable Implementation

    By Log-Space Formulation.

    1. Sigmoid in log-space:

      log βi,j=log σ(zi,j)=zi,jlog(1+exp(zi,j))

      log(1βk,j)=log(1σ(zk,j))=log(1+exp(zk,j))

      Where log(1+exp(x)) is commonly known as softplus(x).

    2. Compute Ai,j in log-space:

      Ai,j=exp(log βi,j+k=i+1j1log(1βk,j))

      Ai,j=exp(zi,jlog(1+exp(zi,j))k=i+1j1log(1+exp(zk,j)))

      Ai,j=exp(zi,jk=ij1log(1+exp(zk,j)))

    3. Stabilized softplus: softplus(x)={log(1+exp(x))if x15xotherwise

Rotary

RoPE

Roformer: Enhanced transformer with rotary position embedding

(cosmθ1sinmθ10000sinmθ1cosmθ1000000cosmθ2sinmθ20000sinmθ2cosmθ2000000cosmθd/2sinmθd/20000sinmθd/2cosmθd/2)Wm(q0q1q2q3qd2qd1)

where θi=100002(i1)/d

It works because (Wmq)(Wnk)=qWmWnk=qWnmk

2D-RoPE

Rotary Position Embedding for Vision Transformer

R(n,2t)=eiθtxn,R(n,2t+1)=eiθtynθt=100t/(dhead/4),where t{0,1,,dhead/4}

LieRE

LieRE: Lie Rotational Positional Encodings

  • For high dimension (n).

  • Learn skew-symmetric basis of matrices {Ai}

    Skew-symmetric: Ai=A

  • For position p, encode as p=i=0npiAi

  • R(p)=exp(i=0npiAi)

  • Qi=R(pi)Qi, Ki=R(pi)Ki

ComRoPE

ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

  • f is a Relative Positional Encoding if and only if there is a g satisfies:

    g(q,k,xy)=ρ(f(q,x),f(k,y))

    where ρ is a similarity function

  • In this form, RoPE can be represented as:

    {f(q,x)=Rf(x)qρ(q,k)=qkg(q,k,xy)=qRf(yx)k

    which also satisfies:

    Rf(x)Rf(y)=Rf(yx)
  • R is a rotation matrix function if

    R(x;{A1,A2,})=exp(i=1NAixi),where exp(X)=k=0Xkk!

    THEOREM:

    Rf(x)Rf(y)=Rf(yx)i,j,AiAj=AjAi

    Let Ai=diag(Bi1,Bi2,,Bim)

    i,j,kBikBjk=BjkBik
  • Then we need to find B that satisfies above.

    • ComRoPE-AngleMatrices:

      Bij={PjPj,if ji(modN)O,otherwise

      where Pj is trainable

    • ComRoPE-LinearlyDependent:

      λ1B1=λ2B2==λNBN

      Specially,

      Bi=θi(PP)
MethodCommutativityExtra ParametersExtra Time Complexity
APEndO(nd)
Vanilla RoPEYes0O(Lnd(bN+b2+dh))O(Lnd2h)
LieRECommonly NotLNdbO(Lnd(bN+b2+dh))
ComRoPE-APYesLdbO(Lnd(bN+b2+dh))
ComRoPE-LDYesLd(b+Nb)O(Lnd(bN+b2+dh))

FoPE

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

TaPE

conTextualized equivariAnt Position Embedding

Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

  • Permutation Equivariance