Linear Attention

Original Linear Attention

The original attention mechanism is defined as:

Complexity: $O (N^{2} d_{k} + N^{2} d_{v})$

O = softmax (Q K^{⊤}) V

If we omit the softmax operator it becomes:

Complexity: $O (N d_{k} d_{v})$

O = Q K^{⊤} V = Q (K^{⊤} V)

which is of linear complexity.

In Autoagressive Models

In autoregressive models like GPT we need a causal mask:

\begin{aligned} Training: & O = softmax (Q K^{⊤} ⊙ M) V \\ Inference: & o_{t} = \sum_{j = 1}^{t} \frac{\exp (q_{t}^{⊤} k_{j})}{\sum_{l = 1}^{t} \exp (q_{t}^{⊤} k_{l})} v_{j} \end{aligned}

Removing softmax yields:

\begin{aligned} Training: & O = (Q K^{⊤} ⊙ M) V \\ Inference: & o_{t} = \sum_{j = 1}^{t} (q_{t}^{⊤} k_{j}) v_{j} \end{aligned}

Unfortunately the Hadamard product ( $⊙$ ) does not commute with matrix multiplication, so the operation remains quadratic.

RNN-like Sequential Form

Even though, Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention observes that the inference equation can be written in an RNN-like form:

o_{t} = \sum_{j = 1}^{t} (v_{j} k_{j}^{⊤}) q_{t} = S_{t} q_{t}, S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}

This achieves linear time, but introduces two drawbacks:

Memory during autograd:
High memory footprint during autograd: every intermediate state $S_{t}$ must be stored, resulting in $O (L d^{2})$ memory usage. The authors alleviate this by recomputing $S_{t}$ on-the-fly during back-propagation.
Parallelism:
Poor training parallelism: the update is element-wise instead of large matrix multiplications, which under-utilizes GPU tensor cores.
A compromise is the chunkwise algorithm proposed in Transformer Quality in Linear Time, which allows parallelism while remaining linear.
From up to bottom: Parallel form, recurrent form, chunkwise parallel form:

Gated Linear Attention

A learnable 2D forget gate $G_{t} \in (0, 1)^{d_{k} \times d_{v}}$ is added:

S_{t} = G_{t} ⊙ S_{t - 1} + k_{t}^{⊤} v_{t}

This is very general and encompasses many recent RNNs with 2D hidden states:

Gated Linear Attention Transformers with Hardware-Efficient Training also proposed its GLA:

Recurrent form:

S_{t} = (α_{t}^{⊤} 1) ⊙ S_{t - 1} + k_{t}^{⊤} v_{t} = Diag (α_{t}) S_{t - 1} + k_{t}^{⊤} v_{t}

Parallel form:

S_{t} = \sum_{i = 1}^{t} ((\prod_{j = i + 1}^{t} α_{j}) ⊙ k_{i}^{⊤} v_{i})

Chunkwise parallel:

References

Sonta. "Zhihu answer." https://www.zhihu.com/question/9740764576/answer/80735153803

Katharopoulos A., Vyas A., Pappas N., Fleuret F. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." https://arxiv.org/abs/2006.16236

Vyas A., Katharopoulos A., Fleuret F. "Transformer Quality in Linear Time." https://arxiv.org/abs/2202.10447

Yang S.L., Wang B.L., et al. "Gated Linear Attention Transformers with Hardware-Efficient Training." https://arxiv.org/abs/2312.06635

Linear Attention ​

Original Linear Attention ​

In Autoagressive Models ​

RNN-like Sequential Form ​