ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Why do we need Positional Encoding
  • How to Embed Positional Information
  • Rotary Positional Encoding(RoPE)
  • RoPE in 2D case
  • RoPE in General Form
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

Rotary Positional Encoding

Last updated 3 months ago

Was this helpful?

Why do we need Positional Encoding

In Transformer model, we apply self-attention method to calculate the relationship between other tokens.

But attention method doesn't account the location of the token.

For example, in the figure above, there are two token of "Jane" in a single sentence.

Based on human-knowledge, the first "Jane" is closely related to "went to the movie theater" part. And second "Jane" is related to "go back to her home".

But based on self-attention without positional encoding, "Jane" will equally attend to both meaning.

If we embed the positional information before attention method, then we can make the "Jane" attend to the left corpus. Since first "Jane" is close to the "went to the movie theater", it will attend to its meaning. Second "Jane" will attend to "go back to her home".

How to Embed Positional Information

In the paper "Attention is All you Need", it uses Sinusoidal Positional Encoding. It adds sinusoidal value to the embedding vector to encode position information.

Please look at the paper to understand how it is used in the Transformer.

Now we can embed position information by slightly changing the embedding vector. Let's look at Rotary Positional Encoding(RoPE), an alternative of Sinusoidal Positional Encoding.

Rotary Positional Encoding(RoPE)

Embedding vector is a vector in d_model dimension space. In RoPE, we don't add values to the vector, but we rotate the vector.

RoPE have many interesting feature. To deeply understand RoPE, we have to understand how we came up with this encoding scheme.

RoPE in 2D case

To rotate a embedding vector, we apply matrix multiplication with rotation matrix.

[cos⁡mθ−sin⁡mθsin⁡mθcos⁡mθ]\begin{bmatrix} \cos{m\theta} & -\sin{m\theta} \\ \sin{m\theta} & \cos{m\theta} \\ \end{bmatrix}[cosmθsinmθ​−sinmθcosmθ​]

For example, if we multiply this to arbitrary vector (rcos⁡α,rsin⁡α)(r\cos{\alpha}, r\sin{\alpha})(rcosα,rsinα), then we get the rotated vector.

[cos⁡mθ−sin⁡mθsin⁡mθcos⁡mθ][rcos⁡αrsin⁡α]=[rcos⁡(α+θ)rsin⁡(α+θ)]\begin{bmatrix} \cos{m\theta} & -\sin{m\theta} \\ \sin{m\theta} & \cos{m\theta} \\ \end{bmatrix} \begin{bmatrix} r\cos{\alpha} \\ r\sin{\alpha} \\ \end{bmatrix} = \begin{bmatrix} r\cos({\alpha + \theta}) \\ r\sin({\alpha + \theta}) \end{bmatrix}[cosmθsinmθ​−sinmθcosmθ​][rcosαrsinα​]=[rcos(α+θ)rsin(α+θ)​]

Amazing part is applying RoPE with dot-product attention method. fq,k(xm,m)f_{q,k}(x_m,m)fq,k​(xm​,m) mean the query/key vector from embedding vector xmx_mxm​ with RoPE.

fq,k(xm,m)=[cos⁡mθ−sin⁡mθsin⁡mθcos⁡mθ][Wq,k(11)Wq,k(12)Wq,k(21)Wq,k(22)][xm(1)xm(2)]f_{q,k}(x_m, m)=\begin{bmatrix} \cos{m\theta} & -\sin{m\theta} \\ \sin{m\theta} & \cos{m\theta} \\ \end{bmatrix} \begin{bmatrix} W_{q,k}^{(11)} & W_{q,k}^{(12)} \\ W_{q,k}^{(21)} & W_{q,k}^{(22)} \\ \end{bmatrix}\begin{bmatrix} x_m^{(1)} \\ x_m^{(2)} \end{bmatrix}fq,k​(xm​,m)=[cosmθsinmθ​−sinmθcosmθ​][Wq,k(11)​Wq,k(21)​​Wq,k(12)​Wq,k(22)​​][xm(1)​xm(2)​​]

If we calculate the attention weight, we can get the following equation. This means that RoPE provides Relative Position Embedding .

qmTkn=fq(xm,m)Tfk(xn,n) =([cos⁡mθ−sin⁡mθsin⁡mθcos⁡mθ][Wq(11)Wq(12)Wq(21)Wq(22)][xm(1)xm(2)])T([cos⁡nθ−sin⁡nθsin⁡nθcos⁡nθ][Wk(11)Wk(12)Wk(21)Wk(22)][xn(1)xn(2)]) =xmTWqT[cos⁡(n−m)θ−sin⁡(n−m)θsin⁡(n−m)θcos⁡(n−m)θ]Wkxnq_m^Tk_n=f_q(x_m,m)^Tf_k(x_n,n) \\~\\ =\left ( \begin{bmatrix} \cos{m\theta} & -\sin{m\theta} \\ \sin{m\theta} & \cos{m\theta} \\ \end{bmatrix} \begin{bmatrix} W_{q}^{(11)} & W_{q}^{(12)} \\ W_{q}^{(21)} & W_{q}^{(22)} \\ \end{bmatrix}\begin{bmatrix} x_m^{(1)} \\ x_m^{(2)} \end{bmatrix} \right )^T \left( \begin{bmatrix} \cos{n\theta} & -\sin{n\theta} \\ \sin{n\theta} & \cos{n\theta} \\ \end{bmatrix} \begin{bmatrix} W_{k}^{(11)} & W_{k}^{(12)} \\ W_{k}^{(21)} & W_{k}^{(22)} \\ \end{bmatrix}\begin{bmatrix} x_n^{(1)} \\ x_n^{(2)} \end{bmatrix} \right ) \\~\\ = x_m^T W_q^T \begin{bmatrix} \cos{(n-m)\theta} & -\sin{(n-m)\theta} \\ \sin{(n-m)\theta} & \cos{(n-m)\theta} \\ \end{bmatrix} W_k x_nqmT​kn​=fq​(xm​,m)Tfk​(xn​,n) =([cosmθsinmθ​−sinmθcosmθ​][Wq(11)​Wq(21)​​Wq(12)​Wq(22)​​][xm(1)​xm(2)​​])T([cosnθsinnθ​−sinnθcosnθ​][Wk(11)​Wk(21)​​Wk(12)​Wk(22)​​][xn(1)​xn(2)​​]) =xmT​WqT​[cos(n−m)θsin(n−m)θ​−sin(n−m)θcos(n−m)θ​]Wk​xn​

RoPE in General Form

We can simply expand the 2D form of RoPE into multi-dim.

The main purpose of RoPE is to use with Linear Transformer.

If you are not familiar with Linear Transformer, please look at two articles I wrote: SVM, Linear Transformer

In SVM post, you can skip the front part and focus on the kernel function.

In linear transformer, we can rewrite the attention method as follows: shape of Q,KQ, KQ,K: (seq_len, d_model) shape of VVV: (seq_len, d_model // n_head) QmQ_mQm​ means the mmmth row of matrix QQQ ϕ(⋅)\phi(\cdot)ϕ(⋅) is a row-wise kernel function

Vm′=∑nNϕ(Qm)Tϕ(Kn)Vn∑nNϕ(Qm)Tϕ(Kn) ...(1) =ϕ(Qm)T∑nNϕ(Kn)Vnϕ(Qm)T∑nNϕ(Kn) ...(2)V_m' = \frac{\sum_{n}^{N}\phi(Q_m)^T\phi(K_n)V_n}{\sum_{n}^{N}\phi(Q_m)^T\phi(K_n)} \ ...(1)\\~\\ = \frac{\phi(Q_m)^T\sum_{n}^{N}\phi(K_n)V_n}{\phi(Q_m)^T\sum_{n}^{N}\phi(K_n)} \ ...(2)Vm′​=∑nN​ϕ(Qm​)Tϕ(Kn​)∑nN​ϕ(Qm​)Tϕ(Kn​)Vn​​ ...(1) =ϕ(Qm​)T∑nN​ϕ(Kn​)ϕ(Qm​)T∑nN​ϕ(Kn​)Vn​​ ...(2)

In Linear Transformer, we reuse ∑nNϕ(Kn)\sum_{n}^{N}\phi(K_n)∑nN​ϕ(Kn​), ∑nNϕ(Kn)Vn\sum_{n}^{N}\phi(K_n)V_n∑nN​ϕ(Kn​)Vn​ term. When we reuse the terms, then it is hard to account every position relationship (m,n=0),(m,n=1),(m,n=2),...(m, n=0), (m, n=1), (m, n=2), ...(m,n=0),(m,n=1),(m,n=2),... because we already calculate ∑nNϕ(Kn)\sum_{n}^{N}\phi(K_n)∑nN​ϕ(Kn​) and ∑nNϕ(Kn)Vn\sum_{n}^{N}\phi(K_n)V_n∑nN​ϕ(Kn​)Vn​.

So we use RoPE! Unlike sinusoidal embedding, we apply positional embedding after kernel calculation.

ϕ(Qm)→Rθ,mdϕ(Qm) ϕ(Kn)→Rθ,ndϕ(Kn)\phi(Q_m) \to R_{\theta,m}^d\phi(Q_m) \\~\\ \phi(K_n) \to R_{\theta,n}^d\phi(K_n)ϕ(Qm​)→Rθ,md​ϕ(Qm​) ϕ(Kn​)→Rθ,nd​ϕ(Kn​)

Then we can rewrite equation (1) as following:

Vm′=∑nN(Rθ,mdϕ(Qm))TRθ,ndϕ(Kn)Vn∑nNϕ(Qm)Tϕ(Kn) =ϕ(Qm)T∑nNRθ,n−mϕ(Kn)Vnϕ(Qm)T∑nNϕ(Kn)V_m' = \frac{\sum_{n}^{N} \left (R_{\theta,m}^d \phi(Q_m) \right )^T R_{\theta,n}^d\phi(K_n)V_n}{\sum_{n}^{N} \phi(Q_m)^T\phi(K_n)} \\~\\ = \frac{\phi(Q_m)^T \sum_{n}^{N}R_{\theta,n-m}\phi(K_n)V_n}{\phi(Q_m)^T \sum_{n}^{N}\phi(K_n)}Vm′​=∑nN​ϕ(Qm​)Tϕ(Kn​)∑nN​(Rθ,md​ϕ(Qm​))TRθ,nd​ϕ(Kn​)Vn​​ =ϕ(Qm​)T∑nN​ϕ(Kn​)ϕ(Qm​)T∑nN​Rθ,n−m​ϕ(Kn​)Vn​​

You might think that we cannot reuse ∑nNϕ(Kn)Vn\sum_{n}^{N}\phi(K_n)V_n∑nN​ϕ(Kn​)Vn​ term no more. But this is not true. The norm of the term doesn't change, and it is just rotating as mmm change. So we can still reuse the term!

References

[1]

[2]

[3]

https://arxiv.org/abs/2104.09864
https://arxiv.org/pdf/2006.16236
https://blog.eleuther.ai/rotary-embeddings/
Drawing
Figure of self-attention
Drawing
Meaning of "Jane" without Positional Encoding
Drawing
Meaning of "Jane" with Positional Encoding
Figure of Sinusoidal Positional Encoding. From paper "Attention is All you Need"
Figure of RoPE in 2D. From
General Form of RoPE. From paper
https://www.cuemath.com/algebra/rotation-matrix/
https://arxiv.org/abs/2104.09864
LogoRoFormer: Enhanced Transformer with Rotary Position EmbeddingarXiv.org