Why do we need Positional Encoding
In Transformer model, we apply self-attention method to calculate the relationship between other tokens.
But attention method doesn't account the location of the token.
For example, in the figure above, there are two token of "Jane" in a single sentence.
Based on human-knowledge, the first "Jane" is closely related to "went to the movie theater" part. And second "Jane" is related to "go back to her home".
But based on self-attention without positional encoding, "Jane" will equally attend to both meaning.
If we embed the positional information before attention method, then we can make the "Jane" attend to the left corpus. Since first "Jane" is close to the "went to the movie theater", it will attend to its meaning. Second "Jane" will attend to "go back to her home".
In the paper "Attention is All you Need", it uses Sinusoidal Positional Encoding
. It adds sinusoidal value to the embedding vector to encode position information.
Please look at the paper to understand how it is used in the Transformer.
Now we can embed position information by slightly changing the embedding vector. Let's look at Rotary Positional Encoding(RoPE), an alternative of Sinusoidal Positional Encoding.
Rotary Positional Encoding(RoPE)
Embedding vector is a vector in d_model
dimension space. In RoPE, we don't add values to the vector, but we rotate the vector.
RoPE have many interesting feature. To deeply understand RoPE, we have to understand how we came up with this encoding scheme.
RoPE in 2D case
To rotate a embedding vector, we apply matrix multiplication with rotation matrix
.
[cosmθsinmθ−sinmθcosmθ] For example, if we multiply this to arbitrary vector (rcosα,rsinα), then we get the rotated vector.
[cosmθsinmθ−sinmθcosmθ][rcosαrsinα]=[rcos(α+θ)rsin(α+θ)] Amazing part is applying RoPE with dot-product attention method. fq,k(xm,m) mean the query/key vector from embedding vector xm with RoPE.
fq,k(xm,m)=[cosmθsinmθ−sinmθcosmθ][Wq,k(11)Wq,k(21)Wq,k(12)Wq,k(22)][xm(1)xm(2)] If we calculate the attention weight, we can get the following equation. This means that RoPE provides Relative Position Embedding
.
qmTkn=fq(xm,m)Tfk(xn,n) =([cosmθsinmθ−sinmθcosmθ][Wq(11)Wq(21)Wq(12)Wq(22)][xm(1)xm(2)])T([cosnθsinnθ−sinnθcosnθ][Wk(11)Wk(21)Wk(12)Wk(22)][xn(1)xn(2)]) =xmTWqT[cos(n−m)θsin(n−m)θ−sin(n−m)θcos(n−m)θ]Wkxn We can simply expand the 2D form of RoPE into multi-dim.
The main purpose of RoPE is to use with Linear Transformer.
If you are not familiar with Linear Transformer, please look at two articles I wrote:
SVM, Linear Transformer
In SVM post, you can skip the front part and focus on the kernel function.
In linear transformer, we can rewrite the attention method as follows:
shape of Q,K: (seq_len, d_model)
shape of V: (seq_len, d_model // n_head)
Qm means the mth row of matrix Q
ϕ(⋅) is a row-wise kernel function
Vm′=∑nNϕ(Qm)Tϕ(Kn)∑nNϕ(Qm)Tϕ(Kn)Vn ...(1) =ϕ(Qm)T∑nNϕ(Kn)ϕ(Qm)T∑nNϕ(Kn)Vn ...(2) In Linear Transformer, we reuse ∑nNϕ(Kn), ∑nNϕ(Kn)Vn term. When we reuse the terms, then it is hard to account every position relationship (m,n=0),(m,n=1),(m,n=2),... because we already calculate ∑nNϕ(Kn) and ∑nNϕ(Kn)Vn.
So we use RoPE! Unlike sinusoidal embedding, we apply positional embedding after kernel calculation.
ϕ(Qm)→Rθ,mdϕ(Qm) ϕ(Kn)→Rθ,ndϕ(Kn) Then we can rewrite equation (1) as following:
Vm′=∑nNϕ(Qm)Tϕ(Kn)∑nN(Rθ,mdϕ(Qm))TRθ,ndϕ(Kn)Vn =ϕ(Qm)T∑nNϕ(Kn)ϕ(Qm)T∑nNRθ,n−mϕ(Kn)Vn You might think that we cannot reuse ∑nNϕ(Kn)Vn term no more. But this is not true. The norm of the term doesn't change, and it is just rotating as m change. So we can still reuse the term!
References