ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Summary
  • Attention method
  • How to get the attention weights?
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Transformer

Attention is All you need

Summary

If somebody request me to explain Attention method to non-CS personnel, then I would explain as follows:

The purpose of Attention is converting sentence into mathematical format that embeds relationship between words.

Attention method

Attention method is the core of transformer. At this section, we are going to take a closer look at what attention method is.

Example 1

We have 6 students in the class, and we know the test score of 5 student. How can we guess the last student's score?

88, 67, 90, 100, 98, __

We can simply assume the last student's score will be the mean of 5 student's score.

score=1588+1567+1590+15100+1598score = \frac{1}{5}88 + \frac{1}{5}67 + \frac{1}{5}90 + \frac{1}{5}100 + \frac{1}{5}98score=51​88+51​67+51​90+51​100+51​98

We got the last student's score by weighting 15\frac{1}{5}51​ to all other student's score and sum it.

Example 2

Cont. by example 1, let's say student6 always studied with student4, student5. Then we can say that student6's score will be more likely to student4, student5's score. We can change the equation as follows:

score=11088+11067+11090+720100+72098score = \frac{1}{10}88 + \frac{1}{10}67 + \frac{1}{10}90 + \frac{7}{20}100 + \frac{7}{20}98score=101​88+101​67+101​90+207​100+207​98

We weighted more for student4 and student5's score.

Why do we need Attention method?

Let's say we are trying to convert the following sentence into a matrix:

"Jake had a walk with his cute dog"

Assume that we can convert each words into embedding vector. Would just concatenating all the vectors and making a matrix would be enough? No.

"Jake had a walk with his cute dog" vs "Jake didn't had a walk with his dog"

In both sentence, there is a word dog. But it has a different meaning: At first sentence, dog is cute, and had a walk with Jake. In second sentence, we don't know if the dog is cute, and dog didn't had a walk with Jake.

Even for same word, it can have a different meaning. So just concatenating isn't enough.

How Attention method works?

Examples above is closely related to Attention Method.

"Jake had a walk with his cute dog"

Let's say we are trying to convert the word "dog" into a vector. Dog is closely related to some words: "dog"(of course, it is closely related to itself), "cute", "walk", and "Jake".

Attention method is that give some weights to closely related words, and sum it! We can express the final vector of "dog" as following equation.

A(vector of "Jake")+B(vector of "walk")+C(vector of "cute")+D(vector of "dog") ...(1)A(vector \ of \ "Jake") + B(vector \ of \ "walk") + C(vector \ of \ "cute") + D(vector \ of \ "dog") \ ...(1)A(vector of "Jake")+B(vector of "walk")+C(vector of "cute")+D(vector of "dog") ...(1)

How to get the attention weights?

We discussed that attention method is summing other word's vector multiplied by weight. Then how can we get the weights?

Key, Query, Value

Before discussing how to get the weights, we should look at some terms. Let's say we want to calculate the AAA at equation (1). We have to calculate how much "Jake" and "dog" is related.

Assume there is a function relationship(key,query)relationship(key, query)relationship(key,query) that calculates the relationship between two words.

At this point, "dog" is the Key, and "Jake" is the Query. If we get the weights, we have to multiply with vector of "Jake"vector \ of \ "Jake"vector of "Jake", which will be the Value.

relationship(key,query)relationship(key, query)relationship(key,query)

Please read the paper for more information.

References

Last updated 3 months ago

Was this helpful?

In the paper , it introduce dot product with softmax. It just simply do the dot product of two vectors(key and query), and apply softmax to make the sum of weights to 1. (In equation (1), sum of A,B,C,DA, B, C, DA,B,C,D should be 1)

[1]

[2]

https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762
https://www.youtube.com/watch?v=wjZofJX0v4M