Attention is All you need
Summary
If somebody request me to explain Attention method to non-CS personnel, then I would explain as follows:
The purpose of Attention is converting sentence into mathematical format that embeds relationship between words.
Attention method
Attention method is the core of transformer. At this section, we are going to take a closer look at what attention method is.
Example 1
We have 6 students in the class, and we know the test score of 5 student. How can we guess the last student's score?
We can simply assume the last student's score will be the mean of 5 student's score.
We got the last student's score by weighting to all other student's score and sum it.
Example 2
Cont. by example 1, let's say student6 always studied with student4, student5. Then we can say that student6's score will be more likely to student4, student5's score. We can change the equation as follows:
We weighted more for student4 and student5's score.
Why do we need Attention method?
Let's say we are trying to convert the following sentence into a matrix:
"Jake had a walk with his cute dog"
Assume that we can convert each words into embedding vector. Would just concatenating all the vectors and making a matrix would be enough? No.
"Jake had a walk with his cute dog" vs "Jake didn't had a walk with his dog"
In both sentence, there is a word dog. But it has a different meaning: At first sentence, dog is cute, and had a walk with Jake. In second sentence, we don't know if the dog is cute, and dog didn't had a walk with Jake.
Even for same word, it can have a different meaning. So just concatenating isn't enough.
How Attention method works?
Examples above is closely related to Attention Method.
"Jake had a walk with his cute dog"
Let's say we are trying to convert the word "dog" into a vector. Dog is closely related to some words: "dog"(of course, it is closely related to itself), "cute", "walk", and "Jake".
Attention method is that give some weights to closely related words, and sum it! We can express the final vector of "dog" as following equation.
How to get the attention weights?
We discussed that attention method is summing other word's vector multiplied by weight. Then how can we get the weights?
Key, Query, Value
Before discussing how to get the weights, we should look at some terms. Let's say we want to calculate the at equation (1). We have to calculate how much "Jake" and "dog" is related.
Assume there is a function that calculates the relationship between two words.
At this point, "dog" is the Key, and "Jake" is the Query. If we get the weights, we have to multiply with , which will be the Value.
Please read the paper for more information.
References
Last updated
Was this helpful?