ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Padding Mask
  • Look-ahead Mask
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Transformer

Why do we need a mask in Transformer

Last updated 3 months ago

Was this helpful?

I was working on implementing Transformer Model, and I wondered why we need a mask for Tranformer. In this post, I will talk about why Tranformer needs two types of masks: Padding mask, Look-ahead mask.

Padding Mask

Let's assume each word takes one embedding vector. (In real, one word can take multiple embedding vectors)

Padding token

"I am jinho choi" "I want to work at NVIDA"

These sentences have different length. However, since model has strict rule on matrix shape, they should be made as equal-size tokens. So we add <PAD> tokens. For example, if the max length is 6, then "I am jinho choi" -> ["I", "am", "jinho", "choi", "<PAD>", "<PAD>"] "I want to work at NVIDA" -> ["I", "want", "to", "work", "at", "NVIDIA"]

Padding token has no meaning

Since padding token has no meaning, we should mask it out during applying attention. Applying mask will prevent other words "interact" with the <PAD> token.

The figure below shows the padding mask. a, b, c are all essential words, and D is the padding token.

As a result, if we add the mask to the dot-product, then we can get the following attention weight:

Look-ahead Mask

We need look-ahead mask at decoder self-attention to give auto-regressive feature to the Tranformer. For example, let's say we are doing translation task. En->Kr

"Hello, my name is jinho choi" -> "안녕, 내 이름은 최진호야"

Decoder will produce the following outputs step-by-step:

  1. "안녕"

  2. "안녕, 내"

  3. "안녕, 내 이름은"

  4. "안녕, 내 이름은 최진호야"

Output of previous step is put in to the decoder self-attention layer. However, it is important to block "안녕" word to attend to "이름은" word. ("이름은" can attend to "안녕")

So we add look-ahead mask to the decoder self-attention layer.

References

[1]

https://gmongaras.medium.com/how-do-self-attention-masks-work-72ed9382510f
Figure of padding mask. Source:
Figure of attention weight with padding mask. Source:
Figure of look-ahead mask for decoder. Source:
here
here
here