ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Summary
  • Problem with Deep Neural Network
  • Batch Normalization
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

Batch Normalization

Last updated 3 months ago

Was this helpful?

Summary

Each layer's input affects all input of the preceding layers. If the input distribution fluctuates, then weights will adopt in different input distribution for each steps(Internal Covariate Shift). Batch normalization fix the input distribution for each layer.

Problem with Deep Neural Network

Internal Covariate Shift

If you see the neural network figure below, input of layer1 affects all the input in preceding layer. Also, weight change during training will cause the input distribution of preceding layers to change. Weights of each layer has to adopt in changing input distribution for each training steps.

This phenomenon is called Internal Covariate Shift.

In summary, following loop occurs during training.

  1. Input distribution change in layer1

  2. Input distribution change in all layers

  3. Weights try to adopt to changed input distribution

  4. As weight changes, input distribution changes

  5. Go to 2

Saturated Activation Problem

Sigmoid is a popular saturated activation function.

We can use non-saturating Activation such as ReLU(Rectified Linear Unit), but there might be a case sigmoid should be used.

Batch Normalization

Batch Normalization reduce the internal covariate shift by fixing each layer's input distribution.

References

Saturated Activation means that as ∣x∣|x|∣x∣ gets bigger, the gradient of the activation function converge to 0.

Let's think of a simple linear layer, and sigmoid activation function ggg

x=Wu+b g(x)=1exp(−x)x = Wu+b \\~\\ g(x) = \frac {1}{exp(-x)}x=Wu+b g(x)=exp(−x)1​

If we calculate the gradient of xxx,

∂g(x)∂W=g′(Wu+b)⋅u\frac{\partial g(x)}{\partial W} = g'(Wu+b) \cdot u∂W∂g(x)​=g′(Wu+b)⋅u

As ∣Wu+b∣|Wu+b|∣Wu+b∣ increase, the gradient ∂g(x)∂W\frac{\partial g(x)}{\partial W}∂W∂g(x)​ will converge to 0. This is Gradient Vanishing Problem.

[1]

[2]

https://arxiv.org/abs/1502.03167
https://en.wikipedia.org/wiki/Sigmoid_function
LogoBatch Normalization: Accelerating Deep Network Training by...arXiv.org
Drawing
Sigmoid function.
Batch Normalization Algorithm.
https://en.wikipedia.org/wiki/Sigmoid_function
https://arxiv.org/abs/1502.03167