Batch Normalization

Last updated 3 months ago

Was this helpful?

Summary

Each layer's input affects all input of the preceding layers. If the input distribution fluctuates, then weights will adopt in different input distribution for each steps(Internal Covariate Shift). Batch normalization fix the input distribution for each layer.

Problem with Deep Neural Network

Internal Covariate Shift

If you see the neural network figure below, input of layer1 affects all the input in preceding layer. Also, weight change during training will cause the input distribution of preceding layers to change. Weights of each layer has to adopt in changing input distribution for each training steps.

This phenomenon is called Internal Covariate Shift.

In summary, following loop occurs during training.

Input distribution change in layer1
Input distribution change in all layers
Weights try to adopt to changed input distribution
As weight changes, input distribution changes
Go to 2

Saturated Activation Problem

Sigmoid is a popular saturated activation function.

Saturated Activation means that as $|x|$ gets bigger, the gradient of the activation function converge to 0.

Let's think of a simple linear layer, and sigmoid activation function $g$

x = Wu+b \\~\\ g(x) = \frac {1}{exp(-x)}

If we calculate the gradient of $x$ ,

\frac{\partial g(x)}{\partial W} = g'(Wu+b) \cdot u

As $|Wu+b|$ increase, the gradient $\frac{\partial g(x)}{\partial W}$ will converge to 0. This is Gradient Vanishing Problem.

We can use non-saturating Activation such as ReLU(Rectified Linear Unit), but there might be a case sigmoid should be used.