Batch Normalization
Last updated
Was this helpful?
Last updated
Was this helpful?
Each layer's input affects all input of the preceding layers. If the input distribution fluctuates, then weights will adopt in different input distribution for each steps(Internal Covariate Shift). Batch normalization fix the input distribution for each layer.
If you see the neural network figure below, input of layer1 affects all the input in preceding layer. Also, weight change during training will cause the input distribution of preceding layers to change. Weights of each layer has to adopt in changing input distribution for each training steps.
This phenomenon is called Internal Covariate Shift
.
In summary, following loop occurs during training.
Input distribution change in layer1
Input distribution change in all layers
Weights try to adopt to changed input distribution
As weight changes, input distribution changes
Go to 2
Sigmoid is a popular saturated activation function.
We can use non-saturating Activation such as ReLU(Rectified Linear Unit), but there might be a case sigmoid should be used.
Batch Normalization reduce the internal covariate shift by fixing each layer's input distribution.
Saturated Activation means that as gets bigger, the gradient of the activation function converge to 0.
Let's think of a simple linear layer, and sigmoid activation function
If we calculate the gradient of ,
As increase, the gradient will converge to 0.
This is Gradient Vanishing Problem
.
[1]
[2]