Adam Optimizer
Last updated
Was this helpful?
Last updated
Was this helpful?
In most training in Deep Learning, we use Adam Optimizer. In this post, I would like to discuss what is an optimizer, and reason we use Adam optimizer.
If you want more detailed information about Adam optimizer, please look at the following paper:
In Deep learning, we optimize the model to make lower loss. In other words, we call this procedure Training.
To optimize(minimizing the loss) the model, we use 'Gradient Descent method'. This is done by calculating each parameter's gradient and subtract it from the parameter value.
Following GIF compares various algorithms in the local minima situation.
SGD cannot escape the local minima. Momentum algorithm escapes at final seconds. Other algorithms could easily escape the local minima.
As you can see, good algorithm can bring good training speed, and good model performance.
We are going to focus on Adam algorithm, but first things first! We are going to look at Momentum Algorithm and RMSProp Algorithm.
Momentum algorithm is an algorithm that use previous moment value to calculate current moment.
: moment value at timestamp t : moment weight constant : gradient calculated at timestamp t : model parameter at timestamp t : learning rate constant
The key of RMSprop is following:
If the gradient is big, the parameter convergence is reached early. On the other hand, if the gradient is small, then the parameter convergence is delayed.
So in RMSprop, it scales the learning rate by the size of gradient.
To calculate the size of gradient, it just square the gradient value!
Adam algorithm is mixture of Momentum and RMSprop.
You can see there is a bias-correction step during the algorithm. Let's take a look at the reason
Let's think as a probability distribution. For example, are sampled from distribution .
As the timestamp t increase, the estimate value of converge.
However, there is a problem. When timestamp t is small, is greatly biased.
So in Adam divides by to correct the bias.
[1]
[2]
[3]
[4]