What is the difference between gradient descent optimisation algorithms Adam and Momentum?

Data Science - Interview Questions

ADAM ALGORITHM : Adaptive Moment Estimation, shortly called ADAM, is a combination of Momentum and RMSProp. In the AdaGrad algorithm, the sum of gradients is squared, which only grows, and it is prolonged. RMSProp is nothing, but root mean square propagation which fixes the issue by considering a decay factor. In the Adam algorithm, when mathematically explained, two decay rates are used namely beta1 and beta2 where beta1 denotes the first momentum in which the sum of the gradient is considered and beta2 denotes the second momentum in which the sum of gradient squared is considered. Since the Momentum algorithm gives us a faster way and RMSProp provides the ability to gradient to restyle in different directions, the combination of the two works well. Thus, the Adam algorithm is considered the go-to choice of deep learning algorithms.

MOMENTUM ALGORITHM : Vanilla gradient descent with momentum is a method of accelerating the gradient descent to move faster towards the global minimum. Mathematically, a decay rate is multiplied to the previous sum of gradients and added with the present gradient to get a new sum of gradients. When the decay rate is assigned zero, it denotes a normal gradient descent. When the decay rate is set to 1, it oscillates like a ball in a frictionless bowl without any end. Hence decay rate is typically chosen around 0.8 to 0.9 to arrive at an end. The momentum algorithm gives us the advantage of escaping the local minima and getting into global minima.