Introduction to neural network optimizers [part 3] – Adam optimizer

This is the third part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. As a reminder, here is the table of contents:

1. Part 1: momentum optimizer
2. Part 2: adaptive learning rates

We covered two important concepts of optimizers in the previous sections, namely the introduction of a momentum term and adaptive learning rates. However, other variations, combinations or even additional concepts have also been proposed1.

Each optimizer has its own advantages and limitations making it suitable for specific contexts. It is beyond the scope of this series to name or introduce them all. Instead, we shortly explain the well established Adam optimizer as one example. It also re-uses some of the ideas discussed previously.

Before we proceed, we want to stress some thoughts regarding the combination of optimizers. One obvious choice might be to combine the momentum optimizer with the adaptive learning scheme. Even though this is theoretically possible and even an option in an implementation of the RMSProp algorithm, there might be a problem.

The main concept of the momentum optimizer is to accelerate when the direction of the gradient remains the same in subsequent iterations. As a result, the update vector increases in magnitude. This, however, contradicts one of the goals of adaptive learning rates which tries to keep the gradients in “reasonable ranges”. This may lead to issues when the momentum vector $$\fvec{m}$$ increases but then gets scaled down again by the scaling vector $$\fvec{s}$$.

It is also noted by the authors of RMSProp that the direct combination of adaptive learning rates with a momentum term does not work so well. The theoretical argument discussed might be a cause for these observations.

In the following, we first define the Adam algorithm and then look at the differences compared to previous approaches. The first is the usage of first-order moments which behave differently compared to a momentum vector. We are using an example to see how this choice has an advantage in skipping suboptimal local minima. The second difference is the usage of bias-correction terms necessary due to the zero-initialization of the moment vectors. Finally, we are also going to take a look at different trajectories.

Mathematical formulation

This optimizer was introduced by Diederik P. Kingma and Jimmy Ba in 2017. It mainly builds upon the ideas from AdaGrad and RMSProp, i.e. adaptive learning rates, and extends these approaches. The name is derived from adaptive moment estimation.

Additionally to the variables used in classical gradient descent, let $$\fvec{m} = (m_1, m_2, \ldots, m_n) \in \mathbb{R}^n$$ and $$\fvec{s} = (s_1, s_2, \ldots, s_n) \in \mathbb{R}^n$$ be the vectors with the estimates of the first and second raw moments of the gradients (same lengths as the weight vector $$\fvec{w}$$). Both vectors are initialized to zero, i.e. $$\fvec{m}(0) = \fvec{0}$$ and $$\fvec{s}(0) = \fvec{0}$$. The hyperparameters $$\beta_1, \beta_2 \in [0;1[$$ denote the decaying rates for the moment estimates and $$\varepsilon \in \mathbb{R}^+$$ is a smoothing term. Then, the Adam optimizer defines the update rules

\begin{align} \begin{split} \fvec{m}(t) &= \beta_1 \cdot \fvec{m}(t-1) + (1-\beta_1) \cdot \nabla E\left( \fvec{w}(t-1) \right) \\ \fvec{s}(t) &= \beta_2 \cdot \fvec{s}(t-1) + (1-\beta_2) \cdot \nabla E \left( \fvec{w}(t-1) \right) \odot \nabla E \left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \frac{\fvec{m}(t)}{1-\beta_1^t} \oslash \sqrt{\frac{\fvec{s}(t)}{1-\beta_2^t} + \varepsilon} \end{split} \label{eq:AdamOptimizer_Adam} \end{align}

to find a path from the initial position $$\fvec{w}(0)$$ to a local minimum of the error function $$E\left(\fvec{w}\right)$$. The symbol $$\odot$$ denotes the point-wise multiplication and $$\oslash$$ the point-wise division between vectors.

There is a very close relationship to adaptive learning rates. In fact, the update rule of $$\fvec{s}(t)$$ in \eqref{eq:AdamOptimizer_Adam} is identical to the one in the adaptive learning scheme. We also see that there is an $$\fvec{m}$$ vector, although this one is different compared to the one defined in momentum optimization. We are picking up this point shortly.

In the description of Adam, the arguments are more statistically-driven: $$\fvec{m}$$ and $$\fvec{s}$$ are interpreted as exponentially moving averages of the first and second raw moment of the gradient. That is, $$\fvec{m}$$ is a biased estimate of the means of the gradients and $$\fvec{s}$$ is a biased estimate of the uncentred variances of the gradients. In total, we can say that the Adam update process uses information about where the gradients are located on average and how they tend to scatter.

First-order moments

In momentum optimization, we keep track of an exponentially decaying sum whereas in Adam we have an exponentially decaying average. The difference is that in Adam we do not add the full new gradient vector $$\nabla E\left( \fvec{w}(t-1) \right)$$. Instead, only a fraction is used while at the same time a fraction of the old momentum is removed (the last part is identical to the momentum optimizer). For example, if we set $$\beta_1 = 0.9$$, we keep 90 % of the old value and add 10 % of the new. The bottom line is that we build much less momentum, i.e. the momentum vector does not grow that much.

In the analogy of a ball rolling down a valley, we may think of the moment updates in \eqref{eq:AdamOptimizer_Adam} as of a very heavy ball with a lot of friction. It accelerates less and needs more time to take the gradient information into account. The ball rolls down the valley according to the running average of gradients along the track. Since it takes some time until the old gradient information is lost, it is less likely to stop at small plateaus and can hence overshoot small local minima.2

We now want to test this argument on a small example function. For this, we leave out the second moments $$\fvec{s}$$ for now so that \eqref{eq:AdamOptimizer_Adam} reduces to

\begin{align} \begin{split} \fvec{m}(t) &= \beta_1 \cdot \fvec{m}(t-1) + (1-\beta_1) \cdot \nabla E\left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \frac{\fvec{m}(t)}{1-\beta_1^t}. \end{split} \label{eq:AdamOptimizer_AdamFirstMoment} \end{align}

We want to compare these first moment updates with classical gradient descent. The following figure shows the example function and allows you to play around with a trajectory which starts near the summit of the hill.

Directly after the first descent is a small local minimum and we see that classical gradient descent ($$\beta_1 = 0$$) gets stuck here. However, with first-order moments (e.g. $$\beta_1 = 0.95$$), we leverage the fact that the moving average decreases not fast enough so that we can still roll over this small hole and make it down to the valley.4

We can see from the error landscape that the first gradient component has the major impact on the updates as it is the direction of the steepest hill. It is insightful to visualize the first component $$m_1(t)$$ of the first-order moments over iteration time $$t$$: Figure 2: First component $$m_1(t)$$ of the first-order moments over iteration time $$t$$. The values are calculated according to \eqref{eq:AdamOptimizer_AdamFirstMoment} and use the same starting point as the trajectory in the previous figure. 150 iterations and a global learning rate of $$\eta=20$$ were used. The $$\beta_1 = 0$$ curve corresponds to classical gradient descent and the $$\beta_1 = 0.95$$ curve to an update scheme which employs first-order moments.

With classical gradient descent ($$\beta_1 = 0$$), we move fast down the hill but then get stuck in the first local minimum. As only local gradient information is used in the update process, the chances of escaping the hole are very low.

In contrast, when using first-order moments, we increase slower in speed as only a fraction of the large first gradients is used. However, $$m_1(t)$$ also decreases slower when reaching the first hole. In this case, the behaviour of the moving average helps to step over the short increase and to move further down the valley.

Building momentum and accelerating when we move in the same direction in subsequent iterations is the main concept and advantage of momentum optimization. However, as we already saw in the toy example used in the momentum optimizer article, large momentum vectors may be problematic as they can overstep local minima and lead to oscillations. What is more, as stressed in the argument above, it is not entirely clear if momentum optimization works well together with adaptive learning rates. Hence, it might be reasonable that the momentum optimizer is not used directly in Adam.

Bias correction

The final change in the Adam optimizer compared to its predecessors is the bias correction terms where we divide both moment vectors by either $$(1-\beta_1^t)$$ or $$(1-\beta_2^t)$$. This is because the moment vectors are initialized to zero so that the moving averages are, especially in the beginning, biased towards the origin. The factors are a countermeasure to correct this bias.

Practically speaking, these terms boost both vectors in the beginning since they are divided by a number usually $$< 1$$. This can speed-up convergence when the true moving averages are not located at the origin but are larger instead. As the factors have the iteration number $$t$$ in the exponent of the hyperparameters, the terms approach 1 over time and hence become less influential.

We now consider, once again, a one-dimensional example and define measures to compare the update vectors of the second iteration using either classical gradient descent or the Adam optimizer. To visualize the effect of the bias-correction terms, we repeat the process in which we leave these terms out.

Denoting the gradients of the first two iterations as $$g_t = \nabla E\left( w(t-1) \right)$$, we build the moment estimates

\begin{align*} m(1) &= \beta_1 \cdot m(0) + (1-\beta_1) \cdot g_1 = (1-\beta_1) \cdot g_1 \\ m(2) &= \beta_1 \cdot m(1) + (1-\beta_1) \cdot g_2 = \beta_1 \cdot (1-\beta_1) \cdot g_1 + (1-\beta_1) \cdot g_2 \\ s(1) &= \beta_2 \cdot s(0) + (1-\beta_2) \cdot g_1^2 = (1-\beta_2) \cdot g_1^2 \\ s(2) &= \beta_2 \cdot s(1) + (1-\beta_2) \cdot g_2^2 = \beta_2 \cdot (1-\beta_2) \cdot g_1^2 + (1-\beta_2) \cdot g_2^2 \end{align*}

so that we can define a comparison measure as

\begin{equation} \label{eq:AdamOptimizer_AdamMeasureCorrection} C_A(g_1,g_2) = \left| \eta \cdot \frac{\frac{m(2)}{1-\beta_1^2}}{\sqrt{\frac{s(2)}{1-\beta_2^2} + \varepsilon}} \right| - |\eta \cdot g_2| = \left| \eta \cdot \frac{\sqrt{1-\beta_2^2}}{1-\beta_1^2} \cdot \frac{m(2)}{\sqrt{s(2) + (1-\beta_2^2) \cdot \varepsilon}} \right| - |\eta \cdot g_2|. \end{equation}

To make the effect of the bias correction terms more evident, we moved them out of the compound fraction and used them as prefactor. We define a similar measure without these terms

\begin{equation} \label{eq:AdamOptimizer_AdamMeasureNoCorrection} \tilde{C}_A(g_1,g_2) = \left| \eta \cdot \frac{m(2)}{\sqrt{s(2) + \varepsilon}} \right| - |\eta \cdot g_2|. \end{equation}

The following figure compares the two measures by interpreting the gradients of the first two iterations as variables. Figure 3: Effect of the bias correction terms in the Adam optimizer. The left plot shows the measure $$C_A(g_1,g_2)$$ (\eqref{eq:AdamOptimizer_AdamMeasureCorrection}) and the right $$\tilde{C}_A(g_1,g_2)$$ (\eqref{eq:AdamOptimizer_AdamMeasureNoCorrection}). In the former measure, bias correction terms are used and in the latter not. Both measures compare the updates of Adam optimizer with the ones of classical gradient descent. The learning rate is set to $$\eta = 1$$, the smoothing term to $$\varepsilon = 10^{-8}$$ and the exponentially decaying rates to $$\beta_1 = 0.9$$ and $$\beta_2 = 0.999$$.

With correction terms (left image), we can observe that small gradients get amplified and larger ones attenuated. This is an inheritance from the adaptive learning scheme. Back then, however, this behaviour was more centred around the origin whereas here smaller gradients get amplified less and more independently of $$g_1$$. This is likely an effect of the $$m(2)$$ term which uses only a small fraction (10 % in this case) of the first gradient $$g_1$$ leading to a smaller numerator.

When we compare this result with the one without any bias corrections (right image), we see a much brighter picture. That is, the area of amplification of small and attenuation of large gradients is stronger. This is not surprising, as the prefactor

\begin{equation*} \frac{\sqrt{1-\beta_2^2}}{1-\beta_1^2} = \frac{\sqrt{1-0.999^2}}{1-0.9^2} \approx 0.2353 \end{equation*}

is smaller than 1 and hence leads to an overall decrease (the term $$(1-\beta_2^2) \cdot \varepsilon$$ is too small to have a visible effect). Therefore, the bias correction terms ensure that the update vectors behave also more moderately at the beginning of the learning process.

Trajectories

Like in previous articles, we now also want to compare different trajectories when using the Adam optimizer. For this, we can use the following widget which implements the Adam optimizer.

Basically, the parameters behave like expected: larger values for $$\beta_1$$ make the accumulated gradients decrease slower so that we first overshoot the minimum. $$\beta_2$$ controls again the preference of direction ($$\beta_2$$ small) vs. magnitude ($$\beta_2$$ large).

It is to note that even though the Adam optimizer is much more advanced than classical gradient descent, this does not mean that it is immune against extreme settings. It is still possible that weird effects happen like oscillations or that the overshooting mechanism discards good minima (example settings). Hence, it may still be worth it to search for good values for the hyperparameters.

Conclusion

We finished with the main concepts of the Adam optimizer. It is a popular optimization technique and its default settings are often a good starting point. Personally, I have had good experience with this optimizer and would definitely use it again. However, depending on the problem, it might not be the best choice or requires tuning of the hyperparameters. For this, it is good to know what they do and also how the other optimization techniques work.

List of attached files: