A neural network is a model with a lot of parameters which are used to derive an output based on an input. In the learning process, we show the network a series of example inputs with an associated output so that it can adapt its parameters (weights) according to a defined error function. Since these error functions are too complex, we cannot simply determine an explicit formula for the optimal parameters. The usual approach to tackle this problem is to start with a random initialization of the weights and use gradient descent to iteratively find a local minimum in the error landscape. That is, we adapt the weights over multiple iterations according to the gradients until the value of the error function is sufficiently low.

However, using this approach without modifications (classical gradient descent) has some disadvantages as it can be slow, may end up in a suboptimal local minimum and requires a careful choice of the learning rate. Hence, multiple approaches have been developed over the years to improve classical gradient descent and address a few of the problems.

This is the first part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. We start in in this article with the momentum optimizer which tries to improve convergence by speeding up when we keep moving in the same direction. The second part introduces the concept of individual learning rates for each weight in the form of an adaptive learning scheme which scales according to the accumulation of past gradients. Finally, the third part introduces the Adam optimizer which re-uses some of the ideas discussed in the first two parts and also tries to target better local minima. In summary:

**Part 1: momentum optimizer**- Part 2: adaptive learning rates
- Part 3: Adam optimizer

### The basic idea behind momentum optimization

In classical gradient descent, we perform each update step solely on the basis of the current gradient. In particular, no information about the previous gradients is included in the update step. However, this might not be very beneficial as the previous gradients can provide us with information to speed up convergence. This is where the momentum optimizer comes into play.^{1}.

The general idea is to speed up learning when we move in the same direction over multiple iterations and we slow down when the direction changes. With “direction” we mean the sign of the gradient components as we update each weight individually. For example, if we had moved to the right in the first two iterations (positive gradient components), we could assume that our optimum is somewhere at the right and move even a bit further in this direction. Of course, this decision is only possible when we still have the information about the gradient of the first iteration.

There is an analogy which is often used: imagine we place a ball at a hill and it rolls down into a valley. We use gradient descent to update the position of the ball based on the current slopes. If we use classical gradient descent, it is like placing the ball each time on the new position completely independent of the previous trajectory. With momentum optimization, however, the ball accelerates in speed as it builds momentum^{2}. This allows the ball to reach the valley much faster than with classical gradient descent.

There is one problem with the current approach, though: if we accelerated uncontrolledly as long as we move in the same direction, we might end up with huge update steps predestined to overshoot local minima. Hence, we need some kind of friction mechanism to act as a counterforce to slow down the ball again.

In the following, we introduce the concepts behind momentum optimization. We begin with the mathematical formulation where we also compare it with classical gradient descent. We then apply the formulas on a small numerical example to understand their basic functionality and look at different trajectories. Next, we analyse the speed improvements of momentum optimization from an exemplary and theoretical point of view. Lastly, we discuss the two basic concepts in momentum optimization, namely acceleration and deceleration, in more detail.

### Mathematical formulation

The basic algorithm to iteratively find a minimum of a function is (classical) gradient descent. It uses, as the name suggests, the gradients \(\nabla E\) of the error function \(E\left(\fvec{w}\right)\) as the main ingredient.

Let \(\fvec{w} = (w_1, w_2, \ldots, w_n) \in \mathbb{R}^n\) be a vector of weights (e.g. of a neural network) initialized randomly at \(\fvec{w}(0)\), \(\eta \in \mathbb{R}^+\) a global learning rate and \(t \in \mathbb{N}^+\) the iteration number. Then, classical gradient descent uses the update rule

\begin{equation} \label{eq:MomentumOptimizer_ClassicalGradientDescent} \fvec{w}(t) = \fvec{w}(t-1) - \eta \cdot \nabla E\left( \fvec{w}(t-1) \right) \end{equation}to find a path from the initial position \(\fvec{w}(0)\) to a local minimum of the error function \(E\left(\fvec{w}\right)\) based on the gradients \(\nabla E\).

That is, we move in each iteration in the direction of the current gradient vector (the gradient is subtracted since we want to reach a minimum and not a maximum of \(E(\fvec{w})\) in gradient *descent*). In momentum optimization, we move in the direction of the momentum vector \(\fvec{m}\) instead.

Additionally to the variables used in classical gradient descent, let \(\fvec{m} = (m_1, m_2, \ldots, m_n) \in \mathbb{R}^n\) be the momentum vector of the same length as the weights \(\fvec{w}\) which is initialized to zero, i.e. \(\fvec{m}(0) = \fvec{0}\), and \(\alpha \in [0;1]\) the momentum parameter (friction parameter). Then, the momentum optimizer defines the update rules

\begin{align} \begin{split} \fvec{m}(t) &= \alpha \cdot \fvec{m}(t-1) + \nabla E\left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \fvec{m}(t) \end{split} \label{eq:MomentumOptimizer_Momentum} \end{align}to find a path from the initial position \(\fvec{w}(0)\) to a local minimum of the error function \(E\left(\fvec{w}\right)\).

These equations summarize the former ideas: we keep track of previous gradients by adding them to the momentum vector, we subtract the momentum vector \(\fvec{m}\) which includes information about the complete previous trajectory and we have a friction mechanism in terms of the momentum parameter \(\alpha\).

If we set \(\alpha = 0\), we do not consider previous gradients and end up with classical gradient descent, i.e. \eqref{eq:MomentumOptimizer_ClassicalGradientDescent} and \eqref{eq:MomentumOptimizer_Momentum} produce the same sequence of weights. In the other extreme, \(\alpha = 1\), we have no friction at all and accumulate every previous gradient. This is not really a useful setting as the updates tend to get way too large so that we almost certainly overshoot every local minimum. For values in-between, i.e. \(\alpha \in \; ]0;1[\), we can control the amount of friction we want to have with lower values of \(\alpha\) meaning a higher amount of friction as more of the previous gradients are discarded. We are comparing different values for this hyperparameter^{3} later.

### Numerical example

First, let us look at a numerical example which compares classical gradient descent with momentum optimization. For this, suppose that we have two weights \(w_1\) and \(w_2\) used in the (fictional) error function

\begin{equation} \label{eq:MomentumOptimizer_ExampleFunction} E(\fvec{w}) = 3 \cdot w_1^2 + 10 \cdot w_2^2. \end{equation}We are starting in our error landscape at the position \(\fvec{w}(0) = (-15,20)\), using a learning rate of \(\eta = 0.001\) and want to apply two updates to the weights \(\fvec{w}\). We first need to calculate the gradients of our error function

\begin{equation} \label{eq:MomentumOptimizer_ExampleFunctionGradient} \nabla E\left( \fvec{w} \right) = \cvec{\xfrac{\partial E}{\partial w_1} \\ \xfrac{\partial E}{\partial w_2}} = \cvec{6 \cdot w_1 \\ 20 \cdot w_2} \end{equation}and evaluate it at the initial position

\begin{equation*} \nabla E \left( \fvec{w}(0) \right) = \cvec{6 \cdot (-15) \\ 20 \cdot 20} = \cvec{-90 \\ 400}. \end{equation*}We can then start with classical gradient descent and apply the first update^{4}

The new weight position \(\fvec{w}_c(1)\) is the basis for the evaluation of the next gradient

\begin{equation*} \nabla E \left( \fvec{w}_c(1) \right) = \cvec{6 \cdot (-14.91) \\ 20 \cdot 19.6} = \cvec{-89.46 \\ 392} \end{equation*}required for the second update (results are rounded to two decimal places)

\begin{equation} \label{eq:MomentumOptimizer_W2Classic} \fvec{w}_c(2) = \fvec{w}_c(1) - \eta \cdot \nabla E\left( \fvec{w}_c(1) \right) = \cvec{-14.91 \\ 19.6} - 0.001 \cdot \cvec{-89.46 \\ 392} \approx \cvec{-14.82 \\ 19.21}. \end{equation}We now repeat the same process but with momentum optimization enabled and the momentum parameter set to \(\alpha = 0.9\). Since the momentum vector is initialized to \( \fvec{m}(0) = \fvec{0} \), the first update is the same as before

\begin{align*} \fvec{m}(1) &= \alpha \cdot \fvec{m}(0) + \nabla E\left( \fvec{w}(0) \right) = 0.9 \cdot \fvec{0} + \cvec{-90 \\ 400} = \cvec{-90 \\ 400} \\ \fvec{w}_m(1) &= \fvec{w}(0) - \eta \cdot \fvec{m}(1) = \cvec{-15 \\ 20} - 0.001 \cdot \cvec{-90 \\ 400} = \cvec{-14.91 \\ 19.6}. \end{align*}In the second update, however, the momentum optimizer leverages the fact that we move again in the same direction (\(w_1\) to the right and \(w_2\) to the left) and accelerates (the gradient is the same as before, i.e. \(\nabla E \left( \fvec{w}_m(1) \right) = \nabla E \left( \fvec{w}_c(1) \right)\))

\begin{align*} \fvec{m}(2) &= \alpha \cdot \fvec{m}(1) + \nabla E\left( \fvec{w}_m(1) \right) = 0.9 \cdot \cvec{-90 \\ 400} + \cvec{-89.46 \\ 392} = \cvec{-170.46 \\ 752} \\ \fvec{w}_m(2) &= \fvec{w}_m(1) - \eta \cdot \fvec{m}(2) = \cvec{-14.91 \\ 19.6} - 0.001 \cdot \cvec{-170.46 \\ 752} \approx \cvec{-14.74 \\ 18.85}. \end{align*}This situation is also visualized in the following figure:

We can see that the position with momentum \(\fvec{w}_m(2)\) is a fair way ahead of the position \(\fvec{w}_c(2)\) reached via classical gradient descent. That is, the first weight moved more to the right (\(w_{1,m}(2) > w_{1,c}(2)\)) and the second weight more to the left (\(w_{2,m}(2) < w_{2,c}(2)\)).

### Trajectories

So, how do the two approaches proceed further on their way to the minimum? To answer this question, you can play around in the following animation. With the default error function and without considering extreme settings, all trajectories should find their way to the local (and in this case also global) minimum of \(\fvec{w}^* = (0, 0)\).

If we start with classical gradient descent (\(\alpha = 0\)) and then increase the momentum parameter \(\alpha\), we see how we reach closer to the minimum in the same number of iterations. However, setting the parameter too high can lead to negative effects. For example, the \(\alpha = 0.9\) path first moves in the wrong direction (down) before it turns back to the minimum. This is a consequence of too little friction. The accumulation from the first few gradients is still very high so that the step sizes are very large (enable the \(m_2\) trace to make this effect more prominent). It takes some iterations before the friction can do its job and decreases the accumulation so that the path can turn and move into the correct direction again. This happens once more when the path is already near the local minimum (zoom in to see this better).

Generally speaking, when we set the momentum parameter \(\alpha\) too high, it is likely to happen that the path oscillates around an optimum. It may overshoot the minimum several times before the accumulated values decreased enough.

Of course, instead of adjusting the momentum parameter \(\alpha\), we could also tune the learning rate \(\eta\). Increasing it can also help to reach the minimum faster and it can lead to similar oscillation problems^{6}. However, \(\eta\) is a global setting and influences all step sizes independent of the current location in the error surface. The momentum parameter \(\alpha\), on the other hand, specifies how much we take from the currently accumulated momentum and this depends on all previous gradients. What is more, a good setting for the learning rate \(\eta\) helps also in momentum optimization.

It is to note that in neural networks, with usually many more parameters than just two, it is not easily possible to visualize the trajectory as it is done here. One has to trust other measures, like the value of the error function \(E(\fvec{w})\), instead.

### Speed improvements

The goal of momentum optimization is to converge faster to a local optimum and we can see that this is indeed the case when we keep track of the error course. To make this comparison even clearer, we are elaborating on this point a bit further in this section.

One note before we proceed, though: the results here serve only as a general hint and do not necessarily represent realistic speed improvements. That is because we use \eqref{eq:MomentumOptimizer_ExampleFunction} as error function which is only a toy example. Error surfaces from the real world are usually more complex so that optimizers may behave differently.

However, the nice thing about this error function is that it has only one minimum so that any reduction of the error value means that we come closer to the same optimum. This allows us to compare the value of the error function in each step \(t\). This is done in the following figure for three different trajectories with different values for the momentum parameter \(\alpha\). The lower the value on the \(y\)-axis, the closer is the path to the minimum and the earlier this happens, the better.

We can clearly see that with classical gradient descent (\(\alpha = 0\)) we reach the optimum slowest compared to the other curves. Using the momentum optimizer helps here in both cases with \(\alpha = 0.9\) being even a bit faster than \(\alpha = 0.6\). However, we also see the problems of the \(\alpha = 0.9\) curve which has a small bump at around \(t = 20\). This is when the corresponding trajectory moves too much to the south of the error function before turning to the optimum.

### Theoretical speedup

Regarding the speed improvements, there is also a theoretical argument we want to stress here. It concerns a possible theoretical speedup with momentum optimization. We use a one-dimensional example and assume that the gradient \(g\) stays constant during all iterations. We can then calculate the sequence of momentum updates

\begin{align*} m(0) &= 0 \\ m(1) &= \alpha \cdot m(0) + g = g \\ m(2) &= \alpha \cdot m(1) + g = \alpha \cdot g + g \\ m(3) &= \alpha \cdot m(2) + g = \alpha \cdot (\alpha \cdot g + g) + g = \alpha^2 \cdot g + \alpha \cdot g + g \\ &\vdots \\ m(t) &= \sum_{i=0}^{t-1} \alpha^i \cdot g \Rightarrow \lim\limits_{t \rightarrow \infty} g \sum_{i=0}^{t-1} \alpha^i = g \frac{1}{1-\alpha}. \end{align*}In the last step, we assumed \(|\alpha| < 1\) and used the limit of the geometric series to simplify the formula.

As we can see, the influence of previous gradients decreases exponentially. In total, momentum optimization effectively pushes the gradient by a factor depending on the momentum parameter \(\alpha\). For example, if we use \(\alpha = 0.9\), we get a speedup of \(\xfrac{1}{0.1} = 10\) for the last update. That is, the weight updates are up to 10 times larger than without momentum optimization.

It is to note that this is only a theoretical perspective and the simplification of constant gradients is not very realistic. After all, we could just adapt the learning rate with increasing iterations to achieve a similar effect in this case. Still, it gives a general idea of the convergence improvements from momentum optimization and which role the hyperparameter \(\alpha\) plays.

### Two basic cases

So far, we focused mainly on the acceleration functionality of momentum optimization. As we saw, we accelerate when we move in the same direction as in the previous iteration^{7}. That is, the magnitude of the momentum vector increases, i.e. \( \left\| \fvec{m}(t) \right\| > \left\| \fvec{m}(t-1) \right\| \). On the other hand, we decelerate when we change directions. In this case, the magnitude of the momentum vector decreases, i.e. \( \left\| \fvec{m}(t) \right\| < \left\| \fvec{m}(t-1) \right\| \). These are two fundamental cases in the momentum optimizer. We now want to discuss a bit deeper when and to which extent these cases apply.

To make things simpler, we use, again, one dimension so that only one weight and more importantly one gradient per iteration remains. We consider two iterations of the momentum update, interpret the gradients as variables \(g_1 = \nabla E \left( w(0) \right)\) for the first and \(g_2 = \nabla E \left( w(1) \right)\) for the second iteration:

\begin{align} \begin{split} m(1) &= g_1 \\ m(2) &= \alpha \cdot g_1 + g_2. \end{split} \label{eq:MomentumOptimizer_CasesUpdates} \end{align}The idea is to measure the acceleration and deceleration between the two momentum updates as a function of the gradients \(g_i\). We can do so by simply calculating the difference between \(m(1)\) and \(m(2)\) to see how much the momentum changes. However, since “the correct direction” (positive to the right, negative to the left) depends on the sign of the first gradient \(g_1\), we need to handle these cases separately:

\begin{equation} \label{eq:MomentumOptimizer_Cases} C_m(g_1, g_2) = \begin{cases} m(2) - m(1), & g_1 \geq 0 \\ m(1) - m(2), & g_1 < 0 \\ \end{cases} = \begin{cases} (\alpha \cdot g_1 + g_2) - g_1, & g_1 \geq 0 \\ g_1 - (\alpha \cdot g_1 + g_2), & g_1 < 0 \\ \end{cases} \end{equation}For the measure \(C_m(g_1, g_2)\), we can interpret the sign and the magnitude leading to the following cases:

- For \(C_m(g_1, g_2) > 0\), the gradient \(g_2\) points in the same direction as \(g_1\) and the magnitude \(\left| m \right| \) increases by \( \left|C_m(g_1, g_2)\right| \).
- For \(C_m(g_1, g_2) < 0\), the gradient \(g_2\) points in the opposite direction as \(g_1\) and the magnitude \(\left| m \right| \) decreases by \( \left|C_m(g_1, g_2)\right| \).
- For \(C_m(g_1, g_2) = 0\), the magnitude \(\left| m \right| \) remains unchanged. This is not really an interesting case and only listed for the sake of completeness.

Note that the reverse of the above statements is not true in general. For example, if the gradient \(g_2\) points in the same direction as \(g_1\), it does not necessarily mean that the magnitude of the momentum increases. Let \(g_1 = 2, g_2 = 1\) and \(\alpha = 0.4\) to make this point clear:

\begin{align*} \begin{split} m(1) &= 2 \\ m(2) &= 0.4 \cdot 2 + 1 = 1.8. \end{split} \end{align*}Both gradients point in the same direction but the momentum still changes by \(m(2) - m(1) = -0.2\), i.e. it gets smaller. This is due to the friction parameter \(\alpha\) and the second gradient being too small to account for the friction loss \(0.6 \cdot 2 = 1.2\).

The following figure visualizes \eqref{eq:MomentumOptimizer_Cases} graphically. The red regions correspond to the first case and the blue regions to the second case. Additionally, we see by the intensity of the colour how much the momentum changes. We are building momentum in the bottom-left and top-right quarters, i.e. when the gradients share the same sign and the second gradient \(g_2\) can account for the friction loss.

The friction parameter \(\alpha\) influences the slope of the contour lines. It is highest for \(\alpha = 0\) where the previous gradient is ignored completely and the momentum term only increases when the new gradient is larger, i.e. \(g_2 > g_1\). This is not surprising as this case reduces to classical gradient descent where we do not have a momentum term. The slope is lowest (more precisely: zero) for \(\alpha = 1\). This is the case where we do not have any friction at all and sum up the full magnitudes of all previous gradients. Hence, when the second gradient has e.g. a value of \(g_2 = 1\), the momentum term increases by this value: \(m(2) - m(1) = 1\).

### Conclusion

This was the story about momentum optimization. It introduces an important idea to speed up convergence in neural network optimization namely that we can be smarter when we keep moving in the same direction over iteration time. However, it should not be used without caution since the friction parameter \(\alpha\) highly influences our success. Setting it too low and we may learn slower than we could. Setting it too high and we may lose control over our trajectory in the error landscape. This is especially hard to catch for higher-dimensional error functions (usually the case in neural networks).

Even though the concept of momentum optimization has its right to exist, I have to admit that I can't remember ever using it on its own. There are usually enough (critical) hyperparameters in a network so that adjusting the friction parameter \(\alpha\) can be a bit scary. What is more, other optimization techniques have been developed which work reasonably well and have less critical hyperparameters.

A popular example is the Adam optimizer which reuses the idea of momentum even though in a different form and with different effects. We are also covering the Adam optimizer in this series about neural network optimization (third part). But before we are taking a look at this technique, we should first learn about a new concept: adaptive learning rates. This is also very important for the Adam optimizer.

List of attached files:

- MomentumOptimizer.nb [PDF] (Mathematica notebook with some basic computations and visualizations used to write this article)