This is the second part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. As a reminder, here is the table of contents:
- Part 1: momentum optimizer
- Part 2: adaptive learning rates
- Part 3: Adam optimizer
In the optimization techniques discussed previously (classical gradient descent and momentum optimizer), we used a global learning rate \(\eta\) for every weight in the model. However, this might not be a suitable setting as some neurons can benefit from a higher and others from a lower learning rate. What is more, per-neuron learning rates can help the weights move more in lockstep in the error landscape. This can be beneficial in approaching an optimum.
In a neural network with maybe thousands and thousands of parameters, we certainly do not want to adjust them all by hand. Hence, an automatic solution is required. There are multiple approaches to tackle this problem1. Here, we introduce the concept of adaptive learning rates which apply a different scaling to each gradient according to information from previous gradients essentially changing the learning rate.
The optimization technique we describe here follows the RMSProp algorithm introduced by Tieleman and Hinton in 2012. This is again an advancement of the AdaGrad method which was proposed in 2011 by Duchi, Hazan and Singer. We stick to the advanced version and refer to it as the adaptive learning scheme.
There is a second family of problems with relevance to this context. There are two extreme cases of gradient descent which can become problematic. On the one hand, there is the problem of plateaus in the error landscape. There, the gradient is very low and hence learning happens only very slowly. It would be nice to escape from these plateaus faster. Related to this problem are vanishing gradients, which can occur especially in deep neural networks, originating from the multiplication of small numbers (e.g. due to a plateau) and leading to even smaller numbers. If the gradients were larger, we could alleviate these problems.
On the other hand, gradients can also get very large, e.g. when moving across a steep hill in the error landscape. There, at least some components of the gradients are very large. If the steps are too big, we may overstep local minima hampering the goal of finding a good solution. Related to this are exploding gradients where gradients can grow uncontrollably fast due to the multiplication of large numbers. Even though less likely than vanishing gradients, they can still be problematic. If the gradients were smaller, we could alleviate these problems as well.
Generally speaking, problems can occur when gradients become either too large or too small. One goal is to keep the gradients in “reasonable ranges”. That is, amplify very small and attenuate very large gradients so that the problems discussed are less likely. Using an update scheme with adaptive learning rates can help.
An additional advantage of adaptive learning rates is that the global learning rate \(\eta\) needs much less tuning because each weight adapts its learning speed on its own. This leaves \(\eta\) being more of a general indicator than a critical design choice. To put it differently, the damage which can be caused by \(\eta\), e.g. by setting it way too high, is far smaller as each weight also applies its own scaling.
We are starting by defining the update rules for the adaptive learning scheme. We then look at a small numerical example, play around with trajectories in the error landscape and discuss some general concepts. Last but not least, there is also a new hyperparameter on which we want to elaborate. We skip speed comparisons as this would require a more realistic error function which can leverage some of the advantages of the procedure (e.g. faster escape from plateaus).
Mathematical formulation
The basic extension of the adaptive learning scheme, compared to classical gradient descent, is the introduction of scaling factors \(\fvec{s}\) for each weight.
Additionally to the variables used in classical gradient descent, let \(\fvec{s} = (s_1, s_2, \ldots, s_n) \in \mathbb{R}^n\) be a vector with scaling factors for each weight in \(\fvec{w}\) initialized to zero, i.e. \(\fvec{s}(0) = \fvec{0}\), \(\beta \in [0;1]\) the decaying parameter and \(\varepsilon \in \mathbb{R}_{>0}\) a smoothing term. Then, the update rules
\begin{align} \begin{split} \fvec{s}(t) &= \beta \cdot \fvec{s}(t-1) + (1-\beta) \cdot \nabla E \left( \fvec{w}(t-1) \right) \odot \nabla E \left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \nabla E \left( \fvec{w}(t-1) \right) \oslash \sqrt{\fvec{s}(t) + \varepsilon} \end{split} \label{eq:AdaptiveLearning_AdaptiveLearningScheme} \end{align}define the adaptive learning scheme and find a path from the initial position \(\fvec{w}(0)\) to a local minimum of the error function \(E\left(\fvec{w}\right)\). The symbol \(\odot\) denotes the point-wise multiplication and \(\oslash\) the point-wise division between vectors.
The scaling factors \(\fvec{s}\) accumulate information about all previous gradients in their squared form so that only the magnitude and not the sign of the gradient is changed. The squaring has the advantage (e.g. compared to the absolute value) that small gradients get even smaller and large gradients even larger so that their contribution in the scaling factors \(\fvec{s}\) is boosted. This intensifies the effects of the adaptive learning rates which we discuss below.
The hyperparameter \(\beta\) controls the influence of the previous scales \(\fvec{s}(t-1)\) compared to the influence of the new gradient \(\nabla E \left( \fvec{w}(t-1) \right)\). We are going to take a look at this one a bit later.
In each step, the weights now move in the direction of the update vector
\begin{equation} \label{eq:Optimizer_AdaptiveUpdateVector} \fvec{v}(t) = \eta \cdot \nabla E \left( \fvec{w}(t-1) \right) \oslash \sqrt{\fvec{s}(t) + \varepsilon} \end{equation}which scales the gradient according to the square root of \(\fvec{s}\) effectively reverting the previous squaring step. Additionally, a small smoothing term \(\varepsilon\) to avoid division by zero is used. It is usually set to a very small number, e.g. \(\varepsilon = 10^{-8}\).
Numerical example
With the mathematical toolbox set up, we are ready to start with a small numerical example where we compare classical gradient descent with adaptive learning rates. We are using the same error function as before, i.e.
\begin{equation} \label{eq:AdaptiveLearning_ExampleFunction} E(\fvec{w}) = 3 \cdot w_1^2 + 10 \cdot w_2^2. \end{equation}We do not change the parameters for classical gradient descent so the results \(\fvec{w}_c(1) = (-14.91, 19.6)\) and \(\fvec{w}_c(2) \approx (-14.82, 19.21)\) remain valid.
We set the decaying parameter to \(\beta = 0.9\) and the smoothing term to \(\varepsilon = 10^{-8}\). This leaves only the global learning rate to clarify. As we do not change the classical gradient descent approach, this leaves \(\eta_c = 0.001\). This is not an appropriate rate for the adaptive scheme, however. Due to the attenuation of higher gradients, this value is way too low so that we would not make much progress. We increase it therefore to2 \(\eta_a = 0.05\). This would make speed comparisons more difficult but since we are not intending to do them, this is of no concern.
With the gradient
\begin{equation} \label{eq:AdaptiveLearning_ExampleFunctionGradient} \nabla E\left( \fvec{w} \right) = \cvec{\xfrac{\partial E}{\partial w_1} \\ \xfrac{\partial E}{\partial w_2}} = \cvec{6 \cdot w_1 \\ 20 \cdot w_2} \end{equation}evaluated at the initial position
\begin{equation*} \nabla E \left( \fvec{w}(0) \right) = \cvec{6 \cdot (-15) \\ 20 \cdot 20} = \cvec{-90 \\ 400}, \end{equation*}we can apply the first update step (results are rounded to two decimal places)
\begin{align} \begin{split} \fvec{s}(1) &= \beta \cdot \fvec{s}(0) + (1-\beta) \cdot \nabla E \left( \fvec{w}(0) \right) \odot \nabla E \left( \fvec{w}(0) \right) \\ &= 0.9 \cdot \fvec{0} + 0.1 \cdot \cvec{ (-90) \cdot (-90) \\ 400 \cdot 400 } = \cvec{810 \\ 16000} \\ \fvec{w}_a(1) &= \fvec{w}(0) - \eta_a \cdot \nabla E \left( \fvec{w}(0) \right) \oslash \sqrt{\fvec{s}(1) + \varepsilon} \\ &= \cvec{-15 \\ 20} - 0.05 \cdot \cvec{\xfrac{-90}{\sqrt{810 + 10^{-8}}} \\ \xfrac{400}{\sqrt{16000 + 10^{-8}}}} \approx \cvec{-14.84 \\ 19.84}. \end{split} \label{eq:AdaptiveLearning_AdaptiveStep1} \end{align}It is not a coincidence that the fractional part \(.84\) is the same for both vector components of \(\fvec{w}_a(1)\). This is because we initialized the scaling factors to \(\fvec{s}(0) = \fvec{0}\) so that only the squared gradient times \(1-\beta\) remain. In the weight update step, the gradient cancels nearly out due to the square root. In the end, both weight components change roughly by the same value (\(\approx +0.16\) for the first and \(\approx -0.16\) for the second component). Only the direction, indicated by the sign of the gradient, is different (we move to the right for \(w_1\) and to the left for \(w_2\)). We take up this point again later.
The new weight position \(\fvec{w}_a(1)\) is the basis to evaluate the next gradient
\begin{equation*} \nabla E \left( \fvec{w}_a(1) \right) = \cvec{6 \cdot (-14.84) \\ 20 \cdot 19.84} = \cvec{-89.04 \\ 396.8} \end{equation*}so that we can proceed in a similar way with the second update
\begin{align*} \fvec{s}(2) &= \beta \cdot \fvec{s}(1) + (1-\beta) \cdot \nabla E \left( \fvec{w}_a(1) \right) \odot \nabla E \left( \fvec{w}_a(1) \right) \\ &= 0.9 \cdot \cvec{810 \\ 16000} + 0.1 \cdot \cvec{ (-89.04) \cdot (-89.04) \\ 396.8 \cdot 396.8 } \approx \cvec{1521.81 \\ 30145.02} \\ \fvec{w}_a(2) &= \fvec{w}_a(1) - \eta_a \cdot \nabla E \left( \fvec{w}_a(1) \right) \oslash \sqrt{\fvec{s}(2) + \varepsilon} \\ &= \cvec{-14.84 \\ 19.84} - 0.05 \cdot \cvec{\xfrac{-89.04}{\sqrt{1521.81 + 10^{-8}}} \\ \xfrac{396.84}{\sqrt{30145.02 + 10^{-8}}}} \approx \cvec{-14.73 \\ 19.73}. \end{align*}This situation is also visualized in the following figure.
The picture here is quite different from the momentum optimizer example. The difference is not in how far the vectors move (this would be hard to compare anyway due to the different learning rates) but rather how they move. With adaptive learning rates, we move more to the right and less to the bottom than with classical gradient descent.
Trajectories
Similar to what we did in momentum optimization, we want to know how the sequence from the example proceeds further. For this, you can play around in the following animation. It is the same as in the previous article except that the update sequence now uses adaptive learning rates.
We can see that with adaptive learning rates the trajectory is very different from classical gradient descent. Instead of moving down the steepest hill and then walking along the horizontal direction of the valley, the adaptive route points more directly towards the optimum. In this case, this results in a shorter path.
This is an effect of the general principle that the adaptive scheme tends to focus more on the direction than on the magnitude of the gradient. In momentum optimization, on the other side, we mainly focused on improving the magnitude. We even accepted small divergences from the optimal path for larger step sizes (cf. the oscillations of an \(\alpha = 0.9\) path). The opposite is the case with adaptive learning rates. Here, we focus more on the direction of the gradient vector instead of the magnitude.
The reason behind this lies in the scaling by the \(\fvec{s}\) vector. On the one hand, when the previous gradients were high so that the scaling factor is \(\sqrt{\fvec{s} + \varepsilon} > 1\), then the update step is reduced. On the other hand, with small previous gradients, the scaling factor is \(\sqrt{\fvec{s} + \varepsilon} < 1\) which increases the update step. This happens for each vector component individually.
The scaling has the effect that the magnitude of the gradient becomes less important as the update vectors \(\fvec{v}(t)\) are more forced to stay in “reasonable ranges”. This design was chosen with the intent to overcome the aforementioned problems of extremely small or large gradients. However, it is to note that the sign of the gradient is unaffected by this scaling which ensures that the update vector \(\fvec{v}(t)\) points always in the same direction as the gradient \(\nabla E(t)\).
In our example, this means that the vertical vector component decreases more since the gradients are higher in this direction (this is the direction with the steepest decline). As a result, the update vectors \(\fvec{v}\) point more to the right than with classical gradient descent. In the case of elongated error functions, this helps to point more directly at the optimum.
When we compare a \(\beta = 0\) with a \(\beta = 0.9\) trajectory, we see that the former is shaped more angularly and the latter is smoother. This is the case because the \(\beta = 0.9\) path does involve, at least to some extent, the magnitude of the gradients in the update decisions so that effectively a broader range of orientations is incorporated in the path. With \(\beta = 0\), however, we mainly restrict the path to the eight cardinal directions, e.g. in this case first to the south-east (\(\fvec{v} \approx (1, -1)\)) and then to the south (\(\fvec{v} \approx (0, -1)\)).
The decaying parameter \(\beta\)
We discussed that with adaptive learning rates the focus is more on the direction than on the magnitude of the gradients. A valid question may be whether we have control over this focus. It turns out that we can use the hyperparameter \(\beta\) for this purpose and this is what we want to discuss now.
Let us begin with the extreme cases. When we set \(\beta = 0\), we do not use information from previous gradients and use only the current gradient instead. This is essentially what happened in the first update step of the numerical example (\eqref{eq:AdaptiveLearning_AdaptiveStep1}) where both vector components were updated by roughly the same value4. As we saw, this cancelled the magnitude of the gradient nearly completely out giving a strong importance to the sign. This is a way of expressing an exclusive focus on the direction.5
Setting \(\beta = 1\) does not use the current gradient in the scaling vector \(\fvec{s}\). Instead, only the initial value \(\fvec{s}(0)\) is used. In the case of zero-initialization, this leaves \(\sqrt{\varepsilon}\), i.e. a very small number, in the denominator so that an increase of the magnitude \( \left| \nabla E \left( \fvec{w}_a(1) \right) \right| \) is likely. In a way, this is an extreme focus on the magnitude. However, it is doubtful that this is a useful case as it contradicts the goal of keeping the update vectors in “reasonable ranges”.
For values in-between, i.e. \(\beta \in \; ]0;1[\), we have the choice of focusing more on the direction (smaller values of \(\beta\)) or more on the magnitude (larger values of \(\beta\)). For the latter, it is to note that the upper limit is not the original magnitude. Rather, it is possible that the magnitude even amplifies, as in the case of the extreme value \(\beta = 1\).
We now want to analyse how the update vectors \(\fvec{v}\) of the adaptive learning scheme are different from the gradients \(\nabla E\) used directly in classical gradient descent. For this, we use a one-dimensional example, consider two update iterations and interpret the gradients \(g_t = \nabla E \left( w(t-1) \right)\) as variables. We can then set up the scaling values
\begin{align*} s(1) &= (1-\beta) \cdot g_1^2 \\ s(2) &= \beta \cdot s(1) + (1-\beta) \cdot g_2^2 = \beta \cdot \left( (1-\beta) \cdot g_1^2 \right) + (1-\beta) \cdot g_2^2 \\ &= (1-\beta) \cdot (\beta \cdot g_1^2 + g_2^2). \end{align*}We are interested in the interplay between the first two gradients so we consider only the update value of the second iteration
\begin{equation*} v(2) = \eta \cdot \frac{g_2}{\sqrt{s(2) + \varepsilon}} = \eta \cdot \frac{g_2}{\sqrt{(1-\beta) \cdot (\beta \cdot g_1^2 + g_2^2) + \varepsilon}}. \end{equation*}The question now is in which way the update value \(v(2)\) is different from the gradient \(g_2\), i.e. how we move differently with adaptive learning rates than without. More precisely, we want to know if the original gradient is amplified or attenuated. The idea is to measure this as a simple difference between the two update schemes:
\begin{equation} \label{eq:AdaptiveLearning_AdaptiveDifference} C_a(g_1, g_2) = |\eta \cdot v(2)| - |\eta \cdot g_2| = \eta \cdot \begin{cases} v(2) - g_2, & g_2 \geq 0 \\ g_2 - v(2), & g_2 < 0 \\ \end{cases} \end{equation}The measure \(C_a(g_1, g_2)\) informs us about whether we move further in the adaptive learning scheme compared to classical gradient descent. This is, again, a sign-dependent decision as moving further means more to the right of \(g_2\) when \(g_2\) is already positive and more to the left of \(g_2\) when \(g_2\) is already negative.
When we interpret the sign and the magnitude of \(C_a(g_1, g_2)\), we can distinguish between the following cases:
- For \(C_a(g_1, g_2) > 0\), the magnitude of the gradient \(|g_2|\) is amplified by \(\left| C_a(g_1, g_2) \right|\). The update step is larger with adaptive learning rates than without.
- For \(C_a(g_1, g_2) < 0\), the magnitude of the gradient \(|g_2|\) is attenuated by \(\left| C_a(g_1, g_2) \right|\). The update step is smaller with adaptive learning rates than without.
- For \(C_a(g_1, g_2) = 0\), the magnitude of the gradient \(|g_2|\) remains unchanged. The update step is the same with both approaches.
The following figure visualizes \eqref{eq:AdaptiveLearning_AdaptiveDifference} for different values of the hyperparameter \(\beta\).
The red regions show where the gradient \(g_2\) is enlarged (first case). We see that this happens especially in areas where the value of both gradients is low. This is what helps us leaving a plateau of the error landscape escaping more quickly also alleviating the problem of vanishing gradients.
In the blue regions, the gradient \(g_2\) gets smaller. This corresponds to the second case. We see that these regions are visible strongest at the borders. That is, when either \(g_1\) or \(g_2\) are high (or both). This makes sure that the gradients do not become too large so that steep hills in the error landscape pose a risk of overstepping local minima. Smaller gradients also alleviate the problem of exploding gradients.
When we compare the result of different \(\beta\)-values, we can see how the focus changes from the direction to the magnitude of the gradients. The extreme value \(\beta = 0\) reveals a nearly linear relationship between \(v(2)\) and \(g_2\) as the magnitude is almost completely neglected setting the update essentially to \(\left| v(2) \right| \approx 1\).
Conclusion
This was also the story about adaptive learning rates. The global learning rate \(\eta\) does not have the power to suit the needs for every weight. Hence, using an individual learning rate (via the scaling vector \(\fvec{s}\)) per weight is a great and powerful idea which is also an important aspect of the Adam optimizer (covered in the next article). It is also nice that the scaling factors are calculated automatically. But, to be fair, we also do not have much of a choice since there is no practical way we could manually adjust the learning rates for each weight in a larger network.
Another advantage is that the global learning rate \(\eta\) becomes a less critical hyperparameter to adjust. Even though it still has an influence (otherwise, we could remove it), the effects are less drastic. This also means that it can cause less damage when it is set incorrectly.