## Introduction to the Hessian feature detector for finding blobs in an image

In many computer vision applications, it is important to extract special features from the image which are distinctive, unambiguously to locate and occur in different images showing the same scene. There is a complete subbranch in the computer vision field dedicated to this task: feature matching. Usually, this process consists of three tasks: detection, description and matching. I shortly want to summarize the steps but you can easily find further information on the web or in related books1.

1. The first step is about the detection of distinctive points in an image. This could be a corner of an object, an intensity blob (e.g. the inner parts of the eyes) or any other special shape or structure you can imagine. Not good candidate points are e.g. straight lines. It is hard to detect a point on a line unambiguously in two images (which point on the line to choose?).
2. The goal of the description step is to describe the area around a detected point. A lot of techniques can be used here and it is quite common to get a vector with some numbers which describe the area around the point as output. If two points have a similar surrounding, it would be good if the vectors would also have similar numbers.
3. This is essential for the third and last step: matching. Imagine two images showing the same scene with a different viewpoint. In both images are distinctive points detected and the surrounding of each point described. Which points correspond, i.e. which points show the same part of the scene? This can e.g. be done by measuring the similarity between the description vectors.

For each of these steps, many different algorithms exist. As you may have guessed from the title of this article I want to focus here on the detection step. More precisely, I introduce the Hessian detector which is mathematically defined as:

Definition 1: Hessian feature detector

An image is scaled to a size defined by the scale parameter $$\sigma$$. Let $$H_{\sigma}$$ denote the Hessian matrix at a specific image location in level $$\sigma$$ and e.g. $$\partial_{xx} = \frac{\partial^2 I}{\partial^2 x^2}$$ denoting the second order derivative of the image $$I$$ along the $$x$$-axis. We can use the normalized determinant response of the Hessian

$$\label{eq:HessianDetector_Definition} \sigma^4 \cdot \det(H_{\sigma}) = \sigma^4 \cdot \left( \partial_{xx} \cdot \partial_{yy} - \partial_{xy}^2 \right)$$

to detect image features (blobs and notches) by searching for maxima in each image location across scale.

I just wanted to give the definition of the detector at the beginning; the intention of this article is to give some insights why this equation might indeed be useful. Also, note that the detector is designed to find blobs in an image (and not corners).

So, what exactly is a blob in an image? It is a homogeneous area with roughly equal intensity values compared to the surrounding. The ideal artificial example is a 2D Gaussian function where the intensity values equally decay in a circle way blending together with the surrounding – visible in the image as a smooth blob. A more realistic example would be the inner part of a sunflower which is usually dark compared to the bright leaves. The following figure shows examples for both. A blob is not necessarily related to a circle like structure though. Any kind of intensity notch may also be detected as a blob. And the idea is that these circles and notches are part of the object and therefore are also visible in other images showing the same object.

The rest of this article is structured as follows: I begin introducing the Hessian matrix, analyse the curvature information it contains and describes how this information is used in the Hessian detector. But I also want to talk a bit about how the detector incorporates in the scale space usually used in feature matching to achieve scale invariance.

### The Hessian matrix

Let's start with the Hessian matrix $$H$$ (scale index $$\sigma$$ for simplicity omitted). Mathematically, this matrix is defined to hold the information about every possible second order derivative (here shown in the 2D case)3:

\begin{equation*} H = \begin{pmatrix} \frac{\partial^2 I}{\partial^2 x^2} & \frac{\partial^2 I}{\partial x \partial y} \\ \frac{\partial^2 I}{\partial y \partial x} & \frac{\partial^2 I}{\partial^2 y^2} \\ \end{pmatrix}. \end{equation*}

Since $$\frac{\partial^2 I}{\partial x \partial y} = \frac{\partial^2 I}{\partial y \partial x}$$ this matrix is symmetric4. Conceptually, $$H$$ can be understood as the second order derivative of a higher dimensional function. As we know from the 1D case, the second order derivative tells us something about the curvature of a function: is it convex, concave or neither of those? The result here is a number where the signum function tells us which of the cases is true. But what is the curvature of a 2D function?

It turns out that things get more complicated when introducing another dimension (who on earth would have guessed that...). We can't express the curvature as a single number anymore since the question now is: how is the curvature in a certain direction? Imagine you are walking over an arête in the mountains where it goes down on the left and right side and straight on in front of you. What is the curvature when you keep moving ahead? Since there are not many changes in height probably (roughly) zero. And what if you move right directly downwards the precipice? Probably very high since there are a lot of changes in height. So, direction obviously matters.

To get the curvature of a 2D function we need to provide three parameters: the Hessian matrix $$H$$, which acts as the curvature tensor containing the information about every direction, a specific direction $$\fvec{s}$$ in which we are interested and the location $$\left( x_0, y_0 \right)^T$$ where we want to know the curvature of. The direction is usually provided as uniform (i.e. $$\left\| \fvec{s} \right\|=1$$) column vector. We can then compute the so-called directional derivative

$$\label{eq:HessianDetector_DirectionalDerivative} \frac{\partial^2 I}{\partial^2 \fvec{s}^2} = \fvec{s}^T H \fvec{s}$$

which gives as the curvature (the second order derivative) in the direction $$\fvec{s}$$. Before showing some visualization, I want to provide a numeric example. For this, consider the following 2D function and suppose you want to know the curvature in the $$45^{\circ}=\frac{\pi}{4}$$ direction at the origin

\begin{equation*} f(x,y) = \cos(x), \quad \fvec{s} = \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) \\ \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \quad \text{and} \quad \begin{pmatrix} x_0 \\ y_0 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}. \end{equation*}

It is actually a bit stupid to call this a 2D function since it is just the standard cosine but the resulting graph fits nice to my arête analogy. The first thing to do is to calculate the derivatives of the Hessian matrix

\begin{align*} \frac{\partial^2 f(x, y)}{\partial^2 x^2} &= - \frac{\partial}{\partial x} \sin(x) = -\cos(x) \\ \frac{\partial^2 f(x, y)}{\partial x \partial y} &= 0 \\ \frac{\partial^2 f(x, y)}{\partial y \partial x} &= - \frac{\partial}{\partial y} \sin(x) = 0 \\ \frac{\partial^2 f(x, y)}{\partial x \partial y} &= 0 \end{align*}

and summarize the results in the Hessian matrix

\begin{equation*} H = \begin{pmatrix} -\cos(x) & 0 \\ 0 & 0 \\ \end{pmatrix}. \end{equation*}

Next, we apply \eqref{eq:HessianDetector_DirectionalDerivative} and get

\begin{align*} \frac{\partial^2 f}{\partial^2 \fvec{s}^2} = \fvec{s}^T H \fvec{s} &= \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) & \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \begin{pmatrix} -\cos(x) & 0 \\ 0 & 0 \\ \end{pmatrix} \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) \\ \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \\ &= \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) & \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \begin{pmatrix} -\cos(x) \cos\left( \frac{\pi}{4} \right) \\ 0 \end{pmatrix} \\ &= - \cos\left( \frac{\pi}{4} \right) \cos(x) \cos\left( \frac{\pi}{4} \right) \\ &= - \frac{1}{\sqrt{2}} \frac{1}{\sqrt{2}} \cos(x) \\ &= - \frac{\cos(x)}{2} \end{align*}

i.e. a curvature of $$-\frac{1}{2}$$ in the direction of $$\fvec{s}$$ at our evaluation point $$x_0=0$$ (origin).

You can imagine this as follows: at every point of our 2D function, we have a normal vector pointing orthogonal to the surface at that point. The normal vector forms together with the direction $$\fvec{s}$$ a plane in the 3D space. This plane intersects with our function resulting in a projection which is a 1D function and \eqref{eq:HessianDetector_DirectionalDerivative} is then just the curvature of this function (like from every other 1D function). The following animation shows the discussed points and lets you choose an arbitrary direction.

Even though this is very exciting, one is often not interested in the curvature of an arbitrary direction. What is more of interest is the direction of highest and lowest curvature. The good news is that it is also possible to retrieve this information from $$H$$. The information lies in the eigenvectors and eigenvalues: the eigenvector $$\fvec{e}_1$$ points in the direction of the highest curvature with the magnitude $$\lambda_1$$. Similarly, $$\fvec{e}_2$$ corresponds to the direction of lowest curvature with the strength of $$\lambda_2$$5. As you may have noticed, I already included $$\fvec{e}_1$$ in the previous animation. Can you guess in which direction $$\fvec{e}_2$$ points and what the value of $$\lambda_2$$ is?6

For a 2D function, this means that we can obtain two orthogonal vectors pointing in the highest and lowest direction. The following animation has the aim to visualize this. In order to make things a little bit more interesting, I also used a different function: $$L(x,y)$$. This is now a “proper” 2D function, meaning that both variables are actually used. It is essentially a linear combination of Gaussian functions building a landscape; therefore, I called it Gaussian landscape7.

Take your time and explore the landscape a bit. A few cases I want point to:

1. If you are at the centre of the Gaussian blob (bottom left corner, $$\left( -10, -10 \right)^T$$, did you notice that both eigenvalues are equal ($$\left| \lambda_1 \right| = \left| \lambda_2 \right|$$)? This is because the curvature is the same in every direction so there is no unique minimal or maximal direction. This is for example also true for every point on a sphere.
2. When you are on the ground, e.g. at position $$\left( -6, -6\right)^T$$, there is no curvature at all ($$\left| \lambda_1 \right|, \left| \lambda_2 \right| \approx 0$$). This is true for every point on a planar surface.
3. Another fact is already known from the previous animation: when you walk alongside the arête (e.g. at position $$\left( 0, -5\right)^T$$), the magnitude of the first eigenvalue is high whereas the second is roughly zero ($$\left| \lambda_1 \right| > 0, \left| \lambda_2 \right| \approx 0$$). This is for example also true for tube-like objects (e.g. a cylinder).
4. Last but not least, when you descent from the hill (e.g. at position $$\left( 0, -9\right)^T$$) there is curvature present in both directions but the first direction is still stronger ($$\left| \lambda_1 \right| > \left| \lambda_2 \right| > 0$$). This is a very interesting point, as we will see later.

What has this to do with blobs in an image? Well, the first and last case I have just pointed out fall directly into the definition of a blob whereas the first one is more circle-like and the last one is more notch-like. Since we want to detect both cases, it does not really matter which one we have. One way to detect these cases based on the eigenvalues is by using the Gaussian curvature

$$\label{eq:HessianDetector_GaussianCurvature} K = \lambda_1 \cdot \lambda_2.$$

The main observation is that this product is only large when both eigenvalues are large. This is true for the first and fourth case and false for the other two since the product is already low when $$\left| \lambda_2 \right| \approx 0$$ (note that it is assumed that the eigenvalues are descendingly sorted, i.e. $$\left| \lambda_1 \right| \geq \left| \lambda_2 \right|$$).

We can now detect blobs at each image position by calculating the Hessian matrix via image derivatives, their eigenvalues and then the Gaussian curvature $$K$$. Wherever $$K$$ is high we can label the corresponding pixel position as a blob. However, “wherever something is high” is not a very precise formulation, and, usually, comes with the cost of a threshold in practice. Since we have also a strength of the blob (the determinant response) we can easily overcome this by using only the largest blobs in an image and these are the local maxima of $$K$$. It is e.g. possible to sort the local maxima descendingly and use the $$n$$ best (or define a threshold, or use just every maximum, or do something clever).

But wait a moment, when \eqref{eq:HessianDetector_GaussianCurvature} is already the answer to our initial problem (finding blobs) why does \eqref{eq:HessianDetector_Definition} use the determinant of $$H$$ and this strange $$\sigma^4$$ prefactor? The second part of this question is a bit more complicated and postponed to the next section. But the first part of this question is easily answered: there is no difference, since the equation

\begin{equation*} \det(H) = \lambda_1 \cdot \lambda_2 \end{equation*}

holds. This means that \eqref{eq:HessianDetector_Definition} already calculates the Gaussian curvature. Before moving to the discussion of the scale parameter though, I want to show the result when the determinant of the Hessian is applied to the Gaussian landscape.

As you can see, the determinant response clearly separates our landscape with the maxima identifying locations which are either circle-like (bottom left) or notch-like (rest) blobs. Next, I want to apply this technique to a real image.

Notice the small intensity blobs present in the original image which are very well captured by the determinant response (e.g. the small white dots in the bottom of the image). Even though we can now detect some relevant parts of the image, did you notice that other parts are not detected? Especially the larger blobs like the round structure in the top right corner of the image. This is where the scale normalization parameter $$\sigma$$ gets important. But before dealing with it, I want to analyse one aspect of the curvature a bit further.

### What about the sign of the eigenvalues?

There is one detail which I have left out so far: the eigenvalues obtained from the Hessian matrix $$H$$ do have a sign and you may notice that I used the absolute values $$\left| \lambda_i \right|$$ when comparing the eigenvalues of the Gaussian landscape. Further, we only searched for maxima and not minima in the determinant response. Before deciding whether the sign is important or not, let's analyse the curvature of a one-dimensional Gaussian in order to understand what the sign actually means.

The following figure shows a Gaussian function $$g(x)$$ together with its second order derivative, i.e. the curvature. Besides the normal Gaussian (which is called positive) also a negative Gaussian is shown. The later is just an inverted version which has a hole in the middle instead of a peak. In an image, the positive Gaussian corresponds to a white blob on a dark background and the negative Gaussian to a black blob on a white background (both is possible and we want to detect both).

The interesting part is the second order derivative. For the positive Gaussian, it starts with a low left curvature, changes to a strong right curvature in the middle and reverts back to a low left curvature. You can observe three extrema but only one of them is high and corresponds to the blob we want to detect. In case of the negative Gaussian, the strong peak corresponds to a maximum instead of a minimum. So, you may wonder why we search for a maximum then if both are possible. Remember that this is only a one-dimensional view and for images, we have a second dimension and deal with the curvature of the two main directions. Since we are only interested in detecting occurrences of the high extrema, there are three cases to consider for the second order derivative:

1. We have a maximum in both directions, i.e. $$\lambda_1 > 0$$ and $$\lambda_2 > 0$$.
2. We have a minimum in both directions, i.e. $$\lambda_1 < 0$$ and $$\lambda_2 < 0$$.
3. We have a minimum in one and a maximum in the other direction, e.g. $$\lambda_1 < 0$$ and $$\lambda_2 > 0$$.

I created an example for each of the cases based on Gaussian functions $$g$$. All two-dimensional plots are coloured based on the determinant response.

We certainly want to detect holes and blobs in the image (and corresponding notches). So, the first two cases are something we want to consider. But what about the third case? Is this a structure of interest? It is a white and a black line crossing smoothly with each other. Not sure what kind of image structure we can imagine for this but from the implementations I know, these curvature points are simply ignored. You could also argue that you detect rather the four white blobs surrounding the black blob instead (not as strong as the black blob, but still). Hence, it might not be necessary to search for this case after all (however, there might be applications where this case is of interest).

Ok, let's agree on detecting only the first two cases. We still have a maximum in one and a minimum in the other case. Is this a problem? Well, remember that the definition of the Gaussian curvature (\eqref{eq:HessianDetector_GaussianCurvature}) multiplies the two eigenvalues. For our two cases this means we either multiply two negative or two positive values with each other, i.e. we definitively end up with a positive value. This mystery solved, we can now move on to the discussion of the scale parameter $$\sigma$$.

### Scale normalization

So, the first important observation is that blobs may occur in different sizes. This was already the case for the sunflower image at the beginning. Not all sunflowers have the same size, and, additionally, they project to different sizes on the camera depending on the distance to the camera. What we want to achieve is a so-called scale invariance: blobs should be detected regardless of the size they occupy in the image.

But how can we achieve scale invariance? The answer is to build a scale space, e.g. by blurring the image repeatedly with a Gaussian function. The result is a scale pyramid with a different intrinsic image resolution on each level. This resolution is denoted by a $$\sigma_i$$ parameter per level $$i$$ which corresponds to a Gaussian function $$G_{\sigma_i}$$ used to create the level (via convolution, i.e. $$I*G_{\sigma_i}$$). With increasing scale level the $$\sigma_i$$ value also increases and the image is blurred more intense.

Our extrema search is therefore extended by an additional dimension: we can now search for maxima in the determinant response not only spatially in the image but also across scale in different resolutions of the same image. At each position, we can consider our determinant response also as a function of scale and pick the scale level with the highest response. But before we can do that we first need to solve a problem.

The thing is that the Gaussian blurring flattens the intensities in the image out since in each blurring step the intensity values come closer together due to the weighting process. Per se this is not a problem and is precisely the job of Gaussian blurring: flatten out outliers to create a smooth transition between the values. But the weighting process also decreases the possible range of values. If we, for example, start with an image where the range of intensity values lies in $$[10;20]$$ then after the blurring this range must be smaller, e.g. $$[12;17]$$ (depending on the distribution of the values and the precise $$\sigma_i$$ value). At the end when we use a really high $$\sigma_i$$ value, we have only one flat surface left with exactly one intensity value remaining and this is the average intensity of the first image (this is because the convolution with a Gaussian converges to the average image value).

Ok, but why is this now a problem? The fact is that when the intensity range decreases over scale it is also very likely that the derivative values will decrease (fewer changes) and hence the determinant response decreases. But when the determinant responses tend to get lower in higher scale levels we introduce a systematic error while searching for extrema across scale. Responses from higher levels will always have the burden of being computed based on a smaller intensity range compared to the first scale levels of the pyramid. Note that this does not mean that a response value for a specific location strictly decreases over scale. It is still possible that locally a new blob structure is detected which produces (slightly) higher response values than the previous level. But is it correct and significant? To find out, we must find a compensation for the decreased intensity range and we need to compensate more as higher we climb in our scale pyramid.

The idea is to use the $$\sigma_i$$ values for compensation. Since they increase with higher scales, they fit our requirements. And this is exactly the reason why \eqref{eq:HessianDetector_Definition} contains $$\sigma^4$$ as prefactor9 which stretches our intensity range again to be (roughly) equal as before the blurring process (e.g. $$[10.4;19.7]$$ instead of $$[12;17]$$; don't expect exact results in a noisy and discrete world). This makes the determinant response $$\det(H_{\sigma})$$ comparable across scale. With this problem now solved we can move on and detect extrema over scale. For this, consider the following figure which shows the determinant response for the position in the blob of the top right corner of the airport image (also labelled in the image below).

Technically, we have a unique maximum in both cases. Even when we do not normalize we get a peak. However, it is not as wide as in the normalized case. But back to the question: is this peak correct and significant? To answer the question let me emphasize that this is only the trend for a specific location and when you look at the original image you see that in a very fine scale space there is just a white area (i.e. roughly zero determinant response). But when increasing scale the borders of the circle-like blob get visible and we have therefore an increase in response. This doesn't answer the question about correctness and significance, though. For this, let's consider the following figure which shows the global response range for all response values in all locations over scale for both cases.

This graph shows very clearly what I meant with the decrease of response over scale when no normalization is used. In this case, we start with a large range in the first levels but the range narrows very fast to a nearly zero wide range. This explains why the magnitude of the response values was so low in the previous figure (e.g. only about 0.00006 at the peak) and really reinforces doubts about the correctness and significance of the peak. However, the range of the normalized response gives a different picture. Even though the range does not stay fixed (but this wasn't expected and highly depends on the example image), it's width remains rather solid. No strict decay and the values do not get ridiculously small. This shows very clearly that scale normalization is indeed important. To come back to the question about significance: it is obvious that any comparison between the unnormalized values can't lead to correct results and hence any extrema detected in this range is certainly not significant.

To answer the question about correctness and significance also numerically for a specific example, let's compare the maximum response from our specific location $$(x_0,y_0)=(532, 98)^T$$ with the global maximum response of all locations in all scale levels10

\begin{align*} \frac{ \max_{i=1,\ldots,50}(\det(H_{\sigma_i}(532, 98))) }{ \max_{i=1,\ldots,50}(\det(H_{\sigma_i})) } &= % MathJax offers a way to apply css to math elements: http://docs.mathjax.org/en/latest/tex.html#html % The math must contain a symbol, i.e. it should not be empty. Therefore, an arbitrary symbol is printed with the same colour as the face of the circle \frac{ \style{border: 2.2px solid #5e81b5; border-radius: 50%; background: white; width: 0.6em; height: 0.6em;}{\color{white}{o}} } { \style{border: 2.2px solid #5e81b5; border-radius: 50%; background: black; width: 0.6em; height: 0.6em;}{\color{black}{o}} } = \frac{0.0000621447}{1.10636} = 0.0000561703 \\ \frac{ \max_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i}(532, 98))) }{ \max_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) } &= \frac{ \style{border: 2.2px solid #e19c24; border-radius: 50%; background: white; width: 0.6em; height: 0.6em;}{\color{white}{o}} } { \style{border: 2.2px solid #e19c24; border-radius: 50%; background: black; width: 0.6em; height: 0.6em;}{\color{black}{o}} } = \frac{15.3165}{34.6083} = 0.442567. \end{align*}

The idea behind this is that we need to compare different response values over scale for extrema detection and this relation measures how near the values are to each other. If values across scale do not operate in the same domain, a comparison is not really useful. This is like comparing distance values but one is measured in metres and the other kilometre. Since the value denoted in metre is naturally higher than the value denoted in kilometre, any comparison operation we perform on these values is prone to errors. As you can see from the calculated values, this is true for the maximum response without normalization as it is absolutely not significant and hence does not represent a correct extremum. On the other hand, the normalized response does not suffer from a huge difference between the values and therefore I would conclude this to be a very significant and correct extremum.

To finish up, let's take a look at the response image for the different scale levels. In order to make responses across scale visually comparable and to achieve a better image contrast, I scaled each response with the minimum and maximum value over all scale levels and all pixel locations, i.e. the response image $$R_{\sigma_i}$$ for the scale size $$\sigma_i$$ is calculated as

$$\label{eq:HessianDetector_ResponseImageScale} R_{\sigma_i} = \frac{ \sigma_i^4 \cdot \det(H_{\sigma_i}) - \min_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) } { \max_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) - \min_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) }.$$

This way all response values stay in $$[0;1]$$ and a response value of e.g. 0.4 means the same in all images, i.e. does not depend on the current scale level. But this formula is only used to achieve a better visualization; feel free to ignore it.

As you can see, the blob at our location of interest $$(x_0,y_0)=(532, 98)^T$$ is now detected and $$\sigma_{16}=32$$ gives indeed a fairly strong response. With the same technique we can also detect other blobs: find the maximum determinant response across scale for each image position.

This was also the last part of our journey which started with \eqref{eq:HessianDetector_Definition} were we took a deep breath at the Hessian matrix and analysed the curvature information it contains. We moved forward while keeping the Gaussian curvature $$K$$ in our bag, making sure signs don't cross our way and used $$K$$ to find extrema responses in the image. We needed to take a break and first got around about correct scale normalization. This problem solved via the $$\sigma_i^4$$ prefactor, we could search for extrema over the complete scale pyramid and so the circle completed again with \eqref{eq:HessianDetector_Definition}.

List of attached files:

1. I can e.g. recommend chapter 16 (p. 493ff) of the Learning OpenCV 3 book by Gary Bradski and Adrian Kaehler. If you want a deeper understanding I would recommend looking at a specific feature detector. E.g. the book Digital Image Processing (second edition) by Burger and Burge has a great chapter about SIFT (chapter 25, p. 609ff).
2. Image from Frank Vincentz (Own work) [GFDL or CC-BY-SA-3.0], via Wikimedia Commons.
3. It is possible to derive the definition of the Hessian matrix but I will only present the results here. For more information see e.g. chapter 3 (p. 88ff) of the book Data Visualization - Principles and Practice (second edition) by Alexandru C. Telea.
4. Due to a property called symmetry of second derivatives.
5. If you want to know why this is true, you may are interested in reading chapter 7 (p. 254ff) of the book Data Visualization - Principles and Practice (second edition) by Alexandru C. Telea.
6. It is probably not too hard to see that the second eigenvector points in the $$\alpha = 90^{\circ}$$ direction (the $$\alpha = 270^{\circ}$$ direction has an equal curvature in this case), i.e. $$\fvec{e}_2 = \begin{pmatrix} \cos(90^{\circ}) \\ \sin(90^{\circ}) \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}$$ with $$\lambda_2=0$$ (zero curvature).
7. If you are interested in the exact definition, check out the corresponding Mathematica notebook. But since the function values do not really matter here, I did not want to confuse you.
8. Part of an image by Hynek Moravec (Self-photographed) [GFDL, CC-BY-SA-3.0 or CC BY 2.5], via Wikimedia Commons
9. To be honest, I don't know why the exponent $$^4$$ is used here. What I can guess is that it starts with $$\sigma$$ and per derivative this factor gets somehow squared, so we have $$\sigma$$ for $$\partial_x$$ and $$\partial_y$$ and $$\sigma^2$$ for $$\partial_{xx}, \partial_{yy}$$ and $$\partial_{xy}$$. When we plug this into the determinant formula \begin{equation*} \sigma^2 \partial_{xx} \cdot \sigma^2 \partial_{yy} + \sigma^2 \partial_{xy} \cdot \sigma^2 \partial_{xy} = \sigma^4 \cdot \left( \partial_{xx} \cdot \partial_{yy} - \partial_{xy}^2 \right) = \sigma^4 \cdot \det(H_{\sigma}) \end{equation*} we get \eqref{eq:HessianDetector_Definition}. You can find more about this topic in the paper Feature Detection with Automatic Scale Selection by Tony Lindeberg. But the important concept here is that there must be some kind of compensation, otherwise, extrema detection would not work.
10. Values correspond with the circles shown in the figures. The exact values are calculated in the Mathematica notebook.

## Introduction to convolution with link to filters in computer vision

Convolution is arguably one of the most important operations in computer vision. It can be used to modify the image (e.g. blurring), find relevant structures (e.g. edge detection) or infer arbitrary features (e.g. machine learning). It is often one of the first steps in an image processing pipeline. Abstractly speaking, this operation takes an image as input and produces also an image as output by operating on each pixel in the input image. For this, it takes the neighbourhood of the pixel into account and its behaviour can be configured by so-called kernels. This makes it a very powerful and versatile operation.

As its core, convolution is, first of all, a mathematical operation with general properties and not necessarily restricted to the computer vision domain (it is popular in signal processing in general). However, the mathematical definition compared to the actual calculations we perform can be confusing at first glance. The goal of this article is to provide a deep understanding of the convolution operation and to bridge the gap between the theoretical definitions and the practical operations. For this, we are going to look at various perspectives: we are starting in the 1D discrete domain, moving further to the 1D continuous and concluding in the 2D discrete world (skip to this last section if you are only interested in image processing).

### 1D – discrete

Let's start with the one-dimensional and discrete case. There is a very good example at BetterExplained which we want to take up here and use as a basis. Suppose you are a doctor leading a medical study where you oversee the test of a new medicine (a new pill) which has a strict plan of intake. A patient has to come on three consecutive days to your office while intaking three pills on the first, two pills on the second and one pill on the third day. Now, assume further that you don't only have one but multiple patients instead. To make things even more complicated, the patients start on different days with their treatment. Let's assume that on the first day just one patient comes to your office. He/She will get three pills from you. One the second day two new patients and the one from yesterday come to your office. The patient from yesterday gets two and the other two will get three pills each — leaving a total of eight pills you have to deposit in your office (to satisfy all demands)1. Of course, you want to keep track of the number of pills required per day. You could go on with business as usual, but this will probably get complicated and confusing. Maybe there is a better way...

We are now going to optimize this problem. We have two variables in this scenario: the plan for the intake of the pills per treatment day $$\tau = 1, 2, 3$$ of a patient and the number of new patients coming to the office on each day $$t = 1, 2, \ldots$$ of the experiment. In the end, we would like to retrieve the number of pills we need to hand out per experiment day $$t$$.

The plan is a list of values denoting the number of pills required per treatment day $$\tau$$ of one patient, i.e. $$\fvec{k} = (3, 2, 1)$$ in this example. The number of patients can also be stored in a list denoting the number of new patients arriving on a day $$t$$ of the experiment. Let us assume that this list would look like $$\fvec{s} = (1, 2, 3, 4, 5)$$ for our example (1 new patient comes on the first, 2 new patients come on the second day, etc.). We now want to combine the two lists in a way so that we know how many pills we need per experiment day $$t$$. For this, we summarize the information in a table.

The top row shows the plan with the number of pills in decreasing order. Below are the number of new patients arranged (visibility depends on the experiment day $$t$$). We multiply the two lists together over the treatment days $$\tau$$ so that each column denotes the total number of pills required for first-time, second-time and third-time patients. Then, we sum up these values and this result is the total number of required pills per experiment day $$t$$.

This is a very dynamic scenario and we end up with one table per experiment day $$t$$. This is essentially what you can control with the slider and has the effect of sliding the patient list from left to right over the plan list. However, if we used the original order of the patient list, the number 5 would come first but there is only one patient on the first day. So, we have to reverse the patient list: $$(5, 4, 3, 2, 1)$$.

Hopefully, we can agree that this representation makes the problem much clearer than the manual (naive) approach. Just with a combination of basic operations (multiplication and addition), we can easily retrieve the value we want.

You may have noticed that some terms in the previous paragraphs were highlighted. In summary:

• Plan: a function $$\label{eq:Convolution_ExamplePlan} k(\tau) = \begin{cases} k_{\tau + 1} & 0 \leq \tau \leq 2 \\ 0 & \text{else} \end{cases}$$ with the treatment day $$\tau$$ as argument and the number of required pills as the function value2.
• Patients: a function $$\label{eq:Convolution_ExamplePatients} s(t) = \begin{cases} s_t & 1 \leq t \leq 5 \\ 0 & \text{else} \end{cases}$$ with the day $$t$$ of the experiment as argument and the number of new patients coming on that day as the function value.
• Reverse order of the patient list (helpful to model our problem).
• Multiplication of the values in the patient and plan lists along the treatment days $$\tau$$.
• Addition of the previous multiplication results.
• New function $$p(t)$$ which returns the total number of pills we need to have in stock on each day $$t$$ of the experiment.

These points provide the ingredients for the definition of the convolution operation.

Definition 1: Convolution [1D, discrete]

Let $$s(t)$$ and $$k(t)$$ denote a one-dimensional and discrete ($$t \in \mathbb{Z}$$) signal as well as kernel, respectively. Then, applying the convolution operator $$*$$ on both functions

$$\label{eq:Convolution_1DDiscrete} {\color{Mulberry}p}(t) = {\color{LimeGreen}s}(t) * {\color{Aquamarine}k}(t) = {\color{Mahogany}\sum_{\tau=-\infty}^{\infty}} {\color{LimeGreen}s}({\color{Orange}-}{\color{Mahogany}\tau} + t) {\color{Red}\cdot} {\color{Aquamarine}k}({\color{Mahogany}\tau})$$

yields the response function $$p(t)$$. In essence, this operation reverses the signal $${\color{LimeGreen}s}$$, combines it with the kernel $${\color{Aquamarine}k}$$ and aggregates the result into a new function $${\color{Mulberry}p}$$.

It is, of course, not really practicable to have a sum which iterates over $$\infty$$. However, this is not a problem when our discrete functions store only a definite number of elements (always the case in practice). In this case, we can define the values outside the relevant range as 0 (like we did in \eqref{eq:Convolution_ExamplePlan} and \eqref{eq:Convolution_ExamplePatients}) so that most of the multiplications disappear. In the end, we only iterate over the relevant intersection between $$s(t)$$ and $$k(t)$$. It makes sense to let the relevant range be defined by the smaller function so that we save computations. In our example, this is the plan $$k(\tau)$$ which has only $$n=3$$ elements letting us rewrite \eqref{eq:Convolution_1DDiscrete} to

\begin{equation*} {\color{Mulberry}p}(t) = {\color{Mahogany}\sum_{\tau = 0}^{n-1}} {\color{LimeGreen}s}({\color{Orange}-}{\color{Mahogany}\tau} + t) {\color{Red}\cdot} {\color{Aquamarine}k}({\color{Mahogany}\tau}). \end{equation*}

This function can easily be evaluated. For example, when we want to know how many pills we need to hand out on the third day of the experiment, we calculate

\begin{equation*} p(3) = s(3) \cdot k(0) + s(2) \cdot k(1) + s(1) \cdot k(2) = 9 + 4 + 1 = 14. \end{equation*}

To visualize the function $$p(t)$$ of \eqref{eq:Convolution_1DDiscrete}, we can imagine that the patient function $$s(-\tau + t)$$ slides over the plan function $$k(\tau)$$. At each step $$\tau$$, the functions get multiplied together. The is also illustrated in the following animation.

The abbreviations which have been used so far are not arbitrarily chosen. $$s$$ stands for the signal since convolution is often used with signals (audio, images, etc.). $$k$$ is the kernel which defines the kind of operation we want to perform. There are for example kernels for blurring, averaging or finding the derivations of a signal. Even the convolution variable $$\tau$$ is chosen deliberately. Both $$t$$ and $$\tau$$ denote some point in time. $$t$$ is used to specify the timestamp currently of interest and $$\tau$$ is used to iterate over the complete relevant time range which we need to consider in order to calculate the result at the step $$t$$.

In our example, we flipped the signal (patients) and slid it over the kernel (plan). This was not much of a problem because both vectors contain only a few numbers. However, with larger signals (like from the real world) this procedure would get impractical. Luckily, the convolution operation has the commutativity property, i.e.

$$\label{eq:Convolution_Commutativity} s(t)*k(t) = k(t)*s(t).$$

In other words: it doesn't matter whether we flip and slide the signal or the kernel. In terms of the previous example, this would mean that we could also have fixed the patient data and reversed and slid the plan list over the patient list. The result is the same. In practice, we can always flip and slide the smaller of the two vectors and save computation time.

### 1D – continuous

The next step is to extend \eqref{eq:Convolution_1DDiscrete} to the continuous case. This turns out to be pretty simple since we basically have to replace the sum with an integral:

Definition 2: Convolution [1D, continuous]

Let $$s(t)$$ and $$k(t)$$ denote a one-dimensional and continuous ($$t \in \mathbb{R}$$) signal as well as kernel, respectively. Then, applying the convolution operator $$*$$ on both functions

$$\label{eq:Convolution_1DContinuous} s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau$$

yields the response function $$p(t)$$.

So, instead of adding up a discrete set of values, we integrate over the complete (relevant) range. This means that we are interested in the area underneath $$s(-\tau + t) \cdot k(\tau)$$ for each step $$t$$. However, the integral over the complete range $$]-\infty;\infty[$$ makes it really hard to name the resulting function analytically. It is often possible, though, if at least one of the functions is only defined on a limited range or the functions are not very complicated itself. To look at an example, let's consider the increasing step function

\begin{equation*} s(t) = \begin{cases} 1 & t \geq 0 \\ 0 & t < 0 \end{cases} \end{equation*}

as a signal. In an image, this could for example model a sudden contrast increase, i.e. an edge. As kernel, the difference of Gaussian function shall be used

$$\label{eq:Convolution_KernelContinuous} k(t) = \operatorname{DoG}_{\sigma_1,\sigma_2}(t) = G_{\sigma_1}(t) - G_{\sigma_2}(t).$$

This is an approximation of the normalized version of the Laplacian of Gaussian function which is created by applying the Laplace $$\nabla^{2}$$ operator to a Gaussian function. In 2D images, this implements edge detection based on the second order derivatives. Since noise amplifies with increasing order of derivatives, it is often a good idea to smooth the signal first with a Gaussian. The Laplacian of Gaussian combines both operations (smoothing and derivative calculation)3. The following animation shows how the response of \eqref{eq:Convolution_1DContinuous} forms a short impulse response.

So, why does the response function look like it looks? Well, first note that the $$\operatorname{DoG}$$ function consists of a positive part surrounded by two negative parts. When sliding over the step function, first it overlaps with the right negative part. In this range, the multiplication $$s(-\tau + t) \cdot k(\tau) =_{t \geq 0} k(\tau)$$ is non-zero and yields nothing more than the kernel function itself since the step function forms an identity function for $$t \geq 0$$. The integral inherits its sign from the kernel and is hence negative so that we get a negative response for this first part.

When sliding further, the positive part of the kernel function reaches $$t \geq 0$$. Now, we have a positive and a negative part “active” which lowers the response (the two integral areas balance out a bit). Until the second negative part reaches the step function, the response increases further. After this point, the response decreases again due to the left negative part coming in. The response will approximate (but never reach) 0 since the $$\operatorname{DoG}$$ function consists of two Gaussian functions which have non-zero values everywhere in $$\mathbb{R}$$.

### 2D – discrete/continuous

It is now time to add an additional dimension so that we are finally reaching the image domain. This means that our functions now depend on two variables, e.g. $$x$$ and $$y$$. In the convolution formulas, this results in an additional sum respectively integral. The following table summarizes the convolution formulas in the 1D/2D and discrete/continuous cases.

1D 2D
1D $$s(t)*k(t) = \sum_{\tau = -\infty}^{\infty} s(-\tau + t) \cdot k(\tau)$$ $$s(x, y)*k(x, y) = \sum_{\tau = -\infty}^{\infty} \sum_{\kappa = -\infty}^{\infty} s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa)$$
Continuous $$s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau$$ $$s(x, y)*k(x, y) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \! s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa) \, \mathrm{d}\tau \mathrm{d}\kappa$$

We now integrate over the complete relevant area around each point and “each point” means all points in a 2D grid. This sounds like an expensive operation (which it also sometimes is) but when the kernel size is small compared to the image size, it is manageable. Personally, I have never used the 2D continuous convolution formula before (just mentioned for completeness), so let's stick to the 2D discrete case from now on.

As already mentioned, convolution is a very important operation in computer vision4. The signal is most of the time just the image itself. The interesting part is the kernel since it defines how the image will be altered. The kernel is centred5 and flipped around the current pixel position and then each element of the kernel gets multiplied with the corresponding pixel value in the image, i.e. the current pixel with the kernel centre and the surroundings of the kernel with the neighbours of the current pixel. This is what we called the area around a position. To make this point clear, consider the following signal matrix $$A$$ and the $$3 \times 3$$ kernel $$K$$

\begin{equation*} A = \begin{pmatrix} 1 & 2 & 3 \\ 4 & {\color{Aquamarine}5} & 6 \\ 7 & 8 & 9 \\ \end{pmatrix} \quad \text{and} \quad K = \begin{pmatrix} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \\ \end{pmatrix}. \end{equation*}

For the highlighted position, this results in the following operation

\begin{equation*} (A*K)_{22} = 9 k_{11}+8 k_{12}+7 k_{13}+6 k_{21}+5 k_{22}+4 k_{23}+3 k_{31}+2 k_{32}+k_{33}. \end{equation*}

This shows how the flipping of the kernel works: basically, each index is mirrored independently around the centre axis, so that e.g. $$k_{13} \rightarrow k_{31}$$ or, to put it differently, the kernel matrix gets transposed. Note that usually the kernel is flipped and not the image. This is possible due to the commutativity property (\eqref{eq:Convolution_Commutativity}) and in general a good idea since the kernel will usually be much smaller compared to the image.

It is now time for a small example. Let's look at the following image $$I$$ and the kernel $$K_{\nabla^{2}}$$)

\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & {\color{Aquamarine}1} & 2 & 0 \\ 0 & 0 & 1 & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}
\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & {\color{Aquamarine}2} & 0 \\ 0 & 0 & 1 & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}
\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & 2 & 0 \\ 0 & {\color{Aquamarine}0} & 1 & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}
\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & 2 & 0 \\ 0 & 0 & {\color{Aquamarine}1} & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}

and see how we can calculate the convolution response for the highlighted position

\begin{alignat*}{4} 0 \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 1 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 1 & {}+{} & 1 \cdot 2 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 1 & {}{} & = {\color{Aquamarine}-1} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}
\begin{alignat*}{4} 0 \cdot 1 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 0 & {}+{} & \\ 1 \cdot 1 & {}+{} & (-4) \cdot 2 & {}+{} & 1 \cdot 0 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}-5} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}
\begin{alignat*}{4} 0 \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 2 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & \\ 0 \cdot 3 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}2} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}
\begin{alignat*}{4} 0 \cdot 1 & {}+{} & 1 \cdot 2 & {}+{} & 0 \cdot 0 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 1 & {}+{} & 1 \cdot 0 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}-2} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}

Basically, we position the kernel at the target position in the image, multiply the two areas together and aggregate the result. If you compare these computations with the formula from the table, you might notice the similarity. To get the response function, the steps are repeated for every image pixel (this is the sliding of the kernel over the image) resulting in the following response image (assuming that every pixel value outside the image is zero6)

\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & {\color{Aquamarine}-1} & -5 & 2 \\ 3 & 2 & -2 & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}
\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & {\color{Aquamarine}-5} & 2 \\ 3 & 2 & -2 & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}
\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & -5 & 2 \\ 3 & {\color{Aquamarine}2} & -2 & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}
\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & -5 & 2 \\ 3 & 2 & {\color{Aquamarine}-2} & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}

The previous computations are also visualized in the following figure which focuses on the kernel positioning and the sliding over the input image.

Since this article is assigned to the computer vision category, it would be a shame to not show at least one image example. Therefore, the following figure shows an example image which gets convolved by the same kernel as in the previous example.

Did you noticed that the index $$\nabla^2$$ was used for the kernel? This is not a coincidence since the kernel $$K_{\nabla^2}$$ is precisely the 2D discrete version of the Laplace operator and like in the 1D continuous case (approximation in the form of \eqref{eq:Convolution_KernelContinuous}) this filter responses best on edges, e.g. visible at the whiskers of the cat.

There are many other filters and it is beyond the scope of this article to cover them all here. Instead, you can visit this website where you see other filters and examples which run directly in the browser. When you scroll down a bit, you can even test your own kernels interactively.

One final note, though. The convolution operation is also from special interest in the machine learning domain in the form of Convolutional Neural Networks (CNNs). The idea is to introduce a stack of layers in the network hierarchy each basically performing the convolution operation based on the previous input, e.g. the input image or the output of another convolutional operation. The trick is that the kernels of the layers are not predefined but learned instead. That is, based on a bunch of labelled examples and a target function the network adapts the kernel weights (elements $$k_{ij}$$) automatically. This way the kernels fit very well to the application domain without the need for manual feature engineering.7

List of attached files:

1. Don't worry if this description sounds confusing. The goal is that you see in a moment that the convolution operation is perfectly suited for this problem.
2. Precisely speaking, the function $$k(\tau)$$ does not expect the treatment day $$\tau$$ but rather its index $$\tau - 1$$ as argument (first non-zero value at $$\tau = 0$$). This is not necessary in general and an application-dependent choice instead. Here, it has the effect that we yield 0 total pills for the day $$t=0$$ which sounds reasonable for our problem.
3. It also resembles the edge detection which our retina ganglion cells perform.
4. However, it is sometimes termed differently as linear filtering with the kernel being the filter.
5. This is an implementation detail not taken into account by the mathematical definition. You would need to manually add some constant to centre your kernel, e.g. $$+ 1$$ for a $$3 \times 3$$ kernel in the 2D discrete case: $$s(x, y)*k(x, y) = \sum_{\tau = 0}^{n-1} \sum_{\kappa = 0}^{n-1} s(-\tau + x, -\kappa + y) \cdot k(\tau + 1, \kappa + 1)$$.
6. This is one case of the general border handling problem. See this Wikipedia section or page 113 of the book Digital Image Processing (second edition) by Burger and Burge for further details.
7. If you are interested and want to learn more about this topic, then I can recommend the last chapter of the book Neural Networks and Deep Learning by Michael A. Nielsen.

## Scale invariance through Gaussian scale space

When analysing images, it is often desired to make the analysis scale invariant. Objects may appear in different sizes in an image either because the image resolution is different or because the distance of the object to the camera varies between a series of images. Invariance with respect to scale is a property of image processing algorithms which incorporate this observation and adjust its calculations accordingly, e.g. by making sure that a similar output is produced for two images showing the same object but in different sizes. This is not easy, but a general idea is to consider multiple scale levels of an image meaning the same image is analysed in different resolutions. Feature algorithms like SIFT implement this idea by building a (Gaussian) scale space where the original image is repeatedly blurred to achieve the behaviour of different scales. In this article, I want to talk a bit about scale spaces and what they imply.

I start by analysing what scaling in an image actually means and how we can implement it. We will see that scaling has also an influence on the area we need to consider when we try to analyse a certain structure. The goal of the first part is to make the necessity of a scale space clear and convey insights into what we need to consider when using it. The second part shows an example of how we could build a scale space by trying to cover important ideas in this field.

### Why size matters and why the intrinsic resolution matters even more

Let’s dive directly in by considering an example. The following two images show the same object but in different sizes. The original image (right) has a size of $$3866 \times 4320$$ pixels (width by height) whereas the left one is scaled-down by 0.25 to the new size of $$966 \times 1079$$ pixels using the nearest neighbour technique (i.e. no interpolating).

In the following, I want to analyse the local structure around the eye in more detail. If we would restrict the analysis to the mentioned rectangle of size $$100 \times 100$$, the resulting image regions are very different since the rectangles extend to different parts of the image. This is not surprising since a distance of 100 px depends on the size of the image. We can say that the meaning of 100 px varies depending on the image size, i.e. it is not scale invariant.

We need to adjust the rectangle's size, obviously. Since the left image is scaled-down by $$\frac{1}{4}$$, lets scale-up the right rectangle by 4, so that the new size becomes $$400 \times 400$$ pixels.

Now, both rectangles cover roughly the same image region, but the original one has much more information (more details, e.g. visible in the reflection of the eye). You can also argue with the entropy here (the amount of information per area): the entropy of the left image is lower than the entropy of the right image. To achieve scale invariance, we want the entropy to be roughly the same in both image regions. Remember: we want to analyse an object regardless of its size, but when one area shows the object in much more detail then the result still varies depending on the image size.

Let’s think a little bit about the scaling operation. What does scaling mean? We decrease our image size, obviously, and that means we lose information. This is unavoidable since we use fewer pixels to describe the object, fewer pixels to describe all the beautiful details in the image (of course, this is more noticeable in the eye than in the skin – repeating pixels don’t give us more information).

The exact loss of information depends also on the reducing technique. A very easy technique would be to just sub-sample the image by removing every n-th pixel (e.g. remove every second pixel to retrieve a half-sized version of the image). A more sophisticated approach takes the surrounding pixels into account. Blurring is one way to calculate a new pixel value depending on the local neighbourhood. It also reduces information, since the neighbouring pixels are weighted which levels-out the variation of pixel differences. The kind of weighting depends on the kind of blurring. One very common way is to use Gaussian blurring for the job.

Gaussian blurring offers a very nice way to visualize the loss of information. For this, we leave our spatial domain and take a tour to the frequency domain. The Fourier transform of a Gaussian function is very special: it remains a Gaussian function. What changes is the $$\sigma$$ value. The Gaussian function in the spatial domain

$$\label{eq:GaussianScaleSpace_GaussSpatial} g_{\sigma}(x,y) = \frac{1}{2 \pi \sigma^2} \cdot e^{-\frac{x^2+y^2}{2 \sigma ^2}}$$

leads to (after the Fourier transform)

$$\label{eq:GaussianScaleSpace_GaussFrequency} G_{\sigma}(\omega_1, \omega_2) = \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \sigma^2}{2}}$$

in the frequency domain. So, with increasing $$\sigma_s$$ in the spatial domain, the corresponding $$\sigma_f$$ in the frequency domain decreases. More precisely, the relation $$\label{eq:GaussianScaleSpace_GaussRelationFrequency} \sigma_s = \frac{1}{\sigma_f}$$ can be observed. This means a wide Gaussian curve in the spatial domain corresponds to a peaked Gaussian curve in the frequency domain. This is also visualised in the following animation.

Notice how a large $$\sigma$$-value (high blurring) results in a very small peak in the frequency domain. This gives us important insights when considering the convolution theorem: convolution in the spatial domain corresponds to multiplication in the frequency domain meaning $$g(x) * k(x)$$ is the same like $$(2\pi)^{\frac{n}{2}} \cdot G(\omega) \cdot K(\omega)$$, with $$n$$ the number of independent variables; e.g. $$2$$ in the image case. So, when we blur our image (which is done by convolution) with a Gaussian function with a high $$\sigma$$ value, it is the same as multiplying the image in the frequency domain with a Gaussian function with a low $$\sigma$$ value. And by applying this low-pass filter (keeping only frequencies from the lower band), we remove all the high frequencies since a narrow Gaussian consists of mostly zeros outside the peak. The thing is that these high frequencies are responsible for holding all the high details in the image together. High frequencies, i.e. fast changes of intensity values, are needed to show the colour changes details usually consists of (otherwise we would not call them details but rather homogeneous areas).

Back to the example from the beginning. The original image (with increased rectangle size, i.e. $$400 \times 400$$ pixels) contains too many details compared to the reduced version, so let’s blur it with a Gaussian:

The number of details (or the entropy) is now much more similar between the two images. E.g. the stripes in the reflection of the eye are not as clearly visible anymore as before. This is exactly what we wanted to achieve. But, compared to the reduced version, the transitions are smoother, which is a direct effect of the blurring instead of a local neighbour approach.

Also note that for the right image the actual image size has not decreased, we still use the same number of pixels as before. Only the intrinsic image resolution2 changed: fewer details in the image. That is why we increased the size of the rectangle to account for the different real image resolution. If we would now analyse the two image regions, we would probably get a better result as before, since we cover the same spatial area (rectangle size) and consider roughly the same amount of details (blurring). This is precisely what we wanted to achieve for scale invariance since it allows us to analyse the same objects in images of different sizes.

So far, we created one additional scaled version of the original image (0.25 of the original size). Since I already used the term scale space before, there is indeed more. The idea is to add a third dimension to the original image which covers all possible scalings. This results in a three-dimensional function $$\hat{I}(x,y,\sigma)$$ with the scaling parameter $$\sigma$$ as the third dimension. It is then possible to consider (theoretically) all possible scaled version of an image. In the real world, of course, this process will be discretely sampled3. In the end, scale invariance is achieved by considering different scale levels up to a certain point and applying the used algorithms for all these scale levels (or adapt the algorithms to work directly inside the scale space, cf. the SIFT paper).

I will show an example of a scale space later. But first, summarize what we want from a scale space so that it can help us with scale invariance:

• We need to adjust the area we want to analyse to fit the current image size. This is especially important for filter operations where a filter kernel operates at each location in a local neighbourhood. The size of the neighbourhood (or the kernel size) corresponds to the size of the rectangle above. Hence, we may need to adjust the kernel size to be scale invariant. Alternatively, we stick to one kernel size but reduce the image size instead (see below).
• To consider an object at different sizes, we need different scaled versions of an image. More importantly, the intrinsic information (or the entropy) must change. We can use Gaussian blurring for this job which offers very good quality.

Before I continue to show how a scale space can be built based on these requirements, I want to take a detailed look at the scale parameter $$\sigma$$ and how it behaves when the image is blurred multiple times.

### A detailed look at the scale parameter $$\sigma$$

You may have wondered why I blurred the image with $$\sigma=1.936$$. Actually, this is a little bit complicated. It is based on the assumption that the original image is already a blurred version of the continuous function with $$\sigma_s=0.5$$ [616]4, which is somehow related to the Nyquist–Shannon sampling theorem (sampling frequency of $$\sigma_f=2$$ is related to $$\sigma_s=0.5$$). Next, if we double our $$\sigma$$ value, we should also double the size of the rectangle [642]. Since our rectangle is 4 times larger than the original one, we do the same with the $$\sigma$$ value.

So, why $$\sigma = 1.936$$ instead of $$\sigma = \sigma_s \cdot 4 = 2$$? Well, this is probably only from theoretical interest; you won’t see a difference between images blurred with $$\sigma=2$$ or $$\sigma=1.936$$. The reason is that we want our continuous function to be blurred with $$\sigma = 2$$. The continuous function is the theoretical view of the light signals from the real world which were captured by the camera to produce the image. But we don’t have the continuous function, only the image which is a blurred version of the continuous function (with $$\sigma_s=0.5$$). So we have ($$f$$ denotes the continuous function and $$I$$ the discrete image function)

$$\label{eq:GaussianScaleSpace_BlurContinuousFunction} f * g_{\sigma} = \underbrace{f * g_{\sigma_s}}_{=I} * g_{\sigma_x} = I * g_{\sigma_x}$$

and this means we need to calculate the unknown $$\sigma_x$$. There is a rule5 [761] which states

$$\label{eq:GaussianScaleSpace_ScaleCombinations} \sigma = \sqrt{\sigma_s^2 + \sigma_x^2}$$

which only needs to be solved by $$\sigma_x$$ and that is where the mysterious value comes from

\begin{equation*} \sigma_x = \sqrt{\sigma^2 - \sigma_s^2} = \sqrt{4 - 0.25} \approx 1.936. \end{equation*}

$$g_{\sigma_x}$$ is therefore the kernel we need to convolve the image $$I$$ with to retrieve a result which is equivalent to the continuous function blurred with $$\sigma = 2$$ as stated by \eqref{eq:GaussianScaleSpace_BlurContinuousFunction}. But I agree when you say that this operation really does not change the world (it may be important in other contexts or with other values though). I should also mention that in real scenarios a base scaling for the image is defined (e.g. $$\sigma = 1.6$$) [617], which already compensates for some of the noise in the image. This blurred version of the input image is then the basis for the other calculations.

Note that \eqref{eq:GaussianScaleSpace_ScaleCombinations} applies also when the image is blurred multiple times. Blurring the image two times with a Gaussian is equivalent to one Gaussian function with a $$\sigma$$ value calculated according to \eqref{eq:GaussianScaleSpace_ScaleCombinations}. This will get important when we build our scale space later.

Let's discuss the appropriate size of the rectangle a bit more. To make things easier, assume we only use squares and therefore only need to calculate a width $$w$$. The idea is to connect the width with the $$\sigma$$ value in such a way that the entropy inside the rectangle stays roughly constant. This means when $$\sigma$$ increases $$w$$ must also increase to account for the information loss which the blurring is responsible for (assuming the real resolution of the image stays the same), i.e. the loss of entropy due to the blurring is recovered by expanding the area of interest. The same logic can be applied when $$\sigma$$ decreases. Of course, there must be a base width $$w_0$$ (like 100 px) and the actual width is computed relative to this base width. More concretely, since the original continuous function has a base blurring of $$\sigma_s$$ and we want to achieve a blurring of $$\sigma$$, the appropriate width can be calculated as

$$\label{eq:GaussianScaleSpace_RectangleSize} w = w_0 \cdot \frac{\sigma}{\sigma_s}.$$

In the previous example this was

\begin{equation*} w = 100\,\text{px} \cdot \frac{2}{0.5} = 400\,\text{px}. \end{equation*}

If you want to play a bit around with different rectangle sizes and scale levels, you can use the Mathematica notebook6 which I created for this purpose. There you can also move the rectangle to a different position to analyse other structures.

### Building a scale space

In this section, I want to show a concrete example of a scale space. Note the word example here. There is no unique way of building a scale space but there are general ideas which I try to embed here. Also, the example presented is not used by me in practical applications, I invented it solely for educational purposes. However, it is easy to find other approaches on the web7.

First, we must talk about the scaling operation (again). Like before, I want to use Gaussian functions to repeatedly blur the image and hence build the scale space. A Gaussian offers good quality and has a nice counterpart in the frequency domain (namely itself). But it comes at the cost of performance. Blurring an image means convolving the image with a discrete kernel. The Gaussian function is by its nature continuous. Hence, we need to discretize it, e.g. by taking the function values at every integer position and put them in a kernel. When you look at the graph from the animation again, you will see that this does not result in nice numbers. We definitively need floating operations for the job and this will be computationally intensive. Remember, that blurring is an essential operation when we want to change the intrinsic resolution of an image and hence the underlying operation should be fast.

Luckily, there are other filters which approximate the behaviour of a Gaussian-like binomial filters which are commonly used. They are extracted from Pascal's triangle where each row in the triangle can be used as a filter with a specific variance. Since the triangle is built only by addition of integers, the resulting filters will also consist of integers. This increases our computational efficiency.

Here, I want to approximate a Gaussian with $$\sigma=1$$ since this makes the calculations later easier. The corresponding binomial filter $$B_n$$ to use in this case is extracted from the row $$n=4$$ and is defined as filter as8

\begin{equation*} B_4 = \frac{1}{16} \cdot \left( 1, 4, 6, 4, 1 \right)^T. \end{equation*}

Note the prefactor which makes sure that the filter weights up to 1 and can be applied at the end of the convolution operation. The good thing about binomial filters is that they fit very well to a Gaussian function like the following figure depicts.

As you can see, even though we have only five integer values the Gaussian is very well approximated. Other rows from Pascal's triangle have similar behaviour but with different $$\sigma$$-values. To use the filters for images (which require a 2D-filter), separation can be used. But it is also possible to expand the vector to a matrix via the outer product:

\begin{equation*} B_4^T \cdot B_4 = \frac{1}{256} \cdot \left( \begin{array}{ccccc} 1 & 4 & 6 & 4 & 1 \\ 4 & 16 & 24 & 16 & 4 \\ 6 & 24 & 36 & 24 & 6 \\ 4 & 16 & 24 & 16 & 4 \\ 1 & 4 & 6 & 4 & 1 \\ \end{array} \right) \end{equation*}

After the question of the filter kernel to use is now solved, we should start with the discussion of the scale space. It is very common to build a scale pyramid like shown in the following figure.

As mentioned before, the first step is to apply a base scaling to the input image. This already compensates for noise which is usually present in images produced by a camera. Then, the image is repeatedly blurred with our binomial filter until the scaling doubled. This is the moment where we have blurred and hence reduced the intrinsic resolution of the image so much that also the real resolution can be reduced as well without loss of information. This is done by subsampling, e.g. by removing every second pixel, effectively halving the image size. The idea is that the reduced version contains the same information: every structure and change in intensity values present before the subsampling is also present afterwards but described with fewer pixels. The frequency domain does also give us insights here. According to \eqref{eq:GaussianScaleSpace_GaussRelationFrequency}, when we increase the scaling by a factor of $$\sigma_s = 2$$, we multiply our signal with $$\sigma_f = 0.5$$ in the frequency domain which means that most of the frequencies in the upper half are removed. Due to the different image sizes, we get the notion of an octave which comprises all images of the same size.

Two benefits arise from the subsampling process. Most obvious is the performance increase. Usually, we build a scale space to do something with it in the next step and it is e.g. much faster to apply derivative filters to a scale pyramid where the levels decrease in size instead of a scale space where all levels have the same size as the input image. The benefit is quite huge since with each subsampling process the total number of pixels is reduced by 0.25 (both the width and height are reduced by 0.5).

But there is also another advantage. Remember that we needed to adjust the size of the rectangle when we want to analyse the same content at different sizes. For the derivative filters, this would mean that we need to increase its kernel size as we increase the blurring. Imagine a tiny $$3 \times 3$$ filter applied to all levels of the same size. There is no way to capture larger image structures with this approach. But due to the subsampling, this problem is implicitly attenuated. When the image size increases, our $$3 \times 3$$ filter automatically works on a larger local neighbourhood. Of course, we still need different filter sizes for the levels inside each octave but we can re-use the filters across octaves (e.g. the same filter can be applied to level $$I_{11}$$) and level $$I_{21}$$). Note that this holds also true for the binomial filters used to build the scale pyramid itself. In our example, we only need three binomial filters (not counting the initial blurring) to build the complete pyramid and when we apply an iterative scheme we only need one!9.

To understand the build process of the scale pyramid better, let us apply it to a small one-dimensional input signal

$$\label{eq:GaussianScaleSpace_ExampleData} \fvec{u}_0 = \left( 1, 3, 1, 5, 11, 2, 8, 4, 4 \right)^T.$$

The following animation shows the steps for two octaves of the build process.

Take a look at step 5 denoting the subsampling process which selects every second value for removal. When you take a closer look which values are to be removed, you see that they don't convey utterly important information. They merely lie in the interpolation between two other values (more or less). This shows that the subsampling is indeed justified and we don't lose too much information when we do it. The same is true for step 10 but not so much for the intermediate blurring steps. We cannot recover the shape defined by the signal if we would already subsample in step 7 (the value at point 4 is a bit above the interpolation line between point 3 and 5), for example. The build process of the scale pyramid is also summarized in the table below.

Level $$i$$ Octave Step Operation Scale $$\sigma_i$$
0 0 Input image $$\fvec{u}_{0}$$ $$0.5$$
1 1 Initial blurring $$\fvec{u}_{10} = \fvec{u}_{0} * G_{\sigma_b}$$ $$1$$
2 1 Iterative blurring $$\fvec{u}_{11} = \fvec{u}_{10} * B_4$$ $$\sqrt{2} = \sqrt{1^2 + 1^2}$$
3 1 Iterative blurring $$\fvec{u}_{12} = \fvec{u}_{11} * B_4$$ $$\sqrt{3} = \sqrt{(\sqrt{2})^2 + 1^2}$$
4 1 Iterative blurring $$\fvec{u}_{13} = \fvec{u}_{12} * B_4$$ $$2 = \sqrt{(\sqrt{3})^2 + 1^2}$$
5 2 Subsampling $$\fvec{u}_{20} = \operatorname{sample}(\fvec{u}_{13})$$ $$2$$
6 2 Iterative blurring $$\fvec{u}_{21} = \fvec{u}_{20} * B_4$$ $$\sqrt{8} = \sqrt{2^2 + 2^2}$$
7 2 Iterative blurring $$\fvec{u}_{22} = \fvec{u}_{21} * B_4$$ $$\sqrt{12} = \sqrt{(\sqrt{8})^2 + 2^2}$$
8 2 Iterative blurring $$\fvec{u}_{23} = \fvec{u}_{22} * B_4$$ $$4 = \sqrt{(\sqrt{12})^2 + 2^2}$$
9 2 Subsampling $$\fvec{u}_{30} = \operatorname{sample}(\fvec{u}_{23})$$ $$4$$

I think only the last column may be a bit confusing. What I wanted to show is the scaling in the respective layer. As mentioned before, this requires, being a stickler, the underlying continuous function which we don't know but where it is assumed that $$\fvec{u}_0$$ is a blurred version of it (with $$\sigma_s = 0.5$$). But this only affects the initial blurring which I deliberately chose so that we reach a scale of $$\sigma_1 = 1$$ in the first level. But this part is not too important.

However, the other calculations are important. Essentially, each parameter is calculated via \eqref{eq:GaussianScaleSpace_ScaleCombinations} with the scale parameter from the previous level (left operand) and the new increase in scaling introduced by the binomial filter (right operand). The left operand follows a direct iterative schema and the right operand is identical in each octave. This is the case because I only use $$B_4$$ as blurring operator. Other scale pyramids may differ in this point. Whenever the scale parameter doubled, the signal is subsampled.

Did you notice that the scale parameter for the binomial filter doubled as well? It is $$\sigma_1 = 1$$ in the first and $$\sigma_2 = 2$$ in the second octave. This is due to the subsampling done before and related to the implicit increase of local neighbourhood which the subsampling process introduces. Since the same filter now works on a broader range of values we have effectively doubled its standard deviation. This effect is also visible in the animation above. The size of the Gaussian function stays the same when you move e.g. from step 5 to step 6. But the number of values affected by the Gaussian are much larger in step 6 (all actually) than in step 5. You can also think this way: suppose you applied the Gaussian as noted in step 6. Now, you want to achieve the same result but rather applying a Gaussian directly to step 4 (i.e. skipping the subsampling step). How would you need to change your Gaussian? Precisely, you need to double its width (or in a more mathematically term: double its standard deviation) in order to include the same points in your computation as in step 6.

We have now discussed how different scalings for an image can be achieved and why we may need to adjust the size of the local neighbourhood when analysing the image structure in different scale levels. We have seen how the frequency domain can help us to understand basic concepts and we analysed the scale parameter in more detail. Also, the question of how such a scale space can be built is answered in terms of a scale pyramid.

Ok, this sounds all great, but wait, was the intention not to analyse the same object which appears in different sizes in different images? How is this job done and how precisely does the scale space help us with that? Well, it is hard to answer these questions without also talking about a concrete feature detector. Usually, we have a detector which operates in our scale pyramid and analyses the image structure in each level. The detector performs some kind of measurement and calculates a function value at each 3D position ($$x, y$$ and $$\sigma$$). Normally, we need to maximize the function and this can e.g. be done across all scale levels. But this depends highly on the feature detector in use. You are maybe interested in reading my article about the Hessian feature detector if you want to know more about this topic.

List of attached files:

1. Permission to use this image thankfully granted by the artist. Further licence information is included on the referenced website.
2. This term is from a chapter about scale space, which I highly recommend.
3. In one of my FED articles, I show an animation with a continuous Gaussian blurring. It is linked to an example which uses a different blurring approach. The thing is that Gaussian blurring is homogeneous meaning that the same operation is applied to every pixel regardless of the content. This can e.g. be improved by considering the local gradient at each position and adjust the amount of blurring based on the magnitude. This is done in isotropic diffusion.
4. Numbers in square brackets are the page numbers of Digital Image Processing (second edition) by Burger and Burge.
5. The rule can be seen by applying the convolution theorem to the expression $$g_{\sigma_s} * g_{\sigma_x}$$: \begin{equation*} 2\pi \cdot G_{\sigma_s} \cdot G_{\sigma_x} = 2\pi \cdot \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \sigma_s^2}{2}} \cdot \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \sigma_x^2}{2}} = \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \left(\sigma_s^2 + \sigma_x^2\right)}{2}} = \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \left(\sqrt{\sigma_s^2 + \sigma_x^2}\right)^2}{2}} = G_{\sqrt{\sigma_s^2 + \sigma_x^2}}, \end{equation*} which gives us exactly the relation, since $$G_{\sqrt{\sigma_s^2 + \sigma_x^2}}$$ transforms directly back to $$g_{\sqrt{\sigma_s^2 + \sigma_x^2}}$$ in the spatial domain. Therefore, it is possible to convolve any signal with $$g_{\sigma_s}$$ and the result again with $$g_{\sigma_x}$$ or directly convolve the signal with $$g_{\sqrt{\sigma_s^2 + \sigma_x^2}}$$. A proof without the detour over the frequency domain and rather using the convolution operator instead can be found in section 4.5.4 of this book (Machine Vision).
6. If you don't have access to Mathematica, you may be able to use the free CDF Player.
7. E.g. the paper Fast Computation of Characteristic Scale Using a Half-Octave Pyramid describes an approach very similar to the one I show here.
8. Why does $$\sigma=1$$ correspond to $$B_4$$? This can be seen by building an empirical distribution from the values in $$B_4$$, i.e. by using the binomial coefficients as function values for a discrete function centred around the origin and then calculating the standard deviation from this distribution. Check out the corresponding Mathematica notebook for implementation details.
9. With iterative I mean that each filter is applied to the result of the previous filtering process. This is done here and hence only $$B_4$$ is needed: $$I_{11} = I_{10} * B_4, I_{12} = I_{11} * B_4$$ and $$I_{13} = I_{12} * B_4$$. Theoretically, you could imagine applying a filter always on the base image, e.g. $$I_{10}$$, to achieve the same result. In this case, three filters are needed: $$I_{11} = I_{10} * G_{\sigma_1}, I_{12} = I_{10} * G_{\sigma_2}$$ and $$I_{13} = I_{10} * G_{\sigma_3}$$