Introduction to the Hessian feature detector for finding blobs in an image


In many computer vision applications, it is important to extract special features from the image which are distinctive, unambiguously to locate and occur in different images showing the same scene. There is a complete subbranch in the computer vision field dedicated to this task: feature matching. Usually, this process consists of three tasks: detection, description and matching. I shortly want to summarize the steps but you can easily find further information on the web or in related books1.

  1. The first step is about the detection of distinctive points in an image. This could be a corner of an object, an intensity blob (e.g. the inner parts of the eyes) or any other special shape or structure you can imagine. Not good candidate points are e.g. straight lines. It is hard to detect a point on a line unambiguously in two images (which point on the line to choose?).
  2. The goal of the description step is to describe the area around a detected point. A lot of techniques can be used here and it is quite common to get a vector with some numbers which describe the area around the point as output. If two points have a similar surrounding, it would be good if the vectors would also have similar numbers.
  3. This is essential for the third and last step: matching. Imagine two images showing the same scene with a different viewpoint. In both images are distinctive points detected and the surrounding of each point described. Which points correspond, i.e. which points show the same part of the scene? This can e.g. be done by measuring the similarity between the description vectors.

For each of these steps, many different algorithms exist. As you may have guessed from the title of this article I want to focus here on the detection step. More precisely, I want to give an introduction to the Hessian detector which is mathematically defined as

\begin{equation} \label{eq:HessianDetector_Definition} \sigma^4 \cdot \det(H_{\sigma}) = \sigma^4 \cdot \left( \partial_{xx} \cdot \partial_{yy} - \partial_{xy}^2 \right) \end{equation}

with \(H_{\sigma}\) as the Hessian matrix calculated at the scale level \(\sigma\) and e.g. \(\partial_{xx} = \frac{\partial^2 I}{\partial^2 x^2}\) denoting the second order derivative of the image \(I\). I just wanted to give the definition of the detector at the beginning; the intention of this article is to give some insights why this equation might indeed be useful. Also, note that the detector is designed to detect blobs in an image (and not corners).

So, what exactly is a blob in an image? It is a homogeneous area with roughly equal intensity values compared to the surrounding. The ideal artificial example is a 2D Gaussian function where the intensity values equally decay in a circle way blending together with the surrounding – visible in the image as a smooth blob. A more realistic example would be the inner part of a sunflower which is usually dark compared to the bright leaves. The following figure shows examples for both. A blob is not necessarily related to a circle like structure though. Any kind of intensity notch may also be detected as a blob. And the idea is that these circles and notches are part of the object and therefore are also visible in other images showing the same object.

3D and density plot showing a Gaussian blob 17180 detected blobs in an image showing sunflowers
Left: blob of the Gaussian function \(G_{\sigma}(x,y) = \frac{1}{2 \pi \sigma^2} \cdot e^{-\frac{x^2+y^2}{2 \sigma ^2}}\) as 3D and density plot. Right: 17180 detected keypoints on a sunflower image2 using a blob detector (here the AKAZE features which use the Hessian detector were used). The circle size corresponds to the scale level in which the keypoint was detected (size of the object the keypoint covers) and the intensity of the circle colour roughly corresponds to the strength of the detected keypoint (more precisely: on the determinant response of the Hessian, but more on this later). The detector did not find the large sunflower blobs in the foreground but it worked quite well for sunflowers more away from the camera. Notice how the inner parts of the sunflowers relate to very strong keypoints (blobs).

The rest of this article is structured as follows: I begin introducing the Hessian matrix, analyse the curvature information it contains and describes how this information is used in the Hessian detector. But I also want to talk a bit about how the detector incorporates in the scale space usually used in feature matching to achieve scale invariance.

The Hessian matrix

Let's start with the Hessian matrix \(H\) (scale index \(\sigma\) for simplicity omitted). Mathematically, this matrix is defined to hold the information about every possible second order derivative (here shown in the 2D case)3:

\begin{equation*} H = \begin{pmatrix} \frac{\partial^2 I}{\partial^2 x^2} & \frac{\partial^2 I}{\partial x \partial y} \\ \frac{\partial^2 I}{\partial y \partial x} & \frac{\partial^2 I}{\partial^2 y^2} \\ \end{pmatrix}. \end{equation*}

Since \(\frac{\partial^2 I}{\partial x \partial y} = \frac{\partial^2 I}{\partial y \partial x}\) this matrix is symmetric4. Conceptually, \(H\) can be understood as the second order derivative of a higher dimensional function. As we know from the 1D case, the second order derivative tells us something about the curvature of a function: is it convex, concave or neither of those? The result here is a number where the signum function tells us which of the cases is true. But what is the curvature of a 2D function?

It turns out that things get more complicated when introducing another dimension (who on earth would have guessed that...). We can't express the curvature as a single number anymore since the question now is: how is the curvature in a certain direction? Imagine you are walking over an arête in the mountains where it goes down on the left and right side and straight on in front of you. What is the curvature when you keep moving ahead? Since there are not many changes in height probably (roughly) zero. And what if you move right directly downwards the precipice? Probably very high since there are a lot of changes in height. So, direction obviously matters.

To get the curvature of a 2D function we need to provide three parameters: the Hessian matrix \(H\), which acts as the curvature tensor containing the information about every direction, a specific direction \(\vec{s}\) in which we are interested and the location \(\left( x_0, y_0 \right)^T\) where we want to know the curvature of. The direction is usually provided as uniform (i.e. \(\left\| \vec{s} \right\|=1\)) column vector. We can then compute the so-called directional derivative

\begin{equation} \label{eq:HessianDetector_DirectionalDerivative} \frac{\partial^2 I}{\partial^2 \vec{s}^2} = \vec{s}^T H \vec{s} \end{equation}

which gives as the curvature (the second order derivative) in the direction \(\vec{s}\). Before showing some visualization, I want to provide a numeric example. For this, consider the following 2D function and suppose you want to know the curvature in the \(45^{\circ}=\frac{\pi}{4}\) direction at the origin

\begin{equation*} f(x,y) = \cos(x), \quad \vec{s} = \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) \\ \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \quad \text{and} \quad \begin{pmatrix} x_0 \\ y_0 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}. \end{equation*}

It is actually a bit stupid to call this a 2D function since it is just the standard cosine but the resulting graph fits nice to my arête analogy. The first thing to do is to calculate the derivatives of the Hessian matrix

\begin{align*} \frac{\partial^2 f(x, y)}{\partial^2 x^2} &= - \frac{\partial}{\partial x} \sin(x) = -\cos(x) \\ \frac{\partial^2 f(x, y)}{\partial x \partial y} &= 0 \\ \frac{\partial^2 f(x, y)}{\partial y \partial x} &= - \frac{\partial}{\partial y} \sin(x) = 0 \\ \frac{\partial^2 f(x, y)}{\partial x \partial y} &= 0 \end{align*}

and summarize the results in the Hessian matrix

\begin{equation*} H = \begin{pmatrix} -\cos(x) & 0 \\ 0 & 0 \\ \end{pmatrix}. \end{equation*}

Next, we apply \eqref{eq:HessianDetector_DirectionalDerivative} and get

\begin{align*} \frac{\partial^2 f}{\partial^2 \vec{s}^2} = \vec{s}^T H \vec{s} &= \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) & \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \begin{pmatrix} -\cos(x) & 0 \\ 0 & 0 \\ \end{pmatrix} \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) \\ \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \\ &= \begin{pmatrix} \cos\left( \frac{\pi}{4} \right) & \sin\left( \frac{\pi}{4} \right) \end{pmatrix} \begin{pmatrix} -\cos(x) \cos\left( \frac{\pi}{4} \right) \\ 0 \end{pmatrix} \\ &= - \cos\left( \frac{\pi}{4} \right) \cos(x) \cos\left( \frac{\pi}{4} \right) \\ &= - \frac{1}{\sqrt{2}} \frac{1}{\sqrt{2}} \cos(x) \\ &= - \frac{\cos(x)}{2} \end{align*}

i.e. a curvature of \(-\frac{1}{2}\) in the direction of \(\vec{s}\) at our evaluation point \(x_0=0\) (origin).

You can imagine this as follows: at every point of our 2D function, we have a normal vector pointing orthogonal to the surface at that point. The normal vector forms together with the direction \(\vec{s}\) a plane in the 3D space. This plane intersects with our function resulting in a projection which is a 1D function and \eqref{eq:HessianDetector_DirectionalDerivative} is then just the curvature of this function (like from every other 1D function). The following animation shows the discussed points and lets you choose an arbitrary direction.


Animation which shows the directional derivative from \eqref{eq:HessianDetector_DirectionalDerivative} on the function \(f(x)=\cos(x)\). The slider controls the direction for which the curvature should be calculated (in degree). Besides the current direction \(\vec{s}\) (red) also the direction of the maximum curvature \(\vec{e}_1\) (blue) is shown (\(\vec{e}_1\) denotes the first eigenvector, but I come to this point later). The length of the vectors corresponds roughly to the magnitude of the current curvature.

Even though this is very exciting, one is often not interested in the curvature of an arbitrary direction. What is more from interest is the direction of highest and lowest curvature. The good news is that it is also possible to retrieve this information from \(H\). The information lies in the eigenvectors and eigenvalues: the eigenvector \(\vec{e}_1\) points in the direction of the highest curvature with the magnitude \(\lambda_1\). Similarly, \(\vec{e}_2\) corresponds to the direction of lowest curvature with the strength of \(\lambda_2\)5. As you may have noticed, I already included \(\vec{e}_1\) in the previous animation. Can you guess in which direction \(\vec{e}_2\) points and what the value of \(\lambda_2\) is?6

For a 2D function, this means that we can obtain two orthogonal vectors pointing in the highest and lowest direction. The following animation has the aim to visualize this. In order to make things a little bit more interesting, I also used a different function: \(L(x,y)\). This is now a proper 2D function, meaning that both variables are actually used. It is essentially a linear combination of Gaussian functions building a landscape; therefore, I called it Gaussian landscape (if you are interested in the exact definition, check out the corresponding Mathematica notebook, but since the function values do not really matter here I did not want to confuse you).

Curvature analysis in a Gaussian landscape. Two vectors are shown at the current position: the blue vector indicates the first eigenvector \(\vec{e}_1\) and points in the direction of highest curvature (at that point). Likewise, the second eigenvector \(\vec{e}_2\) denotes the direction of lowest curvature. Both vectors are scaled to roughly represent the magnitude of curvature (also shown precisely over the image in form of the eigenvalues \(\lambda_1\) and \(\lambda_2\)). You can use the top 2D slider to choose a different position.

Take your time and explore the landscape a bit. A few cases I want point to:

  1. If you are at the centre of the Gaussian blob (bottom left corner, \(\left( -10, -10 \right)^T\), did you notice that both eigenvalues are equal (\(\lambda_1 = \lambda_2\))? This is because the curvature is the same in every direction so there is no unique minimal or maximal direction. This is for example also true for every point on a sphere.
  2. When you are on the ground, e.g. at position \(\left( -6, -6\right)^T\), there is no curvature at all (\(\lambda_1, \lambda_2 \approx 0\)). This is true for every point on a planar surface.
  3. Another fact is already known from the previous animation: when you walk alongside the arête (e.g. at position \(\left( 0, -5\right)^T\)), the magnitude of the first eigenvalue is high whereas the second is roughly zero (\( \left| \lambda_1 \right| > 0, \lambda_2 \approx 0\)). This is for example also true for tube-like objects (e.g. a cylinder).
  4. Last but not least, when you descent from the hill (e.g. at position \(\left( 0, -9\right)^T\)) there is curvature present in both directions but the first direction is still stronger (\( \left| \lambda_1 \right| > \left| \lambda_2 \right| > 0\)). This is a very interesting point, as we will see later.

What has this to do with blobs in an image? Well, the first and last case I have just noted to fall directly into the definition of a blob. Whereas the first one is more circle-like and the last one is more notch-like. Since we want to detect both cases, it does not really matter which one we have. One way to detect these cases based on the eigenvalues is by using the Gaussian curvature

\begin{equation} \label{eq:HessianDetector_GaussianCurvature} K = \lambda_1 \cdot \lambda_2. \end{equation}

The main observation is that this product is only large when both eigenvalues are large. This is true for the first and fourth case and false for the other two since the product is already low when \(\lambda_2 \approx 0\) (note that it is assumed that the eigenvalues are descendingly sorted, i.e. \(\lambda_1 \geq \lambda_2\)).

We can now detect blobs at each image position by calculating the Hessian matrix via image derivatives, their eigenvalues and then the Gaussian curvature \(K\). Wherever \(K\) is high we can label the corresponding pixel position as a blob. However, wherever something is high is not a very precise formulation, and, usually, comes with the cost of a threshold in practice. Since we have also a strength of the blob (the determinant response) we can easily overcome this by using only the largest blobs in an image and these are the local maxima of \(K\). It is e.g. possible to sort the local maxima descendingly and use the \(n\) best (or define a threshold, or use just every maximum, or do something clever).

But wait a moment, when \eqref{eq:HessianDetector_GaussianCurvature} is already the answer to our initial problem (finding blobs) why does \eqref{eq:HessianDetector_Definition} use the determinant of \(H\) and this strange \(\sigma^4\) prefactor? The second part of this question is a bit more complicated and postponed to the next section. But the first part of this question is easily answered: there is no difference, since the equation

\begin{equation*} \det(H) = \lambda_1 \cdot \lambda_2 \end{equation*}

holds. This means that \eqref{eq:HessianDetector_Definition} already calculates the Gaussian curvature. Before moving to the discussion of the scale parameter though, I want to show the result when the determinant of the Hessian is applied to the Gaussian landscape.

Determinant response of the Hessian applied to the Gaussian landscape
Determinant response of the Hessian applied to the Gaussian landscape \(L(x,y)\). Grey level intensity is used to encode the magnitude of the determinant response whereas darker areas stand for a very low response (absence of curvature) and brighter areas are symbolic for a high response (presence of curvature). Additionally, the local maxima are labelled in red. Note that every local maximum denotes a blob in the function.

As you can see, the determinant response clearly separates our landscape with the maxima identifying locations which are either circle-like (bottom left) or notch-like (rest) blobs. Next, I want to apply this technique to a real image.

Airport image from a bird's eye view
Airport image7 from a bird's eye view.
Determinant response of the airport image at scale level 2
Determinant response of the Hessian applied to the airport image at the scale level \(\sigma_{2}=4\). In this case already \eqref{eq:HessianDetector_Definition} was use, i.e. including the scale normalization parameter which will be discussed in the next section. The image derivatives are calculated via Gaussian derivatives (with \(\sigma=4\)). The image is scaled for better contrast and comparison as \(R_{\sigma_2}\) like defined in \eqref{eq:HessianDetector_ResponseImageScale} (next section).

Notice the small intensity blobs present in the original image which are very well captured by the determinant response (e.g. the small white dots in the bottom of the image). Even though we can now detect some relevant parts of the image, did you notice that other parts are not detected? Especially the larger blobs like the round structure in the top right corner of the image. This is where the scale normalization parameter \(\sigma\) gets important and is the concern of the next section.

Scale normalization

So, the first important observation is that blobs may occur in different sizes. This was already the case for the sunflower image at the beginning. Not all sunflowers have the same size, and, additionally, they project to different sizes on the camera depending on the distance to the camera. What we want to achieve is a so-called scale invariance: blobs should be detected regardless of the size in the image.

But how can we achieve scale invariance? The answer is to build a scale space, e.g. by blurring the image repeatedly with a Gaussian function. The result is a scale pyramid with a different intrinsic image resolution on each level. This resolution is denoted by a \(\sigma_i\) parameter per level \(i\) which corresponds to a Gaussian function \(G_{\sigma_i}\) used to create the level (via convolution, i.e. \(I*G_{\sigma_i}\)). With increasing scale level the \(\sigma_i\) value also increases and the image is blurred more intense.

Our extrema search is therefore extended by an additional dimension: we can now search for maxima in the determinant response not only spatially in the image but also across scale in different resolutions of the same image. At each position, we can consider our determinant response also as a function of scale and pick the scale level with the highest response. But before we can do that we first need to solve a problem.

The thing is that the Gaussian blurring flattens the intensities in the image out since in each blurring step the intensity values come closer together due to the weighting process. Per se this is not a problem and is precisely the job of Gaussian blurring: flatten out outliers to create a smooth transition between the values. But the weighting process also decreases the possible range of values. If we, for example, start with an image where the range of intensity values lies in \([10;20]\) then after the blurring this range must be smaller, e.g. \([12;17]\) (depending on the distribution of the values and the precise \(\sigma_i\) value). At the end, when we use a really high \(\sigma_i\) value we have only one flat surface left with exactly one intensity value remaining and this is the average intensity of the first image (this is because the convolution with a Gaussian converges to the average image value).

Ok, but why is this now a problem? The fact is that when the intensity range decreases over scale it is also very likely that the derivative values will decrease (fewer changes) and hence the determinant response decreases. But when the determinant responses tend to get lower in higher scale levels we introduce a systematic error while searching for extrema across scale. Responses from higher levels will always have the burden of being computed based on a smaller intensity range compared to the first scale levels of the pyramid. Note that this does not mean that a response value for a specific location strictly decreases over scale. It is still possible that locally a new blob structure is detected which produces (slightly) higher response values than the previous level. But is it correct and significant? To find out, we must find a compensation for the decreased intensity range and we need to compensate more as higher we climb in our scale pyramid.

The idea is to use the \(\sigma_i\) values for compensation. Since they increase with higher scales, they fit our requirements. And this is exactly the reason why \eqref{eq:HessianDetector_Definition} contains \(\sigma^4\) as prefactor8 which stretches our intensity range again to be (roughly) equal as before the blurring process (e.g. \([10.4;19.7]\) instead of \([12;17]\); don't expect exact results in a noisy and discrete world). This makes the determinant response \(\det(H_{\sigma})\) comparable across scale. With this problem now solved we can move on and detect extrema over scale. For this, consider the following figure which shows the determinant response for the position in the blob of the top right corner of the airport image (also labelled in the image below).

Determinant response over different scale levels with and without scale normalization
Determinant response over 50 scale levels beginning with \(\sigma_1=2\) and proceeding further in steps of \(\sigma=2\) up to \(\sigma_{50}=100\). The position of interest is \((x_0,y_0)=(532, 98)^T\) and the responses are calculated using Gaussian derivatives with the corresponding \(\sigma_i\) per scale level. The determinant response with (orange) and without (blue) the scale normalization prefactor is shown. The maximum for both cases is depicted with a circle. Note the different plot ranges for the two cases shown on the left respectively right side.

Technically, we have a unique maximum in both cases. Even when we do not normalize we get a peak. However, it is not as wide as in the normalized case. But back to the question: is this peak correct and significant? To answer the question let me emphasize that this is only the trend for a specific location and when you look at the original image you see that in a very fine scale space there is just a white area (i.e. roughly zero determinant response). But when increasing scale the borders of the circle-like blob get visible and we have therefore an increase in response. This doesn't answer the question about correctness and significance, though. For this, let's consider the following figure which shows the global response range for all response values in all locations over scale for both cases.

Range of determinant response values in all locations as a function of scale
Minimum (dotted) and maximum (solid) determinant response values in all locations as a function of scale. The range for both with (orange) and without (blue) scale normalization is shown. The maximum in both cases is depicted with a circle. Note again the different plot ranges for the two cases.

This graph shows very clearly what I meant with the decrease of response over scale when no normalization is used. In this case, we start with a large range in the first levels but the range narrows very fast to a nearly zero wide range. This explains why the magnitude of the response values have been so low in the previous figure (e.g. only about 0.00006 at the peak) and really reinforces doubts about the correctness and significance of the peak. However, the range of the normalized response gives a different picture. Even though the range does not stay fixed (but this wasn't expected and highly depends on the example image), it's width remains rather solid. No strict decay and the values do not get ridiculously small. This shows very clearly that scale normalization is indeed important.

To answer the question about correctness and significance (finally!), let's compare the maximum response from our specific location \((x_0,y_0)=(532, 98)^T\) with the global maximum response of all locations in all scale levels9

\begin{align*} \frac{ \max_{i=1,\ldots,50}(\det(H_{\sigma_i}(532, 98))) }{ \max_{i=1,\ldots,50}(\det(H_{\sigma_i})) } &= % MathJax offers a way to apply css to math elements: http://docs.mathjax.org/en/latest/tex.html#html % The math must contain a symbol, i.e. it should not be empty. Therefore, an arbitrary symbol is printed with the same colour as the face of the circle \frac{ \style{border: 2.2px solid #5e81b5; border-radius: 50%; background: black; width: 0.6em; height: 0.6em;}{\color{black}{o}} }{ \style{border: 2.2px solid #5e81b5; border-radius: 50%; background: white; width: 0.6em; height: 0.6em;}{\color{white}{o}} } = \frac{0.0000621447}{1.10636} = 0.0000561703 \\ \frac{ \max_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i}(532, 98))) }{ \max_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) } &= \frac{ \style{border: 2.2px solid #e19c24; border-radius: 50%; background: black; width: 0.6em; height: 0.6em;}{\color{black}{o}} }{ \style{border: 2.2px solid #e19c24; border-radius: 50%; background: white; width: 0.6em; height: 0.6em;}{\color{white}{o}} } = \frac{15.3165}{34.6083} = 0.442567. \end{align*}

The idea behind this is that we need to compare different response values over scale for extrema detection and this relation measures how near the values are to each other. As you can see, the maximum response without normalization is absolutely not significant and hence does not represent a correct extremum. On the other hand, the normalized response does not suffer from a huge difference between the values and therefore I would conclude this to be a very significant and correct extremum.

To finish up, let's take a look at the response image for the different scale levels. In order to make responses across scale visually comparable and to achieve a better image contrast, I scaled each response value with the minimum and maximum over all scale levels and all pixel locations, i.e. the response image \(R_{\sigma_i}\) for the scale size \(\sigma_i\) is calculated as

\begin{equation} \label{eq:HessianDetector_ResponseImageScale} R_{\sigma_i} = \frac{ \sigma_i^4 \cdot \det(H_{\sigma_i}) - \min_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) } { \max_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) - \min_{i=1,\ldots,50}(\sigma_i^4 \cdot \det(H_{\sigma_i})) }. \end{equation}

This way all response values stay in \([0;1]\) and a response value of e.g. 0.4 means the same in all images, i.e. does not depend on the current scale level. But this formula is only used to achieve a better visualization; feel free to ignore it.

Airport image from a bird's eye view
Airport image with the analysed position \((x_0,y_0)=(532, 98)^T\) labelled in red.

Determinant response over the complete scale space. The image is scaled for better contrast and comparison as \(R_{\sigma_i}\) like defined in \eqref{eq:HessianDetector_ResponseImageScale}.

As you can see, the blob at our location of interest \((x_0,y_0)=(532, 98)^T\) is now detected and \(\sigma_{16}=32\) gives indeed a fairly strong response. With the same technique we can also detect other blobs: find the maximum determinant response across scale for each image position.

This was also the last part of our journey which started with \eqref{eq:HessianDetector_Definition} were we took a deep breath at the Hessian matrix and analysed the curvature information it contains. We moved forward while keeping the Gaussian curvature \(K\) in our bag and used this technique to find extrema responses in the image. We needed to take a break and first got around about correct scale normalization. This problem solved via the \(\sigma_i^4\) prefactor, we could search for extrema over the complete scale pyramid and so the circle completed again with \eqref{eq:HessianDetector_Definition}.

List of attached files:

  • HessianDetector_GaussianBlob.nb [PDF] (Mathematica notebook which I used to create the Gaussian blob image)
  • HessianDetector_HessianCurvature.nb [PDF] (Mathematica notebook which contains the analysis of the curvature information embedded in the Hessian matrix, e.g. it contains the first two animations I used here)
  • HessianDetector_DeterminantResponse.nb [PDF] (Mathematica notebook for the calculation of the determinant response; both on the Gaussian landscape and on the airport image. Contains also the figures for the local response trend and global response range)

1. I can e.g. recommend chapter 16 (p. 493ff) of the Learning OpenCV 3 book by Gary Bradski and Adrian Kaehler. If you want a deeper understanding I would recommend looking at a specific feature detector. E.g. the book Digital Image Processing (2nd edition) by Burger and Burge has a great chapter about SIFT (chapter 25, p. 609ff).
2. Image from Frank Vincentz (Own work) [GFDL or CC-BY-SA-3.0], via Wikimedia Commons.
3. It is possible to derive the definition of the Hessian matrix but I will only present the results here. For more information see e.g. chapter 3 (p. 88ff) of the book Data Visualization - Principles and Practice (2nd edition) by Alexandru C. Telea.
4. Due to a property called symmetry of second derivatives.
5. If you want to know why this is true, you may are interested in reading chapter 7 (p. 254ff) of the book Data Visualization - Principles and Practice (2nd edition) by Alexandru C. Telea.
6. It is probably not too hard to see that the second eigenvector points in the \(\alpha = 90^{\circ}\) direction (the \(\alpha = 270^{\circ}\) direction has an equal curvature in this case), i.e. \(\vec{e}_2 = \begin{pmatrix} \cos(90^{\circ}) \\ \sin(90^{\circ}) \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}\) with \(\lambda_2=0\) (zero curvature).
7. Part of an image by Hynek Moravec (Self-photographed) [GFDL, CC-BY-SA-3.0 or CC BY 2.5], via Wikimedia Commons
8. To be honest, I don't know why the exponent \(^4\) is used here. What I can guess is that it starts with \(\sigma\) and per derivative this factor gets somehow squared, so we have \(\sigma\) for \(\partial_x\) and \(\partial_y\) and \(\sigma^2\) for \(\partial_{xx}, \partial_{yy}\) and \(\partial_{xy}\). When we plug this into the determinant formula \begin{equation*} \sigma^2 \partial_{xx} \cdot \sigma^2 \partial_{yy} + \sigma^2 \partial_{xy} \cdot \sigma^2 \partial_{xy} = \sigma^4 \cdot \left( \partial_{xx} \cdot \partial_{yy} - \partial_{xy}^2 \right) = \sigma^4 \cdot \det(H_{\sigma}) \end{equation*} we get \eqref{eq:HessianDetector_Definition}. You can find more about this topic in the paper Feature Detection with Automatic Scale Selection by Tony Lindeberg. But the important concept here is that there must be some kind of compensation, otherwise, extrema detection would not work.
9. Values correspond with the circles shown in the figures. The exact values are calculated in the Mathematica notebook.

Introduction to convolution with link to filters in computer vision


In this article, I want to discuss one of the basic topics in computer vision: convolution. It is probably one of the most important topics in this field since it offers some very basic operations used on images. When it comes to images, the talk about convolution is also the talk about filters and kernels. The blurring of an image is for example implemented by convolution. But there is also a strict mathematical definition. Here I want to correlate these two aspects by introducing convolution in the 1D discrete, moving further to the 1D continuous and concluding in the 2D discrete (and for the sake of completeness also short the continuous) world.

1D – discrete

Let's start in the 1D discrete world. I want to discuss an example from BetterExplained, which helped me a lot when I tried to understand the topic (after frustratingly leaving the corresponding Wikipedia site). Suppose you are a doctor leading a medical study where you oversee the test of a new medicine (a new pill). For this new pill exists a strict plan of intake. A patient has to come at three consecutive days to your office: intaking three pills on the first, two pills on the second and one pill on the third day. Now additionally assume that you not only have one patient but multiple instead. To make things more complicated, the patients start on different days with their treatment. Let's assume that on the first day just one patient comes to your office. He/She will get three pills from you. One the second day two new patients and the known one from yesterday come to your office. The patient from yesterday gets two and the other two will get three pills each leaving a total of eight pills you have to deposit in your office (to satisfy all demands). Of course, you want to keep track of the number of pills you need each day. You could go on with business as usual, but this will probably get complicated and confusing. Maybe there is a better way...

We will now start to optimize this problem a little bit. We have two variables in this scenario: the plan for the intake of the pills and the number of patients coming to your office each day. The plan is a list of values denoting the number of pills needed per treatment day (per patient), \((3, 2, 1)^T\) in this example. The number of patients can also be stored in a list denoting the number of new patients arrive at a specific day. Let us assume that this list would look like \((1, 2, 3, 4, 5)^T\) in this example (1 new patient at the first, 2 new patients at the second day, etc.). It might help to write this problem in a table format

Table showing the plan and medicine list
The first version of the plan and patient list in a table format.

The top row shows the plan with the number of pills per timestamp in decreasing order. Below that are the number of new patients arranged. Each column should give us the information, how many patients need \(n\) number of pills. Since this is a very dynamic scenario, one table won't be enough, so let's assume that the numbers of the patient row will slide in from left to the right. But if we would slide the patients in the defined order, the number 5 would come first, so we have to reverse the patient list \((5, 4, 3, 2, 1)^T\). If we now multiply the numbers from the first and second row of each column, we have the information how many pills are needed to satisfy the demand for the new-comers requesting 3 pills and for the others as well. To retrieve the total number of needed pills at each timestamp, we just have to add up the values from the last row. This concept is also summarized in the following animation, where you can play around with the timestamp.


Clear overview of the described problem showing the plan of pills and the patient list at one timestamp. The number in the bottom left corner denotes the total amount of pills needed for one day. You can change the timestamp with the controls underneath the table.

I hope you agree that this way of writing the problem is much clearer than the manual (naive) approach. You may have noticed that I highlighted a few terms in the previous paragraphs. In summary:

  • Plan: a function \(k(\tau)\) with the day of treatment as argument and the number of needed pills as the function value.
  • Patients: a function \(s(\tau)\) with the day of the study as argument and the number of new patients coming at that day as the function value.
  • Timestamp: independent time variable of this problem, e.g. the number of pills needed at one day \(t\) in the timeline.
  • Reverse order: \(s(-\tau)\) as reverse of the patient list helping to model the problem.
  • Multiplication of \(s(-\tau) \cdot k(\tau)\) to retrieve the number of needed pills per treatment day.
  • Addition of every relevant (both functions are different from zero) result from the previous multiplication step to retrieve the total number of pills needed at one day \(t\), e.g.
    \( p(3) = s(3) \cdot k(1) + s(2) \cdot k(2) + s(1) \cdot k(3) = 9 + 4 + 1 = 14\).

All the previous steps concluding in one single formula leaves

\begin{equation} \label{eq:Convolution_1DDiscrete} \definecolor{energy}{RGB}{114,0,172} p({\color{Mulberry}t}) = {\color{LimeGreen}s}({\color{Mulberry}t}) * {\color{Aquamarine}k}({\color{Mulberry}t}) = {\color{Mahogany}\sum_{\tau=-\infty}^{\infty}} {\color{LimeGreen}s}({\color{Orange}-}\tau + {\color{Mulberry}t}) {\color{Red}\cdot} {\color{Aquamarine}k}(\tau) = {\color{Mahogany}\sum_{\tau = 0}^{n-1}} {\color{LimeGreen}s}({\color{Orange}-}\tau + {\color{Mulberry}t}) {\color{Red}\cdot} {\color{Aquamarine}k}(\tau), \end{equation}

where \(*\) denotes the convolution operator and \(n = 2\) in our case. And, as you may have guessed, this is exactly the definition of convolution in the 1D discrete case. Note the iteration over the complete relevant domain at each timestamp \( \left({\color{Mahogany}\sum}\right) \). It is very important for practical scenarios to determine \(n\), so that the computation can be performed efficiently. Luckily, it is often the case that the functions are just defined on a fixed range with only zero values outside of the boundaries.

To visualize \eqref{eq:Convolution_1DDiscrete}, you can imagine the patient list sliding over the plan list, combining each related values and resulting in a function value for the current timestamp (slide position). The situation is also illustrated in the following animation.


Plot of the function \(p(t)\) build over time \(t\) by sliding the patient list \(s(-\tau + t)\) over the plan list \(k(\tau)\). The line between the points is just drawn for better clarification. The functions are not really continuous.

The abbreviations I used so far are not arbitrarily chosen. \(s\) stands for the signal, since convolution is often used with signals (audio, image signals etc.). \(k\) is the kernel, which defines the kind of operation performed. There are for example kernels for blurring, averaging or finding the derivations of the signal. This is especially true when convolution is used on images. Even the convolution variable \( \tau \) is chosen deliberately. Both \( t \) and \( \tau \) are denoted to some point in time. \( t \) is used to specify the timestamp currently of interest and \( \tau \) is used to iterate over the complete relevant time range needed to calculate the result of \( t \).

Signals can be very large and therefore it would be simpler to flip the kernel and slide it over the signal, instead of the other way around like before. Hence it is very handy that the convolution operation has the nice property of commutativity, meaning \( s(t)*k(t) = k(t)*s(t) \). In the previous example, this would mean that the patient list stays fixed and the plan list gets reversed and slid over the patient list. The result is the same. I leave it to the reader to verify this as an exercise.

1D – continuous

The next step is to extend the known definition to the 1D continuous case. This turns out to be pretty simple actually since we basically have to replace the sum with an integral:

\begin{equation*} s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau. \end{equation*}

So, instead of adding up a discrete set of values, we integrate over the complete (relevant) range. This means that the result value is the area underneath \( s(-\tau + t) \cdot k(\tau) \). However, the integral over the complete range \(]-\infty;\infty[\) makes it really hard to name the result function analytically. It is often possible though if at least one of the function is only defined on a limited range or the functions are not very complicated itself. To look at an example, I use the increasing step function

\begin{equation*} s(t) = \begin{cases} 1 & t \geq 0 \\ 0 & t < 0 \end{cases} \end{equation*}

as signal. In an image, this could for example model a sudden contrast increase on an edge. As kernel, the difference of Gaussian function shall be used

\begin{equation*} k(t) = \operatorname{DoG}_{\sigma_1,\sigma_2}(t) = G_{\sigma_1}(t) - G_{\sigma_2}(t). \end{equation*}

This is an approximation of the normalized version of the Laplacian of Gaussian \(\nabla^{2}\) which, in 2D images, implements some kind of edge detection. It is also a good model of the edge detection our retina ganglion cells implement. The following animation shows how the resulting function forms a short impulse response.


Example for the convolution operation in the 1D continuous case. A simple step function gets convolved with a difference of Gaussian (\(\sigma_1 = 0.1, \sigma_2 = 1\)) resulting in a short impulse. The resulting function is analytically describable but too long and complicated to show here. The definition of the resulting function is available in the appended Mathematica notebook.

So, why does the response function look like it looks? Well, first note that the \(\operatorname{DoG}\)-function consists of a positive part surrounded by two negative parts. When sliding over the step function, first it overlaps in one negative part. In this range the multiplication \( s(-\tau + t) \cdot k(\tau) =_{t \geq 0} k(\tau) \) is non-zero and nothing more than the kernel function itself, since the step function forms an identity function for \(t \geq 0\). The integral is therefore also negative resulting in a negative response. When sliding further, the positive part of the kernel function reaches \(t \geq 0\). Now there is a positive and negative part active which results in a lowering response (the two integral areas compensate each other). Until the second negative part reaches the step function, the response increases further, but after this point, the response decreases again due to the new negative part coming in. The response will approximate (but never reach) 0, since the \(\operatorname{DoG}\)-function consist of two Gauss-functions which differ just in their \(\sigma\) value, meaning the areas finally compensate each other completely

\begin{equation*} \int_{-\infty}^{\infty} \! \operatorname{DoG}_{\sigma_1,\sigma_2}(t) \, \mathrm{d}t = 0. \end{equation*}

2D – discrete/continuous

It is now time to add an additional dimension. This means that our signal depends now on two variables, e.g. \(x\) and \(y\) and we finally reached the image domain. In the convolution formulas, this results in an additional sum respectively integral. The following table summarizes the convolution formula in the 1D/2D and discrete/continuous cases.

1D 2D
Discrete \( s(t)*k(t) = \sum_{\tau = 0}^{n-1} s(-\tau + t) \cdot k(\tau) \) \( s(x, y)*k(x, y) = \sum_{\tau = 0}^{n-1} \sum_{\kappa = 0}^{n-1} s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa) \)
Continuous \( s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau \) \( s(x, y)*k(x, y) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \! s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa) \, \mathrm{d}\tau \mathrm{d}\kappa \)

So, we now integrate over the complete relevant area around each point and each point means now all points from a 2D array. This sounds like an expensive operation, what it sometimes is, but when the kernel size is small compared to the image size, it is manageable. Personally, I have never used the 2D continuous convolution formula before (just mentioned for completeness), so I stick to the 2D discrete case from now on.

I already mentioned that convolution is a very important operation in computer vision. In this domain also the term linear filtering is used. One part, the signal, is most of the time the image itself. The interesting part is the kernel since it defines how the image will be transformed. The kernel is centred and flipped around the current pixel position1 and then each element of the kernel gets multiplied with the corresponding pixel value. The current pixel with the kernel centre and the surroundings of the kernel with the neighbours of the current pixel. This is the area around a position. Note that here the kernel is flipped and not the signal. This is possible due to the above-mentioned property of commutativity and in general a good idea since the kernel will usually be much smaller compared to the image. For clarification, I first want to look at the matrix \(A\), which contains sequenced numbers, and an arbitrary \(3 \times 3\) kernel \(K\)

\begin{equation*} A = \begin{pmatrix} 1 & 2 & 3 \\ 4 & {\color{Aquamarine}5} & 6 \\ 7 & 8 & 9 \\ \end{pmatrix} \quad \text{and} \quad K = \begin{pmatrix} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \\ \end{pmatrix} \end{equation*}

For the highlighted position, this results in the following operation

\begin{equation*} (A*K)_{22} = 9 k_{11}+8 k_{12}+7 k_{13}+6 k_{21}+5 k_{22}+4 k_{23}+3 k_{31}+2 k_{32}+k_{33}. \end{equation*}

This shows how the flipping of the kernel works: basically, each index is mirrored independently around the centre axis, so that e.g. \( k_{13} \rightarrow k_{31} \). But to also give an example with real numbers, let's look at the following image \(I\) and the kernel \(K_{\nabla^{2}}\)

\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & 2 & 0 \\ 0 & 0 & {\color{Aquamarine}1} & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}

and apply the kernel, again on the highlighted position

\begin{alignat*}{4} 0 \cdot 1 & {}+{} & 1 \cdot 2 & {}+{} & 0 \cdot 0 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 1 & {}+{} & 1 \cdot 0 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}-2} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}

If you compare this computation with the formula from the table, you might notice the similarity. In total, the above step is repeated for every image element (this is the sliding of the kernel over the image) resulting in the following image (assuming that every pixel value outside the image is zero2):

\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & -5 & 2 \\ 3 & 2 & {\color{Aquamarine}-2} & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right) \end{equation*}

Since this is an article about computer vision, it would be a shame to not show at least one image. Therefore, in the following an image and its convolved version using the same kernel as in the previous example.

Original image Image convolved with the Laplacian filter
Original image (left) and the filtered image is shown as absolute version \( \left| I * K_{\nabla^{2}} \right| \) (the kernel can produce negative values). The image shows my old (and already dead) cat Mila – in case you ever wondered where this website has its name from.

Did you noticed that I used the index \(\nabla^2\) for the kernel? This is not a coincidence since the kernel \(K_{\nabla^2}\) is precisely the 2D discrete version of the Laplacian operator. And like in the 1D continuous case from above this filter responses best on edges, visible for instance at the whiskers of the cat.

There are many other filters and I don't have the intention to cover all of them here. Instead, I link you to another website where you see other filters and examples which run directly in the browser. When you scroll down a bit, you can even test your own kernels interactively.

List of attached files:


1. This is an implementation detail not taken into account by the mathematical definition. You would need to manually add some constant to centre your kernel, e.g. \( + 1\) for a \(3 \times 3\) kernel in the 2D discrete case: \( s(x, y)*k(x, y) = \sum_{\tau = 0}^{n-1} \sum_{\kappa = 0}^{n-1} s(-\tau + x, -\kappa + y) \cdot k(\tau + 1, \kappa + 1) \).
2. This is one case of the general border handling problem. See this Wikipedia section or page 113 of the book Digital Image Processing (2nd edition) by Burger and Burge for further details.

Scale invariance through Gaussian scale space


When analysing images, it is often desired to make the analysis scale invariant. Objects may appear in different sizes in an image either because the image resolution is different or because the distance of the object to the camera varies between a series of images. Invariance with respect to scale is a property of image processing algorithms which incorporate this observation and adjust its calculations accordingly, e.g. by making sure that a similar output is produced for two images showing the same object but in different sizes. This is not easy, but a general idea is to consider multiple scale levels of an image meaning the same image is analysed in different resolutions. Feature algorithms like SIFT implement this idea by building a (Gaussian) scale space where the original image is repeatedly blurred to achieve the behaviour of different scales. In this article, I want to talk a bit about scale spaces and what they imply.

I start by analysing what scaling in an image actually means and how we can implement it. We will see that scaling has also an influence on the area we need to consider when we try to analyse a certain structure. The goal of the first part is to make the necessity of a scale space clear and convey insights into what we need to consider when using it. The second part shows an example of how we could build a scale space by trying to cover important ideas in this field.

Why size matters and why the intrinsic resolution matters even more

Let’s dive directly in by considering an example. The following two images show the same object but in different sizes. The original image (right) has a size of \(3866 \times 4320\) pixels (width by height) whereas the left one is scaled-down by 0.25 to the new size of \(966 \times 1079\) pixels using the nearest neighbour technique (i.e. no interpolating).

Triss Merigold in quarter resolution Triss Merigold in full resolution
The same image once scaled-down by 0.25 (left) and once in full resolution (right). The image shows a wonderful painting by Mike Norse (©2014 VoodooHammer)1 of Triss Merigold (Witcher 3).

In the following, I want to analyse the local structure around the eye in more detail. If we would restrict the analysis to the mentioned rectangle of size \(100 \times 100\), the resulting image regions are very different since the rectangles extend to different parts of the image. This is not surprising since a distance of 100 px depends on the size of the image. We can say that the meaning of 100 px varies depending on the image size, i.e. it is not scale invariant.

Two images with different resolution and same rectangle size
Detailed view with same rectangle size (left: scaled-down, right: full resolution). The centre of the two rectangles is roughly placed at the same physical location (left notch of the eye).

We need to adjust the rectangle's size, obviously. Since the left image is scaled-down by \( \frac{1}{4} \), lets scale-up the right rectangle by 4, so that the new size becomes \(400 \times 400\) pixels.

Two images with different resolution and adjusted rectangle size
Detailed view with adjusted rectangle size, so that the covered regions contain the same image patch (left: scaled-down, right: full resolution).

Now, both rectangles cover roughly the same image region, but the original one has much more information (more details, e.g. visible in the reflection of the eye). You can also argue with the entropy here (the amount of information per area): the entropy of the left image is lower than the entropy of the right image. To achieve scale invariance, we want the entropy to be roughly the same in both image regions. Remember: we want to analyse an object regardless of its size, but when one area shows the object in much more detail then the result still varies depending on the image size.

Let’s think a little bit about the scaling operation. What does scaling mean? We decrease our image size, obviously, and that means we lose information. This is unavoidable since we use fewer pixels to describe the object, fewer pixels to describe all the beautiful details in the image (of course, this is more noticeable in the eye than in the skin – repeating pixels don’t give us more information).

The exact loss of information depends also on the reducing technique. A very easy technique would be to just sub-sample the image by removing every n-th pixel (e.g. remove every second pixel to retrieve a half-sized version of the image). A more sophisticated approach takes the surrounding pixels into account. Blurring is one way to calculate a new pixel value depending on the local neighbourhood. It also reduces information, since the neighbouring pixels are weighted which levels-out the variation of pixel differences. The kind of weighting depends on the kind of blurring. One very common way is to use Gaussian blurring for the job.

Gaussian blurring offers a very nice way to visualize the loss of information. For this, we leave our spatial domain and take a tour to the frequency domain. The Fourier transform of a Gaussian function is very special: it remains a Gaussian function. What changes is the \(\sigma\) value. The Gaussian function in the spatial domain

\begin{equation} \label{eq:GaussianScaleSpace_GaussSpatial} g_{\sigma}(x,y) = \frac{1}{2 \pi \sigma^2} \cdot e^{-\frac{x^2+y^2}{2 \sigma ^2}} \end{equation}

leads to (after the Fourier transform)

\begin{equation} \label{eq:GaussianScaleSpace_GaussFrequency} G_{\sigma}(\omega_1, \omega_2) = \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \sigma^2}{2}} \end{equation}

in the frequency domain. So, with increasing \(\sigma_s\) in the spatial domain, the corresponding \(\sigma_f\) in the frequency domain decreases. More precisely, the relation \begin{equation} \label{eq:GaussianScaleSpace_GaussRelationFrequency} \sigma_s = \frac{1}{\sigma_f} \end{equation} can be observed. This means a wide Gaussian curve in the spatial domain corresponds to a peaked Gaussian curve in the frequency domain. This is also visualised in the following animation.


Fourier transform of a Gaussian function. Higher values for \(\sigma=\sigma_s\) lead to a wide function in the spatial domain and a peaked function in the frequency domain. The top graph shows the functions from \eqref{eq:GaussianScaleSpace_GaussSpatial} (blue) and \eqref{eq:GaussianScaleSpace_GaussFrequency} (orange) whereas the bottom graph shows a slice along the \(y\)-axis for the same functions. You can use the slider to control the \(\sigma\) value.

Notice how a large \(\sigma\)-value (high blurring) results in a very small peak in the frequency domain. This gives us important insights when considering the convolution theorem: convolution in the spatial domain corresponds to multiplication in the frequency domain meaning \(g(x) * k(x)\) is the same like \((2\pi)^{\frac{n}{2}} \cdot G(\omega) \cdot K(\omega)\), with \(n\) the number of independent variables; e.g. \(2\) in the image case. So, when we blur our image (which is done by convolution) with a Gaussian function with a high \(\sigma\) value, it is the same as multiplying the image in the frequency domain with a Gaussian function with a low \(\sigma\) value. And by applying this low-pass filter (keeping only frequencies from the lower band), we remove all the high frequencies since a narrow Gaussian consists of mostly zeros outside the peak. The thing is that these high frequencies are responsible for holding all the high details in the image together. High frequencies, i.e. fast changes of intensity values, are needed to show the colour changes details usually consists of (otherwise we would not call them details but rather homogeneous areas).

Back to the example from the beginning. The original image (with increased rectangle size, i.e. \(400 \times 400 \) pixels) contains too many details compared to the reduced version, so let’s blur it with a Gaussian:

Two images with different resolution and adjusted rectangle size with the right image blurred by a Gaussian
Detailed view with a blurred version (\(\sigma = 1.936\)) of the right image (left: scaled-down, right: full resolution). The left rectangle has size of \(100 \times 100\) and the right \(400 \times 400\) pixels.

The number of details (or the entropy) is now much more similar between the two images. E.g. the stripes in the reflection of the eye are not as clearly visible anymore as before. This is exactly what we wanted to achieve. But, compared to the reduced version, the transitions are smoother, which is a direct effect of the blurring instead of a local neighbour approach.

Also note that for the right image the actual image size has not decreased, we still use the same number of pixels as before. Only the intrinsic image resolution2 changed: fewer details in the image. That is why we increased the size of the rectangle to account for the different real image resolution. If we would now analyse the two image regions, we would probably get a better result as before, since we cover the same spatial area (rectangle size) and consider roughly the same amount of details (blurring). This is precisely what we wanted to achieve for scale invariance since it allows us to analyse the same objects in images of different sizes.

So far, we created one additional scaled version of the original image (0.25 of the original size). Since I already used the term scale space before, there is indeed more. The idea is to add a third dimension to the original image which covers all possible scalings. This results in a three-dimensional function \(\hat{I}(x,y,\sigma)\) with the scaling parameter \(\sigma\) as the third dimension. It is then possible to consider (theoretically) all possible scaled version of an image. In the real world, of course, this process will be discretely sampled3. In the end, scale invariance is achieved by considering different scale levels up to a certain point and applying the used algorithms for all these scale levels (or adapt the algorithms to work directly inside the scale space, c.f. the SIFT paper).

I will show an example of a scale space later. But first, summarize what we want from a scale space so that it can help us with scale invariance:

  • We need to adjust the area we want to analyse to fit the current image size. This is especially important for filter operations where a filter kernel operates at each location in a local neighbourhood. The size of the neighbourhood (or the kernel size) corresponds to the size of the rectangle above. Hence, we may need to adjust the kernel size to be scale invariant. Alternatively, we stick to one kernel size but reduce the image size instead (see below).
  • To consider an object at different sizes, we need different scaled versions of an image. More importantly, the intrinsic information (or the entropy) must change. We can use Gaussian blurring for this job which offers very good quality.

Before I continue to show how a scale space can be built based on these requirements, I want to take a detailed look at the scale parameter \(\sigma\) and how it behaves when the image is blurred multiple times.

A detailed look at the scale parameter \(\sigma\)

You may have wondered why I blurred the image with \(\sigma=1.936\). Actually, this is a little bit complicated. It is based on the assumption that the original image is already a blurred version of the continuous function with \(\sigma_s=0.5\) [616]4, which is somehow related to the Nyquist–Shannon sampling theorem (sampling frequency of \(\sigma_f=2\) is related to \(\sigma_s=0.5\)). Next, if we double our \(\sigma\) value, we should also double the size of the rectangle [642]. Since our rectangle is 4 times larger than the original one, we do the same with the \(\sigma\) value.

So, why \(\sigma = 1.936\) instead of \(\sigma = \sigma_s \cdot 4 = 2\)? Well, this is probably only from theoretical interest; you won’t see a difference between images blurred with \(\sigma=2\) or \(\sigma=1.936\). The reason is that we want our continuous function to be blurred with \(\sigma = 2\). The continuous function is the theoretical view of the light signals from the real world which were captured by the camera to produce the image. But we don’t have the continuous function, only the image which is a blurred version of the continuous function (with \(\sigma_s=0.5\)). So we have (\(f\) denotes the continuous function and \(I\) the discrete image function)

\begin{equation} \label{eq:GaussianScaleSpace_BlurContinuousFunction} f * g_{\sigma} = \underbrace{f * g_{\sigma_s}}_{=I} * g_{\sigma_x} = I * g_{\sigma_x} \end{equation}

and this means we need to calculate the unknown \(\sigma_x\). There is a rule5 [761] which states

\begin{equation} \label{eq:GaussianScaleSpace_ScaleCombinations} \sigma = \sqrt{\sigma_s^2 + \sigma_x^2} \end{equation}

which only needs to be solved by \(\sigma_x\) and that is where the mysterious value comes from

\begin{equation*} \sigma_x = \sqrt{\sigma^2 - \sigma_s^2} = \sqrt{4 - 0.25} \approx 1.936. \end{equation*}

\(g_{\sigma_x}\) is therefore the kernel we need to convolve the image \(I\) with to retrieve a result which is equivalent to the continuous function blurred with \(\sigma = 2\) as stated by \eqref{eq:GaussianScaleSpace_BlurContinuousFunction}. But I agree when you say that this operation really does not change the world (it may be important in other contexts or with other values though). I should also mention that in real scenarios a base scaling for the image is defined (e.g. \(\sigma = 1.6\)) [617], which already compensates for some of the noise in the image. This blurred version of the input image is then the basis for the other calculations.

Note that \eqref{eq:GaussianScaleSpace_ScaleCombinations} applies also when the image is blurred multiple times. Blurring the image two times with a Gaussian is equivalent to one Gaussian function with a \(\sigma\) value calculated according to \eqref{eq:GaussianScaleSpace_ScaleCombinations}. This will get important when we build our scale space later.

Let's discuss the appropriate size of the rectangle a bit more. To make things easier, assume we only use squares and therefore only need to calculate a width \(w\). The idea is to connect the width with the \(\sigma\) value in such a way that the entropy inside the rectangle stays roughly constant. This means when \(\sigma\) increases \(w\) must also increase to account for the information loss which the blurring is responsible for (assuming the real resolution of the image stays the same), i.e. the loss of entropy due to the blurring is recovered by expanding the area of interest. The same logic can be applied when \(\sigma\) decreases. Of course, there must be a base width \(w_0\) (like 100 px) and the actual width is computed relative to this base width. More concretely, since the original continuous function has a base blurring of \(\sigma_s\) and we want to achieve a blurring of \(\sigma\), the appropriate width can be calculated as

\begin{equation} \label{eq:GaussianScaleSpace_RectangleSize} w = w_0 \cdot \frac{\sigma}{\sigma_s}. \end{equation}

In the previous example this was

\begin{equation*} w = 100\,\text{px} \cdot \frac{2}{0.5} = 400\,\text{px}. \end{equation*}

If you want to play a bit around with different rectangle sizes and scale levels, you can use the Mathematica notebook6 which I created for this purpose. There you can also move the rectangle to a different position to analyse other structures.

Example screenshot of the Mathematica script
Screenshot of the Mathematica script with the example image. The amount of blurring is controlled by the top slider and the centre of the locator can be changed by dragging the mouse in the left image. Higher values for \(\sigma\) increase the rectangle in a way so that the entropy inside stays roughly constant by applying \eqref{eq:GaussianScaleSpace_RectangleSize}.

Building a scale space

In this section, I want to show a concrete example of a scale space. Note the word example here. There is no unique way of building a scale space but there are general ideas which I try to embed here. Also, the example presented is not used by me in practical applications, I invented it solely for educational purposes. However, it is easy to find other approaches on the web7.

First, we must talk about the scaling operation (again). Like before, I want to use Gaussian functions to repeatedly blur the image and hence build the scale space. A Gaussian offers good quality and has a nice counterpart in the frequency domain (namely itself). But it comes at the cost of performance. Blurring an image means convolving the image with a discrete kernel. The Gaussian function is by its nature continuous. Hence, we need to discretize it, e.g. by taking the function values at every integer position and put them in a kernel. When you look at the graph from the animation again, you will see that this does not result in nice numbers. We definitively need floating operations for the job and this will be computationally intensive. Remember, that blurring is an essential operation when we want to change the intrinsic resolution of an image and hence the underlying operation should be fast.

Luckily, there are other filters which approximate the behaviour of a Gaussian-like binomial filters which are commonly used. They are extracted from Pascal's triangle where each row in the triangle can be used as a filter with a specific variance. Since the triangle is built only by addition of integers, the resulting filters will also consist of integers. This increases our computational efficiency.

Here, I want to approximate a Gaussian with \(\sigma=1\) since this makes the calculations later easier. The corresponding binomial filter \(B_n\) to use in this case is extracted from the row \(n=4\) and is defined as filter as8

\begin{equation*} B_4 = \frac{1}{16} \cdot \left( 1, 4, 6, 4, 1 \right)^T. \end{equation*}

Note the prefactor which makes sure that the filter weights up to 1 and can be applied at the end of the convolution operation. The good thing about binomial filters is that they fit very well to a Gaussian function like the following figure depicts.

Plot of two graphs: binomial values and a Gaussian function with the same variance
Values from the binomial filter \(B_4\) plotted (red) together with a continuous Gaussian function (blue) with the same standard deviation of \(\sigma=1\). The binomial filter is modelled as an empirical distribution which is centred around the origin.

As you can see, even though we have only five integer values the Gaussian is very well approximated. Other rows from Pascal's triangle have similar behaviour but with different \(\sigma\)-values. To use the filters for images (which require a 2D-filter), separation can be used. But it is also possible to expand the vector to a matrix via the outer product:

\begin{equation*} B_4^T \cdot B_4 = \frac{1}{256} \cdot \left( \begin{array}{ccccc} 1 & 4 & 6 & 4 & 1 \\ 4 & 16 & 24 & 16 & 4 \\ 6 & 24 & 36 & 24 & 6 \\ 4 & 16 & 24 & 16 & 4 \\ 1 & 4 & 6 & 4 & 1 \\ \end{array} \right) \end{equation*}

After the question of the filter kernel to use is now solved, we should start with the discussion of the scale space. It is very common to build a scale pyramid like shown in the following figure.

Pyramid of the example scale space
Pyramid of the example scale space. Four levels of the same size make up an octave whereby the first level is either produced by the initial blurring of the input image or by subsampling. In the subsampling process, every second pixel is removed effectively halving the image size. \(I_0\) is the input image and \(I_{oi}\) denotes the \(i\)-th image in octave \(o\).

As mentioned before, the first step is to apply a base scaling to the input image. This already compensates for noise which is usually present in images produced by a camera. Then, the image is repeatedly blurred with our binomial filter until the scaling doubled. This is the moment where we have blurred and hence reduced the intrinsic resolution of the image so much that also the real resolution can be reduced as well without loss of information. This is done by subsampling which e.g. removes every second pixel effectively halving the image size. The idea is that the reduced version contains the same information: every structure and change in intensity values present before the subsampling is also present afterwards but described with fewer pixels. The frequency domain does also give us insights here. According to \eqref{eq:GaussianScaleSpace_GaussRelationFrequency}, when we increase the scaling by a factor of \(\sigma_s = 2\), we multiply our signal with \(\sigma_f = 0.5\) in the frequency domain which means that most of the frequencies in the upper half are removed. Due to the different image sizes, we get the notion of an octave which comprises all images of the same size.

Two benefits arise from the subsampling process. Most obvious is the performance increase. Usually, we build a scale space to do something with it in the next step and it is e.g. much faster to apply derivative filters to a scale pyramid where the levels decrease in size instead of a scale space where all levels have the same size as the input image. The benefit is quite huge since with each subsampling process the total number of pixels is reduced by 0.25 (both the width and height are reduced by 0.5).

But there is also another advantage. Remember that we needed to adjust the size of the rectangle when we want to analyse the same content at different sizes. For the derivative filters, this would mean that we need to increase its kernel size as we increase the blurring. Imagine a tiny \(3 \times 3\) filter applied to all levels of the same size. There is no way to capture larger image structures with this approach. But due to the subsampling, this problem is implicitly attenuated. When the image size increases, our \(3 \times 3\) filter automatically works on a larger local neighbourhood. Of course, we still need different filter sizes for the levels inside each octave but we can re-use the filters across octaves (e.g. the same filter can be applied to level \(I_{11}\)) and level \(I_{21}\)). Note that this holds also true for the binomial filters used to build the scale pyramid itself. In our example, we only need three binomial filters (not counting the initial blurring) to build the complete pyramid and when we apply an iterative scheme we only need one!9.

To understand the build process of the scale pyramid better, let us apply it to a small one-dimensional input signal

\begin{equation} \label{eq:GaussianScaleSpace_ExampleData} \vec{u}_0 = \left( 1, 3, 1, 5, 11, 2, 8, 4, 4 \right)^T. \end{equation}

The following animation shows the steps for two octaves of the build process.


Steps to build a scale space for \(\vec{u}_0\). The build process for the first two octaves is shown and \(B_4\) is used as blurring filter. Besides the signal values in blue also a Gaussian function with \(\sigma = 1\) is shown exemplarily in orange. Note that the Gaussian is centred at the signal value in the middle, but the convolution operation shifts this function over each signal value (values outside the borders are considered zero), of course. The first step applies \(\sigma_b = \sqrt{1^2-0.5^2} = 0.866025\) as base scaling so that the original continuous function \(f\) (\(\sigma_s = 0.5\) is assumed) is lifted to a scale of \(\sigma_1 = 1\).

Take a look at step 5 denoting the subsampling process which selects every second value for removal. When you take a closer look which values are to be removed, you see that they don't convey utterly important information. They merely lie in the interpolation between two other values (more or less). This shows that the subsampling is indeed justified and we don't lose too much information when we do it. The same is true for step 10 but not so much for the intermediate blurring steps. We cannot recover the shape defined by the signal if we would already subsample in step 7 (the value at point 4 is a bit above the interpolation line between point 3 and 5), for example. The build process of the scale pyramid is also summarized in the table below.

Level \(i\) Octave Step Operation Scale \(\sigma_i\)
0 0 Input image \(\vec{u}_{0}\) \(0.5\)
1 1 Initial blurring \(\vec{u}_{10} = \vec{u}_{0} * G_{\sigma_b}\) \(1\)
2 1 Iterative blurring \(\vec{u}_{11} = \vec{u}_{10} * B_4\) \(\sqrt{2} = \sqrt{1^2 + 1^2}\)
3 1 Iterative blurring \(\vec{u}_{12} = \vec{u}_{11} * B_4\) \(\sqrt{3} = \sqrt{(\sqrt{2})^2 + 1^2}\)
4 1 Iterative blurring \(\vec{u}_{13} = \vec{u}_{12} * B_4\) \(2 = \sqrt{(\sqrt{3})^2 + 1^2}\)
5 2 Subsampling \(\vec{u}_{20} = \operatorname{sample}(\vec{u}_{13})\) \(2\)
6 2 Iterative blurring \(\vec{u}_{21} = \vec{u}_{20} * B_4\) \(\sqrt{8} = \sqrt{2^2 + 2^2}\)
7 2 Iterative blurring \(\vec{u}_{22} = \vec{u}_{21} * B_4\) \(\sqrt{12} = \sqrt{(\sqrt{8})^2 + 2^2}\)
8 2 Iterative blurring \(\vec{u}_{23} = \vec{u}_{22} * B_4\) \(4 = \sqrt{(\sqrt{12})^2 + 2^2}\)
9 2 Subsampling \(\vec{u}_{30} = \operatorname{sample}(\vec{u}_{23})\) \(4\)

I think only the last column may be a bit confusing. What I wanted to show is the scaling in the respective layer. As mentioned before, this requires, being a stickler, the underlying continuous function which we don't know but where it is assumed that \(\vec{u}_0\) is a blurred version of it (with \(\sigma_s = 0.5\)). But this only affects the initial blurring which I deliberately chose so that we reach a scale of \(\sigma_1 = 1\) in the first level. But this part is not too important.

However, the other calculations are important. Essentially, each parameter is calculated via \eqref{eq:GaussianScaleSpace_ScaleCombinations} with the scale parameter from the previous level (left operand) and the new increase in scaling introduced by the binomial filter (right operand). The left operand follows a direct iterative schema and the right operand is identical in each octave. This is the case because I only use \(B_4\) as blurring operator. Other scale pyramids may differ in this point. Whenever the scale parameter doubled, the signal is subsampled.

Did you notice that the scale parameter for the binomial filter doubled as well? It is \(\sigma_1 = 1\) in the first and \(\sigma_2 = 2\) in the second octave. This is due to the subsampling done before and related to the implicit increase of local neighbourhood which the subsampling process introduces. Since the same filter now works on a broader range of values we have effectively doubled its standard deviation. This effect is also visible in the animation above. The size of the Gaussian function stays the same when you move e.g. from step 5 to step 6. But the number of values affected by the Gaussian are much larger in step 6 (all actually) than in step 5. You can also think this way: suppose you applied the Gaussian as noted in step 6. Now, you want to achieve the same result but rather applying a Gaussian directly to step 4 (i.e. skipping the subsampling step). How would you need to change your Gaussian? Precisely, you need to double its width (or in a more mathematically term: double its standard deviation) in order to include the same points in your computation as in step 6.

We have now discussed how different scalings for an image can be achieved and why we may need to adjust the size of the local neighbourhood when analysing the image structure in different scale levels. We have seen how the frequency domain can help us to understand basic concepts and we analysed the scale parameter in more detail. Also, the question of how such a scale space can be built is answered in terms of a scale pyramid.

Ok, this sounds all great, but wait, was the intention not to analyse the same object which appears in different sizes in different images? How is this job done and how precisely does the scale space help us with that? Well, it is hard to answer these questions without talking about a concrete feature detector. Usually, we have a detector which operates in our scale pyramid and analyses the image structure in each level. The detector performs some kind of measurement and calculates a function value at each 3D position (\(x, y\) and \(\sigma\)). Normally, this function value needs to get maximised somehow and this process can e.g. be done across all scale levels for a certain position. But this depends highly on the feature detector in use. You are maybe interested in reading my article about the Hessian feature detector if you want to know more about this topic.

List of attached files:


1. Permission to use this image thankfully granted by the artist. Further licence information is included on the referenced website.
2. This term is from a chapter about scale space, which I highly recommend.
3. In one of my FED articles, I show an animation with a continuous Gaussian blurring. It is linked to an example which uses a different blurring approach. The thing is that Gaussian blurring is homogeneous meaning that the same operation is applied to every pixel regardless of the content. This can e.g. be improved by considering the local gradient at each position and adjust the amount of blurring based on the magnitude. This is done in isotropic diffusion.
4. Numbers in square brackets are the page numbers of Digital Image Processing (2nd edition) by Burger and Burge.
5. The rule can be seen by applying the convolution theorem to the expression \(g_{\sigma_s} * g_{\sigma_x}\): \begin{equation*} 2\pi \cdot G_{\sigma_s} \cdot G_{\sigma_x} = 2\pi \cdot \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \sigma_s^2}{2}} \cdot \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \sigma_x^2}{2}} = \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \left(\sigma_s^2 + \sigma_x^2\right)}{2}} = \frac{1}{2 \pi} \cdot e^{-\frac{\left(\omega_1^2+\omega_2^2\right) \cdot \left(\sqrt{\sigma_s^2 + \sigma_x^2}\right)^2}{2}} = G_{\sqrt{\sigma_s^2 + \sigma_x^2}}, \end{equation*} which gives us exactly the relation, since \( G_{\sqrt{\sigma_s^2 + \sigma_x^2}} \) transforms directly back to \( g_{\sqrt{\sigma_s^2 + \sigma_x^2}} \) in the spatial domain. Therefore, it is possible to convolve any signal with \(g_{\sigma_s}\) and the result again with \(g_{\sigma_x}\) or directly convolve the signal with \(g_{\sqrt{\sigma_s^2 + \sigma_x^2}}\). A proof without the detour over the frequency domain and rather using the convolution operator instead can be found in section 4.5.4 of this book (Machine Vision).
6. If you don't have access to Mathematica, you may be able to use the free CDF Player.
7. E.g. the paper Fast Computation of Characteristic Scale Using a Half-Octave Pyramid describes an approach very similar to the one I show here.
8. Why does \(\sigma=1\) correspond to \(B_4\)? This can be seen by building an empirical distribution from the values in \(B_4\), i.e. by using the binomial coefficients as function values for a discrete function centred around the origin and then calculating the standard deviation from this distribution. Check out the corresponding Mathematica notebook for implementation details.
9. With iterative I mean that each filter is applied to the result of the previous filtering process. This is done here and hence only \(B_4\) is needed: \(I_{11} = I_{10} * B_4, I_{12} = I_{11} * B_4\) and \(I_{13} = I_{12} * B_4\). Theoretically, you could imagine applying a filter always on the base image, e.g. \(I_{10}\), to achieve the same result. In this case, three filters are needed: \(I_{11} = I_{10} * G_{\sigma_1}, I_{12} = I_{10} * G_{\sigma_2}\) and \(I_{13} = I_{10} * G_{\sigma_3}\)