Introduction to convolution with link to filters in computer vision

In this article, I want to discuss one of the basic topics in computer vision: convolution. It is probably one of the most important topics in this field since it offers some very basic operations used on images. When it comes to images, the talk about convolution is also the talk about filters and kernels. The blurring of an image is for example implemented by convolution. But there is also a strict mathematical definition. Here I want to correlate these two aspects by introducing convolution in the 1D discrete, moving further to the 1D continuous and concluding in the 2D discrete (and for the sake of completeness also short the continuous) world.

1D – discrete

Let's start in the 1D discrete world. I want to discuss an example from BetterExplained, which helped me a lot when I tried to understand the topic (after frustratingly leaving the corresponding Wikipedia site). Suppose you are a doctor leading a medical study where you oversee the test of a new medicine (a new pill). For this new pill exists a strict plan of intake. A patient has to come at three consecutive days to your office: intaking three pills on the first, two pills on the second and one pill on the third day. Now additionally assume that you not only have one patient but multiple instead. To make things more complicated, the patients start on different days with their treatment. Let's assume that on the first day just one patient comes to your office. He/She will get three pills from you. One the second day two new patients and the known one from yesterday come to your office. The patient from yesterday gets two and the other two will get three pills each leaving a total of eight pills you have to deposit in your office (to satisfy all demands). Of course, you want to keep track of the number of pills you need each day. You could go on with business as usual, but this will probably get complicated and confusing. Maybe there is a better way...

We will now start to optimize this problem a little bit. We have two variables in this scenario: the plan for the intake of the pills and the number of patients coming to your office each day. The plan is a list of values denoting the number of pills needed per treatment day (per patient), $$(3, 2, 1)^T$$ in this example. The number of patients can also be stored in a list denoting the number of new patients arrive at a specific day. Let us assume that this list would look like $$(1, 2, 3, 4, 5)^T$$ in this example (1 new patient at the first, 2 new patients at the second day, etc.). It might help to write this problem in a table format

The top row shows the plan with the number of pills per timestamp in decreasing order. Below that are the number of new patients arranged. Each column should give us the information, how many patients need $$n$$ number of pills. Since this is a very dynamic scenario, one table won't be enough, so let's assume that the numbers of the patient row will slide in from left to the right. But if we would slide the patients in the defined order, the number 5 would come first, so we have to reverse the patient list $$(5, 4, 3, 2, 1)^T$$. If we now multiply the numbers from the first and second row of each column, we have the information how many pills are needed to satisfy the demand for the new-comers requesting 3 pills and for the others as well. To retrieve the total number of needed pills at each timestamp, we just have to add up the values from the last row. This concept is also summarized in the following animation, where you can play around with the timestamp.

I hope you agree that this way of writing the problem is much clearer than the manual (naive) approach. You may have noticed that I highlighted a few terms in the previous paragraphs. In summary:

• Plan: a function $$k(\tau)$$ with the day of treatment as argument and the number of needed pills as the function value.
• Patients: a function $$s(\tau)$$ with the day of the study as argument and the number of new patients coming at that day as the function value.
• Timestamp: independent time variable of this problem, e.g. the number of pills needed at one day $$t$$ in the timeline.
• Reverse order: $$s(-\tau)$$ as reverse of the patient list helping to model the problem.
• Multiplication of $$s(-\tau) \cdot k(\tau)$$ to retrieve the number of needed pills per treatment day.
• Addition of every relevant (both functions are different from zero) result from the previous multiplication step to retrieve the total number of pills needed at one day $$t$$, e.g.
$$p(3) = s(3) \cdot k(1) + s(2) \cdot k(2) + s(1) \cdot k(3) = 9 + 4 + 1 = 14$$.

All the previous steps concluding in one single formula leaves

$$\label{eq:Convolution_1DDiscrete} \definecolor{energy}{RGB}{114,0,172} p({\color{Mulberry}t}) = {\color{LimeGreen}s}({\color{Mulberry}t}) * {\color{Aquamarine}k}({\color{Mulberry}t}) = {\color{Mahogany}\sum_{\tau=-\infty}^{\infty}} {\color{LimeGreen}s}({\color{Orange}-}\tau + {\color{Mulberry}t}) {\color{Red}\cdot} {\color{Aquamarine}k}(\tau) = {\color{Mahogany}\sum_{\tau = 0}^{n-1}} {\color{LimeGreen}s}({\color{Orange}-}\tau + {\color{Mulberry}t}) {\color{Red}\cdot} {\color{Aquamarine}k}(\tau),$$

where $$*$$ denotes the convolution operator and $$n = 2$$ in our case. And, as you may have guessed, this is exactly the definition of convolution in the 1D discrete case. Note the iteration over the complete relevant domain at each timestamp $$\left({\color{Mahogany}\sum}\right)$$. It is very important for practical scenarios to determine $$n$$, so that the computation can be performed efficiently. Luckily, it is often the case that the functions are just defined on a fixed range with only zero values outside of the boundaries.

To visualize \eqref{eq:Convolution_1DDiscrete}, you can imagine the patient list sliding over the plan list, combining each related values and resulting in a function value for the current timestamp (slide position). The situation is also illustrated in the following animation.

The abbreviations I used so far are not arbitrarily chosen. $$s$$ stands for the signal, since convolution is often used with signals (audio, image signals etc.). $$k$$ is the kernel, which defines the kind of operation performed. There are for example kernels for blurring, averaging or finding the derivations of the signal. This is especially true when convolution is used on images. Even the convolution variable $$\tau$$ is chosen deliberately. Both $$t$$ and $$\tau$$ are denoted to some point in time. $$t$$ is used to specify the timestamp currently of interest and $$\tau$$ is used to iterate over the complete relevant time range needed to calculate the result of $$t$$.

Signals can be very large and therefore it would be simpler to flip the kernel and slide it over the signal, instead of the other way around like before. Hence it is very handy that the convolution operation has the nice property of commutativity, meaning $$s(t)*k(t) = k(t)*s(t)$$. In the previous example, this would mean that the patient list stays fixed and the plan list gets reversed and slid over the patient list. The result is the same. I leave it to the reader to verify this as an exercise.

1D – continuous

The next step is to extend the known definition to the 1D continuous case. This turns out to be pretty simple actually since we basically have to replace the sum with an integral:

$$\label{eq:Convolution_1DContinuous} s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau.$$

So, instead of adding up a discrete set of values, we integrate over the complete (relevant) range. This means that the result value is the area underneath $$s(-\tau + t) \cdot k(\tau)$$. However, the integral over the complete range $$]-\infty;\infty[$$ makes it really hard to name the result function analytically. It is often possible though if at least one of the function is only defined on a limited range or the functions are not very complicated itself. To look at an example, I use the increasing step function

\begin{equation*} s(t) = \begin{cases} 1 & t \geq 0 \\ 0 & t < 0 \end{cases} \end{equation*}

as signal. In an image, this could for example model a sudden contrast increase on an edge. As kernel, the difference of Gaussian function shall be used

\begin{equation*} k(t) = \operatorname{DoG}_{\sigma_1,\sigma_2}(t) = G_{\sigma_1}(t) - G_{\sigma_2}(t). \end{equation*}

This is an approximation of the normalized version of the Laplacian of Gaussian function which is created by applying the Laplace $$\nabla^{2}$$ operator to a Gaussian function. In 2D images, this implements edge detection based on the second order derivatives. Since noise amplifies with increasing order of derivatives, it is often a good idea to smooth the signal first with a Gaussian. The Laplacian of Gaussian combines both operations (smoothing and derivative calculation). It is also a good model of the edge detection our retina ganglion cells implement. The following animation shows how the response of \eqref{eq:Convolution_1DContinuous} forms a short impulse response.

So, why does the response function look like it looks? Well, first note that the $$\operatorname{DoG}$$-function consists of a positive part surrounded by two negative parts. When sliding over the step function, first it overlaps in one negative part. In this range the multiplication $$s(-\tau + t) \cdot k(\tau) =_{t \geq 0} k(\tau)$$ is non-zero and nothing more than the kernel function itself, since the step function forms an identity function for $$t \geq 0$$. The integral is therefore also negative resulting in a negative response. When sliding further, the positive part of the kernel function reaches $$t \geq 0$$. Now there is a positive and negative part “active” which results in a lowering response (the two integral areas compensate each other). Until the second negative part reaches the step function, the response increases further, but after this point, the response decreases again due to the new negative part coming in. The response will approximate (but never reach) 0, since the $$\operatorname{DoG}$$-function consist of two Gauss-functions which differ just in their $$\sigma$$ value, meaning the areas finally compensate each other completely

\begin{equation*} \int_{-\infty}^{\infty} \! \operatorname{DoG}_{\sigma_1,\sigma_2}(t) \, \mathrm{d}t = 0. \end{equation*}

2D – discrete/continuous

It is now time to add an additional dimension. This means that our signal depends now on two variables, e.g. $$x$$ and $$y$$ and we finally reached the image domain. In the convolution formulas, this results in an additional sum respectively integral. The following table summarizes the convolution formula in the 1D/2D and discrete/continuous cases.

1D 2D
Discrete $$s(t)*k(t) = \sum_{\tau = 0}^{n-1} s(-\tau + t) \cdot k(\tau)$$ $$s(x, y)*k(x, y) = \sum_{\tau = 0}^{n-1} \sum_{\kappa = 0}^{n-1} s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa)$$
Continuous $$s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau$$ $$s(x, y)*k(x, y) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \! s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa) \, \mathrm{d}\tau \mathrm{d}\kappa$$

So, we now integrate over the complete relevant area around each point and “each point” means now all points from a 2D array. This sounds like an expensive operation, what it sometimes is, but when the kernel size is small compared to the image size, it is manageable. Personally, I have never used the 2D continuous convolution formula before (just mentioned for completeness), so I stick to the 2D discrete case from now on.

I already mentioned that convolution is a very important operation in computer vision. In this domain also the term linear filtering is used. One part, the signal, is most of the time the image itself. The interesting part is the kernel since it defines how the image will be transformed. The kernel is centred and flipped around the current pixel position1 and then each element of the kernel gets multiplied with the corresponding pixel value. The current pixel with the kernel centre and the surroundings of the kernel with the neighbours of the current pixel. This is the area around a position. Note that here the kernel is flipped and not the signal. This is possible due to the above-mentioned property of commutativity and in general a good idea since the kernel will usually be much smaller compared to the image. For clarification, I first want to look at the matrix $$A$$, which contains sequenced numbers, and an arbitrary $$3 \times 3$$ kernel $$K$$

\begin{equation*} A = \begin{pmatrix} 1 & 2 & 3 \\ 4 & {\color{Aquamarine}5} & 6 \\ 7 & 8 & 9 \\ \end{pmatrix} \quad \text{and} \quad K = \begin{pmatrix} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \\ \end{pmatrix} \end{equation*}

For the highlighted position, this results in the following operation

\begin{equation*} (A*K)_{22} = 9 k_{11}+8 k_{12}+7 k_{13}+6 k_{21}+5 k_{22}+4 k_{23}+3 k_{31}+2 k_{32}+k_{33}. \end{equation*}

This shows how the flipping of the kernel works: basically, each index is mirrored independently around the centre axis, so that e.g. $$k_{13} \rightarrow k_{31}$$. But to also give an example with real numbers, let's look at the following image $$I$$ and the kernel $$K_{\nabla^{2}}$$

\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & 2 & 0 \\ 0 & 0 & {\color{Aquamarine}1} & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}

and apply the kernel, again on the highlighted position

\begin{alignat*}{4} 0 \cdot 1 & {}+{} & 1 \cdot 2 & {}+{} & 0 \cdot 0 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 1 & {}+{} & 1 \cdot 0 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}-2} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}

If you compare this computation with the formula from the table, you might notice the similarity. In total, the above step is repeated for every image element (this is the sliding of the kernel over the image) resulting in the following image (assuming that every pixel value outside the image is zero2):

\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & -5 & 2 \\ 3 & 2 & {\color{Aquamarine}-2} & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right) \end{equation*}

Since this is an article about computer vision, it would be a shame to not show at least one image. Therefore, in the following an image and its convolved version using the same kernel as in the previous example.

Did you noticed that I used the index $$\nabla^2$$ for the kernel? This is not a coincidence since the kernel $$K_{\nabla^2}$$ is precisely the 2D discrete version of the Laplace operator. And like in the 1D continuous case from above this filter responses best on edges, visible for instance at the whiskers of the cat.

There are many other filters and I don't have the intention to cover all of them here. Instead, I link you to another website where you see other filters and examples which run directly in the browser. When you scroll down a bit, you can even test your own kernels interactively.

List of attached files:

1. This is an implementation detail not taken into account by the mathematical definition. You would need to manually add some constant to centre your kernel, e.g. $$+ 1$$ for a $$3 \times 3$$ kernel in the 2D discrete case: $$s(x, y)*k(x, y) = \sum_{\tau = 0}^{n-1} \sum_{\kappa = 0}^{n-1} s(-\tau + x, -\kappa + y) \cdot k(\tau + 1, \kappa + 1)$$.
2. This is one case of the general border handling problem. See this Wikipedia section or page 113 of the book Digital Image Processing (2nd edition) by Burger and Burge for further details.