## Introduction to kernel density estimation (Parzen window method)

In probability theory, it is common to work with certain distributions which describe a stochastic process and reveal information of how the process may behave. In reality, it is not unusual to deal only with data directly without any knowledge about any formal distribution. However, even though we don't know the distribution, it is still valid to assume that the data arise from a hidden distribution. This means we acknowledge that there is a distribution which produced our data but we don't have a clue which one it is and there is probably no way to find out for sure. But there are techniques which can estimate a distribution based on the observed data. One is known as kernel density estimation (also known as Parzen window density estimation or Parzen-Rosenblatt window method). This article is dedicated to this techniques and tries to convey the basics to understand it.

Parzen window is a so-called non-parametric estimation since we don't even know the type of the underlying probability distribution. In contrast, when we estimate the PDF $$\hat{p}(x)$$1 in a parametric way, we know (or assume) the type of the PDF (e.g. a Gaussian) and only have to estimate the parameters of the assumed distribution. But in our case, we really know nothing and hence need to find a different approach which is not based on finding the best-suited parameters.

So, to formulate the problem: suppose we have a set of $$N$$ data points $$X = (x_1, x_2, \ldots, x_N)^T$$. Then, our goal is to find a PDF which satisfies the following conditions:

1. The PDF should resemble the data. This means that the general structure of the data should still be visible. We don't require an exact mapping but groups in the dataset should also be noticeable as high-density values in the PDF.
2. We are interested in a smooth representation of the PDF. Naturally, $$X$$ contains only a discrete list of points but the underlying distribution is most likely continuous. This should be covered by the PDF as well. More importantly, we want something better than a simple step function as offered by a histogram (even though histograms will be the starting point of our discussion). Hence, the PDF should approximate the distribution beyond the evidence from the data points.
3. The last point is already implied by the word PDF itself: we want the PDF to be valid. Mathematically, the most important properties (among others) which the estimation $$\hat{p}(x)$$ must obey are $$\label{eq:ParzenWindow_PDFConstraints} \int_{-\infty}^{\infty} \! \hat{p}(x) \, \mathrm{d}x = 1 \quad \text{and} \quad \hat{p}(x) \geq 0.$$

Why the effort to find $$\hat{p}(x)$$ in the first place? Many applications rely on information about the underlying distribution. This is especially true for classifiers which e.g. can use $$\hat{p}(x)$$ to get the likelihood of an unknown $$x$$ not present in the data. But it gives also insight into the structure of the data; a little bit like a better histogram (broadly speaking). This is, of course, application-specific, but I guess when you search for a PDF estimator than you already know why you do so.

The rest of this article is structured as follows: I start with one-dimensional data and even though I don't intend to give a full introduction to histograms, it is IMHO a very good starting point for our discussion. We will get to know about the problems of classical histograms and the solution ends indeed in the Parzen window estimator. We will look at the definition of the estimator and try to get some intuition about its components and parameters. It is also possible to reach the estimator from a different way; namely from the signal theory by using convolution and we will take a look at this, too. In the last part, I generalize the estimator to the multi-dimensional case and provide an example in 2D.

### It starts with histograms

Histograms are a very easy way to get information about the distribution of the data. They show us how often certain data points occur in the whole dataset. A common example is an image histogram which shows the distribution of the intensity values in an image. In this case, we can simply count how often each of the 256 intensity values occurs and draw a line for each intensity value with the height corresponding to the number of times that intensity value is present in the image. Another nice property of the histogram is that it can be scaled to be a valid discrete probability distribution itself (though, not a PDF yet2). This means we reduce the height of each line so that the total sum is 1. In case of the image histogram, we can simply make the count a relative frequency by dividing by the total number of pixels and we end up with a valid distribution. To make it more specific, let $$b_i$$ denote the number of pixels with intensity $$i$$ and $$N$$ the total number of pixels in the image so that we can calculate the height of each line as

$$\label{eq:ParzenWindow_ImageHistogram} \hat{p}_i = \frac{b_i}{N}.$$

$$\hat{p}_i$$ is then an estimate of the probability of the $$i$$-th intensity value (the relative frequency).

An image histogram is a very simple example because we have only a finite set of values to deal with. In other cases, the border values (e.g. 0 and 255 in the image example) are not that clear. What is more, the values might not even be discrete. If every value $$x \in \mathbb{R}$$ is imaginable, the histogram is not very interesting since most of the values will occur only once. E.g. if we have the values 0.1, 0.11 and 0.112, they end up as three lines in our histogram.

Obviously, we need something more sophisticated. The solution is, instead of counting the occurrences of each possible value, we collect them in so-called bins. Each bin has a certain width and we assign each value to the bin it falls into. Instead of lines, we draw now bars and the height of each bar is proportional to the number of values which fell into the bin. For simplicity, I will assume that each bin has the same width, but note that this is not a requirement.

Let's redefine $$b_i$$ to denote the number of values which fall into the $$i$$-th bin. Can we still use \eqref{eq:ParzenWindow_ImageHistogram} to calculate the height of each bar when the requirement is now to obtain a valid PDF3? No we can't, since the histogram estimate $$\hat{p}(x)$$ would violate the first constraint of \eqref{eq:ParzenWindow_PDFConstraints}, i.e. the area of all bars integrate not up to 1. This is due to the binning where we effectively expand each line horizontally to the bin width. This influences the occupied area and we need some kind of compensation. But the solution is very simple: we just scale the height by the bin width4 $$h > 0$$

$$\label{eq:ParzenWindow_Histogram} \hat{p}(x) = \begin{cases} \frac{1}{h} \cdot \frac{b_i}{N} & \text{if $$x$$ falls into bin $$b_i$$} \\ 0 & \text{else} \\ \end{cases}$$

I want to give an example which will also be used later when we dive into the Parzen window approach. We consider a short data set $$X$$ consisting only of four points (or more precisely: values)

$$\label{eq:ParzenWindow_ExampleData} X = \begin{pmatrix} 1 \\ 1.5 \\ 3 \\ 5 \\ \end{pmatrix}.$$

Suppose, we want to create a histogram with bins of width $$h=2$$. The first question is: where do we start binning, i.e. what is the left coordinate of the first bin? 1? 0? -42? This is actually application-specific since sometimes we have a natural range where it could be valid to start e.g. with 0, but in this case, there is no application context behind the data so I decide to start at $$\min(X) = 1$$. With this approach, we can cover the complete dataset with three bins ($$B_i$$ denotes the range of the $$i$$-th bin)

\begin{equation*} B_1 = [1;3[, B_2 = [3;5[ \quad \text{and} \quad B_3 = [5;7]. \end{equation*}

The first two data points fall into $$B_1$$ and the other two in $$B_2$$ respectively $$B_3$$. The heights of the bins according to \eqref{eq:ParzenWindow_Histogram} are given by

\begin{equation*} \hat{p}(B_1) = \frac{1}{2} \cdot \frac{2}{4} = 0.25 \quad \text{and} \quad \hat{p}(B_2) = \hat{p}(B_3) = \frac{1}{2} \cdot \frac{1}{4} = 0.125. \end{equation*}

It is easy to see that this sums indeed up to $$2 \cdot 0.25 + 2 \cdot 2 \cdot 0.125 = 1$$. The following figure shows a graphical representation of the histogram.

So, why do we need something else when we can already obtain a valid PDF via the histogram method? The problem is that histograms provide only a rough representation of the underlying distribution. And as you can see from the last example, $$\hat{p}(x)$$ results only in a step function and usually we want something smoother. This is where the Parzen window estimator enters the field.

### From histograms to the Parzen window estimator

Our goal is to improve the histogram method by finding a function which is smoother but still a valid PDF. The general idea of the Parzen window estimator is to use multiple so-called kernel functions and place such a function at each position where a data point is located. We are superposing all of these kernels and scale the result to our needs. The resulting function from this process is our PDF. We implement this idea by replacing the cardinality of the bins $$b_i$$ in \eqref{eq:ParzenWindow_Histogram} with something more sophisticated (still for the one-dimensional case):

Kernel density estimation (Parzen window estimation) [1D]

Let $$X = (x_1, x_2, \ldots, x_N)^T \in \mathbb{R}^{N \times 1}$$ be the column vector containing the data points $$x_i \in \mathbb{R}$$. Then, a valid PDF, which returns the probability density for an arbitrary $$x$$-value, can be estimated via the Parzen window estimator as

$$\label{eq:ParzenWindow_KernelDensity} \hat{p}(x) = {\color{Orange}\frac{1}{h \cdot N}} \cdot {\color{Aquamarine}\sum_{i=1}^N} {\color{Red}K}\left( \frac{{\color{Mahogany}x_i -} x}{{\color{LimeGreen}h}} \right).$$

The data is collected in bins of width $$h$$ and the estimation is created via superposition of kernels $$K$$. The kernels have to be valid PDFs itself.

To calculate an arbitrary value $$x$$, we now need to iterate over the complete dataset and sum up the result of the evaluated kernel functions $$K(x)$$. The nice thing is that $$\hat{p}(x)$$ inherits the properties of the kernel functions. If $$K(x)$$ is smooth, $$\hat{p}(x)$$ tends to be smooth as well. Therefore, the kernel dictates how $$\hat{p}(x)$$ behaves, how smooth it is and what the contribution of each $$x_i$$ is to any $$x$$ which we might plug into $$\hat{p}(x)$$. A common choice is to use standard normal Gaussians which we will also use in the example later. What is more, it can be shown that if $$K(x)$$ is a valid PDF, i.e. satisfying, among others, \eqref{eq:ParzenWindow_PDFConstraints}, $$\hat{p}(x)$$ is a valid PDF as well!

Two operations are performed in the argument of the kernel function. The kernel is shifted to the position $$x_i$$ and stretched horizontally by $$h$$ to determine the main range of the contribution. Similar to before, $$h$$ specifies how many points contribute to the bins – only that the bins are now more complicated. We don't have a clear bin-structure anymore since the bins are the result of the superposed kernel functions.

And the last part is to scale the whole function by $$h \cdot N$$. As before in \eqref{eq:ParzenWindow_Histogram}, this is necessary so that the area enclosed by $$\hat{p}(x)$$ equals 1 (requirement for a valid PDF). This is also intuitively clear: as more points we have in our dataset, as more we sum up and to compensate for this we need to scale-down $$\hat{p}(x)$$. Likewise, as large the bin width $$h$$, as wider our kernels get and as more points they consider. In the end, we have to compensate for this as well.

We could now directly start and use this approach to retrieve a smooth PDF. But I first want to give an example, and this is easier when we stick to uniform kernels for now. This will not give us a smooth function in the end but hopefully some insights in the approach itself. The uniform kernels are of length 1 and centred around the origin

$$\label{eq:ParzenWindow_KernelUniform} K_U(x) = \begin{cases} 1, & -0.5 \leq x \leq 0.5 \\ 0, & \text{else} \\ \end{cases}$$

When we want to know how this kernel behaves, we must first talk about the kernel argument $$\frac{x_i - x}{h}$$. How does the kernel change with this argument? As mentioned before, it stretches and shifts the kernel to the position of $$x_i$$:

\begin{align*} -0.5 \leq x \leq 0.5 &\Rightarrow -0.5 \leq \frac{x_i - x}{h} \leq 0.5 \\ &\Leftrightarrow -0.5 \cdot h \leq x_i - x \leq 0.5 \cdot h \\ &\Leftrightarrow -0.5 \cdot h \leq -(-x_i + x) \leq 0.5 \cdot h \\ &\Leftrightarrow 0.5 \cdot h \geq -x_i + x \geq -0.5 \cdot h \\ &\Leftrightarrow -0.5 \cdot h \leq -x_i + x \leq 0.5 \cdot h \\ &\Leftrightarrow -0.5 \cdot h + x_i \leq x \leq 0.5 \cdot h + x_i \\ \end{align*}

I basically moved everything except $$x$$ to the borders so that we see how each side is scaled by $$h$$ and shifted to the position $$x_i$$. Note also that this kernel is symmetric and hence has the same absolute values on both sides. The rest of \eqref{eq:ParzenWindow_KernelDensity} is straightforward since it only involves summation and a scaling at the end (to ensure that the integral over the complete range is 1). Let's move on and apply \eqref{eq:ParzenWindow_KernelDensity} to example values $$X$$ (\eqref{eq:ParzenWindow_ExampleData}) by still using $$h=2$$

\begin{align*} \hat{p}(x) &= \frac{1}{8} \left( \begin{cases} 1, & 0 \leq x \leq 2 \\ 0, & \text{else} \\ \end{cases} + \begin{cases} 1, & 0.5 \leq x \leq 2.5 \\ 0, & \text{else} \\ \end{cases} + \begin{cases} 1, & 2 \leq x \leq 4 \\ 0, & \text{else} \\ \end{cases} + \begin{cases} 1, & 4 \leq x \leq 6 \\ 0, & \text{else} \\ \end{cases} \right) \\ &\approx \frac{1}{8} \begin{cases} 1, & 0 \leq x < 0.5 \\ 2, & 0.5 \leq x < 2.5 \\ 1, & 2.5 \leq x \leq 6 \\ 0, & \text{else} \\ \end{cases} \end{align*}

As you can see, each piecewise defined function is centred around one data point $$x_i$$ and after the summation of such functions we end up with a step function5. It is also easy to show that $$\hat{p}(x)$$ integrates indeed up to

\begin{equation*} \int_{-\infty}^{\infty} \! \hat{p}(x) \, \mathrm{d}x = \frac{1}{8} \cdot \left( 0.5 \cdot 1 + 2 \cdot 2 + 3.5 \cdot 1 \right) = 1. \end{equation*}

The next plot shows our estimate $$\hat{p}(x)$$ together with the previous histogram.

Even though the two PDFs are not identical, they are very similar in the way that both result in a step function, and you probably won't call that a smooth function. That $$\hat{p}(x)$$ is not smooth is a manifestation that \eqref{eq:ParzenWindow_KernelDensity} inherits its properties from the underlying kernels: uniform kernels are not smooth and the same is true for $$\hat{p}(x)$$.

So much for the detour over uniform kernels. I hope it has helped to understand the intuition behind \eqref{eq:ParzenWindow_KernelDensity}. Things get even more interesting when we start using a more sophisticated kernel; a Gaussian kernel for instance (with mean $$\mu$$ and standard deviation $$\sigma$$)

$$\label{eq:ParzenWindow_Gaussian} g(x) = \frac{1}{\sqrt{2 \pi} \sigma} \cdot e^{-\frac{\left(x - \mu\right)^2}{2 \sigma^2}}.$$

This makes the Parzes estimator more complex, but this is required to achieve our smoothness constraint. The good thing is though that our recent findings remain valid. More concretely, we are using standard normal distributed kernels, i.e. $$\mu=0, \sigma=1$$

\begin{equation*} K_g(x) = \frac{1}{\sqrt{2 \pi}} \cdot e^{-\frac{x^2}{2}}. \end{equation*}

If we use this kernel in \eqref{eq:ParzenWindow_KernelDensity} with our example data $$X$$ (\eqref{eq:ParzenWindow_ExampleData}), we end up with

$$\label{eq:ParzenWindow_EstimateGaussian} \hat{p}(x) = \frac{1}{4 \sqrt{2 \pi } h} \cdot \left( e^{-\frac{1}{2 h^2} \cdot (x-5)^2}+e^{-\frac{1}{2 h^2} \cdot (x-3)^2}+e^{-\frac{1}{2h^2} \cdot (x-1.5)^2}+e^{-\frac{1}{2 h^2} \cdot (x-1)^2} \right).$$

Even though the function is now composed of more terms, we can still see the individual Gaussian functions and how they are placed at the locations of the data points. The interesting thing with the standard Gaussian kernel $$K_g(x)$$ is that the shift $$x_i - x$$ and the parameter $$h$$ correspond directly to the mean $$\mu$$ and the standard deviation $$\sigma$$ of the non-standard Gaussian function $$g(x)$$. You can see this by comparing \eqref{eq:ParzenWindow_Gaussian} with \eqref{eq:ParzenWindow_EstimateGaussian} which basically leads to the following assignments:

\begin{align*} \mu &= x_i \\ \sigma &= h \end{align*}

i.e. the Gaussian kernel is scaled by $$h$$ and shifted to the position of $$x_i$$. How does $$\hat{p}(x)$$ now look like and what is the global effect of the bin width $$h$$? You can find out in the following animation.

Note, how (as before) each kernel is centred at the position of the data points and that the bin width $$h$$ still influences how many points are aggregated. But compared to the uniform kernels before, we don't have the clear endings of the bins anymore. It is only noticeable that the Gaussians get wider. However, this is exactly what makes our resulting function smooth and this is, after all, precisely what we wanted to achieve in the first place.

Before I show you an example in the two-dimensional case, I first want to make a detour and reach the Parzen window estimator from a different point of view: convolution.

### From convolution to the Parzen window estimator

If you would ask a specialist from the signal theory field about the Parzen window estimator, you might get a different explanation which includes the words “convolution” and “delta functions”. Even though this approach is completely different from what we have so far, the result will still cohere with our previous findings. The first thing we need to do is to convert our data points to a continuous function $$s(x)$$ (the input signal). We can do so by placing a Dirac delta function $$\delta(x)$$ at each data point

\begin{equation*} s(x) = \delta (x-5)+\delta (x-3)+\delta (x-1.5)+\delta (x-1). \end{equation*}

Then, we simply convolve our signal $$s(x)$$ with the kernel we want to use

\begin{equation*} \hat{p}(x) = s(x) * \frac{1}{h \cdot N} \cdot K_g\left(\frac{x}{h}\right) = \frac{1}{4 \sqrt{2 \pi } h} \cdot \left( e^{-\frac{1}{2 h^2} \cdot (x-5)^2}+e^{-\frac{1}{2 h^2} \cdot (x-3)^2}+e^{-\frac{1}{2h^2} \cdot (x-1.5)^2}+e^{-\frac{1}{2 h^2} \cdot (x-1)^2} \right). \end{equation*}

And the response is already the estimate $$\hat{p}(x)$$! If you compare the result with the previous estimate in \eqref{eq:ParzenWindow_EstimateGaussian}, you can see that they are indeed identical. Note that the shifting $$x_i - x$$ is not necessary anymore since this part is already handled by the delta functions. An important aspect to mention here is the identity property of the convolution operation: if you convolve a function with the delta function, you get again the input function as the return value. This property applied to the kernel means

$$\label{eq:ParzenWindow_ConvolutionIdentity} K_g(x) * \delta(x_i - x) = K_g(x_i - x)$$

and that explains how the kernels are shifted to the positions of the data points. To see how the response emerges “over time”, you can check out the following animation.

Of course, the responses are also visually identical. This is not surprising since the same function is plotted. If you take the time, you can see why the resulting function looks like it looks. E.g. around $$t=1.5$$ the PDF $$\hat{p}(x)$$ reaches its maximum. This becomes clear if you consider the kernel functions around that position: when approaching $$t=1.5$$ from the left side, two Gaussians are rising (orange and green) and one is falling (blue). After $$t=1.5$$ only one function is rising (green) and two are falling (blue and orange). Hence, the summed kernel responses must also decrease after $$t=1.5$$.

So much for the detour into the convolution approach. It is interesting to see that both approaches lead to the same result and in essence cover the same idea (but implemented differently): use a set of kernel functions to generate a PDF which reflects the distribution of the data points. Before finishing this article, I want to generalize \eqref{eq:ParzenWindow_KernelDensity} to arbitrary dimensions and give an example for the two-dimensional case.

### Multi-dimensional data

Extending \eqref{eq:ParzenWindow_KernelDensity} to work in arbitrary dimensions is actually pretty simple. The idea is that our bin, which we assumed so far as a structure which expands only in one dimension, is now a hypercube7 expanding to arbitrary dimensions. This results in a quad in 2D and a cube in 3D. But the general idea is always the same: collect all the data which falls into the hypercube, i.e. lies inside the boundaries of the hypercube. Note that this is also true if you want to calculate a histogram of higher dimensional data.

The factor $$\frac{1}{h}$$ in \eqref{eq:ParzenWindow_KernelDensity} (or \eqref{eq:ParzenWindow_Histogram}) compensated for the bin width so fare. When we want to use hypercubes, we must compensate by the hypervolume instead (area in 2D, volume in 3D etc.). So, the only adjustment in order to work in the $$d$$-dimensional space is to divide by $$\frac{1}{h^d}$$ instead. This leads to the generalization of \eqref{eq:ParzenWindow_KernelDensity}:

Kernel density estimation (Parzen window estimation) [$$d$$-dimensional]

Let $$X = \left( \vec{x}_1, \vec{x}_2, \ldots, \vec{x}_N \right)^T \in \mathbb{R}^{N \times d}$$ be the matrix containing the $$d$$-dimensional data points $$\vec{x}_i \in \mathbb{R}^d$$ ($$i$$-th row of $$X$$). Each column corresponds to a variable (or feature) and each row to an observation. Then, a valid PDF, which returns the probability density for $$d$$ independent variables $$\vec{x} = \left( x_1, x_2, \ldots, x_d\right)^T$$, can be estimated via the Parzen window estimator as

$$\label{eq:ParzenWindow_KernelDensity2D} \hat{p}(\vec{x}) = \frac{1}{h{\color{Mulberry}^d} \cdot N} \cdot \sum_{i=1}^N K \left( \frac{\vec{x}_i - \vec{x}} {h} \right)$$

The data is collected in hypercubes of hypervolume $$h^d$$ and the estimation is created via superposition of kernels $$K$$. The kernels have to be valid PDFs itself.

The only new parameter is the dimension $$d$$. The other change is that the independent variable $$\vec{x} = (x_1, x_2, \ldots, x_d)^T$$ is now a vector instead of a scalar since $$\hat{p}(\vec{x})$$ is now a multi-dimensional PDF. In the following tab pane, you can see an example for the two-dimensional case.

Did you notice that the range of $$h$$ was $$[0.2;2.5]$$ in the first and $$[0.05;0.5]$$ in the second example? As you have also seen in the animations, the same value of $$h$$ has a different effect. This means that $$h$$ actually depends on the scaling of the data which is also intuitively clear: the width (or hypervolume) of the bins is spanned in the space of the data and hence inherits its scaling. It is therefore not possible to give general recommendations for the parameter $$h$$. However, a good starting point might be to use the average distance $$\bar{d}$$ between all point pairs. The measure $$\bar{d}$$ directly incorporates the scaling of the data, e.g. if we scale the data, we will get greater distances between the points as well. More concretely, I was told that it is common to set the bin width to

\begin{equation*} h = \frac{1}{2} \bar{d}. \end{equation*}

Not sure why we have the factor $$\frac{1}{2}$$ here, but for the one-dimensional example from above, this leads to a value of $$h = 1.125$$ which is indeed a pretty good guess here. For other automatic bandwidth selection methods, take e.g. a look at these slides (p. 17 seqq.).

That was it from my side regarding the Parzen window estimator. We have started to define what a good PDF is composed of (resemble the data, smoothness and validity) and then approached the goal in two ways. One way led over histograms which can be simplistic PDFs by themselves but can be extended to also fulfil the smoothness constraint. But even the extension (the Parzen window estimator) inherits the basic behaviour and ideas of the histograms; namely the bins which collect the data. We also saw how this width has a strong influence on the resulting function $$\hat{p}(x)$$ and that it is a parameter which depends on the scaling of the data. The other way led to the signal theory where we transferred our data points to a superposition of delta functions which we then simply convolved with the kernel. And as we saw, the two ways lead to identical results. In the last part, we generalized the approach to arbitrary dimensions and took a look at a two-dimensional example.

List of attached files:

1. Statisticians like to put a $$\hat{}$$ on their variables when they want to make clear that the variable is only an estimation of the real PDF $$p(x)$$.
2. We can't integrate to 1 over a function which consists only of discrete values. Luckily, it is sufficient for a discrete probability distribution to sum the values up to 1 instead.
3. Note that we work with continuous data now and hence need the density instead of the direct probability.
4. In the $$d$$-dimensional case we would need to scale by $$\frac{1}{h^d}$$ instead.
5. I used $$\approx$$ instead of $$=$$ in the last step since at the position $$x=2$$ we have, precisely speaking, $$\hat{p}(2) = 3$$ but this makes the function only more complex and the distinction between $$\leq$$ and $$<$$ is not relevant to a PDF anyway.
6. Precisely speaking, it is not possible to draw $$\delta(x)$$. Even though $$\delta(0)\neq 0$$, the value at $$x=0$$ is still undefined. However, it es very common to draw a single line at this position instead (as also done here).
7. Hyperbin would also be a neat name :-).