Density estimation based on the nearest neighbours is another technique to estimate the unknown PDF \(\hat{p}(x)\) from observed data. It implements kind of the opposite idea of the Parzen window estimator where we place kernels at each data point with a certain side length \(h\) which determines the local influence of the kernel. Using a large \(h\) results in wide kernels which collect more points on the way. In the nearest neighbour density estimation, we approach from a different perspective. Instead of fixing the side length \(h\) and collecting a varying number of \(k\) neighbours for each kernel, we now fix \(k\) and adjust the influence area accordingly.
The idea is to use a volume function \(V_k(x)\) which gives us a value denoting how large the volume at a position \(x\) must be to cover \(k\) data points. This results in low values at positions where we have many data points and high values when only a few data points are in the neighbourhood since then we need more space to cover the same number of neighbours. This results in the following approximation of
\begin{equation} \label{eq:NearestNeighbourDensityEstimation_Approximation} \hat{p}(x) \approx \frac{k}{N} \cdot \frac{1}{V_k(x)}. \end{equation}The volume function is inverted so that we have a high density at positions with many data points and a low density at locations where points are only sparsely located. Note that this is already a behaviour we naturally expect from a PDF. Additionally, the inverted volume is weighted by the fraction of points it reflects, i.e. the \(k\) considered points out of \(N\) total points.
The following animation shows in a one-dimensional example how \(\hat{p}(x)\) is generated for the four points
\begin{equation} \label{eq:NearestNeighbourDensityEstimation_Data} X = \begin{pmatrix} 1 \\ 1.5 \\ 3 \\ 5 \\ \end{pmatrix} \end{equation}with the volume function
\begin{equation} \label{eq:NearestNeighbourDensityEstimation_Volume} V_k(x) = 2 \cdot \left| x - \tilde{x}_k \right|. \end{equation}\(\tilde{x}_k\) denotes the \(k\)-th nearest neighbour to \(x\). Note how the “volume” in the one-dimensional case is only the absolute value to the last neighbour. We will use \(k=2\) in the following and the critical part is now to determine what the second nearest neighbour for each position \(x\) is since the distance to that neighbour determines the resulting density value. For this, you can use the following animation and slide over the current value \(x_c\) to see which point the second nearest neighbour is and how the resulting PDF evolves.
Note also that a PDF approximated by \eqref{eq:NearestNeighbourDensityEstimation_Approximation} is actually not a valid PDF. More precisely, it can be shown (page 4) that the required property
\begin{equation*} \int_{-\infty}^{\infty} \! \hat{p}(x) \, \mathrm{d}x = 1 \end{equation*}is violated.
List of attached files:
- NearestNeighbourDensityEstimation.nb [PDF] (Mathematica notebook used to create the animation)
← Back to the overview page