The correlation coefficient is an important metric to measure the linear dependency between two variables \(X\) and \(Y\). It is defined as

\begin{equation*} r_{XY} = \frac{s_{XY}}{s_{X} \cdot s_{Y}} \in [-1;1] \end{equation*}

where \(s_{XY}\) denotes the covariance and \(s_{X}, s_{Y}\) the standard deviations for both variables. High magnitudes \(\left| r_{XY} \right|\) indicate a strong linear relationship between the variables. Another way of seeing this is that we start from a strong relationship and with increasing noise in our variables \(\left| r_{XY} \right|\) gets smaller.

This is illustrated in the animation below. For \(X\), points are generated from -4 to 4 in steps of 0.001 and a direct linear relationship is forced on the second variable

\begin{equation*} Y = 2X. \end{equation*}

Hence, without further changes, all points lie exactly on a line leading to \(r_{XY} = 1\). To analyse the influence of noise on the variables, Gaussian noise is added to the variables separately

\begin{align*} \tilde{X} &= X + N(0, \sigma_x) \\ \tilde{Y} &= Y + N(0, \sigma_y). \end{align*}

The noise parameters \(\sigma_x\) and \(\sigma_y\) can both be controlled in the animation.

Figure 1: Increasing the noise to a once perfect linear relationship. The data points in blue and a trend line in red are shown in the figure. Note also the ranges on the left and bottom side. They show the mean and standard deviation of the variables. The correlation coefficient \(r_{XY}\) and other metrics are shown at the top. The slider range of \(\sigma_y\) is twice as big as \(\sigma_x\) since the values of \(Y\) are also twice as big as the values of \(X\) (note also the different axis ranges).

List of attached files:

← Back to the overview page