Pearson correlation, the most common type of correlation, is widely used in Data Science. However incorrect conclusions are often drawn from a low or high correlation. We will see below some counterexamples, hoping that they will help to better remember some limitations of the Pearson correlation.

# First, let's import some useful libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.notebook_repr_html', False)

## A low correlation means that there is no relationship between two sequences WRONG! Two sequences may have a low Pearson correlation while one sequence could be entirely predicted from the other:

Let’s define two sequences:

df = pd.DataFrame()
df['feature1'] = np.linspace(0, 1, 1001)
df['feature2'] = 1 / (df.feature1 + 1e-10)

and plot them:

start=100
plt.plot(df.feature1.iloc[start:], df.feature1[start:])
plt.plot(df.feature1.iloc[start:], df.feature2[start:])
plt.legend(('feature1', 'feature2'))
plt.show()

We compute the Pearson correlation:

df.corr(method='pearson')
          feature1  feature2
feature1  1.000000 -0.054718
feature2 -0.054718  1.000000


We see that the Pearson correlation has a very small absolute value: 0.05. However, both sequences can be exactly computed from the other.

Below are the two functions to construct one sequence from the other:

f = lambda x: 1 / (x + 1e-10)
# check if f(df.feature1) and df.feature2 are element-wise equal within a tolerance:
np.allclose(f(df.feature1), df.feature2)
True

g = lambda x: 1 / x - 1e-10
# check if g(df.feature2) and df.feature1 are element-wise equal within a tolerance:
np.allclose(g(df.feature2), df.feature1)
True


## A high correlation means that we should be able to approximately predict one sequence from the other WRONG! Two sequences may have a high Pearson correlation while knowing one sequence does not help in predicting the other:

Let’s define two sequences feature1 and feature2 as the cumulative sums of random numbers:

np.random.seed(123)
df = pd.DataFrame()
df['random1'] = np.random.randint(0, 10, 20)
df['random2'] = np.random.randint(0, 10, 20)
df['feature1'] = df.random1.cumsum()
df['feature2'] = df.random2.cumsum()

Let’s look at the first values of these sequences:

df.head()
   random1  random2  feature1  feature2
0        2        7         2         7
1        2        3         4        10
2        6        2        10        12
3        1        4        11        16
4        3        7        14        23


We now compute the Pearson correlation:

df.corr(method='pearson')
           random1   random2  feature1  feature2
random1   1.000000  0.077607 -0.058061 -0.143278
random2   0.077607  1.000000 -0.019545  0.025920
feature1 -0.058061 -0.019545  1.000000  0.986428
feature2 -0.143278  0.025920  0.986428  1.000000


As we can see, feature1 and feature2 are highly correlated, their Pearson correlation being equal to 0.98. However, both sequences are obtained by accumulating random numbers and therefore one sequence cannot be predicted from the other.

Be careful with correlation! As we saw with the previous examples, it is possible to have:

• a high correlation between two sequences without being able to approximately predict a sequence from the other,
• a low correlation between two sequences while one sequence can be exactly predicted from the other.