Pearson correlation, the most common type of correlation, is widely used in Data Science. However incorrect conclusions are often drawn from a low or high correlation. We will see below some counterexamples, hoping that they will help to better remember some limitations of the Pearson correlation.
# First, let's import some useful libraries: import pandas as pd import numpy as np import matplotlib.pyplot as plt pd.set_option('display.notebook_repr_html', False)
A low correlation means that there is no relationship between two sequences WRONG!
Two sequences may have a low Pearson correlation while one sequence could be entirely predicted from the other:
Let’s define two sequences:
df = pd.DataFrame() df['feature1'] = np.linspace(0, 1, 1001) df['feature2'] = 1 / (df.feature1 + 1e-10)
and plot them:
start=100 plt.plot(df.feature1.iloc[start:], df.feature1[start:]) plt.plot(df.feature1.iloc[start:], df.feature2[start:]) plt.legend(('feature1', 'feature2')) plt.show()
We compute the Pearson correlation:
feature1 feature2 feature1 1.000000 -0.054718 feature2 -0.054718 1.000000
We see that the Pearson correlation has a very small absolute value: 0.05. However, both sequences can be exactly computed from the other.
Below are the two functions to construct one sequence from the other:
f = lambda x: 1 / (x + 1e-10) # check if f(df.feature1) and df.feature2 are element-wise equal within a tolerance: np.allclose(f(df.feature1), df.feature2)
g = lambda x: 1 / x - 1e-10 # check if g(df.feature2) and df.feature1 are element-wise equal within a tolerance: np.allclose(g(df.feature2), df.feature1)
A high correlation means that we should be able to approximately predict one sequence from the other WRONG!
Two sequences may have a high Pearson correlation while knowing one sequence does not help in predicting the other:
Let’s define two sequences feature1 and feature2 as the cumulative sums of random numbers:
np.random.seed(123) df = pd.DataFrame() df['random1'] = np.random.randint(0, 10, 20) df['random2'] = np.random.randint(0, 10, 20) df['feature1'] = df.random1.cumsum() df['feature2'] = df.random2.cumsum()
Let’s look at the first values of these sequences:
random1 random2 feature1 feature2 0 2 7 2 7 1 2 3 4 10 2 6 2 10 12 3 1 4 11 16 4 3 7 14 23
We now compute the Pearson correlation:
random1 random2 feature1 feature2 random1 1.000000 0.077607 -0.058061 -0.143278 random2 0.077607 1.000000 -0.019545 0.025920 feature1 -0.058061 -0.019545 1.000000 0.986428 feature2 -0.143278 0.025920 0.986428 1.000000
As we can see, feature1 and feature2 are highly correlated, their Pearson correlation being equal to 0.98. However, both sequences are obtained by accumulating random numbers and therefore one sequence cannot be predicted from the other.
Be careful with correlation! As we saw with the previous examples, it is possible to have:
- a high correlation between two sequences without being able to approximately predict a sequence from the other,
- a low correlation between two sequences while one sequence can be exactly predicted from the other.