Tutorial: How to detect spurious correlations, and how to find the real ones
Specifically designed in the context of big data in our research lab, the new and simple strong correlation synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in large-scale automated data science or machine learning projects. Use this new metric now, to avoid being accused of reckless data science and even being sued for wrongful analytic practice .
In this paper, the traditional correlation is referred to as the weak correlation. as it captures only a small part of the association between two variables: weak correlation results in capturing spurious correlations and predictive modeling deficiencies, even with as few as 100 variables. In short, our strong correlation (with a value between 0 and 1) is high (say above 0.80) if not only the weak correlation is also high (in absolute value), but when the internal structures (auto-dependencies) of both variables X and Y that you want to compare, exhibit a similar pattern or correlogram. Yet this new metric is simple and involves just one parameter a (with a = 0 corresponding to weak correlation. and a =1 being the recommended value for strong correlation ). This setting is designed to avoid over-fitting.
Our strong correlation blends together the concept of ordinary or weak regression - indeed, an improved, robust, outlier-resistant version of ordinary regression (or see my book pages 130-140) - together with the concept of X and Y sharing similar bumpiness (or see my book pages 125-128).
In short, even nowadays, what makes two variables X and Y seem related in most scientific articles and pretty much all articles written by journalists, is based on ordinary (weak) regression. But there are plenty of other metrics that you can use to compare two variables. Including bumpiness in the mix (together with weak regression in just one single blended metric called strong correlation to boost accuracy) guarantees that high strong correlation means that the two variables are really associated, not just based on flawy, old-fashioned weak correlations, but also associated based on sharing similar internal auto-dependencies and structure. To put it differently, two variables can be highly weakly correlated yet have very different bumpiness coefficients, as shown in my original article - meaning that there might be no causal relationship (or see my book pages 165-168) or hidden factors explaining the link. An artificial example is provided below in figure 3.
Using strong. rather than weak correlation, eliminates the majority of these spurious correlations, as we shall see in the examples below. This strong correlation metric is designed to be integrated in automated data science algorithms.
1. Formal definition of strong correlation
- Weak correlation c(X, Y) as the absolute value of the ordinary correlation, with value between 0 and 1. This number is high (close to 1) if X and Y are highly correlated. I recommend using my rank-based, L-1 correlation (or see my book pages 130-140) to eliminate problems caused by outliers.
- c1(X) as the lag-1 auto-correlation for X, that is, if we have n observations X_1. X_n, then c1(X) = c(X_1. X_
, X_2. X_n)
- c1(Y) as the lag-1 auto-correlation for Y
- d-correlation d(X, Y) = exp< -a * | ln( c1(X) / c1(Y) ) | >, with possible adjustment if numerator or denominator is zero, and parameter a must be positive or zero. This number, with value between 0 and 1, is high (close to 1) if X and Y have similar lag-1 auto-correlations.
- Strong correlation r(X, Y) = min< c(X, Y), d(X, Y) >
Note that c1(X), and c1(Y) are the bumpiness coefficients (or see my book pages 125-128) for X and Y. Also, d(X, Y) and thus r(X, Y) are between 0 and 1, with 1 meaning strong similarity between X and Y, and 0 meaning either dissimilar lag-1 auto-correlations for X and Y, or lack of old-fashioned correlation.
The strong correlation between X and Y is, by definition, r(X, Y). This is an approximation to having both spectra identical, a solution mentioned in my article The curse of Big Data (see also my book pages 41-45).
This definition of strong correlation was initially suggested in one of our weekly challenges .
2. Comparison with traditional (weak ) correlation
When a = 0, weak and strong correlations are identical. Note that the strong correlation r(X, Y) still shares the same properties as the weak correlation c(X, Y): it is symmetric and invariant under linear transformations (such as re-scaling) of variables X or Y, regardless of a.
In figures 1 and 2, we simulated a large number of uniformly and independently distributed random variables Y (> 10,000) each with n observations, and computed the correlation with an arbitrary variable X with pre-specified values. So in theory, you would expect all the correlations to be close to zero. The following scatterplots (figures 1 and 2) disprove this fact.
Figure 1: weak (x axis) and d-correlations (y axis) computed on thousands of (simulated) paired variables with n = 9 observations, that are, by design, non-correlated. Each dot represents a correlation.
Figure 2: weak (x axis) and d-correlations (y axis) computed on thousands of (simulated) paired variables with n = 4 observations, that are, by design, non-correlated. Each dot represents a correlation.
In practice, tons of weak correlations are still well above 0.60 if you look at figure 1, though few strong correlations are above 0.20 in the same figure (strong correlation is the minimum between weak correlation and d-correlation ). Figure 2 is more difficult to interpret visually because n is too small (n = 4), though the conclusion is similar and obvious if you check the results in the spreadsheet (see next section). In this example, a = 4.
3. Excel spreadsheet with computations and examples
The spreadsheet shows simulation of a variable X with n observations, stored in first row, with thousands of simulated Y's in the subsequent rows. There are two tabs: one for n = 4, and one for n = 9. For instance, in the n = 9 tab, column J represents the weak correlation c(X, Y), column M represents c1(Y), and column N represents
the strong correlation r(X, Y). The parameter a is stored in cell P1, and summary stats are found in cells Q1:T12. The spreasheet is a bit unusual in the sense that rows represent variables, and columns represent observations.
Download spreadsheet (about 20 MB in compressed format)
The Excel spreadsheet has some intrisic value besides explaining my methodology: you will learn about Excel techniques and formulas such as percentiles, random numbers and indirect countif formulas to build cumulative distributions. For better random generators, read this document or my book pages 161-163.
Figure 3: Both red and green series are highly weakly correlated to the blue series, but the green (bumpy) almost periodic series is not strongly correlated to the (smooth) blue series when using a = 4
Confidence intervals for these correlations are easy to obtain, by running 10 times these simulations and see what min and max you get. For details, read this article or my book pages 158-161.
4. When to use strong versus weak correlation?
The strong correlation is useful in contexts where you are comparing (or correlating or clustering) tons of data bins, for instance for transaction scoring, using a limited number n of features (n corresponding to the one used in this article). This is the case with granular hidden decision trees (see also my book pages 153-158 and 224-228).
It is also useful when comparing millions of small, local time series, for instance in the context of HFT (High Frequency Trading), when you try to find cross-correlations with time lags among thousands of stocks.
Note that a = 4 (as used in my spreadsheet) is too high in most situations, and I recommend a = 1, which has the following advantages:
- Simplification of the formula for r(X, Y)
- The fact that d(X, Y) [see section 1] is a raw rather than tranformed, artificially skewed number, and thus likely to be more compatible and blendable with c(X, Y). This is obvious when you check the summary statistics in the spreadsheet and set a to 1.
Note that in the spreadsheet, when n = 4 and a = 4, about 40% of all weak correlations are above 0.60, while only 5% of strong correlations are above 0.60. Technically, all the simulated Y's are uniform, random, independent variables, so it is amazing to see so many high (weak ) correlations - there are indeed all spurious correlations. Even with n = 9, the contrasts between weak and strong correlations are still significant. The strong correlation clearly eliminates a very large chunk of the spurious correlations, especially when a > 2. But it can eliminate true correlations as well, thus my recommendation to use a = 1, as a compromise. A high value for a has effects similar to over-fitting and should be avoided.
It is possible to integrate auto-correlations of lag 1, 2, and up to n-2, but we then risk to over-fit, except if we put decaying weights on the various lags. This approach has certainly been investigated by other scientists (can you provide references?) as it amounts to do a spectral analysis of time series as mentioned in my article The Curse of Big Data (or see my book pages 41-45), compare spectra, likely using Fourier transforms.
It would be great to do this analysis on actual data, not just simulated random noise. Or even on non-random simulated data, using for instance the artificially correlated data (with correlations injected into the data) described in my article Jackknife Regression . Would we miss too many correlations, on truly correlated data, using strong correlation with a moderate a = 1 and n between 5 and 15?
Finally there are many other metrics available to measure other forms of correlations (for instance on unusual domains), see for instance my article on Structuredness coefficient or my book page 141.
For those participating in our data science apprenticeship (DSA), we have added strong correlation as one of the projects that you can work on: specifically, answer questions from section 5 and 6 from this article (see project #1 under "data science research", in the DSA project list ).
6. Asymptotic properties, additional research
I haven't done research in that direction yet. I have a few questions:
- Is the choice of a test variable X (I mean, the values as in the first raw of my spreadheet) has an impact on the summary statistics, in my spreadsheet?
- When n is large (> 20), does strong correlation still outperform weak correlation ?
- Could we estimate the proportions of genuine, real correlations that are missed by the strong correlation (false negatives)
- What proportions of spurious correlations are avoided with strong correlations, depending on n and a ?
7. About synthetic metrics and our research lab
The strong correlation is a synthetic metric, and belongs to the family of synthetic metrics that we created over the last few years. Synthetic metrics are designed to efficiently solve a problem, rather than being crafted for their beauty, elegancy and mathematical properties: they are directly derived from data experiments (bottom-up approach) rather than the other way around (top-down: from theory to application) as in traditional science. Other synthetitic metrics include:
- Synthetic variance for Hadoop and big data
- Predictive power metric. related to entropy (that is, information quantification), used in big data frameworks, for instance to identify optimum feature combinations for scoring algorithms.
- Correlation for big data. defined by an algorithm and closely related to the optimum variance metric discussed here.
- Structuredness coefficient
- Bumpiness coefficient
Click here for details, or check my book pages 187-194.
This article builds on previous data science (robust correlations, bumpiness coefficient) to design a one-parameter, synthetic metric, that can be used to dramatically reduce - by well over 50% - the number of spurrious correlations. It is part of a family of robust, almost parameter-free metrics and methods developed in our research lab, to be integrated in black-box data science software or automated data science, and used by engineers or managers who are not experts in data science.Source: www.datasciencecentral.com