Correlation vs. Causation: What are the Differences? - NBD Lite #32
Learn the differences between both concepts
In statistics and data science, we often encounter correlation and causation terms.
Although correlation is mentioned more often than causation, its meaning is usually associated with how two variables are related.
However, correlation and causation are two different terms that stand by their definition.
It's two different concepts that might be intertwined and used together, but we still need to know the differences.
So, what are correlation and causation?
Also, what are their differences and usefulness? Let's explore this concept a bit.
Correlation and Causation
Correlation
Let's define both terms properly. A correlation could be defined as a statistical predictive relationship between two variables. The relationship itself indicates the direction in which the variables are moving toward each other.
For example, a positive correlation means that when one variable increases in value, the other tends to increase. If the reverse happens (an increase in one variable tends to decrease in another), it is a negative correlation.
Correlation is often measured as a numerical coefficient. For example, Pearson Correlation measures the strength of two variables from -1 (perfect negative correlation) to +1 (perfect positive correlation). The associated number represents how correlated the two variables are.
So, if the two variables were correlated, they should effectively affect each other, right? Well, the answer is no. Let's take a look at the image below.
The image above shows that bee occurrence increased while the number of ice creams sold increased.
If we use a correlation technique, these two variables above would undoubtedly show a strong correlation—although, are they affecting each other? Or is it just a coincidence? Correlation cannot solve this, but causality may be able to.
Causation
Causation could be defined as a cause-and-effect relationship between two variables where one change would directly result in a change in the other variable.
Correlation information is limited to the strength of association between two variables and the knowledge acquired from the sample data. In contrast, causation requires a more rigorous investigation and controlled experiment.
This is because causation implies that intervention in one variable could predictably influence the other variable.
The most common activity to prove causation is A/B testing, in which we randomly assign subjects to one of two groups. One group receives treatment, and the other is the control.
The results were then measured and statistically tested to determine if the intervention affected the outcome. The process is to find evidence of whether our activity has a causal effect.
Some Techniques Examples
We have established how correlation and causation work. We will learn a little bit about various kinds of techniques that are used to measure correlation or causation.
Correlation Techniques
There are two standard correlation measures: Parametric and Non-Parametric: the decision depNonparametricata, the assumption of the technique used for your data, and the purpose.
Let's look at the sample using the parametric technique of Pearson's Correlation.
Pearson's correlation (Pearson's r)
Pearson's correlation is a parametric method to measure the linear relationship between two numerical variables. There are a few assumptions to using Pearson's correlation, including:
The data is numerical,
Both variables follow a bivariate normal distribution,
There is a linear relationship between the variables.
For the non-parametric example, Spearman’s rank correlation is used.
Spearman's rank correlation (Spearman's rho)
Spearman's correlation is a nonparametric method for measuring the relationship between two variables (continuous or ordinal).
Spearman's correlation is based on the ranks of the data and is beneficial when the relationship between variables is not linear (or does not meet the parametric method assumption). The assumptions were:
The data can be either continuous or ordinal.
The variable's relationship is monotonic (when one variable increases, the other either consistently increases or decreases).
Causation Techniques
Similar to the correlation measurements, various techniques exist to establish causation, including the A/B testing mentioned above.
Another technique example is Causal Impact, which is a concept used to determine if intervention or action impacts the time-series data. As guidance, see the image below.
In the image above, we want to know whether the campaign revenue increased because of the intervention.
We can assess the possibility of intervention effect using the Bayesian Structural Time Series model.
Summary
There are a few points we can summarize to know the differences between Correlation and Causation:
1. Directionality
Correlation reveals a (possible) relationship between two variables but does not provide definitive evidence of cause and effect. On the other hand, causation establishes a clear directionality in the relationship between the two variables.
2. Implications
The correlation only gives information regarding the two variables' direction and strength, whereas causation implies that changing one variable will impact the other.
3. Establishing Evidence
The correlation is only established through observational data without manipulating or controlling the variables. However, causation would use statistical techniques to rule out confounding factors and control the experiments.
That’s all a simple learning of what makes correlation and causation different.
Are there any more things you would love to discuss? Let’s talk about it together!
👇👇👇