How can it be measured?Causation is measuring the real impact on Y because of X.
, What is the effect of ad campaigns on the sales of a product?It is critical to precisely understand the causal effects of these interventions on the subject.
One of the main threats to causal inference is the confounding effect from other variables.
In the case of ad campaigns, it could be a reduction in price of the product, change in the overall economy or various other factors that could be inducing the change in sales at the same time.
So how do we correctly attribute the change in sales because of the ad campaign?There are two ways to estimate the true causal impact of the intervention on the subject.
Randomized experiment: It’s the most reliable method to infer the actual causal impact of the treatment, where we induce a change in the process at random and measure the corresponding change in the outcome variable.
However, in most cases, it would be impossible to conduct experiments and control the whole system to be truly at random.
Causal Inference in Econometrics: This method involves the application of statistical procedures to the data that is available already to arrive at the causal estimate while controlling for confounders.
Some approaches under this method are what we’ll be looking at in this analysis.
The following are the approaches:Difference in Differences (DD)Causal ImpactSynthetic ControlThe Basque dataset will be used for demonstration.
Using this data, we’ll estimate the true economic impact of terrorist conflict in the Basque country, an autonomous community in Spain, with the help of data from 17 other regions.
Let’s look at some facts about the data and the experimental design.
The dataset contains information from the year 1955–1997Information about 18 Spanish regions is available- One of which is average for the whole country of Spain (we’ll remove that)The treatment year is considered to be the year 1975The treatment region is “Basque Country (Pais Vasco)”The economic impact measurement variable is GDP per capita (in thousands)The analysis has been done in R and the source codes can be found in my GitHub.
We’ll get started with the approaches mentioned.
First Difference EstimateBefore going into Difference in Differences method, let’s look at First Differences and what it does.
Our goal here is to quantify the impact of GDP before and after the terrorist conflict in the Basque country.
In a naive way, we can actually achieve this by constructing a first difference regression and observing the estimate.
Let’s look at the general trend of GDP per capita for the Basque country.
From the graph, we can see how the trend plunges right after the terrorist intervention and then increases back all over again.
Our goal is to identify the magnitude of the plunge that we see.
The first difference estimate will tell us the difference in GDP before and after the treatment.
Let’s construct a first difference equation by having GDP as the dependent variable and pre-post indicator as the independent variable.
f_did <- lm(data = basq_fdid, gdpcap ~ post)stargazer(f_did, type=”text”)The coefficient of post indicator suggests that there is an increase in GDP per capita by ~2.
5 units being in the post period and that’s not quite what we want because we want to capture the declining trend.
That’s because there’s an expected problem as we mentioned earlier.
The trend in GDP could have been altered because of a lot of other variables, besides the terrorist conflict, occurring at the same time — otherwise known as confounders.
Possible confounders in this case are:Passing of a trade law which would affect local businesses and GDPMutiny within local groupsPerception of corrupt or dysfunctional governmentThe solution to this is to compare the trend to a control region which was not impacted by terrorist conflict.
This comparison allows us to remove the confounding effect after the intervention period and arrive at the real causal impact.
That’s where the Difference in differences method helps.
Difference in differences (DD)The underlying assumption of Difference-in-Differences (DD) design is that the trend of the control group provides an adequate proxy for the trend that would have been observed in the treatment group in the absence of treatment.
Thus, the difference in change of slope would be the actual treatment effect.
The assumption here is that there treatment group and control group must follow the same trend in the pre-period.
For this analysis, the control region was identified by spotting for the region that had the lowest variation in % difference of GDP across years between each region and the Basque country.
Alternatively, we can look for the control region by eyeballing the GDP trend for treatment and control groups when feasible.
In this case, Cataluna region was recognized to be the best control region.
Let’s look at the GDP trend for test and control regions below:GDP trend for Cataluna region goes hand in hand with Basque’s GDP with an exception for a few years in the pre-period.
Thus, there should be no problem in considering Cataluna to be our control region.
Let’s go ahead and fit the regression with GDP as the dependent variable and treatment indicator and a pre-post indicator as independent variables.
The critical aspect here is to feed the interaction between treatment and pre-post indicator as we want the estimate to contain the effect of being treated along with being in post-period in comparison to not being treated and being in pre-period.
After fitting, let’s look at the regression results below:did <- lm(data = did_data, gdpcap ~ treat*post)stargazer(did, type=”text”)Looking at the estimate of the interaction variable suggests that the GDP in Basque country reduced by 0.
85 units because of the terrorist intervention that happened.
Now there is a stark difference between estimates provided by First difference method and the DD method.
Quantitatively, we can see how the First difference estimates could be deceiving and naive to look at.
If you’re interested, you can read more about Difference in difference here.
For now, let’s move on to other causal inference methods.
Causal ImpactCausal Impact is a methodology developed by Google to estimate the causal impact of a treatment in the treated group.
The official documentation can be found here.
The motivation to use Causal Impact methodology is that the Difference in differences in limited in the following ways:DD is traditionally based on a static regression model that assumes independent and identically distributed data despite the fact that the design has a temporal componentMost DD analyses only consider two time points: before and after the intervention.
In practice, we also have to consider the manner in which an effect evolves over time, especially its onset and decay structureThe idea here is to use the trend in the control group to forecast the trend in the treated group which would be the trend if the treatment had not happened.
Then the actual causal estimate would be the difference in the actual trend vs.
the counter-factual trend of the treated group that we predicted.
Causal Impact uses Bayesian structural time-series models to explain the temporal evolution of an observed outcome.
Essentially, Causal Impact methodology is very close to the Synthetic control methodology we are going to see next.
Control region, in this case, is considered to be Cataluna again.
With the treated and control region’s GDP in place, let’s feed them to the Causal Impact function in R and look at the results.
period <- as.
period <- as.
Date(c(“1976–01–01”, “1997–01–01”))impact <- CausalImpact(basq_CI, pre.
period)summary(impact)The Absolute effect is the difference in GDP between the actual GDP after the treatment and the counter-factual GDP.
From the results, we can see the absolute effect gives us a value of -0.
76 which means that the GDP per capita reduced by 0.
76 units, i.
8% because of the terrorist conflict that happened in the Basque country.
This is almost equal to the estimate we saw using Difference in Differences method.
For those interested in Causal Impact, an exhaustive explanation of the method is given by the authors of this method that can be found here.
Synthetic ControlSynthetic control is a technique which is very similar to Causal Impact in estimating the true impact of a treatment.
Both the methods use the help of control groups to construct a counter-factual of the treated group giving us an idea of what the trend is if the treatment had not happened.
The counter-factual GDP of the treated group would be predicted by the GDP of the control groups and also other possible covariates in the control group.
The synth algorithm predicts the counter-factual by assigning weights to the regressors in the control groups which helps identify individual regressors and their influence in prediction.
Ultimately, the true causal impact is the difference in GDP between actual GDP and the counter-factual GDP if the treatment had not happened.
The difference between Synthetic Control and Causal Impact is that Synthetic Control uses only pre-treatment variables for matching while Causal Impact uses the full pre and post-treatment time series of predictor variables for matching.
Let’s look at a plot for GDP trend in all the 17 other regions in the data.
As seen from the plot above, all the control regions have a similar upward trend in GDP as Basque country’s in the pre-period.
This suggests that GDP in Basque could be constructed fairly accurately using the data from other regions.
The implementation of synthetic control on this problem statement has already been given in the official documentation of the package found here.
After execution, Let’s look at the plot between actual GDP and the counter-factual GDP.
The path-graph shows how the smooth relationship is between the synthetic and actual trend in the pre-period and how it deviates gradually once the treatment happens.
That difference in trends in the post-period would be our average treatment effect.
The root mean squared error of the Actual and Synthetic trends were found to be 0.
We can conclude that the true causal impact of terrorist conflict on Basque country is a reduction of GDP by 0.
57 units calculated using Synthetic control method.
Let’s look at the comparison of results from all three methods below:The magnitude of the causal impact differs only by a small margin between the three methods, and there is no method which will give us the “correct answer.
” Most times, the approaches we use will be restricted by the nature of the experiment and what causal threats we are trying to address.
Some other techniques useful for inferring causal impact are Propensity score Matching, Fixed Effects Regression, Instrumental variables, and Regression Discontinuity.
.. More details