A information to avoiding the frequent pitfalls of occasion research
Occasion research are helpful instruments within the context of causal inference. They’re utilized in quasi-experimental conditions. In these conditions, the therapy is just not randomly assigned. Thus, in distinction to randomized experiments (i.e., A/B checks), one can not depend on a easy comparability of the means between teams to make causal inferences. In most of these conditions, occasion research are very helpful.
Occasion research are additionally steadily used to see if there are any pre-treatment variations between the handled and nontreated teams as a solution to pretest parallel developments, a vital assumption of a preferred causal inference technique known as difference-in-differences (DiD).
Nevertheless, current literature illustrates a wide range of pitfalls in occasion research. If ignored, these pitfalls can have important penalties when utilizing occasion research for causal inference or as a pretest for parallel developments.
On this article, I’ll talk about these pitfalls and suggestions on how you can keep away from them. I’ll give attention to the functions within the context of panel information the place I observe models over time. I’ll use a toy instance as an instance the pitfalls and suggestions. Yow will discover the complete code used to simulate and analyze the info right here. On this article, I restrict the usage of code to essentially the most essential elements to keep away from cluttering.
An Illustrative Instance
Occasion research are generally used to research the impression of an occasion similar to a brand new regulation in a rustic. A current instance of such an occasion is the implementation of lockdowns as a result of pandemic. Within the case of the lockdowns, many companies bought affected as a result of individuals began spending extra time at residence. For instance, a music streaming platform might need to know whether or not individuals’s music consumption patterns have modified on account of lockdowns in order that they’ll deal with these adjustments and serve their clients higher.
A researcher working for this platform can examine whether or not the quantity of music consumed has modified after the lockdown. The researcher may use the nations that by no means imposed a lockdown or imposed a lockdown later as management teams. An occasion examine could be acceptable on this scenario. Assume for this text that the nations that impose a lockdown keep so till the tip of our statement interval and the implementation of the lockdown is binary (i.e., ignore that the strictness of the lockdown can fluctuate).
Occasion Research Specification
I’ll give attention to occasion research within the form of:
Yᵢₜ is the result of curiosity. αᵢ is the unit-fixed results and it controls for time-constant unit traits. γₜ is the time-fixed results and it controls for time developments or seasonality. l is the time relative to the therapy and it signifies what number of intervals it has been for the reason that therapy at a given time t. For instance, l = -1 signifies that it’s one interval earlier than the therapy, l = 2 signifies that it’s two intervals after the therapy. Dˡᵢₜ is the therapy dummy for the relative time interval l at time t for unit i. Principally, we embrace each the leads and lags of the therapy. ϵᵢₜ is the random error.
The coefficient of curiosity βₗ signifies the typical therapy impact in a given relative time interval l. Within the statement interval, there are T intervals, thus, the intervals vary from 0 to T-1. The models get handled at totally different intervals. Every group of models which might be handled on the similar time composes a therapy cohort. This kind of occasion examine is a difference-in-differences (DiD) design wherein models obtain the therapy at totally different closing dates (Borusyak et al. 2021)
Illustrative instance continued:
Consistent with our illustrative instance, I simulate a panel dataset. On this dataset, there are 10,000 clients (or models) and 5 intervals (from interval 0 to 4). I pattern unit- and time-fixed results at random for these models and intervals, respectively. General, we’ve 50,000 (10,000 models x 5 intervals) observations on the customer-period stage. The end result of curiosity is the music consumption measured in hours.
I randomly assign the purchasers to three totally different nations. One among these nations imposed a lockdown in interval 2, the opposite in interval 3, and one by no means imposed a lockdown. Thus, clients from these totally different nations are handled at totally different instances. To make it simple to comply with, I’ll consult with the purchasers by their therapy cohorts relying on once they have been handled: cohort interval 2 and cohort interval 3 for purchasers handled in intervals 2 and three, respectively. One of many cohorts isn’t handled and, thus, I consult with them as cohort interval 99 for the benefit of coding.
Within the simulation, after these clients are randomly assigned to one in all these cohorts, I create the therapy dummy variable deal with
which equals 1 if cohort_period >= interval
, 0 in any other case. deal with
signifies whether or not a unit is handled in a given interval. Subsequent, I create a dynamic therapy impact that grows in every handled interval (e.g., 1 hour within the interval the place therapy occurs and a couple of hours within the interval after that). Therapy results are zero for pre-treatment intervals.
I calculate the result of curiosity hrs_listened
because the sum of a relentless that I randomly selected (80), unit- and time-fixed results, the therapy impact, and error (random noise) for every unit and interval. By building, the therapy (lockdowns) has a rising constructive impression on music consumption.
I skip a number of the setup and simulation elements of the code to keep away from cluttering however yow will discover the complete code right here.
Within the following picture, I present a snapshot of the info. unit
refers to clients, cohort_period
refers to when a unit was handled. hrs_listened
is the dependent variable and it measures the music consumption in a given interval in hours for a given buyer.
rm(listing = ls())
library(information.desk)
library(fastDummies)
library(tidyverse)
library(ggthemes)
library(fixest)
library(kableExtra)information <- make_data(...)
kable(head(information[, ..select_cols]), 'easy')
Within the following picture, I illustrate the developments within the common music listening by cohort and interval. I additionally mark when the nations have imposed lockdowns for the primary time. You may see that there appears to be a constructive impression of the lockdowns for each the earlier- and later-treated nations in comparison with the purchasers from the untreated cohort.
# Graph common music listening by cohort and interval
avg_dv_period <- information[, .(mean_hrs_listened = mean(hrs_listened)), by = c('cohort_period','period')]
ggplot(avg_dv_period, aes(fill=issue(cohort_period), y=mean_hrs_listened, x=interval)) +
geom_bar(place="dodge", stat="identification") + coord_cartesian(ylim=c(79,85))+
labs(x = "Interval", y = "Hours", title = 'Common music listening (hours)',
caption = 'Cohort 2 is the early handled, cohort 3 is the late handled and cohort 99 is the by no means handled group.') +
theme(legend.place = 'backside',
axis.title = element_text(measurement = 14),
axis.textual content = element_text(measurement = 12)) + scale_fill_manual(values=cbPalette) +
geom_vline(xintercept = 1.5, colour = '#999999', lty = 5)+
geom_vline(xintercept = 2.5, colour = '#E69F00', lty = 5) +
geom_text(label = 'Cohort interval 2 is handled',aes(1.4,83), colour = '#999999', angle = 90)+
geom_text(label = 'Cohort interval 3 is handled',aes(2.4,83), colour = '#E69F00', angle = 90) +
guides(fill=guide_legend(title="Therapy cohort interval"))
Since this dataset is simulated, I do know the true therapy impact of lockdowns for every cohort and every interval. Within the following graph, I current the true therapy impact of the lockdowns.
Within the first interval after the therapy (relative interval 1), each cohorts improve their listening by 1 hour. Within the second interval relative to the therapy, the therapy impact is 2 hours for each cohorts. For the relative interval 3, we see that the therapy impact is 3 hours.
One factor to note right here is that the therapy impact is homogenous throughout cohorts over relative intervals (e.g., 1 hrs in relative interval 1; 2 hrs in relative interval 2). Later, we’ll see what occurs if this isn’t the case.
# Graph the true therapy results
avg_treat_period <- information[treat == 1, .(mean_treat_effect = mean(tau_cum)), by = c('cohort_period','period')]
ggplot(avg_treat_period, aes(fill=issue(cohort_period), y=mean_treat_effect, x=interval)) +
geom_bar(place="dodge", stat="identification") +
labs(x = "Interval", y = "Hours", title = 'True therapy impact (hrs)',
caption = 'Cohort 2 is the early handled, cohort 3 is the late handled and cohort 99 is the by no means handled group.') +
theme(legend.place = 'backside',
axis.title = element_text(measurement = 14),
axis.textual content = element_text(measurement = 12)) + scale_fill_manual(values=cbPalette) +
guides(fill=guide_legend(title="Therapy cohort interval"))
Now, we do an occasion examine by regressing the hrs_listened
on relative interval dummies. The relative interval is the distinction between interval
and cohort_period
. The detrimental relative intervals point out the intervals earlier than the therapy and the constructive ones point out the intervals after the therapy. We use unit fixed-effects (αᵢ) and interval fixed-effects (γₜ) for all of the occasion examine regressions.
Within the following desk, I report the outcomes of this occasion examine. Unsurprisingly, there aren’t any results detected pre-treatment. Submit-treatment results are exactly and appropriately estimated as 1, 2, and three hours. So every thing works thus far! Let’s see conditions the place issues don’t work as properly…
# Create relative time dummies to make use of within the regression
information <- information %>%
# make relative yr indicator
mutate(rel_period = ifelse(cohort_period == 99,99,interval - cohort_period))
abstract(information$rel_period)information <- information %>%
dummy_cols(select_columns = "rel_period")
rel_per_dummies <- colnames(information)[grepl('rel_period_', colnames(data))]
# Change title w/ minuses to deal with them extra simply
rel_per_dummies_new<-gsub('-','min', rel_per_dummies)
setnames(information, rel_per_dummies, rel_per_dummies_new)
# Occasion examine
covs <- setdiff(rel_per_dummies_new, c('rel_period_99','rel_period_min1'))
covs_collapse <- paste0(covs, collapse='+')
formulation <- as.formulation(paste0('hrs_listened ~ ',covs_collapse))
mannequin <- feols(formulation,
information = information, panel.id = "unit",
fixef = c("unit", "interval"))
abstract(mannequin)
All the things labored properly thus far however listed below are the highest 4 issues to watch out of to keep away from the potential pitfalls when utilizing the occasion examine method:
1. No anticipation assumption
Many functions of occasion research within the literature impose a no-anticipation assumption. No anticipation assumption implies that handled models don’t change their habits in expectation of the therapy earlier than the therapy. When the no-anticipation assumption holds, one can use the interval earlier than the occasion as (one in all) the reference interval(s) and examine different intervals to this era.
Nevertheless, no anticipation assumption may not maintain in some circumstances, e.g., when the therapy is introduced to the panel earlier than the therapy is imposed and the models can reply to the announcement by adjusting their habits. On this case, one wants to decide on the reference intervals fastidiously to keep away from bias. You probably have an thought of when the themes begin to anticipate the therapy and alter their habits you should utilize that interval because the de facto starting of the therapy and use the interval(s) earlier than that because the reference interval (Borusyak et al. 2021).
For instance, in case you suspect that the themes change their habits in l = -1 (one interval earlier than the therapy) as a result of they anticipate the therapy you should utilize l = -2 (two intervals earlier than the therapy) as your reference interval. You are able to do this by dropping the Dˡᵢₜ the place l = -2 from the equation as an alternative of dropping the dummy for l = -2. This fashion you utilize the l = -2 interval because the reference interval. To verify whether or not your hunch on models altering their habits in l = -1 is true, you’ll be able to verify if the estimated therapy impact in l = -1 is statistically important.
Illustrative instance continued:
Going again to our illustrative instance, lockdowns are often introduced a bit earlier than the imposition of the lockdown, which could have an effect on the models’ pre-treatment habits. For instance, individuals would possibly already begin working from residence as soon as the lockdown is introduced however not but imposed.
Because of this, individuals can change their music-listening habits even earlier than the precise implementation of the lockdown. If the lockdown is introduced 1 interval earlier than the precise implementation one can use the relative interval = -2 because the reference interval by dropping the dummy for the relative interval -1 from the specification.
Consistent with this instance, I copy and modify the unique information to introduce some anticipation results. I introduce a 0.5 hrs improve within the hours listened to all models in relative interval -1. I name this new dataset with anticipation data_anticip
.
The following graph exhibits the typical music listening time over relative intervals. It’s simple to note that the listening time already begins to choose up within the relative interval -1 in comparison with the relative intervals -2 and -3. Ignoring this important change within the listening time can create deceptive outcomes.
# Summarize the hours listened over relative interval (excluding the untreated cohort)
avg_dep_anticip <- data_anticip[rel_period != 99, .(mean_hrs_listened = mean(hrs_listened)), (rel_period)]
setorder(avg_dep_anticip, 'rel_period')rel_periods <- kind(distinctive(avg_dep_anticip$rel_period))
ggplot(avg_dep_anticip, aes(y=mean_hrs_listened, x=rel_period)) +
geom_bar(place="dodge", stat="identification", fill = 'deepskyblue') + coord_cartesian(ylim=c(79,85))+
labs(x = "Relative interval", y = "Hours", title = 'Common music listening over relative time interval',
caption = 'Just for the handled models') +
theme(legend.place = 'backside',
legend.title = element_blank(),
axis.title = element_text(measurement = 14),
axis.textual content = element_text(measurement = 12)) + scale_x_continuous(breaks = min(rel_periods):max(rel_periods))
Now, let’s do an occasion examine as we did earlier than by regressing the hours listened on the relative time interval dummies. Take into account that the one factor I modified is the impact within the relative interval -1 and the remainder of the info is strictly the identical as earlier than.
You may see within the following desk that the pre-treatment results are detrimental and important although there aren’t any actual therapy results in these intervals. The reason being that we use the relative interval -1 because the reference interval and this messes up all of the impact estimations. What we have to do is to make use of a interval the place there is no such thing as a anticipation because the reference interval.
formulation <- as.formulation(paste0('hrs_listened ~ ',covs_collapse))
mannequin <- feols(formulation,
information = data_anticip, panel.id = "unit",
fixef = c("unit", "interval"))
abstract(mannequin)
Within the following desk, I report the occasion examine outcomes from the brand new regression the place I take advantage of relative interval -2 because the reference interval. Now, we’ve the precise estimates! There isn’t a impact detected within the relative interval -3, although an impact is appropriately detected for the relative interval -1. Moreover, the impact sizes for the post-treatment intervals at the moment are appropriately estimated.
# Use launch interval -2 because the reference interval as an alternative
covs_anticip <- setdiff(c(covs,'rel_period_min1'),'rel_period_min2')
covs_anticip_collapse <- paste0(covs_anticip,collapse = '+')formulation <- as.formulation(paste0('hrs_listened ~ ',covs_anticip_collapse))
mannequin <- feols(formulation,
information = data_anticip, panel.id = "unit",
fixef = c("unit", "interval"))
abstract(mannequin)
2. Assumption of homogenous therapy results throughout cohorts
Within the equation proven earlier than, the therapy impact can solely fluctuate by the relative time interval. The implicit assumption right here is that these therapy results are homogenous throughout therapy cohorts. Nevertheless, if this implicit assumption is mistaken the estimated therapy results may be considerably totally different than the precise therapy impact inflicting bias (Borusyak et al. 2021). An instance scenario could possibly be the place earlier cohorts profit extra from the therapy in comparison with the later handled teams. Which means that the therapy results throughout cohorts differ.
The best resolution to deal with this concern is to permit for heterogeneity. To permit for the therapy impact heterogeneity between cohorts, one can estimate relative time and cohort-specific therapy results, as seen within the following specification. Within the following specification, c stands for the therapy cohort. Right here, every thing is identical because the earlier specification besides that the therapy results are going to be estimated for every relative time & treatment-cohort mixture with the estimator for βₗ,c. Dᵢᶜ stands for the therapy cohort dummy for a given unit i.
Illustrative instance continued:
Within the lockdown instance, it could be that the impact of lockdowns is totally different throughout handled nations for various causes (e.g., perhaps in one of many nations, individuals are extra more likely to adjust to the brand new regulation). Thus, one ought to estimate the nation and relative time-specific therapy results as an alternative of merely estimating the relative time-specific therapy impact.
Within the authentic simulated dataset, I introduce cohort heterogeneity in therapy results throughout intervals and name this new dataset data_hetero
. The therapy impact for cohort interval 2 is 1.5 instances greater than the cohort interval 3 throughout all handled intervals as illustrated within the subsequent graph.
Now, as we did earlier than, let’s run an occasion examine for the data_hetero
. The outcomes of this occasion examine are reported within the following desk. Although there aren’t any therapy or anticipation results within the pre-treatment intervals, the occasion examine detects statistically important results! It’s because we don’t account for the heterogeneity throughout cohorts.
# Occasion examine
formulation <- as.formulation(paste0('hrs_listened ~ ',covs_collapse))
mannequin <- feols(formulation,
information = data_hetero, panel.id = "unit",
fixef = c("unit", "interval"))
abstract(mannequin)
Let’s account for the heterogeneity in therapy results throughout cohorts by operating the hours listened on cohort-specific relative interval dummies. Within the following desk, I report the outcomes of this occasion examine. On this desk, the therapy impact estimates for every cohort and relative interval are reported. By permitting the therapy results to fluctuate per cohort, we account for the heterogeneity and because of this, we’ve the precise estimates! No results are detected for the pre-treatment as they need to be.
# Create dummies for the cohort-period
information <- data_hetero %>%
dummy_cols(select_columns = "cohort_period")
cohort_dummies <- c('cohort_period_2','cohort_period_3')
# Create interactions between relative interval and cohort dummies
work together <- as.information.desk(expand_grid(cohort_dummies, covs))
work together[, interaction := paste0(cohort_dummies,':',covs)]
interact_covs <- work together$interplay
interact_covs_collapse <- paste0(interact_covs,collapse = '+')# Run the occasion examine
formulation <- as.formulation(paste0('hrs_listened ~ ',interact_covs_collapse))
mannequin <- feols(formulation,
information = data_hetero, panel.id = "unit",
fixef = c("unit", "interval"))
abstract(mannequin)
3. Below-identification within the totally dynamic specification within the absence of a never-treated group
In a completely dynamic occasion examine specification the place one consists of all leads and lags (often solely relative time -1 is dropped to keep away from good multicollinearity) of the therapy, the therapy impact coefficients should not recognized within the absence of a non-treated group. The rationale for that is that the dynamic causal results can’t be distinguished from the mix of unit and time results (Borusyak et al. 2021). The sensible resolution for that is to drop one other pre-treatment dummy (i.e., one other one of many lead therapy dummies) to keep away from the under-identification downside.
Illustrative instance continued:
Think about that we wouldn’t have information on any untreated nations. Thus, we solely have the handled nations in our pattern. We will nonetheless do an occasion examine using the variation within the therapy timing. On this case, nonetheless, we’ve to make use of not just one however not less than two reference intervals to keep away from under-identification. One can do that by dropping the interval proper earlier than the therapy and essentially the most detrimental relative interval dummies from the specification.
Within the simulated dataset, I drop the observations from the untreated cohort and name this new dataset data_under_id
. Now, we’ve solely handled cohorts in our pattern. The remainder is identical as the unique simulated dataset. Thus, we’ve to make use of not less than two reference intervals by dropping the dummies for any of the pre-treatment relative interval dummies. I select to exclude the dummies for the relative intervals -1 and -3. I report the outcomes from this occasion examine under. As you’ll be able to see now, I’ve just one relative interval estimated within the mannequin. The estimates are right, nice!
4. Utilizing occasion research as a pretest for parallel developments assumption
It’s a frequent technique to make use of occasion research as a pretest for the parallel developments assumption (PTA), a vital assumption of the difference-in-differences (DiD) method. PTA states that within the absence of the therapy, the handled and untreated models would comply with parallel developments when it comes to the result of curiosity. Occasion research are used to see whether or not the handled group behaves in a different way than the non-treated group earlier than the therapy happens. It’s thought that if a statistically important distinction is just not detected between the handled and untreated teams the PTA is more likely to maintain.
Nevertheless, Roth (2022) exhibits that this method may be problematic. One concern is that most of these pretests have decrease statistical energy. This makes it more durable to detect differing developments. One other concern is that in case you have excessive statistical energy you would possibly detect differing pre-treatment results (pre-trends) although they aren’t so vital.
Roth (2022) recommends a few approaches to deal with this downside:
- Don’t rely solely on the statistical significance of the pretest coefficients and take the statistical energy of the pretest under consideration. If the facility is low the occasion examine received’t be very informative with regard to the existence of a pre-tend. You probably have excessive statistical energy the outcomes of the pretest would possibly nonetheless be deceptive as you would possibly discover a statistically important pre-trend that isn’t so vital.
- Contemplate approaches that keep away from pretesting altogether, e.g., use financial information in a given context to decide on the precise PTA similar to a conditional PTA. One other method is to make use of the later handled group because the management group in case you assume the handled and untreated teams comply with totally different developments and should not as comparable. Please, see Callaway & Sant’Anna’s 2021 paper for potential methods to loosen up the PTA.
Illustrative instance continued:
Going again to the unique instance the place we’ve three nations, let’s say that we need to carry out a DiD evaluation and we need to discover assist indicating that the PTA holds on this context. This may imply that if the handled nations have been to not be handled the music consumption would transfer in parallel to the music consumption within the untreated nation.
We consider using an excellent examine as a solution to pretest the PTA as a result of there is no such thing as a solution to take a look at the PTA straight. First, we have to take the statistical energy of the take a look at under consideration. Roth (2021) offers some instruments to do that. Though that is out of the scope of this text, I can say that on this simulated dataset we’ve a comparatively excessive statistical energy. As a result of the random noise is low and we’ve a comparatively huge pattern measurement with not that many coefficients to estimate. Nonetheless, it may be good to run state of affairs analyses to see how huge of a pre-treatment impact one can appropriately detect.
Secondly, whatever the statistical significance standing of the pre-treatment estimates take the particular context under consideration. Do I anticipate the handled nations to comply with the identical developments because the untreated nation? In my simulated information, I do know this for certain as I decide what the info appears to be like like. Nevertheless, in the true world, it’s unlikely that this might maintain unconditionally. Thus, I’d think about using a conditional PTA by conditioning the PTA on varied covariates that make nations extra comparable to one another.
Conclusion
Occasion research are highly effective instruments. Nevertheless, one ought to concentrate on their potential pitfalls. On this article, I explored essentially the most generally encountered pitfalls and offered suggestions on how you can deal with these utilizing a simulated dataset. I mentioned the problems regarding no anticipation assumption, heterogeneity of the therapy results throughout cohorts, under-identification within the absence of an untreated cohort, and utilizing occasion research as a pretest for PTA.