Class 17 Regression Discontinuity Design

Author
Affiliation

Dr Wei Miao

UCL School of Management

Published

November 29, 2023

1 Natural Experiment

1.1 From RCTs to Secondary Data

  • RCTs are the gold standard of causal inference: In an RCT, the treatment is randomized and hence uncorrelated with any confounding factors, i.e., \(cov(X,\epsilon)=0\)

  • In practice, however, it can be challenging to implement a perfect RCT.

    1. Crossover and spillover effects;

    2. Costly in terms of time and money

  • Therefore, we may want to exploit causal effects from existing secondary data. Besides the instrumental variable method, we can also investigate natural experiments.

1.2 Comparison: RCT & Natural Experiment

Natural Experiment

A natural experiment is an event in which individuals are exposed to the experimental conditions that are determined by nature or exogenous factors beyond researchers’ control. The process governing the exposures arguably resembles randomized experiments.

RCT

  1. Assignment of treatment is randomized by us

  2. Treatment is under control by us

  3. Primary data

Natural Experiment

  1. Assignment of treatment is randomized by nature

  2. Treatment is not controlled by us

  3. Secondary data

2 Regression Discontinuity Design

2.1 What is an RDD

  • A regression discontinuity design (RDD) is a natural-experimental design that aims to determine the causal effects of interventions by identifying a cutoff around which an intervention is as if randomized across individuals.

Visual illustration of RDD

2.2 Motivating Example

Business objective: What is the causal effect of receiving a Master’s degree with Distinction versus Merit on students’ future salary?

  • Can we run the following simple linear regression and obtain the causal effects?
    • No, because the regression suffers from endogeneity issue, such as omitted variable bias. Individual’s ability is correlated with their average grades and also affects individual salary, which is omitted from the regression.

\[ Salary_i = \alpha + \beta Distinction_i + \epsilon_i \]

  • Can we use RCT?
    • No, RCTs would be unethical to run for this research question.
  • Can we use instrumental variables?
    • Theoretically yes, but in practice, it is difficult to find valid instrumental variables.

2.3 A Natural Experiment in the UK

  • In the UK Education system, students receiving 70% or above final average grades will receive Distinction while students below 70% will receive Merit.

  • The above setting gives us a nice natural experiment:

    • Students may improve their average grades from 60% to 69% by working harder, but they cannot perfectly control their average grades around the cutoff, say from 69.9% to 70%.

2.4 Visual Illustration of RDD: An Example of Distinction on Salary

2.5 Why RDD Gives Causal Effects?

  • For students just above 70%, to measure the treatment effects of receiving Distinction, we would need their counterfactual salaries if they had not received Distinction.

  • At the same time, because the “running variable” cannot be perfectly controlled by the individuals around the cutoff point, it’s as if the treatment was randomized near the cutoff. Thus, individuals near the cutoff should be very similar, such that there should be no systematic differences across the treatment and control group.

    • Similar to RCT, we overcome the fundamental problem of causal inference using students just below 70 as the control group.
  • All else being equal, a sudden change in the outcome variable at the cutoff can only be attributed to the treatment effect.

2.6 When Can We Use an RDD

  • An RDD design arises when treatment is assigned based on whether an underlying continuous variable crosses a cutoff.
    • The continuous variable is often referred to as the running variable.
  • AND the characteristic cannot be perfectly manipulated by individuals
    • We should only focus on individuals close to the cutoff point.

Exercise: eBay endorses sellers with 10,000 orders as Gold Seller. Can we use RDD to identify the causal effect of receiving Gold Seller endorsement on seller sales?

No, because sellers can have perfect control over their sales around the cutoff point.

3 Implementation of RDD

3.1 Step 1: Select Sample of Analysis

  1. Determine the bandwidth above and below the cutoff and select the subset of individuals within the bandwidth
    • e.g., if we choose a bandwidth of 0.5, we need to filter out students with average scores between 69.5 and 70.5
  • We face a trade-off when selecting the bandwidth: If we choose a smaller bandwidth around the cut-off
    • Pros: Individuals should be more similar around the cutoff, thus it is more likely the control group and treatment group are “as-if randomized”, thus higher internal validity.
    • Cons: We have a smaller subset of subjects which may not be representative of remaining individuals, thus lower external validity; We have a smaller sample size due to fewer individuals selected
  • In practice, there is no specific rule how to determine the bandwidth. We need to run a set of different bandwidths as robustness checks.

3.2 Step 2: Examine Continuity of Observed Characteristics

  1. Examine if other characteristics of the treatment group and control group are continuous at the cut-off point.
    • The idea is similar to “randomization check” in an RCT.

3.3 Step 3: Data Analysis

  1. Regress the outcome variable on the treatment indicator to obtain the causal effect.

\[ Y_i = \beta_0 + \beta_1 Treated + \epsilon_i \]

  • \(Treated\) is a binary variable for whether or not the running variable is above the cutoff.

  • Sometimes, we may also want to control the running variable in the regression to mitigate its confounding effects.

3.4 The Causal Effect of Distinction on Salary

  • Generate a synthetic dataset
pacman::p_load(dplyr,fixest,modelsummary)
n <- 1000 # 1000 individuals
set.seed(888)
score<-runif(n,61,75) # generate scores between 61 and 75
experience<-runif(n,0,3) # generate work experience between 0 and 3

salary<-30000+ 2000*(score>=70) + # causal effect is 2000
  500 * score  + 400*experience + rnorm(n,0,800)

data_rdd <- data.frame(ID = 1:n,
                       score = score,
                       experience = experience,
                       salary = salary)
  • Generate the treatment indicator, Distinction, in the dataset using dplyr
data_rdd <- data_rdd %>%
  mutate(Distinction = ifelse(score >= 70, 1, 0))

3.5 Linear Regression Analyses

  • Run a linear regression: salary ~ Distinction
feols(
  fml  = salary ~ Distinction ,
  data = data_rdd) %>%
  modelsummary(stars = T,
               gof_map = c('nobs','r.squared'))
 (1)
(Intercept) 63306.835***
(57.722)
Distinction 5533.565***
(94.638)
Num.Obs. 1000
R2 0.774
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
  • The result suggests that Distinction can increase the salary by 5.5k, which is far from the true causal effect.

3.6 RDD Analysis

  • Step 1: Select a bandwidth around the cutoff, between 68% to 72%

  • Step 2: Examine discontinuity of other variables (randomization check).

  • Step 3: Run the linear regression on the smaller sample.

feols(
  fml  = salary ~ Distinction,
  data = data_rdd %>%
    filter(score > 68 & score < 72)) %>%
  modelsummary(stars = T, 
             gof_map = c('nobs','r.squared'))
 (1)
(Intercept) 65201.101***
(81.291)
Distinction 2898.285***
(110.725)
Num.Obs. 282
R2 0.710
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
  • Let’s try other bandwidths on the Quarto document. As we tighten the bandwidth, what do you find?

As we use stricter bandwidths, the coefficient of Distinction is getting closer to the real treatment effects. The reason is that treatment and control group individuals are getting more similar.

3.7 Visualization of RDD

  • rdrobust package provides a nice visualization tool for RDD.
pacman::p_load(rdrobust)
rdplot(y = salary, # outcome variable 
       x = score, # x is the running variable
      c = 70, # c is the cutoff point
      p = 2 # polynomial order to fit the trends
)

3.8 Regression Discontinuity in Time

  • A natural experiment occurred on a day affecting all customers, we can then implement a Regression Discontinuity in Time design (RDiT) as follows
    • The running variable is time; the cutoff is the day on which the natural experiment took place

\[ Y_{it}=\alpha+\beta_{1} Post_{it} + \mu_{i t} \]

  • The underlying assumption for RDiT is that, the pre-treatment outcomes before the natural experiment are good counterfactuals for the post-treatment outcomes if the natural experiment had not happened.
    • For the underlying assumption to hold, we need to take a short time window before and after the natural experiment, e.g., 7 days, 14 days, or 30 days.
  • The coefficient \(\beta_1\) then measures the changes in the outcome variable before and after the natural experiment.

3.9 After-class Reading