Class 18 Natural Experiment I: Regression Discontinuity Design

Author
Affiliation

Dr Wei Miao

UCL School of Management

Published

November 27, 2024

1 Natural Experiment

1.1 Class Objectives

  • Concept of regression discontinuity design

  • Estimation of causal effects using regression discontinuity design

  • Application of regression discontinuity design in the business field

1.2 From RCTs to Secondary Data

  • RCTs are the gold standard of causal inference: In an RCT, the treatment is randomized and hence uncorrelated with any confounding factors, i.e., \(cov(X,\epsilon)=0\)

  • In practice, however, it can be challenging to implement a perfect RCT.

    1. Crossover and spillover effects;

    2. Costly in terms of time and money

  • Therefore, we may want to exploit causal effects from existing secondary data. Besides the instrumental variable method, we can also investigate natural experiments.

1.3 Comparison: RCT & Natural Experiment

Natural Experiment

A natural experiment is an event in which individuals are exposed to the experimental conditions that are determined by nature or exogenous factors beyond researchers’ control. The process governing the exposures arguably resembles randomized experiments.

RCT

  1. Assignment of treatment is randomized by us

  2. Treatment is under control by us

  3. Primary data

Natural Experiment

  1. Assignment of treatment is randomized by nature

  2. Treatment is not controlled by us

  3. Secondary data

2 Regression Discontinuity Design

2.1 What is an RDD

  • A regression discontinuity design (RDD) is a natural-experimental design that aims to determine the causal effects of interventions by identifying a cutoff around which an intervention is as if randomized across individuals.

Visual illustration of RDD

Visual illustration of RDD

2.2 Motivating Example

Business objective: What is the causal effect of receiving a Master’s degree with Distinction versus Non-Distinction on students’ future salary?

  • Can we run the following simple linear regression and obtain the causal effects?
    • No, because the regression suffers from endogeneity issue, such as omitted variable bias. Individual’s ability is correlated with their average grades and also affects individual salary, which is omitted from the regression.

\[ Salary_i = \alpha + \beta Distinction_i + \epsilon_i \]

  • Can we use RCT?
    • No, RCTs would be unethical to run for this research question.
  • Can we use instrumental variables?
    • Theoretically yes, but in practice, it is difficult to find valid instrumental variables.

2.3 A Natural Experiment in the UK

  • In the UK Education system, students receiving 70% or above final average grades will receive Distinction while students below 70% will receive Merit.

  • The above setting gives us a nice natural experiment:

    • Students may improve their average grades significantly, such as moving from 60% to 69% by working harder, but they cannot perfectly control their average grades around the cutoff, say, from 69.9% to 70%.

2.4 Visual Illustration of RDD: An Example of Distinction on Salary

2.5 Why RDD Gives Causal Effects?

  • For students just above 70%, to measure the treatment effects of receiving Distinction, we would need their counterfactual salaries if they had not received Distinction.

  • At the same time, because the “running variable” cannot be perfectly controlled by the individuals around the cutoff point, it’s as if the treatment was randomized near the cutoff. Thus, individuals near the cutoff should be very similar, such that there should be no systematic differences across the treatment and control group.

    • Similar to RCT, we overcome the fundamental problem of causal inference using students just below 70 as the control group.
  • All else being equal, a sudden change in the outcome variable at the cutoff can only be attributed to the treatment effect.

2.6 Conditions for Using an RDD

  • An RDD design arises when treatment is assigned based on whether an underlying continuous variable crosses a cutoff.
    • The continuous variable is often referred to as the running variable.
  • AND the characteristic cannot be perfectly manipulated by individuals
    • We should only focus on individuals close to the cutoff point.

Exercise: eBay endorses sellers with 10,000 orders as Gold Seller. Can we use RDD to identify the causal effect of receiving Gold Seller endorsement on seller sales?

No, because sellers can have perfect control over their sales around the cutoff point.

3 Implementation of RDD

3.1 Step 1: Select Sample of Analysis

  1. Determine the bandwidth above and below the cutoff and select the subset of individuals within the bandwidth
    • e.g., if we choose a bandwidth of 0.5, we need to filter out students with average scores between 69.5 and 70.5
  • We face a trade-off when selecting the bandwidth: If we choose a smaller bandwidth around the cut-off
    • Pros: Individuals should be more similar around the cutoff, thus it is more likely the control group and treatment group are “as-if randomized”, thus higher internal validity.
    • Cons: We have a smaller subset of subjects which may not be representative of remaining individuals, thus lower external validity; We have a smaller sample size due to fewer individuals selected
  • In practice, there is no specific rule how to determine the bandwidth. We need to run a set of different bandwidths as robustness checks.

3.2 Step 2: Examine Continuity of Observed Characteristics

  1. Examine if other characteristics of the treatment group and control group are continuous at the cut-off point.
    • The idea is similar to “randomization check” in an RCT.

3.3 Step 3: Data Analysis

  1. Regress the outcome variable on the treatment indicator to obtain the causal effect.

\[ Y_i = \beta_0 + \beta_1 Treated + \beta_2 running\_variable + \epsilon_i \]

  • \(Treated\) is a binary variable for whether or not the running variable is above the cutoff.

  • We may also want to control the running variable in the regression to mitigate its confounding effects.

3.4 The Causal Effect of Distinction on Salary

  • We collect a dataset of 1000 graduates with their MSc final grade and salary.
Code
pacman::p_load(dplyr, fixest, modelsummary, ggplot2)
data_rdd <- read.csv("https://www.dropbox.com/scl/fi/z4rgm15cmp19m3i65il3a/data_rdd.csv?rlkey=wnb5ypssg79whq2x4iiov6vte&dl=1") %>%
    mutate(Distinction = ifelse(score >= 70, 1, 0))

data_rdd %>%
    slice(1:5)

3.5 Linear Regression Analyses

  • Run a linear regression: salary ~ Distinction
Code
feols(
    fml  = salary ~ Distinction,
    data = data_rdd
) %>%
    modelsummary(
        stars = T,
        gof_map = c("nobs", "r.squared")
    )
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 63306.835***
(57.722)
Distinction 5533.565***
(94.638)
Num.Obs. 1000
R2 0.774
  • The result suggests that Distinction can increase the salary by 5.5k, which is likely over-estimated due to omitted variable bias.

3.6 RDD Analysis

  • Step 1: Select a bandwidth around the cutoff, between 68% to 72%
  • Step 2: Examine discontinuity of other variables (randomization check).
  • Step 3: Run a linear regression on the subsample.
Code
result_RDD <- feols(
    fml = salary ~ Distinction,
    data = data_rdd %>%
        filter(score > 68 & score < 72)
)

3.7 RDD Results

Code
result_RDD %>%
    modelsummary(
        stars = T,
        gof_map = c("nobs", "r.squared")
    )
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 65201.101***
(81.291)
Distinction 2898.285***
(110.725)
Num.Obs. 282
R2 0.710
  • The result suggests that Distinction can increase the salary by 2.898k, which is likely a more accurate estimate of the causal effect than the OLS estimate.

3.8 Visualization of RDD

Code
pacman::p_load(ggplot2, ggthemes)
data_rdd %>%
    ggplot(aes(x = score, y = salary)) +
    geom_point() +
    geom_vline(xintercept = 70, linetype = "dashed") +
    labs(
        title = "RDD: Distinction on Salary",
        x = "Final Score",
        y = "Salary"
    ) +
    geom_smooth(
        data = subset(data_rdd, score < 70),
        method = "lm", se = FALSE,
        color = "red"
    ) +
    geom_smooth(
        data = subset(data_rdd, score >= 70),
        method = "lm", se = FALSE,
        color = "blue"
    ) +
    theme_minimal()
Code
pacman::p_load(ggplot2, ggthemes)
data_rdd %>%
    ggplot(aes(x = score, y = salary)) +
    geom_point() +
    geom_vline(xintercept = 70, linetype = "dashed") +
    labs(
        title = "RDD: Distinction on Salary",
        x = "Final Score",
        y = "Salary"
    ) +
    geom_smooth(
        data = subset(data_rdd, score < 70),
        method = "lm", se = FALSE,
        color = "red"
    ) +
    geom_smooth(
        data = subset(data_rdd, score >= 70),
        method = "lm", se = FALSE,
        color = "blue"
    ) +
    theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

3.9 Application of RDD in Marketing and Business Context

3.10 After-class Reading