Class 18 Natural Experiment I: Regression Discontinuity Design

Author
Affiliation

Dr Wei Miao

UCL School of Management

Published

November 26, 2025

1 Natural Experiment

1.1 Class Objectives

  • Concept of regression discontinuity design

  • Estimation of causal effects using regression discontinuity design

  • Application of regression discontinuity design in the business field

1.2 From RCTs to Secondary Data

  • RCTs are the gold standard of causal inference. In practice, however, it can be challenging to implement a perfect RCT.

    • Crossover and spillover effects
    • Costly in terms of time and money
  • Therefore, we may want to exploit causal effects from existing secondary data. The instrumental variable method is a common way to estimate causal effects from secondary data. However, it requires finding valid instrumental variables, which can be challenging in practice.

  • The natural experiment method is another way to estimate causal effects from secondary data, which is our focus in this lecture.

    • Regression discontinuity design (RDD)
    • Difference-in-differences (DiD)

1.3 Comparison: RCT & Natural Experiment

NoteNatural Experiment

A natural experiment is an event in which individuals are exposed to the quasi-experimental conditions that are determined by nature or exogenous factors beyond individuals’ control. The process governing the exposures arguably resembles randomised experiments.

RCT

  1. Assignment of treatment is randomised by us

  2. Treatment is under control by us

  3. Primary data

Natural Experiment

  1. Assignment of treatment is randomised by nature

  2. Treatment is not controlled by us

  3. Secondary data

2 Regression Discontinuity Design

2.1 What is an RDD

  • A regression discontinuity design (RDD) is a natural-experimental design that aims to determine the causal effects of interventions by identifying a cutoff around which an intervention is as if randomised across individuals.

Visual illustration of RDD

Visual illustration of RDD

2.2 Motivating Example

Business objective: What is the causal effect of receiving a Master’s degree with Distinction versus Non-Distinction on students’ future outcomes?

  • Can we run the following simple linear regression and obtain the causal effects?
    • No, because the regression suffers from an endogeneity issue, such as omitted variable bias. An individual’s ability is correlated with their average grades and also affects individual salary, which is omitted from the regression.

\[ Salary_i = \alpha + \beta Distinction_i + \epsilon_i \]

  • Can we use an RCT?
    • No, RCTs would be unethical to run for this research question.
  • Can we use instrumental variables?
    • Theoretically yes, but in practice, it is difficult to find valid instrumental variables.

2.3 A Natural Experiment in the UK

  • In the UK education system, students receiving a final average grade of 70% or above will receive a Distinction, while students below 70% will receive a Merit.

  • The above setting gives us a nice natural experiment:

    • Students may improve their average grades significantly, such as moving from 60% to 69% by working harder, but they cannot perfectly control their average grades around the cutoff, say, from 69.9% to 70%.

2.4 Visual Illustration of RDD: An Example of Distinction on Salary

2.5 Why RDD Gives Causal Effects?

  • For students just above 70%, to measure the treatment effects of receiving Distinction, we would need their counterfactual salaries if they had not received Distinction.

  • At the same time, because the “running variable” cannot be perfectly controlled by the individuals around the cutoff point, it’s as if the treatment was randomised near the cutoff. Thus, individuals near the cutoff should be very similar, such that there should be no systematic differences across the treatment and control group.

    • Similar to an RCT, we overcome the fundamental problem of causal inference using students just below 70% as the control group.
  • All else being equal, a sudden change in the outcome variable at the cutoff can only be attributed to the treatment effect.

2.6 Conditions for Using an RDD

  • An RDD arises when treatment is assigned based on whether an underlying continuous variable crosses a cutoff. The continuous variable is often referred to as the running variable.

  • AND the running variable cannot be perfectly manipulated by individuals

Exercise: eBay endorses sellers with 10,000 orders as Gold Seller. Can we use RDD to identify the causal effect of receiving Gold Seller endorsement on seller sales?

No, because sellers can have perfect control over their sales around the cutoff point.

3 Implementation of RDD

3.1 Data to Collect

  • We collect a dataset of 1000 graduates with their MSc final grade and salary.
Code
pacman::p_load(dplyr, fixest, modelsummary, ggplot2)
data_rdd <- read.csv(
    "https://www.dropbox.com/scl/fi/z4rgm15cmp19m3i65il3a/data_rdd.csv?rlkey=wnb5ypssg79whq2x4iiov6vte&dl=1"
) %>%
    mutate(Distinction = ifelse(score >= 70, 1, 0))

data_rdd %>%
    slice(1:5)
  • Key variables to collect:
    • Running variable: Final score
    • Outcome variable: Salary
    • Treatment variable: Distinction

3.2 Linear Regression Analyses

  • Run a linear regression: salary ~ Distinction
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 63306.835***
(57.722)
Distinction 5533.565***
(94.638)
Num.Obs. 1000
R2 0.774
  • The result suggests that Distinction can increase salary by 5533.565, which is likely over-estimated due to omitted variable bias.

3.3 RDD Analysis: Step 1

  • Step 1: Select a bandwidth around the cutoff, between 68% to 72%
Code
result_RDD <- feols(
    fml = salary ~ Distinction,
    data = data_rdd %>%
        filter(score > 68 & score < 72)
)

3.4 RDD Analysis: Step 2

  • Step 2: Examine discontinuity of other variables (randomisation check).
Code
# Visualise the running variable
data_rdd %>%
    filter(score > 68 & score < 72) %>%
    ggplot(aes(x = score)) +
    geom_histogram() +
    geom_vline(xintercept = 70, linetype = "dashed") +
    labs(
        title = "Running Variable: Final Score",
        x = "Final Score"
    ) +
    theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

3.5 RDD Analysis: Step 3

  • Step 3: Run the RDD regression within the bandwidth.
Code
result_RDD <- feols(
    fml = salary ~ Distinction + score,
    data = data_rdd %>%
        filter(score > 68 & score < 72)
)

3.6 RDD Results

Code
result_RDD %>%
    modelsummary(
        stars = T,
        gof_map = c("nobs", "r.squared")
    )
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 28203.125***
(6687.975)
Distinction 1777.295***
(228.350)
score 536.677***
(97.007)
Num.Obs. 282
R2 0.739
  • The result suggests that Distinction can increase salary by 1777.295, which is likely a more accurate estimate of the causal effect than the OLS estimate.

3.7 Visualisation of RDD

Code
pacman::p_load(ggplot2, ggthemes)
data_rdd %>%
    ggplot(aes(x = score, y = salary)) +
    geom_point() +
    geom_vline(xintercept = 70, linetype = "dashed") +
    labs(
        title = "RDD: Distinction on Salary",
        x = "Final Score",
        y = "Salary"
    ) +
    geom_smooth(
        data = subset(data_rdd, score < 70),
        method = "lm",
        se = FALSE,
        color = "red"
    ) +
    geom_smooth(
        data = subset(data_rdd, score >= 70),
        method = "lm",
        se = FALSE,
        color = "blue"
    ) +
    theme_minimal()
Code
pacman::p_load(ggplot2, ggthemes)
data_rdd %>%
    ggplot(aes(x = score, y = salary)) +
    geom_point() +
    geom_vline(xintercept = 70, linetype = "dashed") +
    labs(
        title = "RDD: Distinction on Salary",
        x = "Final Score",
        y = "Salary"
    ) +
    geom_smooth(
        data = subset(data_rdd, score < 70),
        method = "lm",
        se = FALSE,
        color = "red"
    ) +
    geom_smooth(
        data = subset(data_rdd, score >= 70),
        method = "lm",
        se = FALSE,
        color = "blue"
    ) +
    theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

3.8 Additional Example

  • Scores vs. stars: A regression discontinuity study of online consumer reviews:
    • eBay’s star rating system creates a natural discontinuity
    • Sellers with average ratings ≥ 4.75 receive 5 stars; those below receive 4.5 stars
  • Research question: Does the visual star rating affect sales?
    • Treatment: Receiving 5 stars vs. 4.5 stars at the 4.75 threshold
    • Outcome: Product sales and buyer behaviour
    • Finding: Significant jump in sales at the 4.75 cutoff, demonstrating causal effect of star visualisation

3.9 After-class Reading