Class 19 Natural Experiment II: Difference-in-Differences Design

Author

Affiliation

Dr Wei Miao

UCL School of Management

Published

December 6, 2024

1 Difference-in-Differences Design

1.1 Learning Objectives

Understand the economic intuition of the difference-in-differences (DiD) design.
Learn how to estimate the treatment effect using DiD design.
The intuition of Synthetic Difference-in-Differences when the parallel pre-trend assumption is violated.

1.2 Introduction to DiD

Difference-in-Differences Design (DiD, DD, Diff-in-Diff) is a statistical technique used in economics and business that attempts to mimic an experimental research design using secondary data by comparing the changes in the outcomes of the treatment group with the changes in the outcomes of the control group.

1.3 Requirement for Using Difference-in-Differences

Control group: We need to find a group of units who are unaffected by the natural experiment
Treatment group: We need to find a treatment group of units who are affected by the natural experiment
No cross-over and spill-over: There is no interference between the treatment and control group that can cause cross-over or spill-over effects.
Parallel trend assumption: The treatment and control groups must have similar trends before treatment occurs.

Note

The first 3 requirements apply to A/B testings as well.

Parallel trend is new for DiD analysis, and is the fundamental assumption that must be satisfied.

1.4 DiD Estimation: Linear Regression

\[ Outcome_{i, t}=\beta_0+ \beta_{1} Post_{t}+\beta_{2} Treated_{i}+\beta_{3} Treated_{i} \times Post_{t} + \mu_{i, t} \]

$Post_{t}$ controls for the seasonality for both treatment and control groups
$Treated_{i}$ controls for the pre-existing difference between the treatment group and control group.
After accounting for (1) seasonality ($\beta_1$) and (2) pre-existing across-group differences ($\beta_2$), the interaction term ($\beta_3$) measures the treatment effect.¹

2 Implementation of DiD Using R

2.1 Application of DiD: The Causal Effect of Privacy Regulation

Firms routinely collect consumer data on mobile apps and develop algorithms to target customers with personalized ads. However, consumers increasingly value privacy as an intrinsic human right: “the right to be left alone”.
Regulators and mobile ecosystems have enacted various regulations to restrict firms’ collection of sensitive customer data.
- EU’s GDPR (2018), California’s CCPA (2020), China’s PIPL (2022)
- Apple’s App Tracking Transparency (2021), Android’s Privacy Sandbox
It’s important to understand the causal effects of the privacy regulations on firms and consumers: (1) Trust-enhancing effect (2) Efficiency-decreasing effect

2.2 Apple’s App Tracking Transparency Policy on Consumer Spending

Before iOS 14.5 (26 April 2021), user data tracking on iOS lacked explicit user consent: iOS apps and advertisers could track user activities across different apps through the Identifier for Advertisers (IDFA).
After iOS 14.5, Apple introduced the App Tracking Transparency (ATT) policy, which requires apps to obtain explicit user consent before tracking user activities across different apps.
Causal Question: How does the implementation of Apple’s App Tracking Transparency Policy affect consumer spending at your business?

2.3 The Causal Effect of the Introduction of ATT Policy

$y$: standardized customer spending.
$x1$: standardized customer income.
$id$: Identifier of the customer.
$period$: normalized time period. 0 is the month in which the ATT policy was introduced.
$post$: equals 1 for after ATT was introduced.
$treat$: equals 1 for iPhone customers.

Code

pacman::p_load(fixest, modelsummary, dplyr)
data("base_did")
base_did <- base_did %>%
    mutate(period = period - 6)

2.4 Data Preprocessing

When you run DiD analysis, you need to construct a panel dataset similar to the following dataset.
If the raw data are transaction data, you need to aggregate the data at the unit-time level, such as customer-month level.

Code

head(base_did, 6)

2.5 Estimation of DiD Using Linear Regressions

We need to run a linear regression with 3 variables: treat, post, and the interaction term treat * post \[ Outcome_{i, t}=\beta_0+ \beta_{1} post_{t}+\beta_{2} treat_{i}+\beta_{3} treat_{i} \times post_{t} + \mu_{i, t} \]

Code

est_did <- feols(
    fml = y ~ treat + post + treat:post, # method 1 for interactions
    # fml = y ~ treat * post # method 2 for interactions
    data = base_did
)

2.6 Report the DiD Results

	(1)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	0.323
	(0.317)
treat	0.142
	(0.445)
post	0.713
	(0.449)
treat × post	4.993***
	(0.629)
Num.Obs.	1080
R2	0.183

The coefficient of the interaction term treat:post is the treatment effect of the ATT policy.
After introducing the ATT policy, iPhone users increased their spending by 4.993 units compared to Android users.

2.7 Testing the Parallel Pre-trend Assumption

Before we make the causal conclusion, we must test the parallel pre-trend assumption by plotting the average outcome for the treatment and control group over time.

Code

pacman::p_load(dplyr, ggplot2, ggthemes)

group_did <- base_did %>%
    group_by(treat, period) %>%
    summarise(avg_y = mean(y, na.rm = T)) %>%
    ungroup()

ggplot() +
    geom_line(
        data = group_did,
        aes(
            x = period,
            y = avg_y,
            color = factor(treat)
        )
    ) +
    scale_x_continuous(breaks = unique(base_did$period)) +
    theme_stata()

The graph indicates a violation of the parallel pre-trend assumption.

3 Synthetic Difference-in-Differences

3.1 When the Parallel Pre-Trend is Violated

If the parallel pre-trend assumption is violated, we can use synthetic difference-in-differences, which is a method that combines synthetic control and difference-in-differences.
The Synthetic Control Method is a method that uses unit weighting to create a synthetic control group that approximates the pre-treatment outcomes of the treatment group (Abadie, Diamond, and Hainmueller 2010).
However, Synthetic Control is very restrictive in many ways. Most importantly, it forces the treatment group to have the same level of the outcome variable as the synthetic control group, which is a very strong restriction.

3.2 Synthetic Difference-in-Differences

Synthetic Difference-in-Differences is a method that uses synthetic control methods to estimate the treatment effect when the parallel pre-trend assumption is violated (Arkhangelsky et al. 2021; Clarke et al. 2023).

Code

# devtools is required to install the synthdid package
pacman::p_load(devtools)
# install the synthdid package from GitHub
devtools::install_github("synth-inference/synthdid")

Compared with the Synthetic Control Method, Synthetic DiD is more flexible and allows the treatment group to have different levels of the outcome variable.
Compared with the DiD method, Synthetic DiD can estimate the treatment effect when the parallel pre-trend assumption is violated.

3.3 Prepare the Data for Synthetic DiD

We need to prepare the data for the synthetic DiD method to the required format.

Code

library(synthdid)
# Prepare the data

final_data <- panel.matrices(
    base_did %>% # treat must be treated * post
        mutate(treat = treat * post),
    unit = 3, # unit id
    time = 4, # period id
    outcome = 1, # outcome variable
    treatment = 6 # treat * post
)

3.4 Run SynthDiD

After the dataset is prepared according to the panel.matrices function, we can run the synthdid_estimate() function to estimate the treatment effect.

Code

sdid_result <- synthdid_estimate(
    final_data$Y,
    final_data$N0,
    final_data$T0
)

print(sdid_result)

synthdid: 4.828 +- 1.025. Effective N0/N0 = 52.4/53~1.0. Effective T0/T0 = 4.6/5~0.9. N1,T1 = 55,5.

3.5 Visualization of SynthDiD

Within the synthdid package, the plot() function can be used to visualize the results of the synthetic DiD method.

Code

plot(sdid_result, overlay = 1, se.method = "bootstrap")

3.6 References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505. https://doi.org/10.1198/jasa.2009.ap08746.

Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens, and Stefan Wager. 2021. “Synthetic Difference-in-Differences.” American Economic Review 111 (12): 4088–118. https://doi.org/10.1257/aer.20190159.

Clarke, Damian, Daniel Pailañir, Susan Athey, and Guido Imbens. 2023. “Synthetic Difference In Differences Estimation.” arXiv. https://arxiv.org/abs/2301.11859.

Footnotes

DiD method can only give average treatment effect on the treated (ATT).↩︎

--- date: "`r (lubridate::ymd('20241004')+lubridate::dweeks(9))`" title: "Class 19 Natural Experiment II: Difference-in-Differences Design" execute: warning: false --- # Difference-in-Differences Design ## Learning Objectives - Understand the economic intuition of the difference-in-differences (DiD) design. - Learn how to estimate the treatment effect using DiD design. - The intuition of Synthetic Difference-in-Differences when the parallel pre-trend assumption is violated. ## Introduction to DiD - **Difference-in-Differences Design** (**DiD**, **DD**, **Diff-in-Diff**) is a statistical technique used in economics and business that attempts to mimic an experimental research design using secondary data by comparing the changes in the outcomes of the treatment group with the changes in the outcomes of the control group. ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics("images/Week 10/DiDGraph.png") ``` ## Requirement for Using Difference-in-Differences - **Control group**: We need to find a group of units who are unaffected by the natural experiment - **Treatment group**: We need to find a treatment group of units who are affected by the natural experiment - **No cross-over and spill-over**: There is no interference between the treatment and control group that can cause cross-over or spill-over effects. - **Parallel trend assumption**: The treatment and control groups must have similar trends before treatment occurs. ::: callout-note The first 3 requirements apply to A/B testings as well. Parallel trend is new for DiD analysis, and is the fundamental assumption that must be satisfied. ::: ## DiD Estimation: Linear Regression \small $$ Outcome_{i, t}=\beta_0+ \beta_{1} Post_{t}+\beta_{2} Treated_{i}+\beta_{3} Treated_{i} \times Post_{t} + \mu_{i, t} $$ - $Post_{t}$ controls for the seasonality for both treatment and control groups - $Treated_{i}$ controls for the pre-existing difference between the treatment group and control group. - After accounting for (1) seasonality ($\beta_1$) and (2) pre-existing across-group differences ($\beta_2$), the interaction term ($\beta_3$) measures the treatment effect.[^1] ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics("images/Week 10/DiDGraph.png") ``` [^1]: DiD method can only give average treatment effect on the treated (ATT). # Implementation of DiD Using R ## Application of DiD: The Causal Effect of Privacy Regulation - Firms routinely collect consumer data on mobile apps and develop algorithms to target customers with personalized ads. However, consumers increasingly value privacy as an intrinsic human right: "the right to be left alone". - Regulators and mobile ecosystems have enacted various regulations to restrict firms' collection of sensitive customer data. - EU's GDPR (2018), California's CCPA (2020), China's PIPL (2022) - Apple's App Tracking Transparency (2021), Android's Privacy Sandbox - It's important to understand the causal effects of the privacy regulations on firms and consumers: (1) Trust-enhancing effect (2) Efficiency-decreasing effect ## Apple's App Tracking Transparency Policy on Consumer Spending - Before iOS 14.5 (26 April 2021), user data tracking on iOS lacked explicit user consent: iOS apps and advertisers could track user activities across different apps through the Identifier for Advertisers (IDFA). - After iOS 14.5, Apple introduced the App Tracking Transparency (ATT) policy, which requires apps to obtain explicit user consent before tracking user activities across different apps. - **Causal Question**: How does the implementation of Apple's App Tracking Transparency Policy affect consumer spending at your business? ```{r} #| echo: false #| fig-align: 'center' #| out-width: "2cm" knitr::include_graphics("images/Week 10/ATT.png") ``` ## The Causal Effect of the Introduction of ATT Policy - $y$: standardized customer spending. - $x1$: standardized customer income. - $id$: Identifier of the customer. - $period$: normalized time period. 0 is the month in which the ATT policy was introduced. - $post$: equals 1 for after ATT was introduced. - $treat$: equals 1 for iPhone customers. ```{r} pacman::p_load(fixest, modelsummary, dplyr) data("base_did") base_did <- base_did %>% mutate(period = period - 6) ``` ## Data Preprocessing - When you run DiD analysis, you need to construct a panel dataset similar to the following dataset. - If the raw data are transaction data, you need to aggregate the data at the unit-time level, such as customer-month level. ```{r} head(base_did, 6) ``` ## Estimation of DiD Using Linear Regressions - We need to run a linear regression with 3 variables: `treat`, `post`, and the interaction term `treat * post` $$ Outcome_{i, t}=\beta_0+ \beta_{1} post_{t}+\beta_{2} treat_{i}+\beta_{3} treat_{i} \times post_{t} + \mu_{i, t} $$ ```{r} #| echo: true est_did <- feols( fml = y ~ treat + post + treat:post, # method 1 for interactions # fml = y ~ treat * post # method 2 for interactions data = base_did ) ``` ## Report the DiD Results \footnotesize ```{r} #| echo: false modelsummary(est_did, stars = TRUE, gof_map = c("nobs", "r.squared") ) ``` \normalsize - The coefficient of the interaction term `treat:post` is the treatment effect of the ATT policy. - After introducing the ATT policy, iPhone users increased their spending by 4.993 units compared to Android users. ## Testing the Parallel Pre-trend Assumption - Before we make the causal conclusion, we must test the parallel pre-trend assumption by plotting the average outcome for the treatment and control group over time. ```{r} #| message: false #| warning: false pacman::p_load(dplyr, ggplot2, ggthemes) group_did <- base_did %>% group_by(treat, period) %>% summarise(avg_y = mean(y, na.rm = T)) %>% ungroup() ggplot() + geom_line( data = group_did, aes( x = period, y = avg_y, color = factor(treat) ) ) + scale_x_continuous(breaks = unique(base_did$period)) + theme_stata() ``` - The graph indicates a violation of the parallel pre-trend assumption. # Synthetic Difference-in-Differences ## When the Parallel Pre-Trend is Violated - If the parallel pre-trend assumption is violated, we can use [synthetic difference-in-differences](https://synth-inference.github.io/synthdid/), which is a method that combines synthetic control and difference-in-differences. - The **Synthetic Control Method** is a method that uses unit weighting to create a synthetic control group that approximates the pre-treatment outcomes of the treatment group [@abadieSyntheticControlMethods2010a]. - However, Synthetic Control is very restrictive in many ways. Most importantly, it forces the treatment group to have the same level of the outcome variable as the synthetic control group, which is a very strong restriction. ## Synthetic Difference-in-Differences - Synthetic Difference-in-Differences is a method that uses synthetic control methods to estimate the treatment effect when the parallel pre-trend assumption is violated [@arkhangelskySyntheticDifferenceinDifferences2021; @clarkeSyntheticDifferenceDifferences2023]. ```{r} #| echo: true #| eval: false # devtools is required to install the synthdid package pacman::p_load(devtools) # install the synthdid package from GitHub devtools::install_github("synth-inference/synthdid") ``` - Compared with the Synthetic Control Method, Synthetic DiD is more flexible and allows the treatment group to have different levels of the outcome variable. - Compared with the DiD method, Synthetic DiD can estimate the treatment effect when the parallel pre-trend assumption is violated. ## Prepare the Data for Synthetic DiD - We need to prepare the data for the synthetic DiD method to the required format. ```{r} #| echo: true library(synthdid) # Prepare the data final_data <- panel.matrices( base_did %>% # treat must be treated * post mutate(treat = treat * post), unit = 3, # unit id time = 4, # period id outcome = 1, # outcome variable treatment = 6 # treat * post ) ``` ## Run SynthDiD - After the dataset is prepared according to the `panel.matrices` function, we can run the `synthdid_estimate()` function to estimate the treatment effect. ```{r} #| echo: true sdid_result <- synthdid_estimate( final_data$Y, final_data$N0, final_data$T0 ) print(sdid_result) ``` ## Visualization of SynthDiD - Within the synthdid package, the `plot()` function can be used to visualize the results of the synthetic DiD method. ```{r} #| echo: true #| cache: true plot(sdid_result, overlay = 1, se.method = "bootstrap") ``` ## References