Class 14 Linear Regression for Causal Inference

Author

Affiliation

Dr Wei Miao

UCL School of Management

Published

November 13, 2024

1 Basics of Linear Regression

1.1 Linear Regression Models

A simple linear regression is a model as follows. \[ y_i = \beta_0 + x_1 \beta_1 + x_2\beta_2+ \ldots + x_k\beta_k + \epsilon_i \]
$y_i$: Dependent variable/outcome variable
$x_k$: Independent variable/explanatory variable/control variable
$\beta$: Regression coefficients; $\beta_0$: intercept (should always be included)
$\epsilon_i$: Error term, which captures the deviation of Y from the line. Expected mean should be 0, i.e., $E[\epsilon|X]=0$

1.2 Linear Regression Models

If we take the expectation of Y, we should have \[ E[Y|X] = \beta_0 + x_1 \beta_1 + x_2\beta_2+ \ldots + x_k\beta_k \]

1.3 Origin of the Name “Regression”

The term “regression” was first coined by Francis Galton to describe a biological phenomenon: The heights of descendants of tall ancestors tend to regress down towards a normal average.
The term “regression” was later extended by statisticians Udny Yule and Karl Pearson to a more general statistical context (Pearson, 1903).
In supervised learning models, “regression” has a different meaning: when the outcome variable to be predicted is continuous, the task is called a regression task. This is because ML models are developed by computer science; causal inference models are developed by statisticians and economists.

2 Estimation of Coefficients

2.1 How to Run Regression in R

In R, there are many packages that can run OLS regression. The basic function is lm().
In this module, we will be using the fixest package, because it’s able to accommodate more complex regressions, especially high-dimensional fixed effects.¹

Code

pacman::p_load(modelsummary, fixest)

OLS_result <- feols(
    fml = total_spending ~ Income, # Y ~ X
    data = data_full, # dataset from M&S
)

2.2 Report Regression Results

Code

modelsummary(OLS_result,
    stars = TRUE # export statistical significance
)

	(1)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	-556.823***
	(21.654)
Income	0.022***
	(0.000)
Num.Obs.	2000
R2	0.629
R2 Adj.	0.629
AIC	29306.1
BIC	29317.3
RMSE	367.45
Std.Errors	IID

2.3 Parameter Estimation: Univariate Regression Case

Regressions with a single regressor are called univariate regressions. Let’s take a univariate regression as an example:

\[ total\_spending = a + b \cdot income + \epsilon \]

For each guess of a and b, we can compute the error for customer $i$,

\[ e_i = total\_spending_{i}-a-b \cdot income_{i} \]

We can compute the sum of squared residuals (SSR) across all customers

\[ SSR =\sum_{i=1}^{n}\left(total\_spending_{i}-a-b \cdot income_{i}\right)^{2} \]

Objective of estimation: Search for the unique set of $a$ and $b$ that can minimize the SSR.
This estimation method that minimizes SSR is called Ordinary Least Square (OLS).

2.4 Visualization: Estimation of Univariate Regression

If in the M&S dataset, if we regress total spending (Y) on income (X)

Model	Color	Sum of Squared Error
$Y = -556.823 + 0.06 * X$	Purple	1.5403487^{13}
$Y = 0 + 0.004 * X$	Red	6.420375^{11}
$Y = -556.823 + 0.022 * X$	Green	1.4356017^{9}

2.5 Multivariate Regression

The OLS estimation also applies to multivariate regression with multiple regressors.

\[ y_i = b_0 + b_1 x_{1} + ... + b_k x_{k}+\epsilon_i \]

Objective of estimation: Search for the unique set of $b$ that can minimize the sum of squared residuals.

\[ SSR= \sum_{i=1}^{n}\left(y_{i}-b_0 - b_1 x_{1} - ... - b_k x_{k} \right)^{2} \]

3 Interpretation of Coefficients

3.1 Coefficients Interpretation

Now on your Quarto document, let’s run a new regression, where the DV is $total\_spending$, and X includes $Income$ and $Kidhome$.

	(1)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	-299.119***
	(28.069)
Income	0.019***
	(0.000)
Kidhome	-230.610***
	(16.945)
Num.Obs.	2000
R2	0.661
R2 Adj.	0.660
AIC	29130.7
BIC	29147.5
RMSE	351.51
Std.Errors	IID

Controlling for Kidhome, one unit increase in Income increases totalspending by £0.019.

3.2 Standard Errors and P-Values

If we collect all data from the whole population, the regression coefficient is called the population regression coefficient.
Because the regression is estimated on a random sample of the population, if we rerun the regression on different samples from the same population, we would obtain a different set of sample regression coefficients each time.
In theory, the sample regression coefficients estimates follows a t-distribution: the mean is the true $\beta$. The standard error of the estimates is the estimated standard deviation of the error.
Knowing that the coefficients follow a t-distribution, we can test whether the coefficients are statistically different from 0 using hypothesis testing.
Income/Kidhome is statistically significant at the 1% level.

3.3 R-Squared

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by all included variables in a regression.
Interpretation: 66% of the variation in total_spending can be explained by Income and Kidhome.
As the number of variables increases, the $R^2$ will naturally increase, so sometimes we may need to penalize the number of variables using the so-called adjusted R-squared.

Important

R-Squared is only important for supervised learning prediction tasks, because it measures the predictive power of the X. However, in causal inference tasks, $R^2$ does not matter much.

4 Regression for A/B/N Testing

4.1 Categorical variables

Code

pacman::p_load(dplyr,ggplot2,ggthemes)
# Load both datasets
data_full <- read.csv(file = "https://www.dropbox.com/scl/fi/hhweiqsuwgcwgd1jiuyte/data_full.csv?rlkey=jwyd9z409b5wpwz41ow8d1otj&dl=1", 
                      header = T)

So far, the independent variables we have used are Income and Kidhome, which are continuous variables.
Some variables are intrinsically not countable; we need to treat them as categorical variables, e.g., gender, education group, city.
In A/B/N testings, the treatment assignment is also a categorical variable.

4.2 Handling Categorical Variables in R using `factor()`

In R, we need to use a function factor() to explicitly inform R that this variable is a categorical variable, such that statistical models will treat them differently from continuous variables.
- e.g., we can use factor(Education) to indicate that, Education is a categorical variable.

Code

data_full <- data_full %>%
  mutate(Education_factor = factor(Education))

We can use levels() to check how many categories there are in the factor variable.
- e.g., Education has 5 different levels.

Code

# check levels of a factor
levels(data_full$Education_factor)

[1] "2n Cycle"   "Basic"      "Graduation" "Master"     "PhD"

4.3 Handling Categorical Variables using `factor()`

factor() will check all levels of the categorical variables, and then choose the default level based on alphabetic order.
If needed, we can revise the baseline group to another group using relevel() function.

Code

# Create a new factor variable, with Basic as the baseline.
data_full <- data_full %>%
    mutate(Education_factor_2 = relevel(Education_factor,
        ref = "Basic"
    ))

levels(data_full$Education_factor_2)

[1] "Basic"      "2n Cycle"   "Graduation" "Master"     "PhD"

4.4 Running Regression with Factor Variables

Code

pacman::p_load(fixest, modelsummary)

feols_categorical <- feols(
    data = data_full,
    fml = total_spending ~ Income + Kidhome + Education_factor_2
)

modelsummary(feols_categorical,
    stars = T,
    gof_map = c("nobs", "r.squared"))

4.5 Interpretation of Coefficients for Categorical Variables

In general, R encode factor variables with K levels into K-1 coefficients, with one level as the baseline group.
The interpretation of coefficients for factor variables: Ceteris paribus, compared with the [baseline group], the [outcome variable] of [group X] is higher/lower by [coefficient], and the coefficient is statistically [significant/insignificant].
- Ceteris paribus, compared with the basic education group, the total spending of PhD group is lower by 153.190 dollars. The coefficient is statistically significant at the 1% level.
Now please rerun the regression using Education_factor and interpret the coefficients. What’s your finding?
- Conclusion: factor variables can only measure the relative difference in the outcome variable across different groups rather than telling us about the absolute levels of each group.

4.6 Application of Categorical Variables in Marketing

Quantify the treatment effects in A/B/N testing, where $Treatment_i$ is a categorical variable that specifies the treatment group customer $i$ is in:

\[ Outcome_i = \beta_0 + \delta Treatment_i + \epsilon \]

Quantify the brand premiums or country-of-origin effects:

\[ Sales_i = \beta_0 + \beta_1 Brand_i + \beta_2 Country_i + X\beta +\epsilon \]

4.7 Application: A/B/N Testing Analysis Using Regression

Let’s analyze our Instagram gamification experiment data using linear regression.

Footnotes

Fixed effects are a type of control variable that is constant within a group, such as country, year, or individual, to control for unobserved heterogeneity. See this link.↩︎

--- date: "`r (lubridate::ymd('20241002')+lubridate::dweeks(6))`" title: "Class 14 Linear Regression for Causal Inference" --- # Basics of Linear Regression ```{r} #| echo: false pacman::p_load(dplyr,fixest, modelsummary, ggplot2, ggthemes) data_full <- read.csv("images/Week 7/data_full.csv") data_full <- data_full %>% mutate(Income = ifelse(is.na(Income), mean(Income, na.rm = T), Income )) %>% mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) ``` ## Linear Regression Models - A simple linear regression is a model as follows. $$ y_i = \beta_0 + x_1 \beta_1 + x_2\beta_2+ \ldots + x_k\beta_k + \epsilon_i $$ - $y_i$: Dependent variable/outcome variable - $x_k$: Independent variable/explanatory variable/control variable - $\beta$: Regression coefficients; $\beta_0$: intercept (should always be included) - $\epsilon_i$: Error term, which captures the deviation of Y from the line. Expected mean should be 0, i.e., $E[\epsilon|X]=0$ ## Linear Regression Models - If we take the expectation of Y, we should have $$ E[Y|X] = \beta_0 + x_1 \beta_1 + x_2\beta_2+ \ldots + x_k\beta_k $$ ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 7/linear regression.png') ``` ## Origin of the Name "Regression" - The term "regression" was first coined by Francis Galton to describe a biological phenomenon: The heights of descendants of tall ancestors tend to regress down towards a normal average. - The term "regression" was later extended by statisticians Udny Yule and Karl Pearson to a more general statistical context (Pearson, 1903). - In supervised learning models, "regression" has a different meaning: when the outcome variable to be predicted is continuous, the task is called a regression task. This is because ML models are developed by computer science; causal inference models are developed by statisticians and economists. # Estimation of Coefficients ## How to Run Regression in R - In R, there are many packages that can run OLS regression. The basic function is `lm()`. - In this module, we will be using the `fixest` package, because it's able to accommodate more complex regressions, especially high-dimensional fixed effects.^[Fixed effects are a type of control variable that is constant within a group, such as country, year, or individual, to control for unobserved heterogeneity. See this [link](https://towardsdatascience.com/fixed-effect-regression-simply-explained-ab690bd885cf).] ```{r} #| echo: true pacman::p_load(modelsummary, fixest) OLS_result <- feols( fml = total_spending ~ Income, # Y ~ X data = data_full, # dataset from M&S ) ``` ## Report Regression Results \tiny ```{r} #| echo: true modelsummary(OLS_result, stars = TRUE # export statistical significance ) ``` ## Parameter Estimation: Univariate Regression Case - Regressions with a single regressor are called univariate regressions. Let's take a **univariate regression** as an example: $$ total\_spending = a + b \cdot income + \epsilon $$ - For each guess of a and b, we can compute the error for customer $i$, $$ e_i = total\_spending_{i}-a-b \cdot income_{i} $$ - We can compute the **sum of squared residuals (SSR)** across all customers $$ SSR =\sum_{i=1}^{n}\left(total\_spending_{i}-a-b \cdot income_{i}\right)^{2} $$ - **Objective of estimation**: Search for the unique set of $a$ and $b$ that can minimize the SSR. - This estimation method that minimizes SSR is called **Ordinary Least Square (OLS).** ## Visualization: Estimation of Univariate Regression - If in the M&S dataset, if we regress **total spending** (Y) on **income** (X) ::: {.content-visible when-format="beamer"} ```{r} #| out-width: "50%" #| fig-align: "center" #| echo: false library(ggplot2) lm_obj <- lm(data = data_full, total_spending ~ Income) get_ssr <- function(a,b){ return(sum(data_full$total_spending - a - b * data_full$Income)^2) } ggplot(data = data_full, aes(x = Income, y = total_spending)) + geom_point() + theme_stata() + geom_abline(slope = 0.004, intercept = 0, color = "red") + # geom_abline(slope = 0.004, intercept = 200, color = "blue") + geom_abline(slope = 0.06, intercept = - 552, color = "purple") + geom_abline(slope = lm_obj$coefficients[2], intercept = lm_obj$coefficients[1], color = "green") ``` ::: ::: {.content-visible when-format="html"} ```{r} #| fig-align: "center" #| echo: false library(ggplot2) lm_obj <- lm(data = data_full, total_spending ~ Income) get_ssr <- function(a,b){ return(sum(data_full$total_spending - a - b * data_full$Income)^2) } ggplot(data = data_full, aes(x = Income, y = total_spending)) + geom_point() + theme_stata() + geom_abline(slope = 0.004, intercept = 0, color = "red") + # geom_abline(slope = 0.004, intercept = 200, color = "blue") + geom_abline(slope = 0.06, intercept = - 552, color = "purple") + geom_abline(slope = lm_obj$coefficients[2], intercept = lm_obj$coefficients[1], color = "green") ``` ::: | Model | Color | Sum of Squared Error | |------------------------|--------|-------------------------| | $Y = -556.823 + 0.06 * X$ | Purple | `r get_ssr(-556.823,0.06)` | | $Y = 0 + 0.004 * X$ | Red | `r get_ssr(0,0.004)` | | $Y = -556.823 + 0.022 * X$ | Green | `r get_ssr(-556.823,0.022)` | ## Multivariate Regression - The OLS estimation also applies to multivariate regression with multiple regressors. $$ y_i = b_0 + b_1 x_{1} + ... + b_k x_{k}+\epsilon_i $$ - **Objective of estimation**: Search for the **unique** set of $b$ that can minimize the **sum of squared residuals**. $$ SSR= \sum_{i=1}^{n}\left(y_{i}-b_0 - b_1 x_{1} - ... - b_k x_{k} \right)^{2} $$ # Interpretation of Coefficients ## Coefficients Interpretation - Now on your Quarto document, let's run a new regression, where the DV is $total\_spending$, and X includes $Income$ and $Kidhome$. \tiny ```{r} #| echo: false feols(data = data_full, fml = total_spending ~ Income + Kidhome) %>% modelsummary(stars = T) ``` \normalsize - **Controlling for** Kidhome, one unit increase in `Income` increases `totalspending` by £0.019. ## Standard Errors and P-Values - If we collect all data from the whole population, the regression coefficient is called the **population regression coefficient**. - Because the regression is estimated on a random sample of the population, if we rerun the regression on different samples from the same population, we would obtain a different set of **sample regression coefficients** each time. - In theory, the sample regression coefficients estimates follows a **t-distribution**: the mean is the true $\beta$. The **standard error** of the estimates is the estimated standard deviation of the error. - Knowing that the coefficients follow a t-distribution, we can test whether the coefficients are statistically different from 0 using **hypothesis testing**. - `Income`/`Kidhome` is statistically significant at the 1% level. ## R-Squared - R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by all included variables in a regression. - Interpretation: 66% of the variation in `total_spending` can be explained by `Income` and `Kidhome`. - As the number of variables increases, the $R^2$ will naturally increase, so sometimes we may need to penalize the number of variables using the so-called **adjusted R-squared**. ::: callout-important R-Squared is only important for supervised learning prediction tasks, because it measures the predictive power of the X. However, in causal inference tasks, $R^2$ does not matter much. ::: # Regression for A/B/N Testing ## Categorical variables ::: {.content-visible when-format="beamer"} ```{r} #| echo: false pacman::p_load(dplyr,ggplot2,ggthemes) # Load both datasets data_full <- read.csv(file = "https://www.dropbox.com/scl/fi/hhweiqsuwgcwgd1jiuyte/data_full.csv?rlkey=jwyd9z409b5wpwz41ow8d1otj&dl=1", header = T) ``` ::: ::: {.content-visible when-format="html"} ```{r} #| echo: true pacman::p_load(dplyr,ggplot2,ggthemes) # Load both datasets data_full <- read.csv(file = "https://www.dropbox.com/scl/fi/hhweiqsuwgcwgd1jiuyte/data_full.csv?rlkey=jwyd9z409b5wpwz41ow8d1otj&dl=1", header = T) ``` ::: - So far, the independent variables we have used are `Income` and `Kidhome`, which are **continuous variables**. - Some variables are intrinsically not countable; we need to treat them as **categorical variables**, e.g., gender, education group, city. - In A/B/N testings, the treatment assignment is also a categorical variable. ## Handling Categorical Variables in R using `factor()` - In R, we need to use a function `factor()` to explicitly inform R that this variable is a categorical variable, such that statistical models will treat them differently from continuous variables. - e.g., we can use `factor(Education)` to indicate that, `Education` is a categorical variable. ```{r} #| echo: true data_full <- data_full %>% mutate(Education_factor = factor(Education)) ``` - We can use `levels()` to check how many categories there are in the factor variable. - e.g., `Education` has 5 different levels. ```{r} #| echo: true # check levels of a factor levels(data_full$Education_factor) ``` ## Handling Categorical Variables using `factor()` - `factor()` will check all levels of the categorical variables, and then choose the default level based on alphabetic order. - If needed, we can revise the baseline group to another group using `relevel()` function. ```{r} #| echo: true # Create a new factor variable, with Basic as the baseline. data_full <- data_full %>% mutate(Education_factor_2 = relevel(Education_factor, ref = "Basic" )) levels(data_full$Education_factor_2) ``` ## Running Regression with Factor Variables \tiny ```{r} #| echo: true #| eval: false pacman::p_load(fixest, modelsummary) feols_categorical <- feols( data = data_full, fml = total_spending ~ Income + Kidhome + Education_factor_2 ) modelsummary(feols_categorical, stars = T, gof_map = c("nobs", "r.squared")) ``` ## Interpretation of Coefficients for Categorical Variables - In general, R encode factor variables with **K** levels into **K-1** coefficients, with one level as the baseline group. - The interpretation of coefficients for factor variables: Ceteris paribus, compared with the ***\[baseline group\]***, the ***\[outcome variable\]*** of ***\[group X\]*** is higher/lower by ***\[coefficient\]***, and the coefficient is statistically ***\[significant/insignificant\]***. - Ceteris paribus, compared with the basic education group, the total spending of PhD group is lower by 153.190 dollars. The coefficient is statistically significant at the 1% level. - Now please rerun the regression using `Education_factor` and interpret the coefficients. What's your finding? - Conclusion: factor variables can only measure the relative difference in the outcome variable across different groups rather than telling us about the absolute levels of each group. ## Application of Categorical Variables in Marketing - Quantify the treatment effects in A/B/N testing, where $Treatment_i$ is a categorical variable that specifies the treatment group customer $i$ is in: $$ Outcome_i = \beta_0 + \delta Treatment_i + \epsilon $$ - Quantify the brand premiums or country-of-origin effects: $$ Sales_i = \beta_0 + \beta_1 Brand_i + \beta_2 Country_i + X\beta +\epsilon $$ ## Application: A/B/N Testing Analysis Using Regression - Let's analyze our Instagram gamification experiment data using linear regression.

1 Basics of Linear Regression

1.1 Linear Regression Models

1.2 Linear Regression Models

1.3 Origin of the Name “Regression”

2 Estimation of Coefficients

2.1 How to Run Regression in R

2.2 Report Regression Results

2.3 Parameter Estimation: Univariate Regression Case

2.4 Visualization: Estimation of Univariate Regression

2.5 Multivariate Regression

3 Interpretation of Coefficients

3.1 Coefficients Interpretation

3.2 Standard Errors and P-Values

3.3 R-Squared

4 Regression for A/B/N Testing

4.1 Categorical variables

4.2 Handling Categorical Variables in R using factor()

4.3 Handling Categorical Variables using factor()

4.4 Running Regression with Factor Variables

4.5 Interpretation of Coefficients for Categorical Variables

4.6 Application of Categorical Variables in Marketing

4.7 Application: A/B/N Testing Analysis Using Regression

Footnotes

4.2 Handling Categorical Variables in R using `factor()`

4.3 Handling Categorical Variables using `factor()`