Class 12 OLS Regression Basics
1 Basics of Linear Regression
1.1 Linear Regression Models
A simple linear regression is a model as follows. \[ Y_i = \beta_0 + x_1 \beta_1 + x_2\beta_2+ \ldots + x_k\beta_k + \epsilon_i \]
\(y_i\): Outcome variable/dependent variable/regressand/response variable/LHS variable
\(\beta\): Regression coefficients/estimates/parameters; \(\beta_0\): intercept
\(x_k\): Control variable/independent variable/regressor/explanatory variable/RHS variable
- Lower case such as \(x_1\) usually indicates a single variable while upper case such as \(X_{ik}\) indicates a set of several variables
\(\epsilon_i\): Error term, which captures the deviation of Y from the prediction
Expected mean should be 0, i.e., \(E[\epsilon|X]=0\)
If we take the expectation of Y, we should have \[ E[Y|X] = \beta_0 + x_1 \beta_1 + x_2\beta_2+ \ldots + x_k\beta_k \]
1.2 Why the Name “Regression”?
The term “regression” was coined by Francis Galton to describe a biological phenomenon: The heights of descendants of tall ancestors tend to regress down towards a normal average.
The term “regression” was later extended by statisticians Udny Yule and Karl Pearson to a more general statistical context (Pearson, 1903).
In supervised learning models, “regression” has a different meaning: when outcome is continuous, the task is called regression task.1
2 Estimation of Coefficients
2.1 How to Run Regression in R
In R, there are tons of packages that can run OLS regression.
In this module, we will be using the
fixest
package, because it’s able to estimate high-dimensional fixed effects.
2.2 Report Regression Results
2.3 Parameter Estimation: Univariate Regression Case
- Let’s take a univariate regression2 as an example
\[ y = a + b x_1 + \epsilon \]
For each guess of a and b, we can compute the error for customer \(i\), \[ e_i = y_{i}-a-b x_{1i} \]
We can compute the sum of squared residuals (SSR) across all customers
\[ SSR =\sum_{i=1}^{n}\left(y_{i}-a-b x_{1i}\right)^{2} \]
Objective of estimation: Search for the unique set of \(a\) and \(b\) that can minimize the SSR.
This estimation method that minimizes SSR is called Ordinary Least Square (OLS).
2.4 Visualization: Estimation of Univariate Regression
- If in the Tesco dataset, if we regress total spending (Y) on income (X)
Model | Color | Sum of Squared Error |
---|---|---|
\(Y = -552 + 0.06 * X\) | Purple | 1.6176047^{13} |
\(Y = 0 + 0.004 * X\) | Red | 5.093683^{11} |
\(Y = -552 + 0.021 * X\) | Green | 2.0205681^{9} |
2.5 Multivariate Regression
- The OLS estimation also applies to multivariate regression with multiple regressors.
\[ y_i = b_0 + b_1 x_{1} + ... + b_k x_{k}+\epsilon_i \]
- Objective of estimation: Search for the unique set of \(b\) that can minimize the sum of squared residuals.
\[ SSR= \sum_{i=1}^{n}\left(y_{i}-b_0 - b_1 x_{1} - ... - b_k x_{k} \right)^{2} \]
3 Interpretation of Coefficients
3.1 Coefficients Interpretation
- Now on your Quarto document, let’s run a new regression, where the DV is \(total\_spending\), and X includes \(Income\) and \(Kidhome\).
(1) | |
---|---|
(Intercept) | −316.878*** |
(26.972) | |
Income | 0.019*** |
(0.000) | |
Kidhome | −210.613*** |
(16.282) | |
Num.Obs. | 2000 |
R2 | 0.658 |
R2 Adj. | 0.658 |
AIC | 28971.2 |
BIC | 28988.0 |
RMSE | 337.77 |
Std.Errors | IID |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
- Controlling for Kidhome, one unit increase in
Income
increasestotalspending
by £0.019.
3.2 Standard Errors and P-Values
Because the regression is estimated on a random sample of the population, so if we rerun the regression on different samples, we would get a different set of regression coefficients each time.
In theory, the regression coefficients estimates follows a t-distribution: the mean is the true \(\beta\). The standard error of the estimates is the estimated standard deviation of the error.
We can test whether the coefficients are statistically different from 0 using hypothesis testing.
- Null hypothesis: the true regression coefficient \(\beta\) is 0
Income
/Kidhome
is statistically significant at the 1% level.
3.3 R-Squared
R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by all included variables in a regression.
Interpretation: 65.8% of the variation in
totalspending
can be explained byIncome
andKidhome
.As the number of variables increases, the \(R^2\) will naturally increase, so sometimes we may need to penalize the number of variables using the so-called adjusted R-squared.
ImportantR-Squared is only important for supervised learning prediction tasks, because it measures the predictive power of the X. However, In causal inference tasks, \(R^2\) does not matter much.