Class 13 OLS Regression Advanced
1 Categorical Variables
1.1 Categorical variables
So far, the independent variables we have used are
Income
andKidhome
, which are continuous variables.Some variables are intrinsically not countable; we need to treat them as categorical variables
- e.g., gender, education group, city.
1.2 Handling Categorical Variables in R using factor()
- In R, we need to use a function
factor()
to explicitly inform R that this variable is a categorical variable, such that statistical models will treat them differently from continuous variables.- e.g., we can use
factor(Education)
to indicate that,Education
is a categorical variable.
- e.g., we can use
- We can use
levels()
to check how many categories there are in the factor variable.- e.g.,
Education
has 5 different levels.
- e.g.,
[1] "2n Cycle" "Basic" "Graduation" "Master" "PhD"
1.3 Running Regression with Factor Variables
pacman::p_load(fixest,modelsummary)
feols_categorical <- feols(data = data_full,
fml = total_spending ~ Income + Kidhome + Education_factor_2)
modelsummary(feols_categorical,
stars = T,
gof_map = c('nobs','r.squared'))
(1) | |
---|---|
(Intercept) | −180.297** |
(56.305) | |
Income | 0.020*** |
(0.000) | |
Kidhome | −227.761*** |
(16.961) | |
Education_factor_22n Cycle | −164.044** |
(60.448) | |
Education_factor_2Graduation | −119.695* |
(56.176) | |
Education_factor_2Master | −143.015* |
(58.443) | |
Education_factor_2PhD | −153.190** |
(57.751) | |
Num.Obs. | 2000 |
R2 | 0.662 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
1.4 One-Hot Encoding of factor()
- In the raw data, Education is label-encoded with 5 levels.
- After factorizing education with “Basic” as the baseline group, internally, we have 4 binary indicators as follows. Because we have the intercept,”Basic” is omitted as the baseline group. Other groups represent the comparison relative to the baseline group.
1.5 Interpretation of Coefficients for Categorical Variables
- In general, R uses one-hot encoding to encode factor variables with K levels into K-1 binary variables.
- As we have the intercept term, we can only have K-1 binary variables.
- The interpretation of coefficients for factor variables: Ceteris paribus, compared with the [baseline group], the [outcome variable] of [group X] is higher/lower by [coefficient], and the coefficient is statistically [significant/insignificant].
- Ceteris paribus, compared with the basic education group, the total spending of PhD group is lower by 153.190 dollars. The coefficient is statistically significant at the 1% level.
- Now please rerun the regression using
Education_factor
and interpret the coefficients. What’s your finding?- Conclusion: factor variables can only measure the relative difference in outcome variable across different groups rather than the absolute levels.
1.6 Application of Categorical Variables in Marketing
- Analyze the treatment effects in A/B/N testing, where \(Treatment_i\) is a categorical variable that specifies the treatment group customer \(i\) is in:
\[ Outcome_i = \beta_0 + \delta Treatment_i + \epsilon \]
- Analyze the brand premiums or country-of-origin effects:
\[ Sales_i = \beta_0 + \beta_1 Brand_i + \beta_2 Country_i + X\beta +\epsilon \]
2 Non-linear Effects
2.1 Quadratic Terms
- If we believe the relationship between the outcome variable and explanatory variable is a quadratic function, we can include an additional quadratic term in the regression to model such non-linear relationship.
\[ totalspending = \beta_0 + \beta_1Income + \beta_2Income^2 + \epsilon \]
2.2 Quadratic Terms
- If the coefficient for \(Income^2\) is negative, then we have an downward open parabola. That is, as income increases, total spending first increases and then decreases, i.e., a non-linear, non-monotonic effect.
- As income first increases, customers increase their spending with Tesco due to the income effect; however, as customers get even richer, they may switch to more premium brands such as Waitrose, so their spending may decrease due to the substitution effect.
2.3 Quadratic Terms in Linear Regression
- Let’s run two regressions in the Quarto document, with and without the quadratic term.
modelsummary(list(feols_noquadratic,
feols_quadratic),
stars = T,
fmt = fmt_sprintf("%.2e"),
gof_map = c('nobs','r.squared'))
(1) | (2) | |
---|---|---|
(Intercept) | −5.57e+02*** | −6.27e+02*** |
(2.17e+01) | (3.65e+01) | |
Income | 2.24e−02*** | 2.53e−02*** |
(3.84e−04) | (1.30e−03) | |
Income_squared | −2.66e−08* | |
(1.12e−08) | ||
Num.Obs. | 2000 | 2000 |
R2 | 0.629 | 0.630 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
2.4 Quadratic Terms: Compute the Vertex
- We can compute the vertex point where total spending is maximized by income
3 Linear Probability Model
3.1 Linear Probability Model
In Predictive Analytics, we learned how to use decision tree and random forest to make predictions for binary outcome variables.
In fact, linear regression can also be used as another supervised learning model to predict binary outcomes. When the outcome variable is a binary variable, the linear regression model is also called linear probability model.
On the one hand, regression predicts the expectation of response \(Y\) conditional on \(X\); that is \[ E[Y]= E[X\beta+\epsilon]=X\beta \]
On the other hand, for a binary outcome variable, if the probability of outcome occurring is \(p\), then we can write the expectation of \(Y\) is \[ E[Y] = 1 * p + 0 * (1 - p) = p \]
As a result, we have the following equation \[ p = X \beta \]
Interpretation of LPM coefficients: Everything else equal, a unit change in \(x\) will change the probability of the outcome occurring by \(\beta\).
3.2 Pros and Cons of LPM
We use linear regression function
feols()
to train the LPM on the training data and make predictions usingpredict(LPM, data_test)
to make predictions on the test data.Advantages
- Fast to run, even with a large number of fixed effects and features
- High interpretability: coefficients have clear economic meanings
Disadvantages
- Predicted probabilities of occurring may fall out of the [0,1] range
- Accuracy tends to be low