Class 9 Supervised Machine Learning and Tree-Based Models

Author

Affiliation

Dr Wei Miao

UCL School of Management

Published

October 29, 2025

1 Supervised Learning

1.1 Learning Objectives

Understand the fundamentals of supervised learning and its key components
Distinguish between classification and regression tasks
Recognize the accuracy-interpretability and bias-variance tradeoffs in machine learning
Learn how to implement and interpret decision trees
Understand random forests and their advantages over single decision trees
Apply cross-validation techniques to mitigate overfitting

1.2 Supervised Learning

A supervised learning model is used when we have one or more explanatory variables AND a response variable and we would like to learn the underlying true relationship between the explanatory variables and the response variable as accurately as possible.

1.3 Data Generating Process (DGP)

We use the following notations for supervised learning tasks: \[ Y = f(X;\theta) + \epsilon \]

$Y$ is the response/outcome/target variable to be predicted
$X = (X_1,X_2,...,X_p)$ are a set of explanatory variables/features/predictors
$f(X;\theta) + \epsilon$ is the true relationship between $X$ and $Y$, or DGP, which is never known to us¹; $\epsilon$ is the randomness term or error term
$\theta$ represents the set of parameters to be learnt from the data

1.4 Types of Supervised Learning Algorithms

Depending on the type of the response variable, supervised learning tasks can be divided into two groups:

Classification tasks if the outcome is categorical
- Whether a customer responds to marketing offers (e.g., 1 for response, 0 for no response)
- Whether a customer churns (e.g., 1 for churn, 0 for no churn)
- Which product a customer purchases (e.g., 1 for product A, 2 for product B, etc.)
Regression tasks if the outcome is continuous
- Customer total spending in each period (e.g., $100, $200, etc.)
- Demand forecasting such as the daily sales of a product (e.g., 100 units, 120 units, etc.)

1.5 Difference between Supervised and Unsupervised Learning

	Supervised Learning	Unsupervised Learning
Description	Estimate or predict an output based on one or more inputs.	Find structure and relationships from inputs. No “supervising” output.
Variables	Explanatory and Response variables	Explanatory variables only
Goal	(1) predict new values or (2) understand existing relationships between explanatory and response variables	Group observations into clusters based on similarity
Types of algorithms	(1) Regression and (2) Classification	Clustering

2 Fundamental Tradeoffs

2.1 Accuracy-Interpretability Trade-off

Simpler models are easier to interpret but typically give lower accuracy
More complex models can give better prediction accuracy but results are harder to interpret

After-Class Reading

Due to time constraints, we only cover tree-based models in depth. Learn about other ML models in this video.

2.2 Comparison of Classic Supervised Learning Models

Linear regression class models (easy to interpret, low accuracy)
- Linear regression coefficients have economic interpretations but prediction accuracy is low
Tree-based Models (balance between interpretability and accuracy)
- Decision tree, random forest, and gradient boosting models
Neural-network based models (hard to interpret, high accuracy)
- Deep learning only gives estimated weights that have no direct business interpretations

2.3 Bias Error and Variance Error

After we have trained a machine learning model, we can test the model performance by looking at the errors of predictions.
- bias measures how far off the model’s predictions are from the true values on average (systematic error)
- variance measures how much the model’s predictions vary when trained on different datasets (sensitivity to training data)

2.4 Overfitting

If a predictive model learns from one single training dataset too well, then it may be too rigid and specialised and thus have a higher chance of failing to make predictions for another dataset accurately. This problem is called overfitting.
Overfitting leads to low bias but high variance. This is not ideal because with supervised learning models, we want to have higher prediction accuracy for new data.

2.5 Underfitting

On another extreme, underfitting occurs when a predictive model cannot sufficiently capture the DGP even on the historical training data.
Underfitting leads to high bias but typically low variance. An underfitting model performs poorly on both training and test data, which should be avoided by all means.
To mitigate the underfitting problem, we need to select more suitable or more complex models.

2.6 Bias-Variance Trade-off

Increasing model complexity (e.g., adding more layers to a neural network or more branches to a decision tree) typically decreases bias but increases variance. The model fits the training data better but becomes more sensitive to it, leading to overfitting.
Decreasing model complexity (e.g., using a simpler model like linear regression) typically increases bias but decreases variance. The model is more general but may miss underlying patterns in the data, leading to underfitting.
Hence we face a bias-variance trade-off or bias-variance dilemma.

2.7 How to Mitigate Overfitting

To mitigate the overfitting problem, when training predictive models, we need to use the cross-validation technique by splitting the full historical data into a training set and a test set.
- A training set (70% - 80% of labelled data): we train the ML model based on the training set.
- A test set (20% - 30% of labelled data): Using the trained ML model from the training data, we can make predictions for the test data. However, we do observe the actual outcomes for the test set, so that we can evaluate the prediction accuracy by comparing the predicted outcomes versus the actual outcomes.

Note

For more complicated models with hyper-parameters such as deep learning models, we may even need to split our data into 3 sets (training, validation, and test sets).

3 Decision Tree

3.1 Introduction to Decision Tree

A decision tree is a tree-like structure, which can be used for both classification and regression tasks.

3.2 Business Objective: Predict Customer Response to Marketing Offers

M&S made marketing offers to customers in the data, and the variable Response represents whether or not customers responded to the offer in the previous similar marketing campaign.
Business objective: Based on the historical data data_full, we want to train a decision tree model to predict the outcome variable Response based on Recency and total_spending.
Data collection and cleaning:

Code

pacman::p_load(dplyr, modelsummary)

data_full <- read.csv("data_full.csv") %>%
    mutate(
        total_spending = MntWines + MntFruits + MntMeatProducts +
            MntFishProducts + MntSweetProducts + MntGoldProds
    )

3.3 Implementation of Decision Tree in R

Package rpart provides an efficient implementation of decision trees in R; Package rpart.plot provides visualizations of decision trees
- formula: Response ~ Recency + total_spending means that we want to predict the outcome variable Response based on the explanatory variables Recency and total_spending. In R, we use ~ to separate the outcome variable and the explanatory variables for all supervised learning tasks.
- data: the training dataset to train the model
- method: “class” for classification tasks, “anova” for regression tasks

Code

# Load the necessary packages
pacman::p_load(rpart,rpart.plot)

# Below example shows how to train a decision tree
tree1 <- rpart(
  formula = Response ~ Recency + total_spending, # formula
  data    = data_full,
  method  = "class" # classification task; or 'anova' for regression
  )

# visualize the tree
rpart.plot(tree1)

3.4 How to Measure Split Quality: Classification Tasks

For classification tasks, the goal is to split the data to create nodes that are as “pure” as possible, meaning they contain instances of a single class.
Two common metrics are used to measure the quality of a split: Gini Impurity and Information Gain (based on Entropy). Gini impurity is more commonly used in practice due to its computational efficiency.

3.4.1 Gini Impurity

Formula: $Gini = 1 - \sum_{i=1}^{C} (p_i)^2$, where $p_i$ is the proportion of samples of class $i$.
A Gini score of 0 indicates a perfectly pure node. The algorithm seeks splits that minimize the weighted Gini impurity of the child nodes.

3.5 Numeric Example

Let us start with a dataset of 10 customers, from which we observe X = total spending and Y = Response (1 for response, 0 for no response).

Case 1: Purest

10 customers responded
0 customers did not respond
Gini = $1 - ((10/10)^2 + (0/10)^2) = 0$

Case 2: In-between

7 customers responded
3 customers did not respond
Gini = $1 - ((7/10)^2 + (3/10)^2) = 0.42$

Case 3: Impurest

5 customers responded
5 customers did not respond
Gini = $1 - ((5/10)^2 + (5/10)^2) = 0.5$

3.6 Numeric Example: Split

If we split the 10 customers in Case 2 into two child nodes based on total spending at a threshold of 1396:

Child Node 1 (Left)

total_spending < 1396
4 customers total
- 1 responded
- 3 did not respond
Gini Calculation:
- $p_1 = 1/4 = 0.25$
- $p_0 = 3/4 = 0.75$
- Gini = $1 - (0.25^2 + 0.75^2) = 0.375$
- This node is impure.

Child Node 2 (Right)

total_spending >= 1396
6 customers total
- 6 responded
- 0 did not respond
Gini Calculation:
- $p_1 = 6/6 = 1$
- $p_0 = 0/6 = 0$
- Gini = $1 - (1^2 + 0^2) = 0$
- This node is pure.

3.7 Numeric Example: Weighted Gini and Gini Gain

The goal is to find the split that results in the lowest weighted Gini impurity.

Calculate Weighted Gini of the Split
- Weight (Left) = $4 / 10 = 0.4$
- Weight (Right) = $6 / 10 = 0.6$
- Weighted Gini = $(0.4 \times 0.375) + (0.6 \times 0) = 0.15$
Calculate Gini Gain (The Decision)
- Gini Gain = Gini (Parent) - Weighted Gini (Split)
- Gini Gain = $0.42 - 0.15 = 0.27$

Since the Gini Gain is positive, impurity was reduced, making this a good split. The rpart algorithm repeats this for all possible splits and chooses the one with the highest Gini Gain.

3.8 (Optional) How to Measure Split Quality: Regression Tasks

For regression tasks, the goal is to split the data to create nodes where the outcome values are as similar as possible.
The most common metric used to measure the quality of a split is the Sum of Squared Errors (SSE).

3.8.1 Sum of Squared Errors (SSE)

Measures the total squared difference between the actual values and the mean value of the outcome variable within a node.
Formula: $SSE = \sum_{i \in \text{node}} (y_i - \bar{y}_{\text{node}})^2$, where $y_i$ is the actual value and $\bar{y}_{\text{node}}$ is the mean value of the outcome in the node.
The algorithm seeks the split that results in the largest reduction in the total SSE of the child nodes compared to the parent node.

3.9 How Decision Tree Works: Step 1

Step 1. The decision tree (DT) will try to split customers into 2 groups based on each unique value of each variable, and see which split can lead to customers being most different in terms of outcome Response.

3.10 How Decision Tree Works: Step 1

After this step, DT finds that total spending is the best variable and 1396 is the best cut-off.
DT therefore splits customers into 2 groups based on 1396.
In each node, the 3 numbers are: (1) predicted outcome, (2) predicted probability of outcome being 1, and (3) share of customers in the node

3.11 How Decision Tree Works: Step 2

Step 2. For customers in the left branch (total_spending < 1396), DT will continue to split based on each unique value of each variable, and see which split can result in the customers being most different in terms of Response.

However, DT couldn’t find a cut-off that sufficiently differentiates customers, so DT stops in the left branch.

3.12 How Decision Tree Works: Step 3 …

Step 3. For customers in the right branch (total_spending >= 1396), DT will continue to split based on each unique value of each variable, and see which split can result in the customers being most different in terms of Response.

After this step, DT finds Recency is the best variable and 72 is the best cut-off. DT further splits customers into 2 groups.

Step 4. This process continues until DT determines that there is no need to further split customers.

3.13 How Decision Tree Works: Step 4

Once the tree is fully grown, we can use the tree to make predictions on new customers.
For a new customer, we can follow the tree from the root node to the leaf node, and the predicted outcome is the outcome of the leaf node.
In R, we can use the predict() function to make predictions on new customers, which returns the predicted outcome of the new customers. Note that the test data should have the exact same variable names as the training data.

Code

# Make predictions on the mtcars
prediction_tree1 <- predict(tree1, data = data_test)

3.14 Advantages of Decision Trees

They are very interpretable.
Making predictions is fast.
It’s easy to understand what variables are important in making the prediction. The internal nodes (splits) are those variables that most largely reduce the Gini Impurity/SSE (criteria for split).

4 Prediction Accuracy (Optional)

4.1 Classification Tasks

For classification tasks, we can evaluate model performance using:

Confusion Matrix: A table showing true positives, true negatives, false positives, and false negatives

	Predicted: No	Predicted: Yes
Actual: No	True Negatives	False Positives
Actual: Yes	False Negatives	True Positives

4.2 Classification Tasks

Based on the confusion matrix, we can further compute the following metrics:

Accuracy: The proportion of correct predictions \[\text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Predictions}}\]
Precision: Among predicted positives, how many are actually positive \[\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}\]
Recall (Sensitivity): Among actual positives, how many are correctly predicted \[\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}\]
F1-Score: Harmonic mean of precision and recall \[\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}\]

4.3 Regression Tasks

For regression tasks, we can evaluate model performance using:

Mean Absolute Error (MAE): Average absolute difference between predicted and actual values \[\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\]
Root Mean Square Error (RMSE): Square root of average squared differences \[\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\]
Sum of Squared Errors (SSE): Total squared difference between predicted and actual values \[\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]
Lower MAE/RMSE/SSE indicate better predictions

Note

In R, the caret package provides functions to compute these metrics easily. If you are interested, you can explore this data camp tutorial: link

4.4 Business Metrics as Evaluation Criteria

In practice, we may also use business metrics such as ROI, profit, or customer lifetime value (CLV) as evaluation criteria for predictive models.
For example, in targeted marketing campaigns, we may want to evaluate predictive models based on the ROI of the marketing campaign when using the model to select target customers. The intuition is that a better predictive model should lead to a higher ROI for the marketing campaign. We will see an example of this in the case study later.

5 Random Forest

5.1 Disadvantages of Decision Trees

Single regression trees tend to overfit, resulting in unstable predictions.
Due to the high variance, single regression trees tend to have poor predictive accuracy.

5.2 Random Forest

To overcome the overfitting tendency of a single decision tree, random forest has been developed by (Breiman 2001).
- Instead of using all customers, each tree is grown to a subsample of customers instead of all customers (e.g., 70% of training data)
- Instead of using all features for splitting, each tree is grown to a subset of features instead of all features (e.g., 3 out of 5 features)

5.3 Visualization of Random Forest

For a new customer,

Each tree gives a prediction of the outcome
Random forest takes the average (for regression tasks) or majority vote (for classification tasks) of all trees’ predictions as the final prediction

5.4 Implementation of Random Forest in R

Package ranger provides implementation of random forest in R.
ranger() is the function in the package to train a random forest; refer to its help function for more details.
The following code shows how to train a random forest consisting of 500 decision trees, where the outcome variable is Response, and the predictors are total_spending and Recency.

Code

pacman::p_load(ranger)
randomforest1 <- ranger(
    formula = Response ~ total_spending + Recency, # formula
    data = data_full, # dataset to train the model
    num.trees = 500, # 500 decision trees
    seed = 888, # make sure of replication
    probability = TRUE # to return predicted probabilities
)

5.5 Make Predictions from Random Forest

After we train the predictive model, we can use the predict() function to make predictions
- The 1st argument is the trained model object
- The 2nd argument is the dataset on which to make predictions

Code

# Make predictions on the mtcars
prediction_rf <- predict(randomforest1, data = data_full)

# Because prediction_rf is a list object
# Need to use $ to extract the predicted value as a numeric vector
prediction_rf$predictions

5.6 After-Class Reading

(recommended) Decision tree in R
(recommended) Random forest in R

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Footnotes

“All models are wrong, but some are useful” – George Box. As business analysts, we need to use the “wrong models” correctly.↩︎

--- date: "`r (first_date+lubridate::dweeks(4))`" title: "Class 9 Supervised Machine Learning and Tree-Based Models" format: beamer: echo: true html: default --- # Supervised Learning ## Learning Objectives - Understand the fundamentals of supervised learning and its key components - Distinguish between classification and regression tasks - Recognize the accuracy-interpretability and bias-variance tradeoffs in machine learning - Learn how to implement and interpret decision trees - Understand random forests and their advantages over single decision trees - Apply cross-validation techniques to mitigate overfitting ## Supervised Learning - A **supervised learning model** is used when we have one or more **explanatory variables** AND **a response variable** and we would like to learn the **underlying true relationship** between the **explanatory** variables and the **response** variable as accurately as possible. ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics("images/Week 5/supervisedlearning.png") ``` ## Data Generating Process (DGP) We use the following notations for supervised learning tasks: $$ Y = f(X;\theta) + \epsilon $$ - $Y$ is the **response**/**outcome**/**target** variable to be predicted - $X = (X_1,X_2,...,X_p)$ are a set of **explanatory variables**/**features**/**predictors** - $f(X;\theta) + \epsilon$ is the true relationship between $X$ and $Y$, or DGP, which is never known to us[^1]; $\epsilon$ is the **randomness term** or **error term** - $\theta$ represents the set of **parameters** to be learnt from the data [^1]: "All models are wrong, but some are useful" -- George Box. As business analysts, we need to use the "wrong models" correctly. ## Types of Supervised Learning Algorithms Depending on the type of the **response variable**, supervised learning tasks can be divided into two groups: - **Classification tasks** if the outcome is **categorical** - Whether a customer responds to marketing offers (e.g., 1 for response, 0 for no response) - Whether a customer churns (e.g., 1 for churn, 0 for no churn) - Which product a customer purchases (e.g., 1 for product A, 2 for product B, etc.) - **Regression tasks** if the outcome is **continuous** - Customer total spending in each period (e.g., $100, $200, etc.) - Demand forecasting such as the daily sales of a product (e.g., 100 units, 120 units, etc.) ## Difference between Supervised and Unsupervised Learning \footnotesize | | **Supervised Learning** | **Unsupervised Learning** | |-----------------------|-------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------| | **Description** | Estimate or predict an output based on one or more inputs. | Find structure and relationships from inputs. No "supervising" output. | | **Variables** | Explanatory and Response variables | Explanatory variables only | | **Goal** | (1) predict new values or (2) understand existing relationships between explanatory and response variables | Group observations into clusters based on similarity | | **Types of algorithms** | (1) Regression and (2) Classification | Clustering | # Fundamental Tradeoffs ## Accuracy-Interpretability Trade-off - Simpler models are easier to interpret but typically give lower accuracy - More complex models can give better prediction accuracy but results are harder to interpret ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 5/AccuracyVersusInterpretability.png') ``` ::: {.callout-note} ### After-Class Reading Due to time constraints, we only cover tree-based models in depth. Learn about other ML models in this [video](https://youtu.be/E0Hmnixke2g?si=0ISH4dy2XKkT4uIh). ::: ## Comparison of Classic Supervised Learning Models - Linear regression class models (easy to interpret, low accuracy) - Linear regression coefficients have economic interpretations but prediction accuracy is low - **Tree-based Models (balance between interpretability and accuracy)** - **Decision tree**, **random forest**, and gradient boosting models - Neural-network based models (hard to interpret, high accuracy) - Deep learning only gives estimated weights that have no direct business interpretations ::: {.columns} ::: {.column width='30%'} ```{r} #| echo: false #| fig-align: 'center' #| out-width: "2cm" knitr::include_graphics('images/Week 5/linear regression.png') ``` ::: ::: {.column width='30%'} ```{r} #| echo: false #| fig-align: 'center' #| out-width: "2cm" knitr::include_graphics("images/Week 5/decisiontree.png") ``` ::: ::: {.column width='30%'} ```{r} #| echo: false #| fig-align: 'center' #| out-width: "2cm" knitr::include_graphics("images/Week 5/neural network.jpg") ``` ::: ::: ## Bias Error and Variance Error - After we have trained a machine learning model, we can test the model performance by looking at the errors of predictions. - **bias** measures how far off the model's predictions are from the true values on average (systematic error) - **variance** measures how much the model's predictions vary when trained on different datasets (sensitivity to training data) ## Overfitting - If a predictive model **learns from one single training dataset too well**, then it may be too rigid and specialised and thus have a higher chance of failing to make predictions for another dataset accurately. This problem is called **overfitting**. - Overfitting leads to **low bias** but **high variance**. This is not ideal because with supervised learning models, we want to have higher prediction accuracy for new data. ```{r} #| echo: false #| fig-align: 'center' #| out-width: "3cm" knitr::include_graphics('images/Week 5/overfitting.png') ``` ## Underfitting - On another extreme, **underfitting** occurs when a predictive model cannot sufficiently capture the DGP even on the historical training data. - Underfitting leads to **high bias** but typically **low variance**. An underfitting model performs poorly on both training and test data, which should be avoided by all means. - To mitigate the underfitting problem, we need to select more suitable or more complex models. ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics("images/Week 5/OverfittingUnderfitting.png") ``` ## Bias-Variance Trade-off - Increasing model complexity (e.g., adding more layers to a neural network or more branches to a decision tree) typically decreases bias but increases variance. The model fits the training data better but becomes more sensitive to it, leading to overfitting. - Decreasing model complexity (e.g., using a simpler model like linear regression) typically increases bias but decreases variance. The model is more general but may miss underlying patterns in the data, leading to underfitting. - Hence we face a **bias-variance trade-off** or **bias-variance dilemma**. ```{r} #| echo: false #| fig-align: 'center' #| out-width: "50%" knitr::include_graphics('images/Week 5/bias_variance_tradeoff.png') ``` ## How to Mitigate Overfitting - To mitigate the overfitting problem, when training predictive models, we need to use the **cross-validation** technique by splitting the full **historical data** into a **training set** and a **test set**. - **A training set** (70% - 80% of labelled data)**:** we train the ML model based on the training set. - **A test set** (20% - 30% of labelled data)**:** Using the trained ML model from the training data, we can make predictions for the test data. However, we do observe the actual outcomes for the test set, so that we can evaluate the prediction accuracy by comparing the predicted outcomes versus the actual outcomes. ```{r} #| echo: false #| fig-align: 'center' #| output-width: "3cm" knitr::include_graphics('images/Week 5/trainingtest.png') ``` ::: {.callout-note} For more complicated models with hyper-parameters such as deep learning models, we may even need to split our data into 3 sets (training, validation, and test sets). ::: # Decision Tree ## Introduction to Decision Tree - A **decision tree** is a tree-like structure, which can be used for both classification and regression tasks. ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 5/decisiontree.png') ``` ## Business Objective: Predict Customer Response to Marketing Offers - M&S made marketing offers to customers in the data, and the variable `Response` represents whether or not customers responded to the offer in the previous similar marketing campaign. - **Business objective**: Based on the historical data `data_full`, we want to train a decision tree model to predict the outcome variable `Response` based on `Recency` and `total_spending`. - **Data collection and cleaning:** \footnotesize \vspace{1em} ```{r} pacman::p_load(dplyr, modelsummary) data_full <- read.csv("data_full.csv") %>% mutate( total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds ) ``` ## Implementation of Decision Tree in R - Package `rpart` provides an efficient implementation of decision trees in R; Package `rpart.plot` provides visualizations of decision trees - `formula`: `Response ~ Recency + total_spending` means that we want to predict the outcome variable `Response` based on the explanatory variables `Recency` and `total_spending`. In R, we use `~` to separate the outcome variable and the explanatory variables for all supervised learning tasks. - `data`: the training dataset to train the model - `method`: "class" for **classification** tasks, "anova" for **regression** tasks \footnotesize \vspace{1em} ::: {.content-visible when-format="beamer"} ```{r} #| eval: false # Load the necessary packages pacman::p_load(rpart, rpart.plot) # Below example shows how to train a decision tree tree1 <- rpart( formula = Response ~ Recency + total_spending, data = data_full, method = "class" # classification task; or 'anova' for regression ) # visualize the tree rpart.plot(tree1) ``` ::: ::: {.content-hidden when-format="beamer"} ```{r} # Load the necessary packages pacman::p_load(rpart,rpart.plot) # Below example shows how to train a decision tree tree1 <- rpart( formula = Response ~ Recency + total_spending, # formula data = data_full, method = "class" # classification task; or 'anova' for regression ) # visualize the tree rpart.plot(tree1) ``` ::: ## How to Measure Split Quality: Classification Tasks - For classification tasks, the goal is to split the data to create nodes that are as "pure" as possible, meaning they contain instances of a single class. - Two common metrics are used to measure the quality of a split: Gini Impurity and Information Gain (based on Entropy). Gini impurity is more commonly used in practice due to its computational efficiency. ### **Gini Impurity** - Formula: $Gini = 1 - \sum_{i=1}^{C} (p_i)^2$, where $p_i$ is the proportion of samples of class $i$. - A Gini score of 0 indicates a perfectly pure node. The algorithm seeks splits that minimize the weighted Gini impurity of the child nodes. ## Numeric Example Let us start with a dataset of 10 customers, from which we observe X = total spending and Y = Response (1 for response, 0 for no response). \vspace{2em} ::::: columns ::: {.column width="33%"} Case 1: Purest - 10 customers responded - 0 customers did not respond - Gini = $1 - ((10/10)^2 + (0/10)^2) = 0$ ::: ::: {.column width="33%"} Case 2: In-between - 7 customers responded - 3 customers did not respond - Gini = $1 - ((7/10)^2 + (3/10)^2) = 0.42$ ::: ::: {.column width="33%"} Case 3: Impurest - 5 customers responded - 5 customers did not respond - Gini = $1 - ((5/10)^2 + (5/10)^2) = 0.5$ ::: ::::: ## Numeric Example: Split If we split the 10 customers in Case 2 into two child nodes based on total spending at a threshold of 1396: \vspace{2em} ::::: columns ::: {.column width="50%"} **Child Node 1 (Left)** - `total_spending < 1396` - 4 customers total - 1 responded - 3 did not respond - **Gini Calculation:** - $p_1 = 1/4 = 0.25$ - $p_0 = 3/4 = 0.75$ - Gini = $1 - (0.25^2 + 0.75^2) = 0.375$ - This node is **impure**. ::: ::: {.column width="50%"} **Child Node 2 (Right)** - `total_spending >= 1396` - 6 customers total - 6 responded - 0 did not respond - **Gini Calculation:** - $p_1 = 6/6 = 1$ - $p_0 = 0/6 = 0$ - Gini = $1 - (1^2 + 0^2) = 0$ - This node is **pure**. ::: ::::: ## Numeric Example: Weighted Gini and Gini Gain The goal is to find the split that results in the lowest weighted Gini impurity. 1. **Calculate Weighted Gini of the Split** - Weight (Left) = $4 / 10 = 0.4$ - Weight (Right) = $6 / 10 = 0.6$ - Weighted Gini = $(0.4 \times 0.375) + (0.6 \times 0) = 0.15$ 2. **Calculate Gini Gain (The Decision)** - Gini Gain = Gini (Parent) - Weighted Gini (Split) - Gini Gain = $0.42 - 0.15 = 0.27$ Since the Gini Gain is positive, impurity was reduced, making this a good split. The `rpart` algorithm repeats this for all possible splits and chooses the one with the **highest Gini Gain**. ## (Optional) How to Measure Split Quality: Regression Tasks - For regression tasks, the goal is to split the data to create nodes where the outcome values are as similar as possible. - The most common metric used to measure the quality of a split is the **Sum of Squared Errors (SSE)**. ### **Sum of Squared Errors (SSE)** - Measures the total squared difference between the actual values and the mean value of the outcome variable within a node. - Formula: $SSE = \sum_{i \in \text{node}} (y_i - \bar{y}_{\text{node}})^2$, where $y_i$ is the actual value and $\bar{y}_{\text{node}}$ is the mean value of the outcome in the node. - The algorithm seeks the split that results in the largest reduction in the total SSE of the child nodes compared to the parent node. ## How Decision Tree Works: Step 1 ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 5/decisiontree1.png') ``` Step 1. The decision tree (DT) will try to split customers into 2 groups based on each unique value of each variable, and see which split can lead to customers being most different in terms of outcome `Response`. ## How Decision Tree Works: Step 1 ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 5/decisiontree1.png') ``` Step 1. The decision tree (DT) will try to split customers into 2 groups based on each unique value of each variable, and see which split can lead to customers being most different in terms of outcome `Response`. - After this step, DT finds that total spending is the best variable and 1396 is the best cut-off. - DT therefore splits customers into 2 groups based on 1396. - In each node, the 3 numbers are: (1) predicted outcome, (2) predicted probability of outcome being 1, and (3) share of customers in the node ## How Decision Tree Works: Step 2 ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 5/decisiontree2.png') ``` Step 2. For customers in the left branch (`total_spending` \< 1396), DT will continue to split based on each unique value of each variable, and see which split can result in the customers being most different in terms of `Response`. - However, DT couldn't find a cut-off that sufficiently differentiates customers, so DT stops in the left branch. ## How Decision Tree Works: Step 3 ... ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 5/decisiontree3.png') ``` Step 3. For customers in the right branch (`total_spending` \>= 1396), DT will continue to split based on each unique value of each variable, and see which split can result in the customers being most different in terms of Response. - After this step, DT finds `Recency` is the best variable and 72 is the best cut-off. DT further splits customers into 2 groups. Step 4. This process continues until DT determines that there is no need to further split customers. ## How Decision Tree Works: Step 4 - Once the tree is fully grown, we can use the tree to make predictions on new customers. - For a new customer, we can follow the tree from the root node to the leaf node, and the predicted outcome is the outcome of the leaf node. - In R, we can use the `predict()` function to make predictions on new customers, which returns the predicted outcome of the new customers. Note that the test data should have the **exact same variable names** as the training data. ```{r} #| eval: false # Make predictions on the mtcars prediction_tree1 <- predict(tree1, data = data_test) ``` ## Advantages of Decision Trees - They are very interpretable. - Making predictions is fast. - It's easy to understand what variables are important in making the prediction. The internal nodes (splits) are those variables that most largely reduce the Gini Impurity/SSE (criteria for split). # Prediction Accuracy (Optional) ## Classification Tasks For classification tasks, we can evaluate model performance using: - **Confusion Matrix**: A table showing true positives, true negatives, false positives, and false negatives | | **Predicted: No** | **Predicted: Yes** | |----------------|-------------------|--------------------| | **Actual: No** | True Negatives | False Positives | | **Actual: Yes**| False Negatives | True Positives | ## Classification Tasks Based on the confusion matrix, we can further compute the following metrics: - **Accuracy**: The proportion of correct predictions $$\text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Predictions}}$$ - **Precision**: Among predicted positives, how many are actually positive $$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$$ - **Recall (Sensitivity)**: Among actual positives, how many are correctly predicted $$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$$ - **F1-Score**: Harmonic mean of precision and recall $$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}$$ ## Regression Tasks For regression tasks, we can evaluate model performance using: - **Mean Absolute Error (MAE)**: Average absolute difference between predicted and actual values $$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$ - **Root Mean Square Error (RMSE)**: Square root of average squared differences $$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$ - **Sum of Squared Errors (SSE)**: Total squared difference between predicted and actual values $$\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$ - Lower MAE/RMSE/SSE indicate better predictions ::: {.callout-note} In R, the `caret` package provides functions to compute these metrics easily. If you are interested, you can explore this data camp tutorial: [link](https://www.datacamp.com/courses/machine-learning-with-caret-in-r?utm_cid=898687156&utm_aid=150708303320&utm_campaign=220808_1-ps-other~dsa~generic_2-b2c_3-emea_4-prc_5-na_6-na_7-le_8-pdsh-go_9-nb-e_10-na_11-na&utm_loc=9197811-&utm_mtd=-c&utm_kw=&utm_source=google&utm_medium=paid_search&utm_content=ps-other~emea-en~dsa~generic~courses-r&gad_source=1&gad_campaignid=898687156&gbraid=0AAAAADQ9WsGtSrUy6sJYUGMqL5TCpYjPH&gclid=CjwKCAjwjffHBhBuEiwAKMb8pHA0BWeXHkSSjpXcLSs9DVH-fjsXzpp2_pgYL421bZSbuMiRFi5inRoCbToQAvD_BwE) ::: ## Business Metrics as Evaluation Criteria - In practice, we may also use business metrics such as ROI, profit, or customer lifetime value (CLV) as evaluation criteria for predictive models. - For example, in targeted marketing campaigns, we may want to evaluate predictive models based on the ROI of the marketing campaign when using the model to select target customers. The intuition is that a better predictive model should lead to a higher ROI for the marketing campaign. We will see an example of this in the case study later. # Random Forest ## **Disadvantages of Decision Trees** - Single regression trees tend to overfit, resulting in unstable predictions. - Due to the high variance, single regression trees tend to have poor predictive accuracy. ## Random Forest - To overcome the overfitting tendency of a single decision tree, random forest has been developed by [@breiman2001]. - Instead of using all customers, each tree is grown to a **subsample** of customers instead of all customers (e.g., 70% of training data) - Instead of using all features for splitting, each tree is grown to a **subset** of features instead of all features (e.g., 3 out of 5 features) ## Visualization of Random Forest ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics('images/Week 5/randomforest.png') ``` For a new customer, - Each tree gives a prediction of the outcome - Random forest takes the average (for regression tasks) or majority vote (for classification tasks) of all trees' predictions as the final prediction ## Implementation of Random Forest in R - Package `ranger` provides implementation of random forest in R. - `ranger()` is the function in the package to train a random forest; refer to its help function for more details. - The following code shows how to train a random forest consisting of 500 decision trees, where the outcome variable is `Response`, and the predictors are `total_spending` and `Recency`. \footnotesize \vspace{1em} ```{r} pacman::p_load(ranger) randomforest1 <- ranger( formula = Response ~ total_spending + Recency, # formula data = data_full, # dataset to train the model num.trees = 500, # 500 decision trees seed = 888, # make sure of replication probability = TRUE # to return predicted probabilities ) ``` ## Make Predictions from Random Forest - After we train the predictive model, we can use the `predict()` function to make predictions - The 1st argument is the trained model object - The 2nd argument is the dataset on which to make predictions \footnotesize \vspace{1em} ```{r} #| eval: false # Make predictions on the mtcars prediction_rf <- predict(randomforest1, data = data_full) # Because prediction_rf is a list object # Need to use $ to extract the predicted value as a numeric vector prediction_rf$predictions ``` ## After-Class Reading - (recommended) [Decision tree in R](http://uc-r.github.io/regression_trees) - (recommended) [Random forest in R](http://uc-r.github.io/random_forests)