Class 7 Unsupervised Learning and K-Means Clustering

Author

Affiliation

Dr Wei Miao

UCL School of Management

Published

October 23, 2024

1 Overview of Predictive Analytics

1.1 Roadmap for Predictive Analytics

The core of any business decision is profitability analysis (BEQ, NPV, CLV). To increase firm profitability,
1. Increase revenue
2. Reduce costs (CAC or variable marketing costs)
3. Reduce customer churn
In Weeks 4 and 5, we will learn how to utilize predictive analytics to improve profitability. Correspondingly,
1. Develop customers through ML recommender systems
2. Reduce costs by targeting more responsive customers
3. Predict customer churn and take preventive actions

1.2 Types of Predictive Analytics

Unsupervised Learning
- Only observe X => Want to uncover unknown subgroups
Supervised Learning
- Observe both X and Y => Want to predict Y for new data

In Term 2, you will learn predictive analytics models systematically. By then, think about how those techniques can be applied back to these case studies.

1.3 Types of Predictive Analytics

1.4 Learning Objectives for Today

Understand the concept of unsupervised learning
Understand how to apply K-means clustering and find the optimal number of clusters
How to apply clustering analyses for customer segmentation for M&S

2 K-Means Clustering

2.1 K-Means Clustering

K-means clustering is one of the most commonly used unsupervised machine learning algorithms for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst.
For data scientists: It can classify customers into multiple segments (i.e., clusters), such that customers within the same cluster are as similar as possible, whereas customers from different clusters are as dissimilar as possible.
Input: (1) customer characteristics; (2) the number of clusters
Output: cluster membership of each customer

2.2 Similarity and Dissimilarity

The clustering of observations into groups requires computing the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix.
The choice of similarity measures is a critical step in clustering.
The most common distance measures are the Euclidean distance (the default for K-means) and the Manhattan distance.

2.3 Euclidean Distance

The most common distance measure is the Euclidean distance. \[ d_{\text{euc}}(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} \]
Example of Income and Spending for 3 customers
- $Income = (5, 10, 20)$
- $Spending = (3, 4, 12)$
Euclidean distance
- $d_{\text{euc}}(2, 1) = \sqrt{(10-5)^2 + (4-3)^2} = \sqrt{25 + 1} = \sqrt{26}$
- $d_{\text{euc}}(2, 3) = \sqrt{(10-20)^2 + (4-12)^2} = \sqrt{100 + 64} = \sqrt{164}$

Code

Income <- c(5, 10, 20)

Spending <- c(3, 4, 12)

data <- data.frame(Income, Spending, ID = c("Customer 1", "Customer 2", "Customer 3"))

ggplot(data, aes(x = Income, y = Spending)) +
    geom_point(aes(shape = ID, color = ID), size = 2.5) +
    geom_text(aes(label = rownames(data)), vjust = -0.5) +
    theme_stata()

Code

Income <- c(5, 10, 20)

Spending <- c(3, 4, 12)

data <- data.frame(Income, Spending, ID = c("Customer 1", "Customer 2", "Customer 3"))

ggplot(data, aes(x = Income, y = Spending)) +
    geom_point(aes(shape = ID, color = ID), size = 2.5) +
    geom_text(aes(label = rownames(data)), vjust = -0.5) +
    theme_stata() +
    # show the Eucleadian distance between Customer 1 and Customer 2; also show the vertical and horizontal lines
    geom_segment(aes(x = 5, y = 3, xend = 10, yend = 4), linetype = "dashed") +
    geom_segment(aes(x = 5, y = 3, xend = 5, yend = 4), linetype = "dashed") +
    geom_segment(aes(x = 5, y = 4, xend = 10, yend = 4), linetype = "dashed") +

    # show the Eucleadian distance between Customer 2 and Customer 3; also show the vertical and horizontal lines
    geom_segment(aes(x = 10, y = 4, xend = 20, yend = 12), linetype = "dashed") +
    geom_segment(aes(x = 10, y = 4, xend = 10, yend = 12), linetype = "dashed") +
    geom_segment(aes(x = 10, y = 12, xend = 20, yend = 12), linetype = "dashed")

2.4 Manhattan Distance

Another common distance measure is the Manhattan distance, which is less commonly used because the absolute value function is not differentiable.

\[ d_{\text{man}}(x, y) = \sum_{i=1}^{n} |x_i - y_i| \]

Example of Income and Spending for 3 customers
- $Income = c(5, 10, 20)$
- $Spending = c(3, 4, 12)$
Distance
- $d_{\text{man}}(2, 1) = |10-5| + |4-3| = 5 + 1 = 6$
- $d_{\text{man}}(2, 3) = |10-20| + |4-12| = 10 + 8 = 18$

2.5 K-Means Clustering: Step 1

Raw data points; each dot is a customer
X and Y axis are customer characteristics, say, income and spending
Obviously there should be 2 segments
Let’s see how K-means uses a data-driven way to classify customers into 2 segments

2.6 K-Means Clustering: Step 2

We specify 2 segments
K-means initializes the process by randomly selecting 2 centroids

Due to this randomness, different starting points may yield varying results. We need to reinitialize the process repeatedly to ensure robustness of results.

2.7 K-Means Clustering: Step 3

K-means computes the distance of each customer to the red and blue centroids
K-means assigns each customer to red segment or blue segment based on which centroid is closer

2.8 K-Means Clustering: Step 4

K-means updates the new centroids of each segment
The red cross and blue cross in the picture are the new centroids
We still see some “outliers”, so the algorithm continues

2.9 K-Means Clustering: Step 5

K-means computes the distance of each customer to the red and blue centroids
K-means updates each customer to red segment or blue segment based on which centroid is closer
Now the outliers are correctly assigned each segment

2.10 K-Means Clustering: Step 6

K-means updates the new centroid from the previous clustering
K-means computes the distance of each customer to the new centroids
K-means finds that all customers are correctly assigned to their nearest centroids, so the algorithm does not need to continue
We say, the algorithm converges, and the algorithm stops

2.11 After-Class Readings

More technical details: K-means Cluster Analysis

--- date: "`r (lubridate::ymd('20241002') + lubridate::dweeks(3))`" title: "Class 7 Unsupervised Learning and K-Means Clustering" --- # Overview of Predictive Analytics ```{r} #| echo: false pacman::p_load(dplyr, ggplot2, ggthemes) ``` ## Roadmap for Predictive Analytics - The core of any business decision is **profitability analysis** (BEQ, NPV, CLV). To increase firm profitability, (1) Increase revenue (2) Reduce costs (CAC or variable marketing costs) (3) Reduce customer churn - In Weeks 4 and 5, we will learn how to utilize **predictive analytics** to improve profitability. Correspondingly, (1) Develop customers through ML recommender systems (2) **Reduce costs by targeting more responsive customers** (3) Predict customer churn and take preventive actions ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics("images/CourseStructure2024.png") ``` ## Types of Predictive Analytics - Unsupervised Learning - Only observe X =\> Want to uncover unknown subgroups - Supervised Learning - Observe both X and Y =\> Want to predict Y for new data In Term 2, you will learn predictive analytics models systematically. By then, think about how those techniques can be applied back to these case studies. ## Types of Predictive Analytics ```{r} #| echo: false #| fig-align: 'center' knitr::include_graphics("images/Week 4/PredictiveAnalyticsTypes.png") ``` ## Learning Objectives for Today - Understand the concept of unsupervised learning - Understand how to apply K-means clustering and find the optimal number of clusters - How to apply clustering analyses for customer segmentation for M&S # K-Means Clustering ## K-Means Clustering - K-means clustering is one of the most commonly used unsupervised machine learning algorithms for partitioning a given data set into a set of *k* groups (i.e. *k* clusters), where *k* represents the number of groups pre-specified by the analyst. - For data scientists: It can classify customers into multiple segments (i.e., clusters), such that customers within the same cluster are as **similar** as possible, whereas customers from different clusters are as **dissimilar** as possible. - Input: (1) customer characteristics; (2) the number of clusters - Output: cluster membership of each customer ## Similarity and Dissimilarity - The clustering of observations into groups requires computing the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. - The choice of similarity measures is a critical step in clustering. - The most common distance measures are the Euclidean distance (the default for K-means) and the Manhattan distance. ## Euclidean Distance - The most common distance measure is the Euclidean distance. $$ d_{\text{euc}}(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} $$ - Example of Income and Spending for 3 customers - $Income = (5, 10, 20)$ - $Spending = (3, 4, 12)$ - Euclidean distance - $d_{\text{euc}}(2, 1) = \sqrt{(10-5)^2 + (4-3)^2} = \sqrt{25 + 1} = \sqrt{26}$ - $d_{\text{euc}}(2, 3) = \sqrt{(10-20)^2 + (4-12)^2} = \sqrt{100 + 64} = \sqrt{164}$ ::: {.content-visible when-format='beamer'} ## Visualization of Euclidean Distance ::: ::: {.columns} ::: {.column width='50%'} ```{r} #| out.width: "100%" #| warning: false Income <- c(5, 10, 20) Spending <- c(3, 4, 12) data <- data.frame(Income, Spending, ID = c("Customer 1", "Customer 2", "Customer 3")) ggplot(data, aes(x = Income, y = Spending)) + geom_point(aes(shape = ID, color = ID), size = 2.5) + geom_text(aes(label = rownames(data)), vjust = -0.5) + theme_stata() ``` ::: ::: {.column width='50%'} ```{r} #| out.width: "100%" #| warning: false Income <- c(5, 10, 20) Spending <- c(3, 4, 12) data <- data.frame(Income, Spending, ID = c("Customer 1", "Customer 2", "Customer 3")) ggplot(data, aes(x = Income, y = Spending)) + geom_point(aes(shape = ID, color = ID), size = 2.5) + geom_text(aes(label = rownames(data)), vjust = -0.5) + theme_stata() + # show the Eucleadian distance between Customer 1 and Customer 2; also show the vertical and horizontal lines geom_segment(aes(x = 5, y = 3, xend = 10, yend = 4), linetype = "dashed") + geom_segment(aes(x = 5, y = 3, xend = 5, yend = 4), linetype = "dashed") + geom_segment(aes(x = 5, y = 4, xend = 10, yend = 4), linetype = "dashed") + # show the Eucleadian distance between Customer 2 and Customer 3; also show the vertical and horizontal lines geom_segment(aes(x = 10, y = 4, xend = 20, yend = 12), linetype = "dashed") + geom_segment(aes(x = 10, y = 4, xend = 10, yend = 12), linetype = "dashed") + geom_segment(aes(x = 10, y = 12, xend = 20, yend = 12), linetype = "dashed") ``` ::: ::: ## Manhattan Distance - Another common distance measure is the Manhattan distance, which is less commonly used because the absolute value function is not differentiable. $$ d_{\text{man}}(x, y) = \sum_{i=1}^{n} |x_i - y_i| $$ - Example of Income and Spending for 3 customers - $Income = c(5, 10, 20)$ - $Spending = c(3, 4, 12)$ - Distance - $d_{\text{man}}(2, 1) = |10-5| + |4-3| = 5 + 1 = 6$ - $d_{\text{man}}(2, 3) = |10-20| + |4-12| = 10 + 8 = 18$ ## K-Means Clustering: Step 1 ::: columns ::: {.column width="40%"} ![](images/Week 4/kmeans_1.png){fig-align="left"} ::: ::: {.column width="60%"} - Raw data points; each dot is a customer - X and Y axis are customer characteristics, say, income and spending - Obviously there should be 2 segments - Let's see how K-means uses a data-driven way to classify customers into 2 segments ::: ::: ## K-Means Clustering: Step 2 ::: columns ::: {.column width="40%"} ![](images/Week 4/kmeans_2.png){fig-align="left"} ::: ::: {.column width="60%"} - We specify 2 segments - K-means initializes the process by **randomly** selecting 2 centroids ::: ::: Due to this randomness, different starting points may yield varying results. We need to reinitialize the process repeatedly to ensure **robustness** of results. ## K-Means Clustering: Step 3 ::: columns ::: {.column width="40%"} ![](images/Week 4/kmeans_3.png){fig-align="left"} ::: ::: {.column width="60%"} - K-means computes the distance of each customer to the red and blue centroids - K-means assigns each customer to red segment or blue segment based on which centroid is closer ::: ::: ## K-Means Clustering: Step 4 ::: columns ::: {.column width="40%"} ![](images/Week 4/kmeans_4.png){fig-align="left"} ::: ::: {.column width="60%"} - K-means updates the new centroids of each segment - The red cross and blue cross in the picture are the new centroids - We still see some "outliers", so the algorithm continues ::: ::: ## K-Means Clustering: Step 5 ::: columns ::: {.column width="40%"} ![](images/Week 4/kmeans_5.png){fig-align="left"} ::: ::: {.column width="60%"} - K-means computes the distance of each customer to the red and blue centroids - K-means updates each customer to red segment or blue segment based on which centroid is closer - Now the outliers are correctly assigned each segment ::: ::: ## K-Means Clustering: Step 6 ::: columns ::: {.column width="50%"} ![](images/Week 4/kmeans_6.png){fig-align="left"} ::: ::: {.column width="60%"} - K-means updates the new centroid from the previous clustering - K-means computes the distance of each customer to the new centroids - K-means finds that all customers are correctly assigned to their nearest centroids, so the algorithm does not need to continue - We say, the algorithm **converges**, and the algorithm stops ::: ::: ## After-Class Readings - More technical details: [K-means Cluster Analysis](https://uc-r.github.io/kmeans_clustering)