Descriptive Analytics for M&S

MSIN0094 Case Study

Author

Affiliation

Wei Miao

UCL School of Management

Published

October 16, 2024

1 The Power of Descriptive Analytics

The amount of data created worldwide has been increasing exponentially over the past decade with some estimates placing the total at over 60 zettabytes as of 2030 (Source: Statista). Data without analytics, however, is of little value to business decision-makers aiming to improve performance and increase growth. It is therefore no surprise that top-tier consulting companies, analytics ﬁrms and business schools have been promoting the positive returns to greater usage of analytics technology. It also explains why an increasing number of data analytics enthusiasts (like Tom and you!) are willing to pay up to £40k tuition fee (that is 10,000 bubble teas!) to join the world’s best MSc Business Analytics program at the UCL School of Management (hmm, it’s now already week 3, too late to ask David for a refund!).

By identifying patterns and trends in massive amounts of data, business analytics enables organizations to make better decisions and improve performance. Descriptive analytics is the simplest and most widely used type of analytics; it is used to generate key performance indicators (KPIs) and metrics for business reports and dashboards. Research shows that, even with the adoption of very simple descriptive analytics, businesses can improve their performance by a large extent — Berman and Israeli (2022) analyze the staggered adoption of a retail analytics dashboard by more than 1,500 e-commerce websites,¹ and find an increase of 4%–10% in average weekly revenues post-adoption. The increase in revenue is not explained by price changes or advertising optimization. Instead, it is consistent with the improvement on customer relationship management, personalization, and prospecting technologies to retailer websites. The adoption and usage of descriptive analytics also increases the diversity of products sold, the number of transactions, the numbers of website visitors and unique customers, and the revenue from repeat customers. These findings are consistent with a complementary effect of descriptive analytics that serve as a monitoring device that helps business analysts control additional martech tools and amplify their value. Without using the descriptive dashboard, companies are unable to reap the benefits associated with these technologies.

In the remaining of this case, we will explore (1) how to conduct data wrangling and data cleaning using R and (2) how to conduct descriptive analytics for M&S to achieve marketing efficiency. You might ask, where is Tom? Well, M&S rewarded Tom with a few days off for his successful calculation of CLV last week, and he is now considering setting up his bubble tea business in Canary Wharf. You are the only one who can help M&S now until Tom’s return!

2 Database Marketing at M&S

We learned in Week 2, how to compute the customer lifetime value for M&S. However, when computing the CLV, we used an “average” approach, which did not consider customer heterogeneity. That is, when considering each component in the CLV formula, such as customer spending in each period, their retention rate, etc., we took the average across all customers, and assumed customers are homogeneous. As a result, every customer would have the same CLV. Nevertheless, this is a strong assumption in practice — every customer is unique and should be treated differently.

The key to successful customer relationship management is to maintain a customer database that tracks detailed customer information, including their demograhpic information and past purchase history. This information would empower marketing analytics team to compute individual CLV for each customer, and conduct individualized targeted marketing.

2.1 Demographic Information

Knowing your consumer is a vital concept of running any business. Is the business selling fertilizer to farmers, apparel to teenage girls, or vacations to senior citizens? The distinctions are readily apparent in this comparison.

Demographics define the qualities of clients. To be successful, business owners must understand the demographics of their clients and the trends or changes that are occurring within those specific traits.

The following demographic information is usually of interest to business managers:

Age: Consumer behavior is strongly influenced by age. Younger consumers are more affluent and willing to spend more on entertainment, fashion, and movies. Seniors spend less on these items; they are less active, spend more time indoors, and require more medical treatment. Additionally, market segments can be defined by age groups. For instance, digital devices such as iPhones are targeted more towards millennials than at seniors. While older adults are increasingly utilizing technology, they remain less digitally savvy than millennials and purchase fewer digital products.

Gender: Gender also matters. Males and females have vastly diverse demands and tastes, which influence their purchasing decisions. As a result, some products are created with a specific gender in mind. Macy’s, Nordstrom, and The Gap all have departments dedicated to teenage girls’ clothes, while Seiko has a specific line of diver watches for men only.

Income: Income has a substantial influence on consumer behavior and product purchases. Middle-income customers make purchases with due regard for the utility of money. They do not have unlimited money to spend, and hence the money spent on one item may be used on something else. On the other hand, consumers with higher incomes tend to be less price sensitive and have a higher willingness to pay.

Education: Consumers’ level of education has an effect on their impressions of the world around them and on the amount of research they conduct prior to making a purchase. Individuals with a higher level of education will spend more time educating themselves before investing their money. Education has an impact on fashion, film, and television programming. Consumers with a higher level of education can be more distrustful of commercials and the facts offered.

M&S has collected rich customer demographic information through its loyalty program, M&S Sparks membership. The data scientist team has the following demographic variables:

ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company

2.2 Purchase History

“History doesn’t repeat itself, but it often rhymes.”

This popular aphorism, frequently (and perhaps incorrectly) attributed to Mark Twain, is frequently invoked to demonstrate that, while past events do not always provide a clear indication of future events, they do provide valuable context. This sentiment is especially true for marketing managers, where a consumer’s purchase history provides invaluable insight into their future purchasing habits.

M&S’s data engineering team has assembled a cross-sectional customer purchase history data today, with variables including²:

MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold and jewelry in last 2 years
NumDealsPurchases: Number of purchases made with a discount
NumWebPurchases: Number of purchases made through Ocado’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month
Complain: 1 if customer complained in the last 2 years, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
Recency: Number of days since customer’s last purchase

3 Data Wrangling

3.1 Data Loading

To work on the datasets, we first need to load the raw data into R. In R, we can use read.csv(filepath) to load the data into R environment.

For your convenience, I have stored the data file on my Dropbox and the download link will be provided. We can directly feed the url links to read.csv() function in R to download and create the dataset in R’s environment.

The data is also provided to you in .csv format, which you can also load from your local disk.

Question 1
Answer

Load the dataset into your R environment. Name the dataset as data_full.

Code

## use read.csv() to download and load the data, assign it to an R data object
## header = T argument is to tell read.csv() to keep the dataset header (first row)

pacman::p_load(dplyr, modelsummary)

data_full <- read.csv("https://www.dropbox.com/scl/fi/2q7ppqtyca0pd3j486osl/data_full.csv?rlkey=gsyk51q27vd1skek4qpn5ikgm&dl=1")

After running the above code blocks, you should see the dataset named data_full in your RStudio environment.

Now, click into the dataset, take a look, and get a sense of how it looks like.

Check all data types in data_full are correct and as expected

Code

# can use str() to get the structure of data
# Lina covered this in the induction week
str(data_full)

'data.frame':   2000 obs. of  22 variables:
 $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
 $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
 $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
 $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
 $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
 $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
 $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
 $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
 $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
 $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
 $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
 $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
 $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...
 $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
 $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
 $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
 $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
 $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
 $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
 $ Dt_Customer        : chr  "04/09/2012" "08/03/2014" "21/08/2013" "10/02/2014" ...
 $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...

3.3 Data Cleaning

Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. We often need to deal with missing values and outliers. In this case, we will focus on missing values.

Question 3
Answer

Check if any variables in data_full have missing values. If so, how many missing values are there in each variable? Use the variable average to replace missing values in the dataset.

Code

datasummary_skim(data_full, fmt = 3)

Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
ID	2000	0	5599.172	3242.019	0.000	5492.000	11191.000
MntWines	738	0	306.058	338.276	0.000	176.500	1493.000
MntFruits	157	0	26.358	39.885	0.000	8.000	199.000
MntMeatProducts	532	0	167.885	225.292	0.000	68.000	1725.000
MntFishProducts	179	0	37.614	54.556	0.000	12.000	259.000
MntSweetProducts	175	0	27.495	41.756	0.000	8.000	263.000
MntGoldProds	207	0	43.786	51.736	0.000	24.000	362.000
NumDealsPurchases	15	0	2.321	1.958	0.000	2.000	15.000
NumWebPurchases	15	0	4.082	2.773	0.000	4.000	27.000
NumCatalogPurchases	14	0	2.691	2.956	0.000	2.000	28.000
NumStorePurchases	14	0	5.819	3.256	0.000	5.000	13.000
NumWebVisitsMonth	15	0	5.308	2.452	0.000	6.000	20.000
Complain	2	0	0.010	0.100	0.000	0.000	1.000
Response	2	0	0.150	0.358	0.000	0.000	1.000
Year_Birth	59	0	1968.843	12.020	1893.000	1970.000	1996.000
Income	1783	1	52139.697	21492.372	1730.000	51518.000	162397.000
Kidhome	3	0	0.444	0.540	0.000	0.000	2.000
Teenhome	3	0	0.501	0.546	0.000	0.000	2.000
Recency	100	0	49.165	28.951	0.000	50.000	99.000
		N	%
Education	2n Cycle	185	9.250
	Basic	43	2.150
	Graduation	992	49.600
	Master	327	16.350
	PhD	453	22.650
Marital_Status	Alone	3	0.150
	Divorced	206	10.300
	Married	767	38.350
	Single	436	21.800
	Together	521	26.050
	Widow	67	3.350

We find that income has missing values for a few customers. We can replace the missing values with the average income of all customers.
- Tip: use mutate() and ifelse() functions in dplyr to replace missing values

Code

data_full <- data_full %>%
    mutate(Income = ifelse(is.na(Income), mean(Income, na.rm = T), Income))

4 Preliminary Customer Analysis

Next, once the final dataset is ready, we can proceed to use descriptive analytics to conduct preliminary customer analysis using dplyr.

Descriptive analytics is concerned with summarizing and highlighting patterns in current and historical data in order to assist businesses in comprehending what has occurred thus far. However, it makes no attempt to explain why something occurred or to forecast what may occur in the future. “why something occurred” belongs to the scope of causal inference; “forecast what may occur in the future” falls in the scope of predictive analytics. To answer those questions, analysts must combine descriptive analytics with other types of analytics, which we will learn in later weeks.

Your task is to ‘get to know’ the data_full by conducting some descriptive analytics.

Question 4
Answer

Report the summary statistics of the cleaned data, discuss your business insights (open question; from central tendency and dispersion perspectives)

Code

# remember to load package modelsummary before calling this function
datasummary_skim(data_full)

Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
ID	2000	0	5599.2	3242.0	0.0	5492.0	11191.0
MntWines	738	0	306.1	338.3	0.0	176.5	1493.0
MntFruits	157	0	26.4	39.9	0.0	8.0	199.0
MntMeatProducts	532	0	167.9	225.3	0.0	68.0	1725.0
MntFishProducts	179	0	37.6	54.6	0.0	12.0	259.0
MntSweetProducts	175	0	27.5	41.8	0.0	8.0	263.0
MntGoldProds	207	0	43.8	51.7	0.0	24.0	362.0
NumDealsPurchases	15	0	2.3	2.0	0.0	2.0	15.0
NumWebPurchases	15	0	4.1	2.8	0.0	4.0	27.0
NumCatalogPurchases	14	0	2.7	3.0	0.0	2.0	28.0
NumStorePurchases	14	0	5.8	3.3	0.0	5.0	13.0
NumWebVisitsMonth	15	0	5.3	2.5	0.0	6.0	20.0
Complain	2	0	0.0	0.1	0.0	0.0	1.0
Response	2	0	0.2	0.4	0.0	0.0	1.0
Year_Birth	59	0	1968.8	12.0	1893.0	1970.0	1996.0
Income	1783	0	52139.7	21400.8	1730.0	51844.5	162397.0
Kidhome	3	0	0.4	0.5	0.0	0.0	2.0
Teenhome	3	0	0.5	0.5	0.0	0.0	2.0
Recency	100	0	49.2	29.0	0.0	50.0	99.0
		N	%
Education	2n Cycle	185	9.2
	Basic	43	2.1
	Graduation	992	49.6
	Master	327	16.4
	PhD	453	22.6
Marital_Status	Alone	3	0.1
	Divorced	206	10.3
	Married	767	38.4
	Single	436	21.8
	Together	521	26.0
	Widow	67	3.4

Question 5
Answer

Mutate a new variable called total_spending, which means the sum of spending on all products by customers. Mute a new variable called Frequency, which means the sum of the number of purchases made through different channels. Provide a correlation table between total_spending and other continuous variables, including Frequency, Complain, Response, Income, Kidhome, Teenhome, Recency.

Code

data_full <- data_full %>%
    mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) %>%
    mutate(Frequency = NumStorePurchases + NumWebPurchases + NumCatalogPurchases)

Code

datasummary_correlation(data_full %>%
    select(total_spending, Complain, Response, Income, Kidhome, Teenhome, Recency))

	total_spending	Complain	Response	Income	Kidhome	Teenhome	Recency
total_spending	1	.	.	.	.	.	.
Complain	-.04	1	.	.	.	.	.
Response	.25	.00	1	.	.	.	.
Income	.79	-.03	.15	1	.	.	.
Kidhome	-.56	.04	-.07	-.51	1	.	.
Teenhome	-.14	.00	-.15	.03	-.03	1	.
Recency	.01	.02	-.20	.00	.00	.02	1

RFM Analysis

RFM analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups. RFM stands for Recency, Frequency, and Monetary Value. RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions, as well as those who are more likely to churn. We will learn more about RFM analysis in the subsequent weeks when we talk about segmentation and targeting.

Question 6
Answer

Compare the average total spending across demographic groups. Which group has the highest average spending?
- marital status
- education

Code

data_full %>%
    group_by(Marital_Status) %>% # group by
    summarise(avg_spending = mean(total_spending)) %>% # compute the mean of spending for each group
    ungroup()

Question 7
Answer

Which education group accounts for the largest percentage of customers?
- tip: in dplyr, there is a function called n(), which counts the number of rows in the group after group_by()

Code

data_full %>% # Think about why data_demo? what if we use data_full?
    group_by(Education) %>% # group by Education
    summarise(n_customers_in_group = n()) %>% # compute the number of customers in each education group
    ungroup() %>% # important to do the ungroup after each group_by
    mutate(percent_customer = n_customers_in_group / sum(n_customers_in_group)) %>% # compute the percentage of customers in each group
    arrange(-percent_customer) # descending order

Question 8
Answer

What is the average Total £ spent on wine and total spending by customers with and without kids?
- tip: first mutate a variable called has_kid, which equals 1 if the number of kids in the household is larger than 0, and otherwise 0; then group by this has_kid variable

Code

data_full %>%
    mutate(has_kid = ifelse(Kidhome > 0, 1, 0)) %>%
    group_by(has_kid) %>%
    summarise(
        avg_spending_wine = mean(MntWines, na.rm = T),
        avg_total_spending = mean(total_spending, na.rm = T)
    ) %>%
    ungroup()

5 After-class Exercise

After the class, you can further explore the dataset and conduct more descriptive analytics.

Question 9
Answer

For complainers and non-complainers who have Master’s degrees, find the total number and also the percent of customers who responded to the offer.

Code

data_full %>%
    filter(Education == "Master") %>%
    group_by(Complain) %>%
    summarise(
        n_responder = sum(Response, na.rm = T),
        n_customer = n()
    ) %>%
    ungroup() %>%
    mutate(percent_of_responder = n_responder / n_customer)

Question 10
Answer

Group customers by Marital Status and compare the average spending in each group. What insights can you draw? - Tips: use group_by() + summarise()

Code

data_full %>%
    mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) %>% # mutate total spending
    group_by(Marital_Status) %>%
    summarise(avg_spending = mean(total_spending, na.rm = T)) %>% # compute the mean of spending for each group
    ungroup() %>%
    arrange(-avg_spending)

References

Berman, Ron, and Ayelet Israeli. 2022. “The Value of Descriptive Analytics: Evidence from Online Retailers.” Marketing Science, March, mksc.2022.1352. https://doi.org/10.1287/mksc.2022.1352.

Footnotes

We will cover the difference-in-differences technique to establish causal inference later in the module.↩︎
Cross-sectional data is data collected at a single point in time. It is a snapshot of customers at a particular moment. In contrast, longitudinal data is collected over time, allowing analysts to track changes in customer behavior over time.↩︎

--- title: "Descriptive Analytics for M&S" author: "Wei Miao" thanks: "This case was prepared by Wei Miao, UCL School of Management, University College London for MSIN0094 Marketing Analytics module. This case was developed to provide material for class discussion rather than to illustrate either effective or ineffective handling of a business situation. Names and data may have been disguised or fabricated. Please do not circulate without permission. All copyrights reserved." date: "`r (lubridate::ymd('20241002')+lubridate::dweeks(2))`" subtitle: "MSIN0094 Case Study" institute: UCL School of Management date-format: long format: html: default pdf: number-sections: true toc: false number-depth: 2 fontsize: 9pt linestretch: 1.25 knitr: opts_chunk: echo: false warning: false message: false include: false bibliography: references.bib execute: freeze: auto cache: true editor_options: chunk_output_type: inline --- # The Power of Descriptive Analytics The amount of data created worldwide has been increasing exponentially over the past decade with some estimates placing the total at over 60 zettabytes as of 2030 (Source: Statista). Data without analytics, however, is of little value to business decision-makers aiming to improve performance and increase growth. It is therefore no surprise that top-tier consulting companies, analytics ﬁrms and business schools have been promoting the positive returns to greater usage of analytics technology. It also explains why an increasing number of data analytics enthusiasts (like Tom and you!) are willing to pay up to £40k tuition fee (that is 10,000 bubble teas!) to join the world's best MSc Business Analytics program at the UCL School of Management (hmm, it's now already week 3, too late to ask David for a refund!). By identifying patterns and trends in massive amounts of data, business analytics enables organizations to make better decisions and improve performance. Descriptive analytics is the simplest and most widely used type of analytics; it is used to generate key performance indicators (KPIs) and metrics for business reports and dashboards. Research shows that, even with the adoption of very simple descriptive analytics, businesses can improve their performance by a large extent --- @bermanValueDescriptiveAnalytics2022 analyze the staggered adoption of a retail analytics dashboard by more than 1,500 e-commerce websites,[^1] and find an increase of 4%--10% in average weekly revenues post-adoption. The increase in revenue is not explained by price changes or advertising optimization. Instead, it is consistent with the improvement on customer relationship management, personalization, and prospecting technologies to retailer websites. The adoption and usage of descriptive analytics also increases the diversity of products sold, the number of transactions, the numbers of website visitors and unique customers, and the revenue from repeat customers. These findings are consistent with a complementary effect of descriptive analytics that serve as a monitoring device that helps business analysts control additional martech tools and amplify their value. Without using the descriptive dashboard, companies are unable to reap the benefits associated with these technologies. [^1]: We will cover the difference-in-differences technique to establish causal inference later in the module. In the remaining of this case, we will explore (1) how to conduct data wrangling and data cleaning using R and (2) how to conduct descriptive analytics for M&S to achieve marketing efficiency. You might ask, where is Tom? Well, M&S rewarded Tom with a few days off for his successful calculation of CLV last week, and he is now considering setting up his bubble tea business in Canary Wharf. You are the only one who can help M&S now until Tom's return! # Database Marketing at M&S We learned in Week 2, how to compute the customer lifetime value for M&S. However, when computing the CLV, we used an "average" approach, which did not consider customer heterogeneity. That is, when considering each component in the CLV formula, such as customer spending in each period, their retention rate, etc., we took the average across all customers, and assumed customers are homogeneous. As a result, every customer would have the same CLV. Nevertheless, this is a strong assumption in practice --- every customer is unique and should be treated differently. The key to successful customer relationship management is to maintain a customer database that tracks detailed customer information, including their demograhpic information and past purchase history. This information would empower marketing analytics team to compute individual CLV for each customer, and conduct individualized targeted marketing. ## Demographic Information Knowing your consumer is a vital concept of running any business. Is the business selling fertilizer to farmers, apparel to teenage girls, or vacations to senior citizens? The distinctions are readily apparent in this comparison. Demographics define the qualities of clients. To be successful, business owners must understand the demographics of their clients and the trends or changes that are occurring within those specific traits. The following demographic information is usually of interest to business managers: *Age*: Consumer behavior is strongly influenced by age. Younger consumers are more affluent and willing to spend more on entertainment, fashion, and movies. Seniors spend less on these items; they are less active, spend more time indoors, and require more medical treatment. Additionally, market segments can be defined by age groups. For instance, digital devices such as iPhones are targeted more towards millennials than at seniors. While older adults are increasingly utilizing technology, they remain less digitally savvy than millennials and purchase fewer digital products. *Gender*: Gender also matters. Males and females have vastly diverse demands and tastes, which influence their purchasing decisions. As a result, some products are created with a specific gender in mind. Macy's, Nordstrom, and The Gap all have departments dedicated to teenage girls' clothes, while Seiko has a specific line of diver watches for men only. *Income*: Income has a substantial influence on consumer behavior and product purchases. Middle-income customers make purchases with due regard for the utility of money. They do not have unlimited money to spend, and hence the money spent on one item may be used on something else. On the other hand, consumers with higher incomes tend to be less price sensitive and have a higher willingness to pay. *Education*: Consumers' level of education has an effect on their impressions of the world around them and on the amount of research they conduct prior to making a purchase. Individuals with a higher level of education will spend more time educating themselves before investing their money. Education has an impact on fashion, film, and television programming. Consumers with a higher level of education can be more distrustful of commercials and the facts offered. M&S has collected rich customer demographic information through its loyalty program, M&S Sparks membership. The data scientist team has the following demographic variables: - ID: Customer's unique identifier - Year_Birth: Customer's birth year - Education: Customer's education level - Marital_Status: Customer's marital status - Income: Customer's yearly household income - Kidhome: Number of children in customer's household - Teenhome: Number of teenagers in customer's household - Dt_Customer: Date of customer's enrollment with the company ## Purchase History "History doesn't repeat itself, but it often rhymes." This popular aphorism, frequently (and perhaps incorrectly) attributed to Mark Twain, is frequently invoked to demonstrate that, while past events do not always provide a clear indication of future events, they do provide valuable context. This sentiment is especially true for marketing managers, where a consumer's purchase history provides invaluable insight into their future purchasing habits. M&S's data engineering team has assembled a **cross-sectional** customer purchase history data today, with variables including^[Cross-sectional data is data collected at a single point in time. It is a snapshot of customers at a particular moment. In contrast, **longitudinal** data is collected over time, allowing analysts to track changes in customer behavior over time.]: - MntWines: Amount spent on wine in last 2 years - MntFruits: Amount spent on fruits in last 2 years - MntMeatProducts: Amount spent on meat in last 2 years - MntFishProducts: Amount spent on fish in last 2 years - MntSweetProducts: Amount spent on sweets in last 2 years - MntGoldProds: Amount spent on gold and jewelry in last 2 years - NumDealsPurchases: Number of purchases made with a discount - NumWebPurchases: Number of purchases made through Ocado's web site - NumCatalogPurchases: Number of purchases made using a catalogue - NumStorePurchases: Number of purchases made directly in stores - NumWebVisitsMonth: Number of visits to company's web site in the last month - Complain: 1 if customer complained in the last 2 years, 0 otherwise - Response: 1 if customer accepted the offer in the last campaign, 0 otherwise - Recency: Number of days since customer's last purchase # Data Wrangling ## Data Loading To work on the datasets, we first need to load the raw data into R. In R, we can use `read.csv(filepath)` to load the data into R environment. For your convenience, I have stored the data file on my Dropbox and the download link will be provided. We can directly feed the url links to `read.csv()` function in R to download and create the dataset in R's environment. The data is also provided to you in .csv format, which you can also load from your local disk. ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** Load the dataset into your R environment. Name the dataset as `data_full`. ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** Load the dataset into your R environment. Name the dataset as `data_full`. ### Answer ```{r} ## use read.csv() to download and load the data, assign it to an R data object ## header = T argument is to tell read.csv() to keep the dataset header (first row) pacman::p_load(dplyr, modelsummary) data_full <- read.csv("https://www.dropbox.com/scl/fi/2q7ppqtyca0pd3j486osl/data_full.csv?rlkey=gsyk51q27vd1skek4qpn5ikgm&dl=1") ``` ::: ::: After running the above code blocks, you should see the dataset named `data_full` in your RStudio environment. Now, click into the dataset, take a look, and get a sense of how it looks like. ## Data Types ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** Check all data types in `data_full` are correct and as expected ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** Check all data types in `data_full` are correct and as expected ### Answer ```{r} # can use str() to get the structure of data # Lina covered this in the induction week str(data_full) ``` ::: ::: ## Data Cleaning Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. We often need to deal with missing values and outliers. In this case, we will focus on missing values. ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** Check if any variables in `data_full` have missing values. If so, how many missing values are there in each variable? Use the variable average to replace missing values in the dataset. ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** Check if any variables in `data_full` have missing values. If so, how many missing values are there in each variable? Use the variable average to replace missing values in the dataset. ### Answer ```{r} datasummary_skim(data_full, fmt = 3) ``` - We find that income has missing values for a few customers. We can replace the missing values with the average income of all customers. - Tip: use `mutate()` and `ifelse()` functions in `dplyr` to replace missing values ```{r} data_full <- data_full %>% mutate(Income = ifelse(is.na(Income), mean(Income, na.rm = T), Income)) ``` ::: ::: # Preliminary Customer Analysis Next, once the final dataset is ready, we can proceed to use descriptive analytics to conduct preliminary customer analysis using `dplyr`. Descriptive analytics is concerned with summarizing and highlighting patterns in current and historical data in order to assist businesses in comprehending what has occurred thus far. However, it makes no attempt to explain why something occurred or to forecast what may occur in the future. "why something occurred" belongs to the scope of causal inference; "forecast what may occur in the future" falls in the scope of predictive analytics. To answer those questions, analysts must combine descriptive analytics with other types of analytics, which we will learn in later weeks. Your task is to 'get to know' the `data_full` by conducting some descriptive analytics. ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** - Report the summary statistics of the cleaned data, discuss your business insights (open question; from central tendency and dispersion perspectives) ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** - Report the summary statistics of the cleaned data, discuss your business insights (open question; from central tendency and dispersion perspectives) ### Answer ```{r} # remember to load package modelsummary before calling this function datasummary_skim(data_full) ``` ::: ::: ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** - Mutate a new variable called `total_spending`, which means the sum of spending on all products by customers. - total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds - Mute a new variable called `Frequency`, which means the sum of the number of purchases made through different channels. - Frequency = NumStorePurchases + NumWebPurchases + NumCatalogPurchases - Provide a correlation table between `total_spending` and other continuous variables, including Frequency, Complain, Response, Income, Kidhome, Teenhome, Recency. ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** - Mutate a new variable called `total_spending`, which means the sum of spending on all products by customers. Mute a new variable called `Frequency`, which means the sum of the number of purchases made through different channels. Provide a correlation table between `total_spending` and other continuous variables, including Frequency, Complain, Response, Income, Kidhome, Teenhome, Recency. ### Answer ```{r} data_full <- data_full %>% mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) %>% mutate(Frequency = NumStorePurchases + NumWebPurchases + NumCatalogPurchases) ``` ```{r} datasummary_correlation(data_full %>% select(total_spending, Complain, Response, Income, Kidhome, Teenhome, Recency)) ``` ::: ::: ::: {.callout-note} ### RFM Analysis RFM analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups. RFM stands for Recency, Frequency, and Monetary Value. RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions, as well as those who are more likely to churn. We will learn more about RFM analysis in the subsequent weeks when we talk about segmentation and targeting. ::: ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** - Compare the average total spending across demographic groups. Which group has the highest average spending? - marital status - education ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** - Compare the average total spending across demographic groups. Which group has the highest average spending? - marital status - education ### Answer ```{r} data_full %>% group_by(Marital_Status) %>% # group by summarise(avg_spending = mean(total_spending)) %>% # compute the mean of spending for each group ungroup() ``` ::: ::: ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** - Which education group accounts for the largest percentage of customers? - tip: in dplyr, there is a function called `n()`, which counts the number of rows in the group after `group_by()` ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** - Which education group accounts for the largest percentage of customers? - tip: in dplyr, there is a function called `n()`, which counts the number of rows in the group after `group_by()` ### Answer ```{r} data_full %>% # Think about why data_demo? what if we use data_full? group_by(Education) %>% # group by Education summarise(n_customers_in_group = n()) %>% # compute the number of customers in each education group ungroup() %>% # important to do the ungroup after each group_by mutate(percent_customer = n_customers_in_group / sum(n_customers_in_group)) %>% # compute the percentage of customers in each group arrange(-percent_customer) # descending order ``` ::: ::: ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** - What is the average Total £ spent on wine and total spending by customers with and without kids? - tip: first mutate a variable called `has_kid`, which equals 1 if the number of kids in the household is larger than 0, and otherwise 0; then group by this `has_kid` variable ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** - What is the average Total £ spent on wine and total spending by customers with and without kids? - tip: first mutate a variable called `has_kid`, which equals 1 if the number of kids in the household is larger than 0, and otherwise 0; then group by this `has_kid` variable ### Answer ```{r} data_full %>% mutate(has_kid = ifelse(Kidhome > 0, 1, 0)) %>% group_by(has_kid) %>% summarise( avg_spending_wine = mean(MntWines, na.rm = T), avg_total_spending = mean(total_spending, na.rm = T) ) %>% ungroup() ``` ::: ::: # After-class Exercise After the class, you can further explore the dataset and conduct more descriptive analytics. ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** For complainers and non-complainers who have Master's degrees, find the total number and also the percent of customers who responded to the offer. ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** For complainers and non-complainers who have Master's degrees, find the total number and also the percent of customers who responded to the offer. ### Answer ```{r} data_full %>% filter(Education == "Master") %>% group_by(Complain) %>% summarise( n_responder = sum(Response, na.rm = T), n_customer = n() ) %>% ungroup() %>% mutate(percent_of_responder = n_responder / n_customer) ``` ::: ::: ::: {.content-visible when-format='pdf'} ::: {.callout} **Question `r question_index<-question_index+1; question_index-1`** Group customers by Marital Status and compare the average spending in each group. What insights can you draw? - Tips: use `group_by()` + `summarise()` ::: ::: ::: {.content-visible when-format='html'} ::: {.panel-tabset} ### **Question `r question_index-1`** Group customers by Marital Status and compare the average spending in each group. What insights can you draw? - Tips: use `group_by()` + `summarise()` ### Answer ```{r} data_full %>% mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) %>% # mutate total spending group_by(Marital_Status) %>% summarise(avg_spending = mean(total_spending, na.rm = T)) %>% # compute the mean of spending for each group ungroup() %>% arrange(-avg_spending) ``` ::: ::: # References {.unnumbered}