Descriptive Analytics for M&S

MSIN0094 Case Study

Author
Affiliation

Wei Miao

UCL School of Management

Published

October 16, 2024

1 The Power of Descriptive Analytics

The amount of data created worldwide has been increasing exponentially over the past decade with some estimates placing the total at over 60 zettabytes as of 2030 (Source: Statista). Data without analytics, however, is of little value to business decision-makers aiming to improve performance and increase growth. It is therefore no surprise that top-tier consulting companies, analytics firms and business schools have been promoting the positive returns to greater usage of analytics technology. It also explains why an increasing number of data analytics enthusiasts (like Tom and you!) are willing to pay up to £40k tuition fee (that is 10,000 bubble teas!) to join the world’s best MSc Business Analytics program at the UCL School of Management (hmm, it’s now already week 3, too late to ask David for a refund!).

By identifying patterns and trends in massive amounts of data, business analytics enables organizations to make better decisions and improve performance. Descriptive analytics is the simplest and most widely used type of analytics; it is used to generate key performance indicators (KPIs) and metrics for business reports and dashboards. Research shows that, even with the adoption of very simple descriptive analytics, businesses can improve their performance by a large extent — Berman and Israeli (2022) analyze the staggered adoption of a retail analytics dashboard by more than 1,500 e-commerce websites,1 and find an increase of 4%–10% in average weekly revenues post-adoption. The increase in revenue is not explained by price changes or advertising optimization. Instead, it is consistent with the improvement on customer relationship management, personalization, and prospecting technologies to retailer websites. The adoption and usage of descriptive analytics also increases the diversity of products sold, the number of transactions, the numbers of website visitors and unique customers, and the revenue from repeat customers. These findings are consistent with a complementary effect of descriptive analytics that serve as a monitoring device that helps business analysts control additional martech tools and amplify their value. Without using the descriptive dashboard, companies are unable to reap the benefits associated with these technologies.

In the remaining of this case, we will explore (1) how to conduct data wrangling and data cleaning using R and (2) how to conduct descriptive analytics for M&S to achieve marketing efficiency. You might ask, where is Tom? Well, M&S rewarded Tom with a few days off for his successful calculation of CLV last week, and he is now considering setting up his bubble tea business in Canary Wharf. You are the only one who can help M&S now until Tom’s return!

2 Database Marketing at M&S

We learned in Week 2, how to compute the customer lifetime value for M&S. However, when computing the CLV, we used an “average” approach, which did not consider customer heterogeneity. That is, when considering each component in the CLV formula, such as customer spending in each period, their retention rate, etc., we took the average across all customers, and assumed customers are homogeneous. As a result, every customer would have the same CLV. Nevertheless, this is a strong assumption in practice — every customer is unique and should be treated differently.

The key to successful customer relationship management is to maintain a customer database that tracks detailed customer information, including their demograhpic information and past purchase history. This information would empower marketing analytics team to compute individual CLV for each customer, and conduct individualized targeted marketing.

2.1 Demographic Information

Knowing your consumer is a vital concept of running any business. Is the business selling fertilizer to farmers, apparel to teenage girls, or vacations to senior citizens? The distinctions are readily apparent in this comparison.

Demographics define the qualities of clients. To be successful, business owners must understand the demographics of their clients and the trends or changes that are occurring within those specific traits.

The following demographic information is usually of interest to business managers:

Age: Consumer behavior is strongly influenced by age. Younger consumers are more affluent and willing to spend more on entertainment, fashion, and movies. Seniors spend less on these items; they are less active, spend more time indoors, and require more medical treatment. Additionally, market segments can be defined by age groups. For instance, digital devices such as iPhones are targeted more towards millennials than at seniors. While older adults are increasingly utilizing technology, they remain less digitally savvy than millennials and purchase fewer digital products.

Gender: Gender also matters. Males and females have vastly diverse demands and tastes, which influence their purchasing decisions. As a result, some products are created with a specific gender in mind. Macy’s, Nordstrom, and The Gap all have departments dedicated to teenage girls’ clothes, while Seiko has a specific line of diver watches for men only.

Income: Income has a substantial influence on consumer behavior and product purchases. Middle-income customers make purchases with due regard for the utility of money. They do not have unlimited money to spend, and hence the money spent on one item may be used on something else. On the other hand, consumers with higher incomes tend to be less price sensitive and have a higher willingness to pay.

Education: Consumers’ level of education has an effect on their impressions of the world around them and on the amount of research they conduct prior to making a purchase. Individuals with a higher level of education will spend more time educating themselves before investing their money. Education has an impact on fashion, film, and television programming. Consumers with a higher level of education can be more distrustful of commercials and the facts offered.

M&S has collected rich customer demographic information through its loyalty program, M&S Sparks membership. The data scientist team has the following demographic variables:

  • ID: Customer’s unique identifier
  • Year_Birth: Customer’s birth year
  • Education: Customer’s education level
  • Marital_Status: Customer’s marital status
  • Income: Customer’s yearly household income
  • Kidhome: Number of children in customer’s household
  • Teenhome: Number of teenagers in customer’s household
  • Dt_Customer: Date of customer’s enrollment with the company

2.2 Purchase History

“History doesn’t repeat itself, but it often rhymes.”

This popular aphorism, frequently (and perhaps incorrectly) attributed to Mark Twain, is frequently invoked to demonstrate that, while past events do not always provide a clear indication of future events, they do provide valuable context. This sentiment is especially true for marketing managers, where a consumer’s purchase history provides invaluable insight into their future purchasing habits.

M&S’s data engineering team has assembled a cross-sectional customer purchase history data today, with variables including2:

  • MntWines: Amount spent on wine in last 2 years
  • MntFruits: Amount spent on fruits in last 2 years
  • MntMeatProducts: Amount spent on meat in last 2 years
  • MntFishProducts: Amount spent on fish in last 2 years
  • MntSweetProducts: Amount spent on sweets in last 2 years
  • MntGoldProds: Amount spent on gold and jewelry in last 2 years
  • NumDealsPurchases: Number of purchases made with a discount
  • NumWebPurchases: Number of purchases made through Ocado’s web site
  • NumCatalogPurchases: Number of purchases made using a catalogue
  • NumStorePurchases: Number of purchases made directly in stores
  • NumWebVisitsMonth: Number of visits to company’s web site in the last month
  • Complain: 1 if customer complained in the last 2 years, 0 otherwise
  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  • Recency: Number of days since customer’s last purchase

3 Data Wrangling

3.1 Data Loading

To work on the datasets, we first need to load the raw data into R. In R, we can use read.csv(filepath) to load the data into R environment.

For your convenience, I have stored the data file on my Dropbox and the download link will be provided. We can directly feed the url links to read.csv() function in R to download and create the dataset in R’s environment.

The data is also provided to you in .csv format, which you can also load from your local disk.

Load the dataset into your R environment. Name the dataset as data_full.

Code
## use read.csv() to download and load the data, assign it to an R data object
## header = T argument is to tell read.csv() to keep the dataset header (first row)

pacman::p_load(dplyr, modelsummary)

data_full <- read.csv("https://www.dropbox.com/scl/fi/2q7ppqtyca0pd3j486osl/data_full.csv?rlkey=gsyk51q27vd1skek4qpn5ikgm&dl=1")

After running the above code blocks, you should see the dataset named data_full in your RStudio environment.

Now, click into the dataset, take a look, and get a sense of how it looks like.

3.2 Data Types

Check all data types in data_full are correct and as expected

Code
# can use str() to get the structure of data
# Lina covered this in the induction week
str(data_full)
'data.frame':   2000 obs. of  22 variables:
 $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
 $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
 $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
 $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
 $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
 $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
 $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
 $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
 $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
 $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
 $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
 $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
 $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...
 $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
 $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
 $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
 $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
 $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
 $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
 $ Dt_Customer        : chr  "04/09/2012" "08/03/2014" "21/08/2013" "10/02/2014" ...
 $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...

3.3 Data Cleaning

Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. We often need to deal with missing values and outliers. In this case, we will focus on missing values.

Check if any variables in data_full have missing values. If so, how many missing values are there in each variable? Use the variable average to replace missing values in the dataset.

Code
datasummary_skim(data_full, fmt = 3)
Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.
Unique Missing Pct. Mean SD Min Median Max Histogram
ID 2000 0 5599.172 3242.019 0.000 5492.000 11191.000
MntWines 738 0 306.058 338.276 0.000 176.500 1493.000
MntFruits 157 0 26.358 39.885 0.000 8.000 199.000
MntMeatProducts 532 0 167.885 225.292 0.000 68.000 1725.000
MntFishProducts 179 0 37.614 54.556 0.000 12.000 259.000
MntSweetProducts 175 0 27.495 41.756 0.000 8.000 263.000
MntGoldProds 207 0 43.786 51.736 0.000 24.000 362.000
NumDealsPurchases 15 0 2.321 1.958 0.000 2.000 15.000
NumWebPurchases 15 0 4.082 2.773 0.000 4.000 27.000
NumCatalogPurchases 14 0 2.691 2.956 0.000 2.000 28.000
NumStorePurchases 14 0 5.819 3.256 0.000 5.000 13.000
NumWebVisitsMonth 15 0 5.308 2.452 0.000 6.000 20.000
Complain 2 0 0.010 0.100 0.000 0.000 1.000
Response 2 0 0.150 0.358 0.000 0.000 1.000
Year_Birth 59 0 1968.843 12.020 1893.000 1970.000 1996.000
Income 1783 1 52139.697 21492.372 1730.000 51518.000 162397.000
Kidhome 3 0 0.444 0.540 0.000 0.000 2.000
Teenhome 3 0 0.501 0.546 0.000 0.000 2.000
Recency 100 0 49.165 28.951 0.000 50.000 99.000
N %
Education 2n Cycle 185 9.250
Basic 43 2.150
Graduation 992 49.600
Master 327 16.350
PhD 453 22.650
Marital_Status Alone 3 0.150
Divorced 206 10.300
Married 767 38.350
Single 436 21.800
Together 521 26.050
Widow 67 3.350
  • We find that income has missing values for a few customers. We can replace the missing values with the average income of all customers.
    • Tip: use mutate() and ifelse() functions in dplyr to replace missing values
Code
data_full <- data_full %>%
    mutate(Income = ifelse(is.na(Income), mean(Income, na.rm = T), Income))

4 Preliminary Customer Analysis

Next, once the final dataset is ready, we can proceed to use descriptive analytics to conduct preliminary customer analysis using dplyr.

Descriptive analytics is concerned with summarizing and highlighting patterns in current and historical data in order to assist businesses in comprehending what has occurred thus far. However, it makes no attempt to explain why something occurred or to forecast what may occur in the future. “why something occurred” belongs to the scope of causal inference; “forecast what may occur in the future” falls in the scope of predictive analytics. To answer those questions, analysts must combine descriptive analytics with other types of analytics, which we will learn in later weeks.

Your task is to ‘get to know’ the data_full by conducting some descriptive analytics.

  • Report the summary statistics of the cleaned data, discuss your business insights (open question; from central tendency and dispersion perspectives)
Code
# remember to load package modelsummary before calling this function
datasummary_skim(data_full)
Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.
Unique Missing Pct. Mean SD Min Median Max Histogram
ID 2000 0 5599.2 3242.0 0.0 5492.0 11191.0
MntWines 738 0 306.1 338.3 0.0 176.5 1493.0
MntFruits 157 0 26.4 39.9 0.0 8.0 199.0
MntMeatProducts 532 0 167.9 225.3 0.0 68.0 1725.0
MntFishProducts 179 0 37.6 54.6 0.0 12.0 259.0
MntSweetProducts 175 0 27.5 41.8 0.0 8.0 263.0
MntGoldProds 207 0 43.8 51.7 0.0 24.0 362.0
NumDealsPurchases 15 0 2.3 2.0 0.0 2.0 15.0
NumWebPurchases 15 0 4.1 2.8 0.0 4.0 27.0
NumCatalogPurchases 14 0 2.7 3.0 0.0 2.0 28.0
NumStorePurchases 14 0 5.8 3.3 0.0 5.0 13.0
NumWebVisitsMonth 15 0 5.3 2.5 0.0 6.0 20.0
Complain 2 0 0.0 0.1 0.0 0.0 1.0
Response 2 0 0.2 0.4 0.0 0.0 1.0
Year_Birth 59 0 1968.8 12.0 1893.0 1970.0 1996.0
Income 1783 0 52139.7 21400.8 1730.0 51844.5 162397.0
Kidhome 3 0 0.4 0.5 0.0 0.0 2.0
Teenhome 3 0 0.5 0.5 0.0 0.0 2.0
Recency 100 0 49.2 29.0 0.0 50.0 99.0
N %
Education 2n Cycle 185 9.2
Basic 43 2.1
Graduation 992 49.6
Master 327 16.4
PhD 453 22.6
Marital_Status Alone 3 0.1
Divorced 206 10.3
Married 767 38.4
Single 436 21.8
Together 521 26.0
Widow 67 3.4
  • Mutate a new variable called total_spending, which means the sum of spending on all products by customers. Mute a new variable called Frequency, which means the sum of the number of purchases made through different channels. Provide a correlation table between total_spending and other continuous variables, including Frequency, Complain, Response, Income, Kidhome, Teenhome, Recency.
Code
data_full <- data_full %>%
    mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) %>%
    mutate(Frequency = NumStorePurchases + NumWebPurchases + NumCatalogPurchases)
Code
datasummary_correlation(data_full %>%
    select(total_spending, Complain, Response, Income, Kidhome, Teenhome, Recency))
total_spending Complain Response Income Kidhome Teenhome Recency
total_spending 1 . . . . . .
Complain -.04 1 . . . . .
Response .25 .00 1 . . . .
Income .79 -.03 .15 1 . . .
Kidhome -.56 .04 -.07 -.51 1 . .
Teenhome -.14 .00 -.15 .03 -.03 1 .
Recency .01 .02 -.20 .00 .00 .02 1
RFM Analysis

RFM analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups. RFM stands for Recency, Frequency, and Monetary Value. RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions, as well as those who are more likely to churn. We will learn more about RFM analysis in the subsequent weeks when we talk about segmentation and targeting.

  • Compare the average total spending across demographic groups. Which group has the highest average spending?
    • marital status
    • education
Code
data_full %>%
    group_by(Marital_Status) %>% # group by
    summarise(avg_spending = mean(total_spending)) %>% # compute the mean of spending for each group
    ungroup()
  • Which education group accounts for the largest percentage of customers?
    • tip: in dplyr, there is a function called n(), which counts the number of rows in the group after group_by()
Code
data_full %>% # Think about why data_demo? what if we use data_full?
    group_by(Education) %>% # group by Education
    summarise(n_customers_in_group = n()) %>% # compute the number of customers in each education group
    ungroup() %>% # important to do the ungroup after each group_by
    mutate(percent_customer = n_customers_in_group / sum(n_customers_in_group)) %>% # compute the percentage of customers in each group
    arrange(-percent_customer) # descending order
  • What is the average Total £ spent on wine and total spending by customers with and without kids?
    • tip: first mutate a variable called has_kid, which equals 1 if the number of kids in the household is larger than 0, and otherwise 0; then group by this has_kid variable
Code
data_full %>%
    mutate(has_kid = ifelse(Kidhome > 0, 1, 0)) %>%
    group_by(has_kid) %>%
    summarise(
        avg_spending_wine = mean(MntWines, na.rm = T),
        avg_total_spending = mean(total_spending, na.rm = T)
    ) %>%
    ungroup()

5 After-class Exercise

After the class, you can further explore the dataset and conduct more descriptive analytics.

For complainers and non-complainers who have Master’s degrees, find the total number and also the percent of customers who responded to the offer.

Code
data_full %>%
    filter(Education == "Master") %>%
    group_by(Complain) %>%
    summarise(
        n_responder = sum(Response, na.rm = T),
        n_customer = n()
    ) %>%
    ungroup() %>%
    mutate(percent_of_responder = n_responder / n_customer)

Group customers by Marital Status and compare the average spending in each group. What insights can you draw? - Tips: use group_by() + summarise()

Code
data_full %>%
    mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) %>% # mutate total spending
    group_by(Marital_Status) %>%
    summarise(avg_spending = mean(total_spending, na.rm = T)) %>% # compute the mean of spending for each group
    ungroup() %>%
    arrange(-avg_spending)

References

Berman, Ron, and Ayelet Israeli. 2022. “The Value of Descriptive Analytics: Evidence from Online Retailers.” Marketing Science, March, mksc.2022.1352. https://doi.org/10.1287/mksc.2022.1352.

Footnotes

  1. We will cover the difference-in-differences technique to establish causal inference later in the module.↩︎

  2. Cross-sectional data is data collected at a single point in time. It is a snapshot of customers at a particular moment. In contrast, longitudinal data is collected over time, allowing analysts to track changes in customer behavior over time.↩︎