Descriptive Analytics: Preliminary Customer Analysis

MSIN0094 Case Study

Author
Affiliation

Dr Wei Miao

UCL School of Management

Published

October 18, 2023

1 The Power of Descriptive Analytics

The amount of data created worldwide has been increasing exponentially over the past decade with some estimates placing the total at 59 zettabytes as of 2020 (Statista 2020). Data without analytics, however, is of little value to business decision-makers aiming to improve performance and increase growth. It is therefore no surprise that top-tier consulting companies, analytics firms and business schools have been promoting the positive returns to greater usage of analytics technology. It also explains why an increasing number of data analytics enthusiasts are willing to pay up to £40k tuition fee (that is 10,000 bubble teas!) to join the prestigious MSc Business Analytics program at the UCL School of Management (hmm, it’s now already week 3, too late to ask David for a refund!).

By identifying patterns and trends in massive amounts of data, business analytics enables organizations to make better decisions and improve performance. Descriptive analytics is the simplest and most widely used type of analytics; it is used to generate key performance indicators (KPIs) and metrics for business reports and dashboards. The latest research shows that, even with the adoption of very simple descriptive analytics, businesses can improve their performance by a large extent — Berman and Israeli (2022) use the synthetic difference-in-differences method to analyze the staggered adoption of a retail analytics dashboard by more than 1,500 e-commerce websites,1 and find an increase of 4%–10% in average weekly revenues post-adoption. The increase in revenue is not explained by price changes or advertising optimization. Instead, it is consistent with the addition of customer relationship management, personalization, and prospecting technologies to retailer websites. The adoption and usage of descriptive analytics also increases the diversity of products sold, the number of transactions, the numbers of website visitors and unique customers, and the revenue from repeat customers. These findings are consistent with a complementary effect of descriptive analytics that serve as a monitoring device that helps retailers control additional martech tools and amplify their value. Without using the descriptive dashboard, retailers are unable to reap the benefits associated with these technologies.

In practice, businesses use descriptive analytics to assess how well they are performing and if they are on pace to meet business objectives. Business leaders and financial specialists monitor common financial measures generated by descriptive analytics, such as revenue and spending growth on a regular basis. Marketing teams utilize descriptive analytics to analyze the efficacy of marketing campaigns by tracking data such as conversion rates and social media followers, and manage customer relationship by keeping track of customer lifetime values. Manufacturing organizations track indicators such as line throughput and downtime. Descriptive analytics enables everyone in the organization to make more informed decisions that move the business forward. It reveals trends that would otherwise remain buried in raw data, allowing marketing managers to quickly assess how well the firm is operating and identify areas for improvement. Additionally, descriptive analytics enables firms to convey information within departments and to external parties.

In the remaining of this case, we will explore (1) how to consolidate multiple databases from various sources using R and (2) how to conduct preliminary customer analysis using descriptive analytics.

2 Database Marketing at Tesco

We have learned in Week 2, how to compute the customer lifetime value for i-basket, an online grocery store. However, when computing the CLV, we used an “average” approach, which did not consider customer heterogeneity. That is, when considering each component in the CLV formula, such as customer spending in each period, their retention rate, etc., we took the average across all customers, and assumed customers are homogeneous. As a result, every customer would have the same CLV. Nevertheless, this is a strong assumption in practice — every customer is unique and should be treated differently.

The key to successful customer relationship management is to maintain a customer database that tracks detailed customer information, including their demograhpic information and past purchase history. This information would empower marketing analytics team to compute individual CLV for each customer, and conduct individualized targeted marketing.

2.1 Demographic Information

Knowing your consumer is a vital concept of running any business. Is the business selling fertilizer to farmers, apparel to teenage girls, or vacations to senior citizens? The distinctions are readily apparent in this comparison.

Demographics define the qualities of clients. To be successful, business owners must understand the demographics of their clients and the trends or changes that are occurring within those specific traits.

The following demographic information is usually of interest to business managers:

Age: Consumer behavior is strongly influenced by age. Younger consumers are more affluent and willing to spend more on entertainment, fashion, and movies. Seniors spend less on these items; they are less active, spend more time indoors, and require more medical treatment. Additionally, market segments can be defined by age groups. For instance, digital devices such as iPhones are targeted more towards millennials than at seniors. While older adults are increasingly utilizing technology, they remain less digitally savvy than millennials and purchase fewer digital products.

Gender: Gender also matters. Males and females have vastly diverse demands and tastes, which influence their purchasing decisions. As a result, some products are created with a specific gender in mind. Macy’s, Nordstrom, and The Gap all have departments dedicated to teenage girls’ clothes, while Seiko has a specific line of diver watches for men only.

Income: Income has a substantial influence on consumer behavior and product purchases. Middle-income customers make purchases with due regard for the utility of money. They do not have unlimited money to spend, and hence the money spent on one item may be used on something else. On the other hand, consumers with higher incomes tend to be less price sensitive and have a higher willingness to pay.

Education: Consumers’ level of education has an effect on their impressions of the world around them and on the amount of research they conduct prior to making a purchase. Individuals with a higher level of education will spend more time educating themselves before investing their money. Education has an impact on fashion, film, and television programming. Consumers with a higher level of education can be more distrustful of commercials and the facts offered.

Tesco has collected rich customer demographic information through its loyalty program, Tesco Clubcard membership. In the demograhpics.csv dataset, the data scientist team has the following demographic variables:

  • ID: Customer’s unique identifier
  • Year_Birth: Customer’s birth year
  • Education: Customer’s education level
  • Marital_Status: Customer’s marital status
  • Income: Customer’s yearly household income
  • Kidhome: Number of children in customer’s household
  • Teenhome: Number of teenagers in customer’s household
  • Dt_Customer: Date of customer’s enrollment with the company

2.2 Purchase History

“History doesn’t repeat itself, but it often rhymes.”

This popular aphorism, frequently (and perhaps incorrectly) attributed to Mark Twain, is frequently invoked to demonstrate that, while past events do not always provide a clear indication of future events, they do provide valuable context. This sentiment is especially true for marketing managers, where a consumer’s purchase history provides invaluable insight into their future purchasing habits.

Tesco’s data engineering team has assembled a cross-sectional customer purchase history data, with variables including

  • ID: Customer’s unique identifier
  • MntWines: Amount spent on wine in last 2 years
  • MntFruits: Amount spent on fruits in last 2 years
  • MntMeatProducts: Amount spent on meat in last 2 years
  • MntFishProducts: Amount spent on fish in last 2 years
  • MntSweetProducts: Amount spent on sweets in last 2 years
  • NumDealsPurchases: Number of purchases made with a discount
  • NumWebPurchases: Number of purchases made through the company’s web site
  • NumCatalogPurchases: Number of purchases made using a catalogue
  • NumStorePurchases: Number of purchases made directly in stores
  • NumWebVisitsMonth: Number of visits to company’s web site in the last month
  • Complain: 1 if customer complained in the last 2 years, 0 otherwise
  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  • Recency: Number of days since customer’s last purchase

3 Data Wrangling

3.1 Data Loading

To work on the datasets, we first need to load the raw data into R. The demographic information data are stored as csv files. In R, we can use read.csv(filepath) to load the data into R environment.

For your convenience, I have stored the demographic.csv and purchase.csv files on my Dropbox. We can directly feed the url links to read.csv() to download and create the dataset.

## use read.csv() to download and load the data, assign it to an R data object
## header = T argument is to tell read.csv() to keep the dataset header (first row)

# Load demograhpic data, and call it data_demo
data_demo <- read.csv(file = "https://www.dropbox.com/s/a0v38lpydls2emy/demographics.csv?dl=1",
                      header = T)

# Load purchase history data, and call it data_purchase
data_purchase <- read.csv(file = "https://www.dropbox.com/s/de435r8zdxydnhg/purchase.csv?dl=1" , header = T)

After running the above code blocks, you should see two datasets in your RStudio environment.

Now, click into each dataset, take a look, and get a sense of how these two datasets look like.

3.2 Data Consolidation

In reality, to accomplish a data analytics task, data scientists often need to collect data from various sources, and assemble them into a larger dataset as needed.

Now we have two Tesco datasets at hand, and we should assemble them into a larger data frame.

  • Join the demographic information data_demo into purchase history data data_purchase. Name the joined data as “data_full”
pacman::p_load(dplyr,modelsummary)
# left join, which must be M:1
data_full <- data_purchase %>%
  left_join(data_demo, by = c("ID" = "ID") )

3.3 Data Types

  • Task: Check all data types in data_full are correct and as expected
# can use str() to get the structure of data
# Lina covered this in the induction week
str(data_full)
'data.frame':   2000 obs. of  22 variables:
 $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
 $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
 $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
 $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
 $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
 $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
 $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
 $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
 $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
 $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
 $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
 $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
 $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...
 $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
 $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
 $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
 $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
 $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
 $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
 $ Dt_Customer        : chr  "04/09/2012" "08/03/2014" "21/08/2013" "10/02/2014" ...
 $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...

4 Preliminary Customer Analysis

Next, once the final dataset is ready, we can proceed to use descriptive analytics to conduct preliminary customer analysis using dplyr.

Descriptive analytics is concerned with summarizing and highlighting patterns in current and historical data in order to assist businesses in comprehending what has occurred thus far. However, it makes no attempt to explain why something occurred or to forecast what may occur in the future.2 To answer those questions, businesses must combine descriptive analytics with other types of analysis.

Your task is to ‘get to know’ the data by conducting some statistical analysis using Tesco’s customer database.

  1. Provide the summary statistics of the cleaned data. From summary statistics, do you already see any insights? (open question; from central tendency and dispersion perspectives)
# remember to load package modelsummary before calling this function
datasummary_skim(data_full)
Unique (#) Missing (%) Mean SD Min Median Max
ID 2000 0 5599.2 3242.0 0.0 5492.0 11191.0
MntWines 738 0 306.1 338.3 0.0 176.5 1493.0
MntFruits 157 0 26.4 39.9 0.0 8.0 199.0
MntMeatProducts 532 0 167.9 225.3 0.0 68.0 1725.0
MntFishProducts 179 0 37.6 54.6 0.0 12.0 259.0
MntSweetProducts 175 0 27.5 41.8 0.0 8.0 263.0
MntGoldProds 207 0 43.8 51.7 0.0 24.0 362.0
NumDealsPurchases 15 0 2.3 2.0 0.0 2.0 15.0
NumWebPurchases 15 0 4.1 2.8 0.0 4.0 27.0
NumCatalogPurchases 14 0 2.7 3.0 0.0 2.0 28.0
NumStorePurchases 14 0 5.8 3.3 0.0 5.0 13.0
NumWebVisitsMonth 15 0 5.3 2.5 0.0 6.0 20.0
Complain 2 0 0.0 0.1 0.0 0.0 1.0
Response 2 0 0.2 0.4 0.0 0.0 1.0
Year_Birth 59 0 1968.8 12.0 1893.0 1970.0 1996.0
Income 1783 1 52139.7 21492.4 1730.0 51518.0 162397.0
Kidhome 3 0 0.4 0.5 0.0 0.0 2.0
Teenhome 3 0 0.5 0.5 0.0 0.0 2.0
Recency 100 0 49.2 29.0 0.0 50.0 99.0
  1. Group customers by Marital Status and compare the average spending in each group. What insights can you draw?
    • Tips: use group_by() + summarise()
data_full %>%
  mutate(total_spending = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds) %>% # mutate total spending 
  group_by(Marital_Status)%>% # group by 
  summarise(avg_spending = mean(total_spending))%>% # compute the mean of spending for each group
  ungroup() 
Marital_Status avg_spending
Alone 256.6667
Divorced 620.2379
Married 591.8214
Single 606.8578
Together 619.7620
Widow 722.9552
  1. Which education group accounts for the largest percentage of customers? Answering this question using dplyr only.
    • tip: in dplyr, there is a function called n(), which counts the number of rows in the group after group_by()
data_demo%>% # Think about why data_demo? what if we use data_full?
  group_by(Education)%>% # group by Education
  summarise(n_customers_in_group = n()) %>% # compute the number of customers in each education group
  ungroup() %>% # important to do the ungroup after each group_by
  mutate(percent_customer = n_customers_in_group/ sum(n_customers_in_group)) %>% # compute the percentage of customers in each group 
  arrange(-percent_customer) # descending order
Education n_customers_in_group percent_customer
Graduation 992 0.4960
PhD 453 0.2265
Master 327 0.1635
2n Cycle 185 0.0925
Basic 43 0.0215
  1. What is the average Total £ spent on wine and fruit products by customers with and without kids?
    • tip: first mutate a variable called has_kid, which equals 1 if the number of kids in the household is larger than 0, and otherwise 0; then group by this has_kid variable
data_full %>% 
  mutate(has_kid = ifelse(Kidhome>0,1,0)) %>%
  group_by(has_kid) %>%
  summarise(avg_spending_wine = mean(MntWines, na.rm = T),
            avg_spending_fruit = mean(MntFruits, na.rm = T)) %>%
  ungroup()
has_kid avg_spending_wine avg_spending_fruit
0 453.7390 39.61798
1 103.3677 8.15777
  1. For complainers and non-complainers who have Master’s degrees, find the total number and also the percent of customers who responded to the offer.
data_full %>%
  filter(Education == 'Master') %>%
  group_by(Complain) %>%
  summarise(n_responder = sum(Response, na.rm = T),
            n_customer = n()) %>%
  ungroup() %>%
  mutate(percent_of_responder = n_responder / n_customer )
Complain n_responder n_customer percent_of_responder
0 47 325 0.1446154
1 0 2 0.0000000

References

Berman, Ron, and Ayelet Israeli. 2022. “The Value of Descriptive Analytics: Evidence from Online Retailers.” Marketing Science, March, mksc.2022.1352. https://doi.org/10.1287/mksc.2022.1352.

Footnotes

  1. We will cover the difference-in-differences technique to establish causal inference later in the module.↩︎

  2. “why something occurred” belongs to the scope of causal inference; “forecast what may occur in the future” falls in the scope of predictive analytics.↩︎