Class 6 Descriptive Analytics for M&S

Author
Affiliation

Dr Wei Miao

UCL School of Management

Published

October 16, 2024

1 Data Cleaning

1.1 Missing Values

  • In R, missing values are represented by the symbol NA (i.e., not available).

  • Most statistical models cannot handle missing values, so we need to deal with them in R.

  • If there are just a few missing values: remove them from analysis.

  • If there are many missing values: need to replace them with appropriate values:

    • mean/median/imputation

1.2 Outliers

  • Outliers are data points that are significantly different from other data points in the dataset, such as unusually large and small values.

  • Winsorization is a common method to deal with outliers. It replaces the extreme values with the nearest non-extreme value, usually the 99th or 1th percentile (or other thresholds as appropriate).

2 Descriptive Analytics

2.1 Two Major Tasks of Descriptive Analytics

  • You can think of descriptive analytics as creating a dashboard to display the key information you would like to know for your business. For instance:
  1. Describe data depending on your business purposes
    • “How much do our customers spend each month on average?”
    • “What percentage of our customers are unprofitable?”
    • “What is the difference between the retention rates across different demographic groups?”
  2. Conduct statistical tests (such as t-tests) for hypothesis testing.
    • Is there any significant difference in the average spending between different age/gender groups?
    • Based on our test mailing, can we conclude that ad-copy A works better than ad-copy B?

2.2 Example of Descriptive Analytics Dashboard

2.3 Summary Statistics

  • Summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.

  • There are two main types of summary statistics used in evaluation:

    • measures of central tendency: number of observations, mean, min, 25 percentile, median, 75 percentile, max, etc.

    • measures of dispersion: range and standard deviation.

  • It’s important to include summary statistics table in your dissertation before any statistical analysis!

2.4 Summary Statistics with R

  • In R, a power package to report summary statistics is called modelsummary.

  • datasummary_skim() is a shortcut to conduct basic summary statistics

  • For more features, refer to the package tutorial here

Code
pacman::p_load(modelsummary)
data_full %>%
  datasummary_skim(type = "numeric")
Unique Missing Pct. Mean SD Min Median Max Histogram
ID 2000 0 5599.2 3242.0 0.0 5492.0 11191.0
MntWines 738 0 306.1 338.3 0.0 176.5 1493.0
MntFruits 157 0 26.4 39.9 0.0 8.0 199.0
MntMeatProducts 532 0 167.9 225.3 0.0 68.0 1725.0
MntFishProducts 179 0 37.6 54.6 0.0 12.0 259.0
MntSweetProducts 175 0 27.5 41.8 0.0 8.0 263.0
MntGoldProds 207 0 43.8 51.7 0.0 24.0 362.0
NumDealsPurchases 15 0 2.3 2.0 0.0 2.0 15.0
NumWebPurchases 15 0 4.1 2.8 0.0 4.0 27.0
NumCatalogPurchases 14 0 2.7 3.0 0.0 2.0 28.0
NumStorePurchases 14 0 5.8 3.3 0.0 5.0 13.0
NumWebVisitsMonth 15 0 5.3 2.5 0.0 6.0 20.0
Complain 2 0 0.0 0.1 0.0 0.0 1.0
Response 2 0 0.2 0.4 0.0 0.0 1.0
Year_Birth 59 0 1968.8 12.0 1893.0 1970.0 1996.0
Income 1783 1 52139.7 21492.4 1730.0 51518.0 162397.0
Kidhome 3 0 0.4 0.5 0.0 0.0 2.0
Teenhome 3 0 0.5 0.5 0.0 0.0 2.0
Recency 100 0 49.2 29.0 0.0 50.0 99.0
Code
data_full %>%
  datasummary_skim(type = "categorical")
Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.
N %
Education 2n Cycle 185 9.2
Basic 43 2.1
Graduation 992 49.6
Master 327 16.4
PhD 453 22.6
Marital_Status Alone 3 0.1
Divorced 206 10.3
Married 767 38.4
Single 436 21.8
Together 521 26.0
Widow 67 3.4

3 M&S Descriptive Analytics

Let’s move on to the Quarto document to see how we can apply the descriptive analytics to the M&S dataset.