Class 6 Descriptive Analytics for M&S

Author

Affiliation

Dr Wei Miao

UCL School of Management

Published

October 16, 2024

1 Data Cleaning

1.1 Missing Values

In R, missing values are represented by the symbol NA (i.e., not available).
Most statistical models cannot handle missing values, so we need to deal with them in R.
If there are just a few missing values: remove them from analysis.
If there are many missing values: need to replace them with appropriate values:
- mean/median/imputation

1.2 Outliers

Outliers are data points that are significantly different from other data points in the dataset, such as unusually large and small values.
Winsorization is a common method to deal with outliers. It replaces the extreme values with the nearest non-extreme value, usually the 99th or 1th percentile (or other thresholds as appropriate).

2 Descriptive Analytics

2.1 Two Major Tasks of Descriptive Analytics

You can think of descriptive analytics as creating a dashboard to display the key information you would like to know for your business. For instance:

Describe data depending on your business purposes
- “How much do our customers spend each month on average?”
- “What percentage of our customers are unprofitable?”
- “What is the difference between the retention rates across different demographic groups?”
Conduct statistical tests (such as t-tests) for hypothesis testing.
- Is there any significant difference in the average spending between different age/gender groups?
- Based on our test mailing, can we conclude that ad-copy A works better than ad-copy B?

2.2 Example of Descriptive Analytics Dashboard

2.3 Summary Statistics

Summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.
There are two main types of summary statistics used in evaluation:
- measures of central tendency: number of observations, mean, min, 25 percentile, median, 75 percentile, max, etc.
- measures of dispersion: range and standard deviation.
It’s important to include summary statistics table in your dissertation before any statistical analysis!

2.4 Summary Statistics with R

In R, a power package to report summary statistics is called modelsummary.
datasummary_skim() is a shortcut to conduct basic summary statistics
For more features, refer to the package tutorial here

Code

pacman::p_load(modelsummary)
data_full %>%
  datasummary_skim(type = "numeric")

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
ID	2000	0	5599.2	3242.0	0.0	5492.0	11191.0
MntWines	738	0	306.1	338.3	0.0	176.5	1493.0
MntFruits	157	0	26.4	39.9	0.0	8.0	199.0
MntMeatProducts	532	0	167.9	225.3	0.0	68.0	1725.0
MntFishProducts	179	0	37.6	54.6	0.0	12.0	259.0
MntSweetProducts	175	0	27.5	41.8	0.0	8.0	263.0
MntGoldProds	207	0	43.8	51.7	0.0	24.0	362.0
NumDealsPurchases	15	0	2.3	2.0	0.0	2.0	15.0
NumWebPurchases	15	0	4.1	2.8	0.0	4.0	27.0
NumCatalogPurchases	14	0	2.7	3.0	0.0	2.0	28.0
NumStorePurchases	14	0	5.8	3.3	0.0	5.0	13.0
NumWebVisitsMonth	15	0	5.3	2.5	0.0	6.0	20.0
Complain	2	0	0.0	0.1	0.0	0.0	1.0
Response	2	0	0.2	0.4	0.0	0.0	1.0
Year_Birth	59	0	1968.8	12.0	1893.0	1970.0	1996.0
Income	1783	1	52139.7	21492.4	1730.0	51518.0	162397.0
Kidhome	3	0	0.4	0.5	0.0	0.0	2.0
Teenhome	3	0	0.5	0.5	0.0	0.0	2.0
Recency	100	0	49.2	29.0	0.0	50.0	99.0

Code

data_full %>%
  datasummary_skim(type = "categorical")

Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.

		N	%
Education	2n Cycle	185	9.2
	Basic	43	2.1
	Graduation	992	49.6
	Master	327	16.4
	PhD	453	22.6
Marital_Status	Alone	3	0.1
	Divorced	206	10.3
	Married	767	38.4
	Single	436	21.8
	Together	521	26.0
	Widow	67	3.4

3 M&S Descriptive Analytics

Let’s move on to the Quarto document to see how we can apply the descriptive analytics to the M&S dataset.

--- date: "`r (lubridate::ymd('20241002') + lubridate::dweeks(2))`" title: "Class 6 Descriptive Analytics for M&S" execute: echo: true --- # Data Cleaning ## Missing Values - In R, missing values are represented by the symbol `NA` (i.e., not available). - Most statistical models cannot handle missing values, so we need to deal with them in R. - If there are just a few missing values: remove them from analysis. - If there are many missing values: need to replace them with appropriate values: - mean/median/imputation ## Outliers - **Outliers** are data points that are significantly different from other data points in the dataset, such as unusually large and small values. - **Winsorization** is a common method to deal with outliers. It replaces the extreme values with the nearest non-extreme value, usually the 99th or 1th percentile (or other thresholds as appropriate). # Descriptive Analytics ```{r} #| echo: false pacman::p_load(dplyr, modelsummary) data_full <- read.csv("https://www.dropbox.com/scl/fi/2q7ppqtyca0pd3j486osl/data_full.csv?rlkey=gsyk51q27vd1skek4qpn5ikgm&dl=1") ``` ## Two Major Tasks of Descriptive Analytics - You can think of descriptive analytics as **creating a dashboard** to display the key information you would like to know for your business. For instance: 1. Describe data depending on your business purposes - "How much do our customers spend each month on average?" - "What percentage of our customers are unprofitable?" - "What is the difference between the retention rates across different demographic groups?" 2. Conduct statistical tests (such as t-tests) for hypothesis testing. - Is there any significant difference in the average spending between different age/gender groups? - Based on our test mailing, can we conclude that ad-copy A works better than ad-copy B? ## Example of Descriptive Analytics Dashboard ```{r} #| echo: false #| fig-align: 'center' #| out-width: 9cm knitr::include_graphics('images/Week 3/YoutubeStudio.png') ``` ## Summary Statistics - **Summary statistics** are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. - There are two main types of summary statistics used in evaluation: - **measures of central tendency**: number of observations, mean, min, 25 percentile, median, 75 percentile, max, etc. - **measures of dispersion:** range and standard deviation. - It's important to include summary statistics table in your dissertation before any statistical analysis! ## Summary Statistics with R - In R, a power package to report summary statistics is called `modelsummary`. - `datasummary_skim()` is a shortcut to conduct basic summary statistics - For more features, refer to the package tutorial [here](https://vincentarelbundock.github.io/modelsummary/articles/datasummary.html) ::: {.content-visible when-format="beamer"} ```{r} #| eval: false #| echo: true pacman::p_load(modelsummary) ## Summary statistics for numeric variables data_full %>% datasummary_skim(type = "numeric") ## Summary statistics for categorical variables data_full %>% datasummary_skim(type = "categorical") ``` ::: ::: {.content-visible when-format="html"} ```{r} #| eval: true #| echo: true pacman::p_load(modelsummary) data_full %>% datasummary_skim(type = "numeric") data_full %>% datasummary_skim(type = "categorical") ``` ::: # M&S Descriptive Analytics Let's move on to the Quarto document to see how we can apply the descriptive analytics to the M&S dataset.