Class 6 Descriptive Analytics for M&S
1 Data Cleaning
1.1 Missing Values
In R, missing values are represented by the symbol
NA
(i.e., not available).Most statistical models cannot handle missing values, so we need to deal with them in R.
If there are just a few missing values: remove them from analysis.
If there are many missing values: need to replace them with appropriate values:
- mean/median/imputation
1.2 Outliers
Outliers are data points that are significantly different from other data points in the dataset, such as unusually large and small values.
Winsorization is a common method to deal with outliers. It replaces the extreme values with the nearest non-extreme value, usually the 99th or 1th percentile (or other thresholds as appropriate).
2 Descriptive Analytics
2.1 Two Major Tasks of Descriptive Analytics
- You can think of descriptive analytics as creating a dashboard to display the key information you would like to know for your business. For instance:
- Describe data depending on your business purposes
- “How much do our customers spend each month on average?”
- “What percentage of our customers are unprofitable?”
- “What is the difference between the retention rates across different demographic groups?”
- Conduct statistical tests (such as t-tests) for hypothesis testing.
- Is there any significant difference in the average spending between different age/gender groups?
- Based on our test mailing, can we conclude that ad-copy A works better than ad-copy B?
2.2 Example of Descriptive Analytics Dashboard
2.3 Summary Statistics
Summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.
There are two main types of summary statistics used in evaluation:
measures of central tendency: number of observations, mean, min, 25 percentile, median, 75 percentile, max, etc.
measures of dispersion: range and standard deviation.
It’s important to include summary statistics table in your dissertation before any statistical analysis!
2.4 Summary Statistics with R
In R, a power package to report summary statistics is called
modelsummary
.datasummary_skim()
is a shortcut to conduct basic summary statisticsFor more features, refer to the package tutorial here
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
ID | 2000 | 0 | 5599.2 | 3242.0 | 0.0 | 5492.0 | 11191.0 | |
MntWines | 738 | 0 | 306.1 | 338.3 | 0.0 | 176.5 | 1493.0 | |
MntFruits | 157 | 0 | 26.4 | 39.9 | 0.0 | 8.0 | 199.0 | |
MntMeatProducts | 532 | 0 | 167.9 | 225.3 | 0.0 | 68.0 | 1725.0 | |
MntFishProducts | 179 | 0 | 37.6 | 54.6 | 0.0 | 12.0 | 259.0 | |
MntSweetProducts | 175 | 0 | 27.5 | 41.8 | 0.0 | 8.0 | 263.0 | |
MntGoldProds | 207 | 0 | 43.8 | 51.7 | 0.0 | 24.0 | 362.0 | |
NumDealsPurchases | 15 | 0 | 2.3 | 2.0 | 0.0 | 2.0 | 15.0 | |
NumWebPurchases | 15 | 0 | 4.1 | 2.8 | 0.0 | 4.0 | 27.0 | |
NumCatalogPurchases | 14 | 0 | 2.7 | 3.0 | 0.0 | 2.0 | 28.0 | |
NumStorePurchases | 14 | 0 | 5.8 | 3.3 | 0.0 | 5.0 | 13.0 | |
NumWebVisitsMonth | 15 | 0 | 5.3 | 2.5 | 0.0 | 6.0 | 20.0 | |
Complain | 2 | 0 | 0.0 | 0.1 | 0.0 | 0.0 | 1.0 | |
Response | 2 | 0 | 0.2 | 0.4 | 0.0 | 0.0 | 1.0 | |
Year_Birth | 59 | 0 | 1968.8 | 12.0 | 1893.0 | 1970.0 | 1996.0 | |
Income | 1783 | 1 | 52139.7 | 21492.4 | 1730.0 | 51518.0 | 162397.0 | |
Kidhome | 3 | 0 | 0.4 | 0.5 | 0.0 | 0.0 | 2.0 | |
Teenhome | 3 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 2.0 | |
Recency | 100 | 0 | 49.2 | 29.0 | 0.0 | 50.0 | 99.0 |
Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.
N | % | ||
---|---|---|---|
Education | 2n Cycle | 185 | 9.2 |
Basic | 43 | 2.1 | |
Graduation | 992 | 49.6 | |
Master | 327 | 16.4 | |
PhD | 453 | 22.6 | |
Marital_Status | Alone | 3 | 0.1 |
Divorced | 206 | 10.3 | |
Married | 767 | 38.4 | |
Single | 436 | 21.8 | |
Together | 521 | 26.0 | |
Widow | 67 | 3.4 |
3 M&S Descriptive Analytics
Let’s move on to the Quarto document to see how we can apply the descriptive analytics to the M&S dataset.