Class 6 Data Wrangling with R & Descriptive Analytics
1 Data Wrangling with R
1.1 Data Frame Basics
- A data frame is the R object that we will deal with most of the time in the MSc programme. You can think of a
data.frameas a spreadsheet.- Each row stands for an observation; usually a record for a customer.
- Each column stands for a variable; each column should have a unique name.
- Each column must contain the same data type, but different columns can store different data types.
1.2 Tidyverse in R
- The
tidyverseis a collection of R packages designed for data science. It includes packages for data manipulation, visualisation, and more.
1.3 Install and Load the dplyr package
- In R, we use the
dplyrpackage for data cleaning and manipulation.1
- Load a CSV-format dataset called
data_fullusingread.csv()
- To browse the whole dataset, simply click the object in the Environment pane.
1.4 First Look at the Dataset
- Which variables does the dataset have? What are the data types of each variable?
Rows: 2,000
Columns: 22
$ ID <int> 5524, 2174, 4141, 6182, 5324, 7446, 965, 6177, 485…
$ MntWines <int> 635, 11, 426, 11, 173, 520, 235, 76, 14, 28, 5, 6,…
$ MntFruits <int> 88, 1, 49, 4, 43, 42, 65, 10, 0, 0, 5, 16, 61, 2, …
$ MntMeatProducts <int> 546, 6, 127, 20, 118, 98, 164, 56, 24, 6, 6, 11, 4…
$ MntFishProducts <int> 172, 2, 111, 10, 46, 0, 50, 3, 3, 1, 0, 11, 225, 3…
$ MntSweetProducts <int> 88, 1, 21, 3, 27, 42, 49, 1, 3, 1, 2, 1, 112, 5, 1…
$ MntGoldProds <int> 88, 6, 42, 5, 15, 14, 27, 23, 2, 13, 1, 16, 30, 14…
$ NumDealsPurchases <int> 3, 2, 1, 2, 5, 2, 4, 2, 1, 1, 1, 1, 1, 3, 1, 1, 3,…
$ NumWebPurchases <int> 8, 1, 8, 2, 5, 6, 7, 4, 3, 1, 1, 2, 3, 6, 1, 7, 3,…
$ NumCatalogPurchases <int> 10, 1, 2, 0, 3, 4, 3, 0, 0, 0, 0, 0, 4, 1, 0, 6, 0…
$ NumStorePurchases <int> 4, 2, 10, 4, 6, 10, 7, 4, 2, 0, 2, 3, 8, 5, 3, 12,…
$ NumWebVisitsMonth <int> 7, 5, 4, 6, 5, 6, 6, 8, 9, 20, 7, 8, 2, 6, 8, 3, 8…
$ Complain <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Response <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
$ Year_Birth <int> 1957, 1954, 1965, 1984, 1981, 1967, 1971, 1985, 19…
$ Education <chr> "Graduation", "Graduation", "Graduation", "Graduat…
$ Marital_Status <chr> "Single", "Single", "Together", "Together", "Marri…
$ Income <int> 58138, 46344, 71613, 26646, 58293, 62513, 55635, 3…
$ Kidhome <int> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,…
$ Teenhome <int> 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
$ Dt_Customer <chr> "04/09/2012", "08/03/2014", "21/08/2013", "10/02/2…
$ Recency <int> 58, 38, 26, 26, 94, 16, 34, 32, 19, 68, 11, 59, 82…
1.5 First Few Rows of the Dataset
- We can use
slice_head()to view the first few rows of the dataset. Similarly,slice_tail()can be used to view the last few rows of the dataset.
1.6 Extract a Single Column
We can use the
$operator to extract a single column from a data frame.For example,
data_full$Incomeextracts theIncomecolumn from thedata_fulldata frame, returning a vector of income values.
1.7 Common Data Wrangling Operations
Filter rows (
filter)Sort rows (
arrange)Select columns (
select)Generate new columns (
mutate)Group-wise aggregation (
group_by)Merge datasets (
join)
1.8 Subset Rows Based on Conditions: filter
- We can use
filter()to select rows that meet certain logical criteria.
- Important: To keep the new subset of data in RStudio, assign it to a new object.
Example: From data_full, find customers who are single
1.9 The Pipe Operator
1.9.1 Pipe Operator
%>% passes the object in front as the first argument of the subsequent function.2
- The tidyverse is designed to work with the pipe operator, using a consistent syntax.
1.10 Example of the Pipe Operator
1.11 Why Do We Need Pipe Operator for Data Wrangling?
- Exercise: find single customers who have a PhD without using the pipe.
- Exercise: find single customers who have a PhD using the pipe.
- The pipe works like a conveyor belt in a factory, passing the intermediate outputs from the previous data wrangling step to the next step for further processing until you finish your data wrangling task.
- This makes your code more readable and easier to debug. You can chain multiple (as many as) data wrangling steps together in a single pipeline.
1.12 Sort Rows: arrange
arrange()orders the rows by the values of selected columns.Ascending order by default; for descending order, put a minus sign in front of the variable.
Allows multiple sorting variables, separated by a comma.
Example: sort customers based on income in descending order.
- Exercise: sort customers based on income in descending order and age in ascending order.
1.13 Generate New Variables: mutate
mutate()generates new variables in the dataset while preserving existing variables.Example: create a new variable named
AgefromYear_Birth.
- Exercise: create a new variable named
totalkids, which is the sum ofKidhomeandTeenhome.
1.14 Aggregation by Groups: group_by
group_by()allows us to aggregate data by group and compute statistics for each group.
- Internally, the dataset is already grouped based on the specified variable(s).
1.15 Aggregation by Groups: group_by() %>% summarise()
- The
summarise()function is typically used aftergroup_by()to create a new data frame of summary statistics for each group. - It collapses each group into a single row.
- Inside
summarise(), you can compute various summary statistics (e.g.,mean()for mean,sum()for sum,n()for count,sd()for standard deviation) for each group. The result is a new data frame with one row per group, containing the grouping variables and the calculated summary statistics.
- What if you replace
summarise()withmutate()?
1.16 Comparison of summarise() and mutate()
mutate()will add a new column to the original data frame. The new column will contain the group-wise calculation, repeated for each row within the group. The number of rows in the data frame remains unchanged.In contrast,
summarise()collapses the data frame to one row per group, containing only the grouping variables and the newly created summary statistics.
1.17 Aggregation by Groups: group_by() Multiple Groups
- You can also group by multiple variables. For example, we can compute the average income for each combination of
Marital_StatusandEducation.
2 Data Cleaning
2.1 Missing Values
In R, missing values are represented by the symbol
NA(i.e., not available).Most statistical models cannot handle missing values, so we need to deal with them in R.
If there are just a few missing values: remove them from analysis.
If there are many missing values: need to collect better data or replace them with appropriate values:
- mean/median/statistical imputation
2.2 Outliers
Outliers are data points that are significantly different from other data points in the dataset, such as unusually large and small values.
Winsorisation is a common method to deal with outliers. It replaces the extreme values with the nearest non-extreme value, usually the 99th or 1st percentile (or other thresholds as appropriate).
3 Descriptive Analytics
3.1 Two Major Tasks of Descriptive Analytics
- You can think of descriptive analytics as creating a dashboard to display the key information you would like to know for your business. For instance:
- Describe the data for your business purposes
- “How much do our customers spend each month on average?”
- “What percentage of our customers are unprofitable?”
- “What is the difference between the retention rates across different demographic groups?”
- Conduct statistical tests (such as t-tests) for hypothesis testing.
- Is there a significant difference in the average spending between different age/gender groups?
- Based on our test mailing, can we conclude that ad-copy A works better than ad-copy B?
3.2 Example of Descriptive Analytics Dashboard
3.3 Summary Statistics
Summary statistics are used to summarise a set of observations, in order to communicate the largest amount of information as simply as possible.
There are two main types of summary statistics used in evaluation:
- measures of central tendency: number of observations, mean, min, 25th percentile, median, 75th percentile, max, etc.
- measures of dispersion: range and standard deviation.
3.4 Summary Statistics with R
In R, a powerful package to report summary statistics is called
modelsummary.datasummary_skim()is a shortcut to compute basic summary statistics.For more features, refer to the package tutorial here
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| ID | 2000 | 0 | 5599.2 | 3242.0 | 0.0 | 5492.0 | 11191.0 | |
| MntWines | 738 | 0 | 306.1 | 338.3 | 0.0 | 176.5 | 1493.0 | |
| MntFruits | 157 | 0 | 26.4 | 39.9 | 0.0 | 8.0 | 199.0 | |
| MntMeatProducts | 532 | 0 | 167.9 | 225.3 | 0.0 | 68.0 | 1725.0 | |
| MntFishProducts | 179 | 0 | 37.6 | 54.6 | 0.0 | 12.0 | 259.0 | |
| MntSweetProducts | 175 | 0 | 27.5 | 41.8 | 0.0 | 8.0 | 263.0 | |
| MntGoldProds | 207 | 0 | 43.8 | 51.7 | 0.0 | 24.0 | 362.0 | |
| NumDealsPurchases | 15 | 0 | 2.3 | 2.0 | 0.0 | 2.0 | 15.0 | |
| NumWebPurchases | 15 | 0 | 4.1 | 2.8 | 0.0 | 4.0 | 27.0 | |
| NumCatalogPurchases | 14 | 0 | 2.7 | 3.0 | 0.0 | 2.0 | 28.0 | |
| NumStorePurchases | 14 | 0 | 5.8 | 3.3 | 0.0 | 5.0 | 13.0 | |
| NumWebVisitsMonth | 15 | 0 | 5.3 | 2.5 | 0.0 | 6.0 | 20.0 | |
| Complain | 2 | 0 | 0.0 | 0.1 | 0.0 | 0.0 | 1.0 | |
| Response | 2 | 0 | 0.2 | 0.4 | 0.0 | 0.0 | 1.0 | |
| Year_Birth | 59 | 0 | 1968.8 | 12.0 | 1893.0 | 1970.0 | 1996.0 | |
| Income | 1783 | 1 | 52139.7 | 21492.4 | 1730.0 | 51518.0 | 162397.0 | |
| Kidhome | 3 | 0 | 0.4 | 0.5 | 0.0 | 0.0 | 2.0 | |
| Teenhome | 3 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 2.0 | |
| Recency | 100 | 0 | 49.2 | 29.0 | 0.0 | 50.0 | 99.0 |
Warning: These variables were omitted because they include more than 50 levels:
Dt_Customer.
| N | % | ||
|---|---|---|---|
| Education | 2n Cycle | 185 | 9.2 |
| Basic | 43 | 2.1 | |
| Graduation | 992 | 49.6 | |
| Master | 327 | 16.4 | |
| PhD | 453 | 22.6 | |
| Marital_Status | Alone | 3 | 0.1 |
| Divorced | 206 | 10.3 | |
| Married | 767 | 38.4 | |
| Single | 436 | 21.8 | |
| Together | 521 | 26.0 | |
| Widow | 67 | 3.4 |
4 M&S Descriptive Analytics
Let’s move on to the Quarto document to see how to apply descriptive analytics to the M&S dataset.
4.1 After-Class
(essential) Cheatsheet for
dplyr. This cheatsheet provides a quick reference for the most commonly used functions in thedplyrpackage. It’s very important to familiarise yourself with these functions as you will use them a lot in your future projects.(optional) Complete the after-class exercise for Week 3. If you still have time, you can also complete the DataCamp exercise on the dplyr package. The link is here.