Class 5 Data Wrangling with R

Author
Affiliation

Dr Wei Miao

UCL School of Management

Published

October 16, 2024

1 Data Analytics Workflow

1.1 Class Objectives

  • Understand the major steps to conduct data analytics. We will use improve the marketing efficiency for the M&S case study as an example.

  • Data collection: Learn how to collect first-hand survey data and how to load second-hand data into R

  • Data cleaning: Learn how to use the dplyr package to clean data

  • Data analysis: Learn how to conduct descriptive analytics for the M&S case study

1.2 Overview of a Data Analytics Project

1.3 Business Objective: Our Business Question in Weeks 3 - 5

Our project for M&S in Weeks 3-5: Help M&S to improve marketing efficiency by improving its ROI on its targeted marketing offers. The project will involve data collection, data cleaning, and data analysis, including both descriptive and prescriptive analytics, to identify the most profitable customers and develop a personalized marketing targeting strategy.

1.4 Business Objective: Example Dissertation Projects in Term 3

  • Burberry provides relevant product recommendations on Burberry.com to facilitate in-session product exploration and to create a more personalised user experience. This project is to develop a new product recommendation system that tailors suggestions to individual users based on their product selections and preferences.

  • The AXA project will explore fraud detection approaches using unsupervised ML including models such as isolation forests. The candidate will develop an understanding of the business problem and our data, formulating hypotheses and testing them. They will build, evaluate, and interpret their ML models.

  • At Waitrose, it’s crucial to balance product availability with minimizing waste by understanding sales rates. Factors like product shelf life, varying sales velocities, promotions, and unexpected trends make it challenging to find a one-size-fits-all solution. Current manual forecasting introduces inaccuracies and process delays. We aim to develop a machine learning algorithm to generate daily product forecasts, integrate with our stock management system, and enable automated, accurate forecasting.

2 Data Collection

2.1 Types of Data by Source

  • Primary Data: Data that are generated by the data analyst through surveys, interviews, experiments, specially designed for understanding and solving the research problem at hand.

  • Secondary Data: Existing data generated by the company’s or consumer’s past activities, as part of organizational record keeping.

Basis for Comparison Primary Data Secondary Data
Meaning Primary data refers to the first-hand data gathered by the analyst. Secondary data means data collected by someone else earlier (usually by the company).
Data New data Historical data in the past
Source Surveys, observations, experiments, questionnaire, personal interviews, etc. Company databases, government publications, websites, books, journal articles, internal records, etc.
Cost Expensive; Very involved and costly Economical; Quick and easy
Collection time Long Short
Specific Always specific to the researcher’s needs. May or may not be specific to the researcher’s needs.

2.2 Types of Data by Structure

  • We often consider 2 dimensions of a dataset: unit and time.
    • Cross-sectional data: data collected at a single point in time.
    • Longitudinal data: data collected over time but may not contain the same individuals.
    • Panel data: data collected over time and contain the same individuals.

2.3 Primary Data: Marketing Surveys

  • A marketing survey is often the easiest and most cost-effective way to collect primary data. We often collect the following variables:

    • purchase intention: how likely a customer will buy a product, helps to predict sales
    • willingness to pay: how much a customer is willing to pay for a product, helps to set the optimal price
    • shopping basket: what products a customer usually buys, helps to cross-sell
    • share of wallet: how much a customer spends on a product category, helps to identify the high-potential customers for market penetration
    • demographics: gender, age, income, education, etc., helps to segment customers
  • Let’s see an example in Mentimeter of how to design a marketing survey!

  • You can design surveys to collect data for your term 1 projects or term 3 dissertation.

2.4 Limitation of Marketing Surveys

  • Hawthorne Effect and Response Bias: Participants may answer in ways they think are socially desirable or expected, rather than their true feelings or behavior.

  • Sampling Bias: The sample may not be representative of the customer population.

  • Fatigue: Long surveys may lead to respondent fatigue, causing rushed or careless answers toward the end. Do not ask too many questions!

3 Data Wrangling with R

3.1 Data Frame Basics

  • Data Frame is the R object that we will deal with most of the time in the MSc program. You can think of data.frame as an Excel spreadsheet.
    • Each row stands for an observation; usually a record for a customer
    • Each column stands for a variable; each column should have a unique name.
    • Each column must contain the same data type, but the different columns can store different data types.

3.2 Install and Load the dplyr package

  • In R, we use the dplyr package for data cleaning and manipulation.1
Code
pacman::p_load(dplyr)
  • Load a csv format dataset called data_full using read.csv()
Code
data_full <- read.csv("https://www.dropbox.com/scl/fi/2q7ppqtyca0pd3j486osl/data_full.csv?rlkey=gsyk51q27vd1skek4qpn5ikgm&dl=1")
  • To browse the whole dataset, we can simply click the dataset in the environment

3.3 First Look at the Dataset

  • What variables do the data have? The data types of each variable?
Code
str(data_full)
'data.frame':   2000 obs. of  22 variables:
 $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
 $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
 $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
 $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
 $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
 $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
 $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
 $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
 $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
 $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
 $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
 $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
 $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...
 $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
 $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
 $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
 $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
 $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
 $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
 $ Dt_Customer        : chr  "04/09/2012" "08/03/2014" "21/08/2013" "10/02/2014" ...
 $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...

3.4 Common Data Wrangling Operations

  • Filter rows (filter)

  • Sort rows (arrange)

  • Select columns (select)

  • Generate new columns (mutate)

  • Group aggregation (group_by)

  • Merge datasets (join)

3.5 Subset Rows Based on Conditions: filter

  • We can use filter() to select rows that meet certain logical criteria.

  • Important: To store the generated new subset of data in RStudio, we need to assign it to a new object.

Example: From data_full, find customers who are single

Code
# keep only single customers
filter(data_full, Marital_Status == "Single" )

3.6 The Pipe Operator

3.6.1 Pipe Operator

%>% passes the object in front as the first argument of the subsequent function.2

3.7 Example of the Pipe Operator

Code
# without using pipe
filter(data_full, Marital_Status == 'Single')

# with pipe 
data_full %>% filter(Marital_Status == 'Single')

3.8 Why Do We Need Pipe Operator for Data Wrangling?

  • Exercise: find out single customers who have a PhD without using pipe.
Code
# based on data_full, find out customers who are single
data_full_single <- filter(data_full, Marital_Status == "Single")

# based on data_full_single, find out customers who are single and have PhD
data_full_single_PhD <- filter(data_full_single, Education == "PhD")
  • Exercise: find out single customers who have a PhD using pipe.
Code
data_full_single_PhD <- data_full %>%
  filter(Marital_Status == 'Single') %>%
  filter(Education == 'PhD')
  
  ## You can even continue with more filter steps with more pipe operators
  • The pipe works like a conveyor belt in a factory, passing the intermediate outputs from the previous data wrangling step to the next step for further processing until you finish your data wrangling task.

3.9 Sort Rows: arrange

  • arrange() orders the rows by the values of selected columns.

    • ascending order by default; for descending order, put a minus sign in front of the variable.

    • allows multiple sorting variables separated by comma.

  • Example: sort customers based on income in descending order.

Code
data_full %>% 
  arrange(-Income) %>%
  head(10)
  • Exercise: sort customers based on income in descending order and age in ascending order.
Code
data_full %>% 
  mutate(Age = 2023 - Year_Birth) %>%
  arrange(-Income, Age) %>%
  head(10) # only show the first 10 rows using head(10)

3.10 Generate New Variables: mutate

  • mutate() generates new variables in the dataset while preserving existing variables

  • Example: create a new variable named Age from Year_Birth.

Code
data_full %>%
  mutate(Age = 2023 - Year_Birth) %>%
  head(10)
  • Exercise: create a new variable named totalkids, which is the sum of Kidhome and Teenhome.
Code
data_full %>%
  mutate(totalkids = Kidhome + Teenhome) %>%
  head(10)

3.11 Aggregation by Groups: group_by

  • group_by() allows us to aggregate data by group and compute statistics for each group
Code
# group by marital status
data_full %>%
    group_by(Marital_Status) 
  • Internally, the dataset is already grouped based on the specified variable(s).

3.12 Aggregation by Groups: group_by() %>% summarise()

  • After aggregating data, we can use summarise() to compute group-specific statistics for us.
    • Similar to mutate() in generating new variables
    • Different from mutate() in that the new variable is computed based on groups.
Code
# compute the average income for each marital status group
data_full %>%
  group_by(Marital_Status) %>% 
  summarise(avg_income = mean(Income,na.rm = T)) %>%
  ungroup()
  • What if you replace summarise() with mutate()?

3.13 Aggregation by Groups: group_by() Multiple Groups

  • We can have multiple group variables for group_by , such as computing average income for each marital status, education combination
Code
# compute the average income for each marital, education group
data_full %>%
  group_by(Marital_Status,Education) %>% 
  summarise(avg_income = mean(Income,na.rm = T)) %>% 
  ungroup() %>%
  head(5)

3.14 After-Class

  • (essential) Cheatsheet for dplyr. This cheatsheet provides a quick reference for the most commonly used functions in the dplyr package. It’s very important to familiarize yourself with these functions as you will use them a lot in your future projects.

  • (optional) Complete the after-class exercise for Week 3. If you still have time, you can also complete the data camp exercise on the dplyr package. The link is here.

Footnotes

  1. pacman is a package management tool that makes our lives easier by loading multiple packages at once.↩︎

  2. Starting from R 4.0 version, base R introduces the native pipe operator |>↩︎