Class 4 Data Wrangling with R Part I

Author
Affiliation

Dr Wei Miao

UCL School of Management

Published

October 11, 2023

1 Overview

1.1 Class Objectives

  • Understand the major steps to conduct data analytics

  • Data collection: Learn how to collect first-hand data

  • Data cleaning: Learn how to use the dplyr package to collect, load, and clean data

  • Data analysis: Learn how to conduct descriptive analytics

2 Data Analytics Workflow

2.1 Overview

2.2 Collect Data

  • Primary Data: Data that are generated by the researcher himself/herself, surveys, interviews, experiments, specially designed for understanding and solving the research problem at hand.

  • Secondary Data: Existing data generated by the company’s or consumer’s past activities, as part of organizational record keeping.

2.3 Collect Data: Marketing Surveys

  • In a marketing survey, we typically would like to collect the following information from customers:

    • purchase intention

    • willingness to pay (WTP)

    • shopping basket

    • share of wallet (SoW)

    • demographics

  • Let’s see a quick example of how to design a marketing survey!

  • Useful supplementary readings if you need to design marketing surveys for your term 3 dissertation.

3 Data Wrangling with R

3.1 Data Frame Basics

  • Data Frame is the R object that we will deal with most of the time in the MSc program. You can think of data.frame as a spreadsheet in excel

  • Each row stands for an observation

  • Each column stands for a variable; each column should have a unique name.

  • Each column must contain the same data type, but the different columns can store different data types.1

3.2 Install and Load the dplyr package

  • In R, we will be using the dplyr package for data cleaning and manipulation.
install.packages("dplyr")
  • Load the package
library(dplyr)
  • Load a csv format dataset called data_demo using read.csv()
data_demo <- read.csv("https://www.dropbox.com/s/a0v38lpydls2emy/demographics.csv?dl=1")
  • To browse the whole dataset, we can simply click the dataset in the environment

3.3 First Look at the Dataset

  1. What variables do the data have?
names(data_demo)
[1] "ID"             "Year_Birth"     "Education"      "Marital_Status"
[5] "Income"         "Kidhome"        "Teenhome"       "Dt_Customer"   
[9] "Recency"       
  1. What are the types of each variable?
sapply(data_demo,
       class)
            ID     Year_Birth      Education Marital_Status         Income 
     "integer"      "integer"    "character"    "character"      "integer" 
       Kidhome       Teenhome    Dt_Customer        Recency 
     "integer"      "integer"    "character"      "integer" 
  • Tip: We can use a function called str() short for structure.
str(data_demo)
'data.frame':   2000 obs. of  9 variables:
 $ ID            : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
 $ Year_Birth    : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
 $ Education     : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
 $ Marital_Status: chr  "Single" "Single" "Together" "Together" ...
 $ Income        : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
 $ Kidhome       : int  0 1 0 1 1 0 0 1 1 1 ...
 $ Teenhome      : int  0 1 0 0 0 1 1 0 0 1 ...
 $ Dt_Customer   : chr  "04/09/2012" "08/03/2014" "21/08/2013" "10/02/2014" ...
 $ Recency       : int  58 38 26 26 94 16 34 32 19 68 ...

3.4 Common Data Wrangling Operations

  • Select rows (filter)

  • Sort rows (arrange)

  • Select columns (select)

  • Generate new columns (mutate)

  • Group aggregation (group_by)

  • Merge datasets (join)

3.5 Subset Rows Based on Conditions: filter

  • We can use filter() to select rows that meet certain logical criteria.
    • The filter operation results in a new dataset, which is a subset of the original dataset after filtering
    • The number of variables remains the same

  • Important: To store the generated new subset of data in RStudio, we need to assign it to a new object.

Example: From data_demo, find customers who are single

# keep only single customers
data_demo_single <- filter(data_demo, Marital_Status == "Single" )

3.6 The Pipe Operator %>%

3.6.1 Pipe Operator

%>% passes the object in front as the first argument of the subsequent function.

3.7 Example of the Pipe Operator %>%

# without using pipe
filter(data_demo, Marital_Status == 'Single')

# with pipe 
data_demo %>% filter(Marital_Status == 'Single')

3.8 Why Do We Need Pipe Operator for Data Wrangling?

  • Exercise: find out single customers who have a PhD without using pipe.
# based on data_demo, find out customers who are single
data_demo_single <- filter(data_demo, Marital_Status == 'Single')

# based on data_demo_single, find out customers who are single and have PhD
data_demo_single_PhD <- filter(data_demo_single, Education == "PhD")
  • Exercise: find out single customers who have a PhD using pipe.
data_demo_single_PhD <- data_demo %>%
  filter(Marital_Status == 'Single') %>%
  filter(Education == 'PhD') %>%
  head() ## You can even continue with more filter steps
  • The pipe works like a conveyor belt in a factory, passing the intermediate outputs from the previous data wrangling step to the next step for further processing until you finish your data wrangling task.

3.9 Subset Rows Based on Multiple Conditions: filter

  • We can also add multiple criteria using &, |, and ! to represent and, or, and not (induction week)
data_demo %>%
  filter(Marital_Status == 'Single' & 
           Education == 'PhD') %>%
  head()

3.10 Sort Rows: arrange

  • arrange() orders the rows by the values of selected columns.

    • ascending order by default; for descending order, put a minus sign.

    • allows multiple sorting variables separated by comma.

  • Example: sort customers based on marital status in ascending order and number of teens in descending order.

data_demo %>% 
  arrange(Marital_Status, -Teenhome) %>%
  head(10)
  • Exercise: sort customers based on income in descending order.
data_demo %>% 
  arrange(-Income) %>%
  head(10)

3.11 Generate New Variables: mutate

  • mutate() generates new variables in the dataset while preserving existing variables

  • Example: create a new variable named Age from Year_Birth.

data_demo %>%
  mutate(Age = 2023 - Year_Birth) %>%
  head(10)
  • Exercise: create a new variable named totalkids, which is the sum of Kidhome and Teenhome.
data_demo %>%
  mutate(totalkids = Kidhome + Teenhome) %>%
  head(10)

3.12 After-Class Exercise

  • Data camp dplyr exercise

  • Read “Preliminary Customer Analyses” dataset, and try to solve the case questions using the techniques learned today

Footnotes

  1. Compared with matrix, is there any difference despite both being two-dimensional?↩︎