Class 4 Data Wrangling with R Part I
1 Overview
1.1 Class Objectives
Understand the major steps to conduct data analytics
Data collection: Learn how to collect first-hand data
Data cleaning: Learn how to use the
dplyr
package to collect, load, and clean dataData analysis: Learn how to conduct descriptive analytics
2 Data Analytics Workflow
2.1 Overview
2.2 Collect Data
Primary Data: Data that are generated by the researcher himself/herself, surveys, interviews, experiments, specially designed for understanding and solving the research problem at hand.
Secondary Data: Existing data generated by the company’s or consumer’s past activities, as part of organizational record keeping.
2.3 Collect Data: Marketing Surveys
In a marketing survey, we typically would like to collect the following information from customers:
purchase intention
willingness to pay (WTP)
shopping basket
share of wallet (SoW)
demographics
Let’s see a quick example of how to design a marketing survey!
Useful supplementary readings if you need to design marketing surveys for your term 3 dissertation.
3 Data Wrangling with R
3.1 Data Frame Basics
Data Frame is the R object that we will deal with most of the time in the MSc program. You can think of
data.frame
as a spreadsheet in excelEach row stands for an
observation
Each column stands for a
variable
; each column should have a unique name.Each column must contain the same data type, but the different columns can store different data types.1
3.2 Install and Load the dplyr
package
- In R, we will be using the
dplyr
package for data cleaning and manipulation.
- Load the package
- Load a csv format dataset called
data_demo
usingread.csv()
- To browse the whole dataset, we can simply click the dataset in the environment
3.3 First Look at the Dataset
- What variables do the data have?
[1] "ID" "Year_Birth" "Education" "Marital_Status"
[5] "Income" "Kidhome" "Teenhome" "Dt_Customer"
[9] "Recency"
- What are the types of each variable?
ID Year_Birth Education Marital_Status Income
"integer" "integer" "character" "character" "integer"
Kidhome Teenhome Dt_Customer Recency
"integer" "integer" "character" "integer"
- Tip: We can use a function called
str()
short for structure.
'data.frame': 2000 obs. of 9 variables:
$ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
$ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
$ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
$ Marital_Status: chr "Single" "Single" "Together" "Together" ...
$ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
$ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
$ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
$ Dt_Customer : chr "04/09/2012" "08/03/2014" "21/08/2013" "10/02/2014" ...
$ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
3.4 Common Data Wrangling Operations
Select rows (
filter
)Sort rows (
arrange
)Select columns (
select
)Generate new columns (
mutate
)Group aggregation (
group_by
)Merge datasets (
join
)
3.5 Subset Rows Based on Conditions: filter
- We can use
filter()
to select rows that meet certain logical criteria.- The filter operation results in a new dataset, which is a subset of the original dataset after filtering
- The number of variables remains the same
- Important: To store the generated new subset of data in RStudio, we need to assign it to a new object.
Example: From data_demo
, find customers who are single
3.6 The Pipe Operator %>%
3.6.1 Pipe Operator
%>%
passes the object in front as the first argument of the subsequent function.
3.7 Example of the Pipe Operator %>%
3.8 Why Do We Need Pipe Operator for Data Wrangling?
- Exercise: find out single customers who have a PhD without using pipe.
- Exercise: find out single customers who have a PhD using pipe.
- The pipe works like a conveyor belt in a factory, passing the intermediate outputs from the previous data wrangling step to the next step for further processing until you finish your data wrangling task.
3.9 Subset Rows Based on Multiple Conditions: filter
- We can also add multiple criteria using
&
,|
, and!
to represent and, or, and not (induction week)
3.10 Sort Rows: arrange
arrange()
orders the rows by the values of selected columns.ascending order by default; for descending order, put a minus sign.
allows multiple sorting variables separated by comma.
Example: sort customers based on marital status in ascending order and number of teens in descending order.
- Exercise: sort customers based on income in descending order.
3.11 Generate New Variables: mutate
mutate()
generates new variables in the dataset while preserving existing variablesExample: create a new variable named
Age
fromYear_Birth
.
- Exercise: create a new variable named
totalkids
, which is the sum ofKidhome
andTeenhome
.
3.12 After-Class Exercise
Data camp dplyr exercise
Read “Preliminary Customer Analyses” dataset, and try to solve the case questions using the techniques learned today
Footnotes
Compared with matrix, is there any difference despite both being two-dimensional?↩︎