'data.frame': 93467 obs. of 7 variables:
$ driver_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ booking_date: chr "2020-04-01" "2020-04-02" "2020-04-03" "2020-04-04" ...
$ is_work : int 0 0 0 0 0 0 0 0 0 0 ...
$ income : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_order : int 0 0 0 0 0 0 0 0 0 0 ...
$ avg_distance: num 0 0 0 0 0 0 0 0 0 0 ...
$ city : chr "g" "g" "g" "g" ...
The Causal Impact of COVID-19 on Ridesharing Using Instrumental Variables
MSIN0094 Case Study
1 Industry Background1
The sharing economy has been booming in recent years, leading to a rapid increase in jobs in the “gig” economy. According to Hossain (2020), in the US alone, the sharing economy sector has created 6.23 million jobs with 78 million service providers, and 800 million people engage with it. The transportation sector is one of the most salient beneficiaries of the burgeoning sharing economy. For instance, commuting to work by shared bicycle (e.g., Citi Bike) has become an increasingly popular transportation option (Ford et al. 2019). The ride-sharing service (e.g., Uber) allows drivers to enjoy more flexibility in work, which is proven valuable to drivers and has improved capacity utilization (Cramer and Krueger 2016).
However, the COVID-19 pandemic has brought unprecedented disruptions to many industries, and the transportation industry is among the most disrupted ones. Further, the COVID-19 has raised concerns about the survivability of the sharing economy in general. It is reported that gross bookings on Uber rides were down by 75% in the three months through June 2020, and that Lyft’s April ridership was down by 75% from April 2019. Figure 1 shows that some Uber drivers were extremely cautious about their shift decisions and taking measures to prevent COVID from spreading.
Unlike the traditional taxi market, where taxi drivers rent vehicles from taxi companies and then directly provide transportation services to consumers, modern ride-sharing platforms typically serve as the matching intermediary between drivers and passengers. Due to such two-sided market nature, the profitability of modern ride-sharing platforms (and sharing economy in general) highly depends on the interdependence or externality between the two sides of economic agents (Rysman 2009). Therefore, a ride-sharing platform would benefit from the network effect if more drivers work for them. It is thus managerially important for the ride-sharing platform to understand whether COVID-19 has affected drivers’ labor supply patterns and if yes, the magnitude of the effect across drivers and over time.
In this case study, we will answer the above question using the instrumental variable method.
2 Data Sets
2.1 Driver Daily Trip Data
The data science team has aggregated the raw trip-level data into a driver-day level panel data. Panel data structure refers to a dataset that includes multiple observations over time for the same subjects or individuals. It combines cross-sectional data (observations at a single point in time) and time series data (observations of a single subject over multiple time periods), thus enabling analysis that captures both individual dynamics and temporal variations.
This first data set summarizes drivers’ daily shift each day in April 2020, right during the period when the pandemic began in the UK. The data set consists of a random sample of around 4000 drivers in 3 UK cities (anonymized as g
, s
, and c
) in 2020.
- Check the data types of each variable
Identify any variables that need to be converted
Identify all economically meaningful variables
2.2 Data Wrangling
- Convert the booking date data from characters to date type using the
lubridate
package.- For more details on how to work on date objects in R, refer to this tutorial
pacman::p_load(lubridate)
# function ymd(), dmy(), mdy() can convert characters into date format
# lubridate is super powerful, refer to the tutorial for more usages
data_driver <- data_driver %>%
mutate(booking_date = ymd(booking_date) )
# check the class of booking_date now
class(data_driver$booking_date)
[1] "Date"
- Please report the summary statistics of the trip data.
Table 1 reports the results.
Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
---|---|---|---|---|---|---|---|---|
driver_id | 3223 | 0 | 4378.2 | 2503.8 | 1.0 | 4420.0 | 8640.0 | |
is_work | 2 | 0 | 0.2 | 0.4 | 0.0 | 0.0 | 1.0 | |
income | 12127 | 0 | 5.9 | 20.7 | 0.0 | 0.0 | 948.2 | |
n_order | 38 | 0 | 0.9 | 3.0 | 0.0 | 0.0 | 41.0 | |
avg_distance | 9676 | 0 | 1.7 | 5.0 | 0.0 | 0.0 | 181.8 |
2.3 COVID-19 Data
To measure the severity of COVID-19, the data science team collected daily number of new cases in each city from the government database.
- Please check the data types and correct the data types as needed
- Tips: need to convert the
booking_date
into a date type.
- Tips: need to convert the
'data.frame': 87 obs. of 4 variables:
$ city : chr "g" "g" "g" "g" ...
$ booking_date : chr "2020-04-01" "2020-04-02" "2020-04-03" "2020-04-04" ...
$ new_cases : int 1 1 1 0 3 1 0 2 0 3 ...
$ other_city_new_cases: int 0 0 0 0 0 0 0 0 0 0 ...
- We can plot the trend of COVID-19 cases using ggplot2 package.
Figure 2 plots the trend of new COVID cases in each city by date.
2.4 Data Wrangling
- Join the two datasets using
dplyr
- Tip: In Week 2 and 3, we have learned
dplyr
data joining. We can only do M:1 or 1:1left_join()
. - Please observe the data structure of the two datasets and carefully think about how we should do the data join in this case.
- Tip: In Week 2 and 3, we have learned
3 Simple OLS Regressions
3.1 Key Outcome Variables
To facilitate the empirical analysis of drivers’ responses to COVID-19, the data science team has followed the literature (e.g., Farber 2008) and further aggregate trips into a higher level for each driver so that we can measure both extensive margin (i.e., whether to work) and intensive margin (i.e., how much to work) of drivers’ labor supply. As the COVID-19 measures vary at the daily level, the team aggregated the trip level data into driver-day level. Specifically, Uber cares about the following driver-day level KPI measures which serve as the dependent variables in the subsequent empirical analysis.
- Whether or not to work, a binary outcome variable which equals 1 if a driver has at least one ride request on the day and 0 otherwise. We can use this variable to measure drivers’ shift decision, i.e., willingness to work on a day, which proxies for the extensive margin of drivers’ labor supply. It is ambiguous ex-ante how the number of new cases affects a driver’s shift decision. On the one hand, more new cases may increase the risk of infection, which decease drivers’ expected wellbeing, and therefore discourage drivers from working on a specific day; on the other hand, fewer drivers on the street suggest less competition among drivers and therefore higher chances of getting a passenger and potentially higher hourly earnings, which may motivate drivers to work. It is important for the ride-sharing company to understand how the severity of COVID-19 affects drivers’ willingness to work, so that the company can adjust their stimulus plans for drivers accordingly.
- Total number of completed orders, which contain three aspects of information which are of policy and managerial interest. First, the variable can proxy for the length of drivers’ daily labor supply. Conditional on working, if a driver decides to work for longer hours, then we expect the driver to have a larger number of requests/orders. Second, both variables contain information on consumer demand. We expect the total number of requests/orders to decrease if there is a lower demand for ride-sharing service from consumers due to the COVID-19 outbreak. Finally, both variables can measure the intensity of competition among drivers. Keeping the level of demand fixed, the total number of requests/orders would be larger when there are fewer drivers working on the day. Due to the complexity of information contained, ex-ante, it is not straightforward how the COVID-19 measures affect the total number of orders for individual drivers.
- Earnings. Earnings measure the driver’s income from providing ride-sharing services, which is highly correlated with the number of completed orders and total trip distance. It allows us to directly assess the impact of the COVID-19 on drivers’ financial wellbeing.
- Average trip distance. In our empirical context, drivers cannot reject a booking request once being matched with a passenger, therefore, the trip distance is largely determined by passengers. Since passengers may be reluctant to take long distance trips during the pandemic, we expect a negative impact of the number of new cases on the average trip distance.
3.2 OLS regression Analyses
To empirically investigate the causal impact of COVID-19 cases on driver behavior, we can use simple OLS regression to regress the labor supply measures of driver \(i\), in city \(j\), on day \(t\) on the COVID-19 measure and other covariates as follows:
\[ Outcome_{ijt}=\beta_0+\alpha NewCases_{ijt}+X\beta+\varepsilon_{ijt} \tag{1}\]
- Please run three univariate simple linear regressions, with the outcome being the 3 aforementioned dependent variables and explanatory variable being new cases only.
The results are reported in Table 2. Interpret the results.
pacman::p_load(fixest)
OLS_is_work <- feols(fml = is_work ~ new_cases,
data = data_driver)
OLS_income <- feols(fml = income ~ new_cases,
data = data_driver)
OLS_n_order <- feols(fml = n_order ~ new_cases,
data = data_driver)
OLS_avg_distance <- feols(fml = avg_distance ~ new_cases,
data = data_driver)
modelsummary(list("Work" = OLS_is_work,
"shift income" = OLS_income,
"# orders" = OLS_n_order,
"avg distance" = OLS_avg_distance),
gof_map = c('nobs','r.squared'),
stars = TRUE)
Work | shift income | # orders | avg distance | |
---|---|---|---|---|
(Intercept) | 0.182*** | 5.957*** | 0.885*** | 1.656*** |
(0.001) | (0.071) | (0.010) | (0.017) | |
new_cases | 0.000 | −0.122*** | −0.012* | −0.003 |
(0.001) | (0.036) | (0.005) | (0.009) | |
Num.Obs. | 93467 | 93467 | 93467 | 93467 |
R2 | 0.000 | 0.000 | 0.000 | 0.000 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
3.3 Fixed Effect OLS Regressions
What confounding factors do we need to control in the above OLS regressions? Specifically, since this data is a panel data, what fixed effects do you need to control in the OLS regression?
We first need to include driver fixed effects to control for driver-specific characteristics that may affect drivers’ labor supply patterns. Such characteristics include, but are not limited to, the driver’s socio-demographic characteristics (e.g., gender and age), the driver’s degree of risk aversion, whether a driver is driving full-time or part-time, and the driver’s innate abilities to search for passengers, etc.
For instance, less risk-averse drivers may prefer to work on days when there are more new cases because they expect less competition from peer drivers and potentially higher profitability on such days. Another example is that, full-time drivers can be more subject to the impact of new cases compared to part-time drivers, because full-time drivers’ income largely comes from providing ride-sharing services via the focal company. Driver fixed effects can mitigate such driver-specific time-invariant confounding effects and help us obtain more accurate estimates for our focal explanatory variable NewCases
.
In addition to driver fixed effects that remove cross-sectional confounding effects across drivers, we also include time fixed effects in Equation (1) to mitigate the inter-temporal confounding effects. We consider time fixed effects at the day level. Moreover, given that the local government in each city may have enacted different policies on fighting COVID-19 and/or stimulating economy (e.g., subsidizing drivers) during our data period, we further control for city fixed effects.
FE_is_work <- feols(fml = is_work ~ new_cases|
driver_id + booking_date + city,
data = data_driver)
FE_income <- feols(fml = income ~ new_cases|
driver_id + booking_date + city,
data = data_driver)
FE_n_order <- feols(fml = n_order ~ new_cases|
driver_id + booking_date + city,
data = data_driver)
FE_avg_distance <- feols(fml = avg_distance ~ new_cases|
driver_id + booking_date + city,
data = data_driver)
modelsummary(list("Work" = FE_is_work,
"shift income" = FE_income,
"# orders" = FE_n_order,
"avg distance" = FE_avg_distance),
stars = TRUE,
gof_map = 'nobs')
Work | shift income | # orders | avg distance | |
---|---|---|---|---|
new_cases | 0.000 | 0.029 | 0.003 | 0.002 |
(0.000) | (0.022) | (0.003) | (0.008) | |
Num.Obs. | 93467 | 93467 | 93467 | 93467 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
4 Instrumental Variable
4.1 Potential Endogeneity
After including the driver, city, date fixed effects fixed effects in the above Regression (1), the only challenge to obtaining causal inference is the potential endogeneity of NewCases
.
Equation (1) could be subject to simultaneity issues because drivers’ labor supply decisions and number of new cases may be interdependent. On the one hand, drivers may adjust their labor supply accordingly to the number of new cases. On the other hand, prior research has demonstrated the potential effect of mobility on the COVID-19 case growth rate. If a city has a higher volume of private transportation through ride-sharing services, given the highly contagious nature of COVID-19, the city may have a higher number of new cases.
4.2 Instrumental Variables
To tackle the potential endogeneity issue, we use the instrumental variable (IV) method, leveraging exogenous sources of variation in the explanatory variable that are uncorrelated with the error term in Equation (1) using two-stage least squares (2SLS). We can potentially select two instrumental variables. The first instrumental variable is imported new cases, which measures the number of infected travelers from overseas in each city as disclosed by local government. Because the imported cases relate to travelers from overseas, it should be exogenous to local confirmed cases and meet the exogeneity requirement. The second instrumental variable is other city new cases, which is the number of new cases confirmed in neighboring cities. Since confirmed cases in other cities should not directly affect the focal city’s ride-sharing market, the variable other city new cases should also satisfy the exogeneity requirement.
4.3 Manual IV Regression
The first-stage regression is specified below in Equation Equation 2, where the definitions of variables are the same as in Equation Equation 1.
\[ NewCase_{ijt}=\beta_0+\alpha OtherCityNewCase_{ijt}+X\beta+\varepsilon_{ijt} \tag{2}\]
# Run first stage regression: new_cases ~ other_city_new_cases + controls
IV_is_work_1ststage <- feols(fml = new_cases ~ other_city_new_cases|
driver_id + booking_date + city,
data = data_driver)
# mutate predicted new_cases in data_driver
data_driver <- data_driver %>%
mutate(predicted_new_cases = predict(IV_is_work_1ststage))
In the second stage regression, we regress the outcome variables on the predicted new cases from the 1st stage, controlling for the same set of control variables.
# Run second stage regression: is_work ~ predicted_new_cases + controls
IV_is_work_2ndstage <- feols(fml = is_work ~ predicted_new_cases|
driver_id + booking_date + city,
data = data_driver)
modelsummary(list(IV_is_work_1ststage,IV_is_work_2ndstage),
stars = TRUE)
(1) | (2) | |
---|---|---|
other_city_new_cases | −0.466*** | |
(0.012) | ||
predicted_new_cases | 0.000 | |
(0.001) | ||
Num.Obs. | 93467 | 93467 |
R2 | 0.491 | 0.662 |
R2 Adj. | 0.473 | 0.650 |
R2 Within | 0.225 | 0.000 |
R2 Within Adj. | 0.225 | 0.000 |
AIC | 328117.6 | −7593.9 |
BIC | 358852.8 | 23141.3 |
RMSE | 1.35 | 0.22 |
Std.Errors | by: driver_id | by: driver_id |
FE: driver_id | X | X |
FE: booking_date | X | X |
FE: city | X | X |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
4.4 IV Regression Using feols()
In fact, the feols()
function in the fixest
package is powerful to help us estimate IV regression in a single step. We can specify the instrumental variable using a second formula.
IV_is_work <- feols(fml = is_work ~ 1| # Y ~ other vars except endo var
driver_id + booking_date + city| # fixed efffects
new_cases ~ other_city_new_cases,# endo var ~ IV
data = data_driver)
IV_no_order <- feols(fml = n_order ~ 1|
driver_id + booking_date + city|
new_cases ~ other_city_new_cases,
data = data_driver)
IV_income <- feols(fml = income ~ 1|
driver_id + booking_date + city|
new_cases ~ other_city_new_cases,
data = data_driver)
IV_avg_distance <- feols(fml = avg_distance ~ 1|
driver_id + booking_date + city|
new_cases ~ other_city_new_cases,
data = data_driver)
modelsummary(list("Work" = IV_is_work,
"shift income" = IV_income,
"# orders" = IV_no_order,
"avg distance" = IV_avg_distance),
stars = TRUE)
Work | shift income | # orders | avg distance | |
---|---|---|---|---|
fit_new_cases | 0.000 | 0.122* | 0.019** | 0.023 |
(0.001) | (0.053) | (0.007) | (0.015) | |
Num.Obs. | 93467 | 93467 | 93467 | 93467 |
R2 | 0.662 | 0.555 | 0.607 | 0.420 |
R2 Adj. | 0.650 | 0.539 | 0.593 | 0.399 |
R2 Within | 0.000 | 0.000 | 0.000 | 0.000 |
R2 Within Adj. | 0.000 | 0.000 | 0.000 | 0.000 |
AIC | −7593.5 | 762224.6 | 386801.7 | 523062.0 |
BIC | 23141.8 | 792959.8 | 417536.9 | 553797.2 |
RMSE | 0.22 | 13.79 | 1.85 | 3.84 |
Std.Errors | by: driver_id | by: driver_id | by: driver_id | by: driver_id |
FE: driver_id | X | X | X | X |
FE: booking_date | X | X | X | X |
FE: city | X | X | X | X |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
References
Footnotes
This case was prepared by Wei Miao, UCL School of Management, University College London for MSIN0094 Marketing Analytics module. This case study is heavily adapted from his research (Wang et al. 2022) “The impact of COVID-19 on the ride-sharing industry and its recovery: Causal evidence from China.” Transportation Research Part A: Policy and Practice 155 (2022): 128-141. This case was developed to provide material for class discussion rather than to illustrate either effective or ineffective handling of a business situation. Names and data may have been disguised or fabricated. Please do not circulate without permission. Copyrights reserved.↩︎