The Causal Impact of COVID-19 on Ridesharing Using Instrumental Variables

MSIN0094 Case Study

Author

Affiliation

Dr Wei Miao

UCL School of Management

Published

November 22, 2023

1 Industry Background¹

The sharing economy has been booming in recent years, leading to a rapid increase in jobs in the “gig” economy. According to Hossain (2020), in the US alone, the sharing economy sector has created 6.23 million jobs with 78 million service providers, and 800 million people engage with it. The transportation sector is one of the most salient beneficiaries of the burgeoning sharing economy. For instance, commuting to work by shared bicycle (e.g., Citi Bike) has become an increasingly popular transportation option (Ford et al. 2019). The ride-sharing service (e.g., Uber) allows drivers to enjoy more flexibility in work, which is proven valuable to drivers and has improved capacity utilization (Cramer and Krueger 2016).

However, the COVID-19 pandemic has brought unprecedented disruptions to many industries, and the transportation industry is among the most disrupted ones. Further, the COVID-19 has raised concerns about the survivability of the sharing economy in general. It is reported that gross bookings on Uber rides were down by 75% in the three months through June 2020, and that Lyft’s April ridership was down by 75% from April 2019. Figure 1 shows that some Uber drivers were extremely cautious about their shift decisions and taking measures to prevent COVID from spreading.

Unlike the traditional taxi market, where taxi drivers rent vehicles from taxi companies and then directly provide transportation services to consumers, modern ride-sharing platforms typically serve as the matching intermediary between drivers and passengers. Due to such two-sided market nature, the profitability of modern ride-sharing platforms (and sharing economy in general) highly depends on the interdependence or externality between the two sides of economic agents (Rysman 2009). Therefore, a ride-sharing platform would benefit from the network effect if more drivers work for them. It is thus managerially important for the ride-sharing platform to understand whether COVID-19 has affected drivers’ labor supply patterns and if yes, the magnitude of the effect across drivers and over time.

In this case study, we will answer the above question using the instrumental variable method.

2 Data Sets

2.1 Driver Daily Trip Data

The data science team has aggregated the raw trip-level data into a driver-day level panel data. Panel data structure refers to a dataset that includes multiple observations over time for the same subjects or individuals. It combines cross-sectional data (observations at a single point in time) and time series data (observations of a single subject over multiple time periods), thus enabling analysis that captures both individual dynamics and temporal variations.

This first data set summarizes drivers’ daily shift each day in April 2020, right during the period when the pandemic began in the UK. The data set consists of a random sample of around 4000 drivers in 3 UK cities (anonymized as g, s, and c) in 2020.

Check the data types of each variable
- Identify any variables that need to be converted
- Identify all economically meaningful variables

str(data_driver)

'data.frame':   93467 obs. of  7 variables:
 $ driver_id   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ booking_date: chr  "2020-04-01" "2020-04-02" "2020-04-03" "2020-04-04" ...
 $ is_work     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ income      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ n_order     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ avg_distance: num  0 0 0 0 0 0 0 0 0 0 ...
 $ city        : chr  "g" "g" "g" "g" ...

2.2 Data Wrangling

Convert the booking date data from characters to date type using the lubridate package.
- For more details on how to work on date objects in R, refer to this tutorial

pacman::p_load(lubridate)

# function ymd(), dmy(), mdy() can convert characters into date format
# lubridate is super powerful, refer to the tutorial for more usages
data_driver <- data_driver %>%
  mutate(booking_date = ymd(booking_date) )

# check the class of booking_date now
class(data_driver$booking_date)

[1] "Date"

Please report the summary statistics of the trip data.

Table 1 reports the results.

pacman::p_load(modelsummary)
# report the summary statistics below
datasummary_skim(data_driver)

Table 1: Summary Statistics
	Unique (#)	Mean	SD	Min	Median	Max
driver_id	3223	4378.2	2503.8	1.0	4420.0	8640.0
is_work	2	0.2	0.4	0.0	0.0	1.0
income	12127	5.9	20.7	0.0	0.0	948.2
n_order	38	0.9	3.0	0.0	0.0	41.0
avg_distance	9676	1.7	5.0	0.0	0.0	181.8

2.3 COVID-19 Data

To measure the severity of COVID-19, the data science team collected daily number of new cases in each city from the government database.

Please check the data types and correct the data types as needed
- Tips: need to convert the booking_date into a date type.

# check the structure of data below
str(data_covid)

'data.frame':   87 obs. of  4 variables:
 $ city                : chr  "g" "g" "g" "g" ...
 $ booking_date        : chr  "2020-04-01" "2020-04-02" "2020-04-03" "2020-04-04" ...
 $ new_cases           : int  1 1 1 0 3 1 0 2 0 3 ...
 $ other_city_new_cases: int  0 0 0 0 0 0 0 0 0 0 ...

# convert the data types of booking_date below

data_covid <- data_covid %>%
  mutate(booking_date = ymd(booking_date))

We can plot the trend of COVID-19 cases using ggplot2 package.

Figure 2 plots the trend of new COVID cases in each city by date.

pacman::p_load(ggplot2, ggthemes)

ggplot() + 
  geom_line(data = data_covid,
            aes(x = booking_date,
                y = new_cases,
                color = city,
                linetype = city))+
  theme_stata() + 
  xlab('date') + ylab('new cases')

Figure 2: Number of New Cases by City Date

2.4 Data Wrangling

Join the two datasets using dplyr
- Tip: In Week 2 and 3, we have learned dplyr data joining. We can only do M:1 or 1:1 left_join().
- Please observe the data structure of the two datasets and carefully think about how we should do the data join in this case.

data_driver <- data_driver %>%
  left_join(data_covid, by = c('city' = 'city','booking_date'='booking_date'))

3 Simple OLS Regressions

3.1 Key Outcome Variables

To facilitate the empirical analysis of drivers’ responses to COVID-19, the data science team has followed the literature (e.g., Farber 2008) and further aggregate trips into a higher level for each driver so that we can measure both extensive margin (i.e., whether to work) and intensive margin (i.e., how much to work) of drivers’ labor supply. As the COVID-19 measures vary at the daily level, the team aggregated the trip level data into driver-day level. Specifically, Uber cares about the following driver-day level KPI measures which serve as the dependent variables in the subsequent empirical analysis.

Whether or not to work, a binary outcome variable which equals 1 if a driver has at least one ride request on the day and 0 otherwise. We can use this variable to measure drivers’ shift decision, i.e., willingness to work on a day, which proxies for the extensive margin of drivers’ labor supply. It is ambiguous ex-ante how the number of new cases affects a driver’s shift decision. On the one hand, more new cases may increase the risk of infection, which decease drivers’ expected wellbeing, and therefore discourage drivers from working on a specific day; on the other hand, fewer drivers on the street suggest less competition among drivers and therefore higher chances of getting a passenger and potentially higher hourly earnings, which may motivate drivers to work. It is important for the ride-sharing company to understand how the severity of COVID-19 affects drivers’ willingness to work, so that the company can adjust their stimulus plans for drivers accordingly.
Total number of completed orders, which contain three aspects of information which are of policy and managerial interest. First, the variable can proxy for the length of drivers’ daily labor supply. Conditional on working, if a driver decides to work for longer hours, then we expect the driver to have a larger number of requests/orders. Second, both variables contain information on consumer demand. We expect the total number of requests/orders to decrease if there is a lower demand for ride-sharing service from consumers due to the COVID-19 outbreak. Finally, both variables can measure the intensity of competition among drivers. Keeping the level of demand fixed, the total number of requests/orders would be larger when there are fewer drivers working on the day. Due to the complexity of information contained, ex-ante, it is not straightforward how the COVID-19 measures affect the total number of orders for individual drivers.
Earnings. Earnings measure the driver’s income from providing ride-sharing services, which is highly correlated with the number of completed orders and total trip distance. It allows us to directly assess the impact of the COVID-19 on drivers’ financial wellbeing.
Average trip distance. In our empirical context, drivers cannot reject a booking request once being matched with a passenger, therefore, the trip distance is largely determined by passengers. Since passengers may be reluctant to take long distance trips during the pandemic, we expect a negative impact of the number of new cases on the average trip distance.

3.2 OLS regression Analyses

To empirically investigate the causal impact of COVID-19 cases on driver behavior, we can use simple OLS regression to regress the labor supply measures of driver \(i\), in city \(j\), on day \(t\) on the COVID-19 measure and other covariates as follows:

\[ Outcome_{ijt}=\beta_0+\alpha NewCases_{ijt}+X\beta+\varepsilon_{ijt} \tag{1}\]

Please run three univariate simple linear regressions, with the outcome being the 3 aforementioned dependent variables and explanatory variable being new cases only.

The results are reported in Table 2. Interpret the results.

pacman::p_load(fixest)
OLS_is_work <- feols(fml = is_work ~ new_cases,
      data = data_driver)
OLS_income <- feols(fml = income ~ new_cases,
      data = data_driver)
OLS_n_order <- feols(fml = n_order ~ new_cases,
      data = data_driver)
OLS_avg_distance <- feols(fml = avg_distance ~ new_cases,
      data = data_driver)

modelsummary(list("Work" = OLS_is_work,
                  "shift income" = OLS_income,
                  "# orders" = OLS_n_order,
                  "avg distance" = OLS_avg_distance),
             
             gof_map = c('nobs','r.squared'),
             stars = TRUE)

Table 2: OLS Regression Results
	Work	shift income	# orders	avg distance
(Intercept)	0.182***	5.957***	0.885***	1.656***
	(0.001)	(0.071)	(0.010)	(0.017)
new_cases	0.000	−0.122***	−0.012*	−0.003
	(0.001)	(0.036)	(0.005)	(0.009)
Num.Obs.	93467	93467	93467	93467
R2	0.000	0.000	0.000	0.000
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

3.3 Fixed Effect OLS Regressions

What confounding factors do we need to control in the above OLS regressions? Specifically, since this data is a panel data, what fixed effects do you need to control in the OLS regression?

We first need to include driver fixed effects to control for driver-specific characteristics that may affect drivers’ labor supply patterns. Such characteristics include, but are not limited to, the driver’s socio-demographic characteristics (e.g., gender and age), the driver’s degree of risk aversion, whether a driver is driving full-time or part-time, and the driver’s innate abilities to search for passengers, etc.

For instance, less risk-averse drivers may prefer to work on days when there are more new cases because they expect less competition from peer drivers and potentially higher profitability on such days. Another example is that, full-time drivers can be more subject to the impact of new cases compared to part-time drivers, because full-time drivers’ income largely comes from providing ride-sharing services via the focal company. Driver fixed effects can mitigate such driver-specific time-invariant confounding effects and help us obtain more accurate estimates for our focal explanatory variable NewCases.

In addition to driver fixed effects that remove cross-sectional confounding effects across drivers, we also include time fixed effects in Equation (1) to mitigate the inter-temporal confounding effects. We consider time fixed effects at the day level. Moreover, given that the local government in each city may have enacted different policies on fighting COVID-19 and/or stimulating economy (e.g., subsidizing drivers) during our data period, we further control for city fixed effects.

FE_is_work <- feols(fml = is_work ~ new_cases|
                       driver_id + booking_date + city,
                    
      data = data_driver)
FE_income <- feols(fml = income ~ new_cases|
                       driver_id + booking_date + city,
      data = data_driver)
FE_n_order <- feols(fml = n_order ~ new_cases|
                       driver_id + booking_date + city,
      data = data_driver)
FE_avg_distance <- feols(fml = avg_distance ~ new_cases|
                       driver_id + booking_date + city,
      data = data_driver)

modelsummary(list("Work" = FE_is_work,
                  "shift income" = FE_income,
                  "# orders" = FE_n_order,
                  "avg distance" = FE_avg_distance),
             stars = TRUE,
             gof_map = 'nobs')

Table 3: Fixed Effects Regression
	Work	shift income	# orders	avg distance
new_cases	0.000	0.029	0.003	0.002
	(0.000)	(0.022)	(0.003)	(0.008)
Num.Obs.	93467	93467	93467	93467
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

4 Instrumental Variable

4.1 Potential Endogeneity

After including the driver, city, date fixed effects fixed effects in the above Regression (1), the only challenge to obtaining causal inference is the potential endogeneity of NewCases.

Equation (1) could be subject to simultaneity issues because drivers’ labor supply decisions and number of new cases may be interdependent. On the one hand, drivers may adjust their labor supply accordingly to the number of new cases. On the other hand, prior research has demonstrated the potential effect of mobility on the COVID-19 case growth rate. If a city has a higher volume of private transportation through ride-sharing services, given the highly contagious nature of COVID-19, the city may have a higher number of new cases.

4.2 Instrumental Variables

To tackle the potential endogeneity issue, we use the instrumental variable (IV) method, leveraging exogenous sources of variation in the explanatory variable that are uncorrelated with the error term in Equation (1) using two-stage least squares (2SLS). We can potentially select two instrumental variables. The first instrumental variable is imported new cases, which measures the number of infected travelers from overseas in each city as disclosed by local government. Because the imported cases relate to travelers from overseas, it should be exogenous to local confirmed cases and meet the exogeneity requirement. The second instrumental variable is other city new cases, which is the number of new cases confirmed in neighboring cities. Since confirmed cases in other cities should not directly affect the focal city’s ride-sharing market, the variable other city new cases should also satisfy the exogeneity requirement.

4.3 Manual IV Regression

The first-stage regression is specified below in Equation Equation 2, where the definitions of variables are the same as in Equation Equation 1.

\[ NewCase_{ijt}=\beta_0+\alpha OtherCityNewCase_{ijt}+X\beta+\varepsilon_{ijt} \tag{2}\]

# Run first stage regression: new_cases ~ other_city_new_cases + controls
IV_is_work_1ststage <- feols(fml = new_cases ~ other_city_new_cases|
                       driver_id + booking_date + city,
      data = data_driver)

# mutate predicted new_cases in data_driver
data_driver <- data_driver %>%
  mutate(predicted_new_cases = predict(IV_is_work_1ststage))

In the second stage regression, we regress the outcome variables on the predicted new cases from the 1st stage, controlling for the same set of control variables.

# Run second stage regression: is_work ~ predicted_new_cases + controls
IV_is_work_2ndstage <- feols(fml = is_work ~ predicted_new_cases|
                       driver_id + booking_date + city,
      data = data_driver)

modelsummary(list(IV_is_work_1ststage,IV_is_work_2ndstage),
             stars = TRUE)

	(1)	(2)
other_city_new_cases	−0.466***
	(0.012)
predicted_new_cases		0.000
		(0.001)
Num.Obs.	93467	93467
R2	0.491	0.662
R2 Adj.	0.473	0.650
R2 Within	0.225	0.000
R2 Within Adj.	0.225	0.000
AIC	328117.6	−7593.9
BIC	358852.8	23141.3
RMSE	1.35	0.22
Std.Errors	by: driver_id	by: driver_id
FE: driver_id	X	X
FE: booking_date	X	X
FE: city	X	X
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

4.4 IV Regression Using `feols()`

In fact, the feols() function in the fixest package is powerful to help us estimate IV regression in a single step. We can specify the instrumental variable using a second formula.

IV_is_work <- feols(fml = is_work ~ 1| # Y ~ other vars except endo var
                       driver_id + booking_date + city| # fixed efffects
                      new_cases ~ other_city_new_cases,# endo var ~ IV
      data = data_driver)

IV_no_order <- feols(fml = n_order ~ 1|
                       driver_id + booking_date + city|
                      new_cases ~ other_city_new_cases,
      data = data_driver)

IV_income <- feols(fml = income ~ 1|
                       driver_id + booking_date + city|
                      new_cases ~ other_city_new_cases,
      data = data_driver)

IV_avg_distance <- feols(fml = avg_distance ~ 1|
                       driver_id + booking_date + city|
                      new_cases ~ other_city_new_cases,
      data = data_driver)

modelsummary(list("Work" = IV_is_work,
                  "shift income" = IV_income,
                  "# orders" = IV_no_order,
                  "avg distance" = IV_avg_distance),
             stars = TRUE)

	Work	shift income	# orders	avg distance
fit_new_cases	0.000	0.122*	0.019**	0.023
	(0.001)	(0.053)	(0.007)	(0.015)
Num.Obs.	93467	93467	93467	93467
R2	0.662	0.555	0.607	0.420
R2 Adj.	0.650	0.539	0.593	0.399
R2 Within	0.000	0.000	0.000	0.000
R2 Within Adj.	0.000	0.000	0.000	0.000
AIC	−7593.5	762224.6	386801.7	523062.0
BIC	23141.8	792959.8	417536.9	553797.2
RMSE	0.22	13.79	1.85	3.84
Std.Errors	by: driver_id	by: driver_id	by: driver_id	by: driver_id
FE: driver_id	X	X	X	X
FE: booking_date	X	X	X	X
FE: city	X	X	X	X
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

References

Cramer, Judd, and Alan B. Krueger. 2016. “Disruptive Change in the Taxi Business: The Case of Uber.” American Economic Review 106 (5): 177–82. https://doi.org/10.1257/aer.p20161002.

Farber, Henry S. 2008. “Reference-Dependent Preferences and Labor Supply: The Case of New York City Taxi Drivers.” American Economic Review 98 (3): 1069–82.

Ford, Weixing, Jaimie W. Lien, Vladimir V. Mazalov, and Jie Zheng. 2019. “Riding to Wall Street: Determinants of Commute Time Using Citi Bike.” International Journal of Logistics Research and Applications 22 (5): 473–90.

Rysman, Marc. 2009. “The Economics of Two-Sided Markets.” Journal of Economic Perspectives 23 (3): 125–43.

Wang, Wei, Wei Miao, Yongdong Liu, Yiting Deng, and Yunfei Cao. 2022. “The Impact of COVID-19 on the Ride-Sharing Industry and Its Recovery: Causal Evidence from China.” Transportation Research Part A: Policy and Practice 155 (January): 128–41. https://doi.org/10.1016/j.tra.2021.10.005.

Footnotes

This case was prepared by Wei Miao, UCL School of Management, University College London for MSIN0094 Marketing Analytics module. This case study is heavily adapted from his research (Wang et al. 2022) “The impact of COVID-19 on the ride-sharing industry and its recovery: Causal evidence from China.” Transportation Research Part A: Policy and Practice 155 (2022): 128-141. This case was developed to provide material for class discussion rather than to illustrate either effective or ineffective handling of a business situation. Names and data may have been disguised or fabricated. Please do not circulate without permission. Copyrights reserved.↩︎

1 Industry Background1