Uber_analysis.Rmd

---
title: "Uber analysis"
author: "Jackyieym"
date: "2024-04-20"
output: rmarkdown::github_document
---
```{r, echo = FALSE}
knitr::opts_chunk$set(collapse=TRUE, comment="##", fig.retina=2, fig.path = "README_figs/README-")
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

This project is conduct under the Google Data Analytics Capstone and follows Track B, which is performing own case study. The company I choose is Uber, an American multinational transportation company that provides ride-hailing services, courier services, food delivery, and freight transport. The goal of this project is to analyse how some of the fundemental factors of people affect their usage of Uber and provide insights to Uber's strategy in tailoring their ride-hailing service.

## Data collection

The data used in this project is collected from survey answer. The survey was distributed in the university campus and received 32 responses. The questions asked about the gender, their major, usage of Uber, frequency, price, number of people, tier of services, main alternative, rating on important and improvement factors, including price, safety, convenient and customer services. Due to the data was collected in campus, its respondents are mostly limited to university students.

## Question

Some question are set up for analysis using `chisq.test` 
1. Is gender related to the score of important factors when using Uber In this question, `Importance_Price`, `Importance_Price`, `Importance_Price`, `Importance_Price` are selected for analysis, representing Price, Safety, Convenient and Customer Service respectively to `Gender`. 
For null hypothesis (H0), no relation between gender and the score of important factors
For alternative hypothesis (H1), there is relation between gender and the score of important factors

## Setting up Environment

```{r library, message=FALSE, warning=FALSE, paged.print=FALSE}
library(ggplot2)
library(ggcorrplot)
library(nFactors)
library(GGally)
library(dplyr)
library(patchwork)
library(readr)
```

Load the data file

```{r data, message=FALSE, warning=FALSE,}
UberProjectRawData <- read_csv("./UberProjectRawData.csv")
```
```{r view data, echo=FALSE}
UberProjectRawData
```
## Data cleaning

After filtering out those did not use Uber, columns required for analysis are selected and arrange according to gender.

```{r data clean}
s1<-UberProjectRawData %>% 
  filter(OnlineTaxiService=="Yes") %>% 
  select(2,11,12,13,14) %>% 
  arrange(Gender)
```
```{r show data clean, echo=FALSE}
s1
```

## Data visualisation

To have a quick glance on the data to get the rating by the respondents, we can use bar chart to represent the vote of rating on different factors by genders. Putting four chart together will have a better comparison. 

```{r frequency bar chart, echo=T}
bar_Price<-ggplot(s1, aes(x=Importance_Price, fill=Gender))+
  geom_bar(position = position_dodge(preserve = "single"))
bar_Safety<-ggplot(s1, aes(x=Importance_Safety,fill=Gender))+
  geom_bar(position = position_dodge(preserve = "single"))
bar_Conv<-ggplot(s1, aes(x=Importance_Conv,fill=Gender))+
  geom_bar(position = position_dodge(preserve = "single"))
bar_Cus<-ggplot(s1, aes(x=Importance_Cus, fill=Gender))+
  geom_bar(position = position_dodge(preserve = "single"))
bar_Price+bar_Safety+bar_Conv+bar_Cus+
  plot_annotation("Votes on score of different factors per gender (frequency)", 
                  theme=theme(plot.title=element_text(hjust=0.5)))
```

However, due to some catagory of gender have less response in absolute number compare to other options, their vote on rating may under-represented. Therefore, converting the data into proportion will provide a better representation.

```{r proportion bar chart, echo=T}
port1<-as.data.frame(with(s1, prop.table(table(Gender, Importance_Price), margin=1)))
port2<-as.data.frame(with(s1, prop.table(table(Gender, Importance_Safety), margin=1)))
port3<-as.data.frame(with(s1, prop.table(table(Gender, Importance_Conv), margin=1)))
port4<-as.data.frame(with(s1, prop.table(table(Gender, Importance_Cus), margin=1)))

porp_bar_Price<-ggplot(port1, aes(x=Importance_Price, y=Freq, fill=Gender))+
  geom_col(position="dodge")+
  labs(y="Proportion of count")
porp_bar_Safety<-ggplot(port2, aes(x=Importance_Safety, y=Freq, fill=Gender))+
  geom_col(position="dodge")+
  labs(y="Proportion of count")
porp_bar_Conv<-ggplot(port3, aes(x=Importance_Conv, y=Freq, fill=Gender))+
  geom_col(position="dodge")+
  labs(y="Proportion of count")
porp_bar_Cus<-ggplot(port4, aes(x=Importance_Cus, y=Freq, fill=Gender))+
  geom_col(position="dodge")+
  labs(y="Proportion of count")
porp_bar_Price+porp_bar_Safety+porp_bar_Conv+porp_bar_Cus+
  plot_annotation("Votes on score of different factors per gender (proportion)", 
                  theme=theme(plot.title=element_text(hjust=0.5)))
```

## Testing relationship

Performing the `chisq.test` can provide us the relationship between the gender and the factors

```{r test, echo=TRUE, warning=FALSE}
chisq.test(s1$Gender, s1$Importance_Price)
chisq.test(s1$Gender, s1$Importance_Safety)
chisq.test(s1$Gender, s1$Importance_Conv)
chisq.test(s1$Gender, s1$Importance_Cus)
```

## Conclusion for question 1

Because p-value of all factors is greater than 0.05, H0 is not rejected and meaning there is no relation between gender and the score of important factors.