Skip to content

Latest commit

 

History

History
380 lines (291 loc) · 17.3 KB

p8105_hw3_zdz2101.md

File metadata and controls

380 lines (291 loc) · 17.3 KB

p8105_hw3_zdz2101

Zelos Zhu 10/7/2018

Load libraries
library(tidyverse)
library(ggplot2)
library(p8105.datasets) 
library(knitr)
library(lubridate)
library(hexbin)
library(ggpubr)

Problem 1

This problem uses the BRFSS data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package.

First, do some data cleaning:

format the data to use appropriate variable names; focus on the “Overall Health” topic include only responses from “Excellent” to “Poor” organize responses as a factor taking levels from “Excellent” to “Poor”

data("brfss_smart2010")
brfss_smart2010 <- brfss_smart2010 %>%
  janitor::clean_names() %>%
  filter(topic == "Overall Health") %>%
  mutate(response = factor(response, levels = c("Poor", "Fair", "Good", "Very good", "Excellent")))

Using this dataset, do or answer the following (commenting on the results of each):

In 2002, which states were observed at 7 locations?

locations_2002 <- brfss_smart2010 %>%
  filter(year == 2002) %>%
  group_by(locationabbr, locationdesc) %>%
  distinct(locationdesc) %>%
  group_by(locationabbr) %>%
  count(locationabbr) %>%
  filter(n == 7)
locations_2002
## # A tibble: 3 x 2
## # Groups:   locationabbr [3]
##   locationabbr     n
##   <chr>        <int>
## 1 CT               7
## 2 FL               7
## 3 NC               7

There are 3 states that were observed at 7 locations: Connecticut, Florida, North Carolina.

Make a “spaghetti plot” that shows the number of observations in each state from 2002 to 2010.

brfss_smart2010 %>%
  group_by(locationabbr, locationdesc, year) %>%
  distinct(locationdesc) %>%
  group_by(locationabbr, year) %>%
  count(locationabbr) %>%
  ggplot(.,aes(x = year, y = n, color = locationabbr)) +
  geom_line() +
  ylab("Number of observed locations")

There seems to be this wild outlier-like blip in 2007 by Florida that happens again in 2010. Otherwise, it looks like majority of the states stayed pretty stable throughout the years.

Make a table showing, for the years 2002, 2006, and 2010, the mean and standard deviation of the proportion of “Excellent” responses across locations in NY State.

table_02_06_10 <- brfss_smart2010 %>%
  filter(year %in% c(2002,2006,2010) & locationabbr == "NY" & response == "Excellent") %>%
  group_by(locationabbr, year) %>%
  summarise(excellent_mean = mean(data_value),
            excenllent_sd = sd(data_value))
  
kable(table_02_06_10)
locationabbr year excellent_mean excenllent_sd
NY 2002 24.04000 4.486424
NY 2006 22.53333 4.000833
NY 2010 22.70000 3.567212

The mean proportion of excellent responses in NY state stays somewhere between and 22.53% and 24.04% . There seems to be a trend where the standard deviation seems to decrease over time.

For each year and state, compute the average proportion in each response category (taking the average across locations in a state). Make a five-panel plot that shows, for each response category separately, the distribution of these state-level averages over time.

brfss_smart2010 %>%
  group_by(locationabbr, year, response) %>%
  summarise(avg_prop_response = mean(data_value)) %>%
  ggplot(., aes(x = year, y = avg_prop_response, color = locationabbr)) + 
  geom_line() +
  facet_grid(~response) +
  theme(axis.text.x = element_text(angle = 45),
        legend.position = "bottom") + 
  guides(colour = guide_legend(nrow = 3))
## Warning: Removed 3 rows containing missing values (geom_path).

It seems the average proportion in each response category stays pretty stable throughout the years in most states. With so many states it gets a little hard to interpret which state corresponds to which lines in this 5 paneled plot. But just by eyeballing it, poor response is generally somewhere between 2%-10%, fair response 5%-20%, good response 25%-35%, very good 25%-45%, and excellent response 10%-30%, respectively.

Problem 2

This problem uses the Instacart data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package (it’s called instacart).

The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and giving illstrative examples of observations. Then, do or answer the following (commenting on the results of each):

data("instacart")
num_unique_orders <- length(unique(instacart$order_id))
num_unique_products <- length(unique(instacart$product_name))
num_unique_departments <- length(unique(instacart$department))

The instacart dataset has 1384617 observations/rows and 15 variables/columns. There were 131209 unique orders made. There are 39123 unique products offered from 21 unique departments.

How many aisles are there, and which aisles are the most items ordered from?
num_unique_aisles <- length(unique(instacart$aisle))

most_items_ordered <- instacart %>%
  group_by(aisle) %>%
  count(aisle) %>%
  arrange(desc(n))
head(most_items_ordered)  
## # A tibble: 6 x 2
## # Groups:   aisle [6]
##   aisle                              n
##   <chr>                          <int>
## 1 fresh vegetables              150609
## 2 fresh fruits                  150473
## 3 packaged vegetables fruits     78493
## 4 yogurt                         55240
## 5 packaged cheese                41699
## 6 water seltzer sparkling water  36617

There are 134 aisles. The top 6 aisles most items ordered from are: fresh vegetables, fresh fruits, packaged vegetables fruits, yogurt, packaged cheese, water seltzer sparkling water. The aisles in which most items ordered seems about right, you would expect to see these "aisles" in grocery stores and supermarkets to take up the most square footage/real estate.

Make a plot that shows the number of items ordered in each aisle. Order aisles sensibly, and organize your plot so others can read it.

aisle_plot <- instacart %>%
  group_by(department, aisle) %>%
  count(aisle) %>%
  arrange(department, aisle) %>%
  ggplot(., aes(x = aisle, y = n, fill = department)) +
  geom_bar(stat="identity") +
  theme(axis.text.y = element_text(size = 16)) +
  ylab("Number of items sold") + 
  coord_flip() +
  facet_wrap(~department, nrow = 7, scale = "free")

knitr::include_graphics("aisle_plot.png")

I faceted the plot by department (colored that way as well for viewing pleasure). It is important to note scale: certain departments like produce get as high as 150000 products in an aisle sold meanwhile the bulk department doesn't exceed 1000 for any particular aisle. I'm surprised the baby formula aisle sells items at such a more exponentially higher rate than its resepective similar aisles in the babies department. It might be worth to one day plot by proportion, such that each department's aisles proportions add up to 1 in each department (see who is biggest contributor to each department).

Make a table showing the most popular item from aisles “baking ingredients”, “dog food care”, and “packaged vegetables fruits”

most_popular <- instacart %>%
  filter(aisle %in% c("baking ingredients", "dog food care", "packaged vegetables fruits")) %>%
  group_by(aisle) %>%
  count(product_name) %>%
  arrange(aisle, desc(n))

product_key <- tapply(most_popular$n, most_popular$aisle, max) #attain the aisle and max sold item at the aisle
product_key
##         baking ingredients              dog food care 
##                        499                         30 
## packaged vegetables fruits 
##                       9784
most_popular$product_name[which(most_popular$aisle == names(product_key)[1] & most_popular$n == product_key[1])]
## [1] "Light Brown Sugar"
most_popular$product_name[which(most_popular$aisle == names(product_key)[2] & most_popular$n == product_key[2])]
## [1] "Snack Sticks Chicken & Rice Recipe Dog Treats"
most_popular$product_name[which(most_popular$aisle == names(product_key)[3] & most_popular$n == product_key[3])]
## [1] "Organic Baby Spinach"
most_popular_items <- tibble(aisle=names(product_key),
                              product_name= c(most_popular$product_name[which(most_popular$aisle == names(product_key)[1] & most_popular$n == product_key[1])],
                                              most_popular$product_name[which(most_popular$aisle == names(product_key)[2] & most_popular$n == product_key[2])],
                                              most_popular$product_name[which(most_popular$aisle == names(product_key)[3] & most_popular$n == product_key[3])]),
                              number_sold = as.numeric(product_key))
kable(most_popular_items)
aisle product_name number_sold
baking ingredients Light Brown Sugar 499
dog food care Snack Sticks Chicken & Rice Recipe Dog Treats 30
packaged vegetables fruits Organic Baby Spinach 9784

Among baking ingredients, light brown sugar is the most popular item sold which isn't all too surprising. When you thinking "baking" you think sweets and pastries (or at least I do) and sugar is near-essential ingredient in whatever you make. It is interesting to note how much more, practically >300x more so, that the baby spinach is ordered than the snack sticks & rice recipe dog treats. I suppose this is good information to know how much "space" to dedicate to items/aisles.

Make a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are ordered on each day of the week; format this table for human readers (i.e. produce a 2 x 7 table).

desired_products <- instacart %>%
  filter(product_name %in% c("Pink Lady Apples","Coffee Ice Cream")) %>%
  select(order_hour_of_day, order_dow, product_name) %>%
  group_by(product_name, order_dow) %>%
  summarise(mean(order_hour_of_day)) %>%
  spread(., order_dow, `mean(order_hour_of_day)`)

names(desired_products) <- c("Product", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
kable(desired_products)
Product Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Coffee Ice Cream 13.77419 14.31579 15.38095 15.31818 15.21739 12.26316 13.83333
Pink Lady Apples 13.44118 11.36000 11.70213 14.25000 11.55172 12.78431 11.93750
desired_products_rounded <- round(desired_products[,-1])
kable(cbind(desired_products[,1], desired_products_rounded))
Product Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Coffee Ice Cream 14 14 15 15 15 12 14
Pink Lady Apples 13 11 12 14 12 13 12

Important to note that in the first table this is in terms of a "mean" hour of the day and in military/24 hour time. For example, a mean time of 13.5 would be 1:30pm. In terms of decimals, every 0.25 is equivalent to 15 minutes. However, our original data was in integer data/whole hour time so I thought it would be worth it to see the resulting table in rounded form (traditional rounding, >0.5 rounds up, round down otherwise). Coffee Ice cream generally sells in the afternoon, a little earlier so on Saturdays interestingly enough. Pink Lady apples are also generally sold mid-day, from 11-14 hours (11am - 2pm).

Problem 3

This problem uses the NY NOAA data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package (it’s called ny_noaa).

The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and indicating the extent to which missing data is an issue. Then, do or answer the following (commenting on the results of each):

Do some data cleaning. Create separate variables for year, month, and day. Ensure observations for temperature, precipitation, and snowfall are given in reasonable units. For snowfall, what are the most commonly observed values? Why?

data("ny_noaa")
ny_noaa <- ny_noaa %>%
  mutate(prcp = prcp/10,
         snow = snow/10,
         snwd = snwd/10,
         tmax = as.numeric(tmax)/10,
         tmin = as.numeric(tmin)/10,
         year = year(date),
         month = month(date),
         day = weekdays(date))

#summary(ny_noaa$date) -- everyday 1/1/81 - 12/31/10

The ny_noaa dataset is quite large, 2595176 observations with 10 variables. A little breakdown of each variable: - There are 747 unique different id's in this dataset. - Dates range for a 30 year span from January 1st, 1981 to December 31st, 2010.

kable(t(apply(ny_noaa[,3:7], 2, summary)))
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
prcp 0.0 0.0 0.0 2.9823236 2.3 2286.0 145838
snow -1.3 0.0 0.0 0.4987025 0.0 1016.0 381221
snwd 0.0 0.0 0.0 3.7312281 0.0 919.5 591786
tmax -38.9 5.0 15.0 13.9798384 23.3 60.0 1134358
tmin -59.4 -3.9 3.3 3.0285374 11.1 60.0 1134420

It seems NAs are quite an issue across most of the numerical variables but especially so in the temperature columns, where almost half of the values are NA's.

ny_noaa %>%
  group_by(snow) %>%
  count(snow)
## # A tibble: 282 x 2
## # Groups:   snow [282]
##     snow       n
##    <dbl>   <int>
##  1  -1.3       1
##  2   0   2008508
##  3   0.3    8790
##  4   0.5    9748
##  5   0.8    9962
##  6   1      5106
##  7   1.3   23095
##  8   1.5    3672
##  9   1.8    3226
## 10   2      4797
## # ... with 272 more rows

The most observed snowfall value is 0 which is unsurprising, considering it does not snow for most of the year.

Make a two-panel plot showing the average temperature in January and in July in each station across years. Is there any observable / interpretable structure? Any outliers?

ny_noaa %>%
  filter(month %in% c(1,7)) %>%
  group_by(year, month, id) %>%
  summarize(mean_max_temp = mean(tmax)) %>%
  ggplot(., aes(x = year, y = mean_max_temp, color = id)) + 
  geom_line() +
  theme(legend.position="none") +
  facet_grid(~month)
## Warning: Removed 6007 rows containing missing values (geom_path).

I would not say there is any sort of particular interpretable structure beyond the fact that you can see that temperature varies within stations on a per year basis and that on average the weather stays within a ~20 degrees Celsius window throughout the years for borth January and July as a whole. There does seem to be one outlier though from one stayion in the late 80s in July that dipped much lower than usual.

Make a two-panel plot showing (i) tmax vs tmin for the full dataset (note that a scatterplot may not be the best option); and (ii) make a plot showing the distribution of snowfall values greater than 0 and less than 100 separately by year.

funky_temps <- ny_noaa[which(ny_noaa$tmin > ny_noaa$tmax),]

temp_plot <- ggplot(ny_noaa, aes(x = tmin, y = tmax)) + 
  geom_hex() +
  xlim(-60, 60) +
  ylim(-60, 60)
  geom_abline(intercept = 0, slope = 1, color = "red") 
## mapping: intercept = ~intercept, slope = ~slope 
## geom_abline: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
snow_plot <- ny_noaa %>%
  filter(snow > 0 & snow < 100) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(., aes(x = year, y = snow)) +
  geom_boxplot() + 
  coord_flip()

ggarrange(temp_plot, snow_plot, ncol = 2, nrow = 1)
## Warning: Removed 1136276 rows containing non-finite values (stat_binhex).

## Warning: Removed 5 rows containing missing values (geom_hex).

For the first panel, the tmax vs tmin plot, there were over a million values to be plotted so it was better to use a density-based plot. It is interesting to see that there are a decent amount of values where tmin seems to be great than tmax (anything right of the y = x line). This does not make much sense, in fact there are 1183 observations where this is seen. For the sefcond panel, we see that there is a lot of outliers in in all years but in general the IQR (along with the median) stays pretty consistent throughout the years.