p8105_hw3_zdz2101

Zelos Zhu 10/7/2018

Load libraries

library(tidyverse)
library(ggplot2)
library(p8105.datasets) 
library(knitr)
library(lubridate)
library(hexbin)
library(ggpubr)

Problem 1

This problem uses the BRFSS data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package.

First, do some data cleaning:

format the data to use appropriate variable names; focus on the “Overall Health” topic include only responses from “Excellent” to “Poor” organize responses as a factor taking levels from “Excellent” to “Poor”

data("brfss_smart2010")
brfss_smart2010 <- brfss_smart2010 %>%
  janitor::clean_names() %>%
  filter(topic == "Overall Health") %>%
  mutate(response = factor(response, levels = c("Poor", "Fair", "Good", "Very good", "Excellent")))

Using this dataset, do or answer the following (commenting on the results of each):

In 2002, which states were observed at 7 locations?

locations_2002 <- brfss_smart2010 %>%
  filter(year == 2002) %>%
  group_by(locationabbr, locationdesc) %>%
  distinct(locationdesc) %>%
  group_by(locationabbr) %>%
  count(locationabbr) %>%
  filter(n == 7)
locations_2002

## # A tibble: 3 x 2
## # Groups:   locationabbr [3]
##   locationabbr     n
##   <chr>        <int>
## 1 CT               7
## 2 FL               7
## 3 NC               7

There are 3 states that were observed at 7 locations: Connecticut, Florida, North Carolina.

Make a “spaghetti plot” that shows the number of observations in each state from 2002 to 2010.

brfss_smart2010 %>%
  group_by(locationabbr, locationdesc, year) %>%
  distinct(locationdesc) %>%
  group_by(locationabbr, year) %>%
  count(locationabbr) %>%
  ggplot(.,aes(x = year, y = n, color = locationabbr)) +
  geom_line() +
  ylab("Number of observed locations")

There seems to be this wild outlier-like blip in 2007 by Florida that happens again in 2010. Otherwise, it looks like majority of the states stayed pretty stable throughout the years.

Make a table showing, for the years 2002, 2006, and 2010, the mean and standard deviation of the proportion of “Excellent” responses across locations in NY State.

table_02_06_10 <- brfss_smart2010 %>%
  filter(year %in% c(2002,2006,2010) & locationabbr == "NY" & response == "Excellent") %>%
  group_by(locationabbr, year) %>%
  summarise(excellent_mean = mean(data_value),
            excenllent_sd = sd(data_value))
  
kable(table_02_06_10)

locationabbr	year	excellent_mean	excenllent_sd
NY	2002	24.04000	4.486424
NY	2006	22.53333	4.000833
NY	2010	22.70000	3.567212

The mean proportion of excellent responses in NY state stays somewhere between and 22.53% and 24.04% . There seems to be a trend where the standard deviation seems to decrease over time.

For each year and state, compute the average proportion in each response category (taking the average across locations in a state). Make a five-panel plot that shows, for each response category separately, the distribution of these state-level averages over time.

brfss_smart2010 %>%
  group_by(locationabbr, year, response) %>%
  summarise(avg_prop_response = mean(data_value)) %>%
  ggplot(., aes(x = year, y = avg_prop_response, color = locationabbr)) + 
  geom_line() +
  facet_grid(~response) +
  theme(axis.text.x = element_text(angle = 45),
        legend.position = "bottom") + 
  guides(colour = guide_legend(nrow = 3))

## Warning: Removed 3 rows containing missing values (geom_path).

It seems the average proportion in each response category stays pretty stable throughout the years in most states. With so many states it gets a little hard to interpret which state corresponds to which lines in this 5 paneled plot. But just by eyeballing it, poor response is generally somewhere between 2%-10%, fair response 5%-20%, good response 25%-35%, very good 25%-45%, and excellent response 10%-30%, respectively.

Problem 2

This problem uses the Instacart data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package (it’s called instacart).

The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and giving illstrative examples of observations. Then, do or answer the following (commenting on the results of each):

data("instacart")
num_unique_orders <- length(unique(instacart$order_id))
num_unique_products <- length(unique(instacart$product_name))
num_unique_departments <- length(unique(instacart$department))

The instacart dataset has 1384617 observations/rows and 15 variables/columns. There were 131209 unique orders made. There are 39123 unique products offered from 21 unique departments.

How many aisles are there, and which aisles are the most items ordered from?

num_unique_aisles <- length(unique(instacart$aisle))

most_items_ordered <- instacart %>%
  group_by(aisle) %>%
  count(aisle) %>%
  arrange(desc(n))
head(most_items_ordered)

## # A tibble: 6 x 2
## # Groups:   aisle [6]
##   aisle                              n
##   <chr>                          <int>
## 1 fresh vegetables              150609
## 2 fresh fruits                  150473
## 3 packaged vegetables fruits     78493
## 4 yogurt                         55240
## 5 packaged cheese                41699
## 6 water seltzer sparkling water  36617

There are 134 aisles. The top 6 aisles most items ordered from are: fresh vegetables, fresh fruits, packaged vegetables fruits, yogurt, packaged cheese, water seltzer sparkling water. The aisles in which most items ordered seems about right, you would expect to see these "aisles" in grocery stores and supermarkets to take up the most square footage/real estate.

Make a plot that shows the number of items ordered in each aisle. Order aisles sensibly, and organize your plot so others can read it.

aisle_plot <- instacart %>%
  group_by(department, aisle) %>%
  count(aisle) %>%
  arrange(department, aisle) %>%
  ggplot(., aes(x = aisle, y = n, fill = department)) +
  geom_bar(stat="identity") +
  theme(axis.text.y = element_text(size = 16)) +
  ylab("Number of items sold") + 
  coord_flip() +
  facet_wrap(~department, nrow = 7, scale = "free")

knitr::include_graphics("aisle_plot.png")

I faceted the plot by department (colored that way as well for viewing pleasure). It is important to note scale: certain departments like produce get as high as 150000 products in an aisle sold meanwhile the bulk department doesn't exceed 1000 for any particular aisle. I'm surprised the baby formula aisle sells items at such a more exponentially higher rate than its resepective similar aisles in the babies department. It might be worth to one day plot by proportion, such that each department's aisles proportions add up to 1 in each department (see who is biggest contributor to each department).

Make a table showing the most popular item from aisles “baking ingredients”, “dog food care”, and “packaged vegetables fruits”

most_popular <- instacart %>%
  filter(aisle %in% c("baking ingredients", "dog food care", "packaged vegetables fruits")) %>%
  group_by(aisle) %>%
  count(product_name) %>%
  arrange(aisle, desc(n))

product_key <- tapply(most_popular$n, most_popular$aisle, max) #attain the aisle and max sold item at the aisle
product_key

##         baking ingredients              dog food care 
##                        499                         30 
## packaged vegetables fruits 
##                       9784

most_popular$product_name[which(most_popular$aisle == names(product_key)[1] & most_popular$n == product_key[1])]

## [1] "Light Brown Sugar"

most_popular$product_name[which(most_popular$aisle == names(product_key)[2] & most_popular$n == product_key[2])]

## [1] "Snack Sticks Chicken & Rice Recipe Dog Treats"

most_popular$product_name[which(most_popular$aisle == names(product_key)[3] & most_popular$n == product_key[3])]

## [1] "Organic Baby Spinach"

most_popular_items <- tibble(aisle=names(product_key),
                              product_name= c(most_popular$product_name[which(most_popular$aisle == names(product_key)[1] & most_popular$n == product_key[1])],
                                              most_popular$product_name[which(most_popular$aisle == names(product_key)[2] & most_popular$n == product_key[2])],
                                              most_popular$product_name[which(most_popular$aisle == names(product_key)[3] & most_popular$n == product_key[3])]),
                              number_sold = as.numeric(product_key))
kable(most_popular_items)

aisle	product_name	number_sold
baking ingredients	Light Brown Sugar	499
dog food care	Snack Sticks Chicken & Rice Recipe Dog Treats	30
packaged vegetables fruits	Organic Baby Spinach	9784

Among baking ingredients, light brown sugar is the most popular item sold which isn't all too surprising. When you thinking "baking" you think sweets and pastries (or at least I do) and sugar is near-essential ingredient in whatever you make. It is interesting to note how much more, practically >300x more so, that the baby spinach is ordered than the snack sticks & rice recipe dog treats. I suppose this is good information to know how much "space" to dedicate to items/aisles.

Make a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are ordered on each day of the week; format this table for human readers (i.e. produce a 2 x 7 table).

desired_products <- instacart %>%
  filter(product_name %in% c("Pink Lady Apples","Coffee Ice Cream")) %>%
  select(order_hour_of_day, order_dow, product_name) %>%
  group_by(product_name, order_dow) %>%
  summarise(mean(order_hour_of_day)) %>%
  spread(., order_dow, `mean(order_hour_of_day)`)

names(desired_products) <- c("Product", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
kable(desired_products)

Product	Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
Coffee Ice Cream	13.77419	14.31579	15.38095	15.31818	15.21739	12.26316	13.83333
Pink Lady Apples	13.44118	11.36000	11.70213	14.25000	11.55172	12.78431	11.93750

desired_products_rounded <- round(desired_products[,-1])
kable(cbind(desired_products[,1], desired_products_rounded))

Product	Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
Coffee Ice Cream	14	14	15	15	15	12	14
Pink Lady Apples	13	11	12	14	12	13	12

Important to note that in the first table this is in terms of a "mean" hour of the day and in military/24 hour time. For example, a mean time of 13.5 would be 1:30pm. In terms of decimals, every 0.25 is equivalent to 15 minutes. However, our original data was in integer data/whole hour time so I thought it would be worth it to see the resulting table in rounded form (traditional rounding, >0.5 rounds up, round down otherwise). Coffee Ice cream generally sells in the afternoon, a little earlier so on Saturdays interestingly enough. Pink Lady apples are also generally sold mid-day, from 11-14 hours (11am - 2pm).

Problem 3

This problem uses the NY NOAA data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package (it’s called ny_noaa).

The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and indicating the extent to which missing data is an issue. Then, do or answer the following (commenting on the results of each):

Do some data cleaning. Create separate variables for year, month, and day. Ensure observations for temperature, precipitation, and snowfall are given in reasonable units. For snowfall, what are the most commonly observed values? Why?

data("ny_noaa")
ny_noaa <- ny_noaa %>%
  mutate(prcp = prcp/10,
         snow = snow/10,
         snwd = snwd/10,
         tmax = as.numeric(tmax)/10,
         tmin = as.numeric(tmin)/10,
         year = year(date),
         month = month(date),
         day = weekdays(date))

#summary(ny_noaa$date) -- everyday 1/1/81 - 12/31/10

The ny_noaa dataset is quite large, 2595176 observations with 10 variables. A little breakdown of each variable: - There are 747 unique different id's in this dataset. - Dates range for a 30 year span from January 1st, 1981 to December 31st, 2010.

kable(t(apply(ny_noaa[,3:7], 2, summary)))

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA's
prcp	0.0	0.0	0.0	2.9823236	2.3	2286.0	145838
snow	-1.3	0.0	0.0	0.4987025	0.0	1016.0	381221
snwd	0.0	0.0	0.0	3.7312281	0.0	919.5	591786
tmax	-38.9	5.0	15.0	13.9798384	23.3	60.0	1134358
tmin	-59.4	-3.9	3.3	3.0285374	11.1	60.0	1134420

It seems NAs are quite an issue across most of the numerical variables but especially so in the temperature columns, where almost half of the values are NA's.

ny_noaa %>%
  group_by(snow) %>%
  count(snow)

## # A tibble: 282 x 2
## # Groups:   snow [282]
##     snow       n
##    <dbl>   <int>
##  1  -1.3       1
##  2   0   2008508
##  3   0.3    8790
##  4   0.5    9748
##  5   0.8    9962
##  6   1      5106
##  7   1.3   23095
##  8   1.5    3672
##  9   1.8    3226
## 10   2      4797
## # ... with 272 more rows

The most observed snowfall value is 0 which is unsurprising, considering it does not snow for most of the year.

Make a two-panel plot showing the average temperature in January and in July in each station across years. Is there any observable / interpretable structure? Any outliers?

ny_noaa %>%
  filter(month %in% c(1,7)) %>%
  group_by(year, month, id) %>%
  summarize(mean_max_temp = mean(tmax)) %>%
  ggplot(., aes(x = year, y = mean_max_temp, color = id)) + 
  geom_line() +
  theme(legend.position="none") +
  facet_grid(~month)

## Warning: Removed 6007 rows containing missing values (geom_path).

I would not say there is any sort of particular interpretable structure beyond the fact that you can see that temperature varies within stations on a per year basis and that on average the weather stays within a ~20 degrees Celsius window throughout the years for borth January and July as a whole. There does seem to be one outlier though from one stayion in the late 80s in July that dipped much lower than usual.

Make a two-panel plot showing (i) tmax vs tmin for the full dataset (note that a scatterplot may not be the best option); and (ii) make a plot showing the distribution of snowfall values greater than 0 and less than 100 separately by year.

funky_temps <- ny_noaa[which(ny_noaa$tmin > ny_noaa$tmax),]

temp_plot <- ggplot(ny_noaa, aes(x = tmin, y = tmax)) + 
  geom_hex() +
  xlim(-60, 60) +
  ylim(-60, 60)
  geom_abline(intercept = 0, slope = 1, color = "red")

## mapping: intercept = ~intercept, slope = ~slope 
## geom_abline: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

snow_plot <- ny_noaa %>%
  filter(snow > 0 & snow < 100) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(., aes(x = year, y = snow)) +
  geom_boxplot() + 
  coord_flip()

ggarrange(temp_plot, snow_plot, ncol = 2, nrow = 1)

## Warning: Removed 1136276 rows containing non-finite values (stat_binhex).

## Warning: Removed 5 rows containing missing values (geom_hex).

For the first panel, the tmax vs tmin plot, there were over a million values to be plotted so it was better to use a density-based plot. It is interesting to see that there are a decent amount of values where tmin seems to be great than tmax (anything right of the y = x line). This does not make much sense, in fact there are 1183 observations where this is seen. For the sefcond panel, we see that there is a lot of outliers in in all years but in general the IQR (along with the median) stays pretty consistent throughout the years.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p8105_hw3_zdz2101.md

p8105_hw3_zdz2101.md

p8105_hw3_zdz2101

Load libraries

Problem 1

Using this dataset, do or answer the following (commenting on the results of each):

In 2002, which states were observed at 7 locations?

Make a “spaghetti plot” that shows the number of observations in each state from 2002 to 2010.

Make a table showing, for the years 2002, 2006, and 2010, the mean and standard deviation of the proportion of “Excellent” responses across locations in NY State.

For each year and state, compute the average proportion in each response category (taking the average across locations in a state). Make a five-panel plot that shows, for each response category separately, the distribution of these state-level averages over time.

Problem 2

How many aisles are there, and which aisles are the most items ordered from?

Make a plot that shows the number of items ordered in each aisle. Order aisles sensibly, and organize your plot so others can read it.

Make a table showing the most popular item from aisles “baking ingredients”, “dog food care”, and “packaged vegetables fruits”

Make a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are ordered on each day of the week; format this table for human readers (i.e. produce a 2 x 7 table).

Problem 3

Make a two-panel plot showing the average temperature in January and in July in each station across years. Is there any observable / interpretable structure? Any outliers?

Make a two-panel plot showing (i) tmax vs tmin for the full dataset (note that a scatterplot may not be the best option); and (ii) make a plot showing the distribution of snowfall values greater than 0 and less than 100 separately by year.

Files

p8105_hw3_zdz2101.md

Latest commit

History

p8105_hw3_zdz2101.md

File metadata and controls

p8105_hw3_zdz2101

Load libraries

Problem 1

Using this dataset, do or answer the following (commenting on the results of each):

In 2002, which states were observed at 7 locations?

Make a “spaghetti plot” that shows the number of observations in each state from 2002 to 2010.

Make a table showing, for the years 2002, 2006, and 2010, the mean and standard deviation of the proportion of “Excellent” responses across locations in NY State.

For each year and state, compute the average proportion in each response category (taking the average across locations in a state). Make a five-panel plot that shows, for each response category separately, the distribution of these state-level averages over time.

Problem 2

How many aisles are there, and which aisles are the most items ordered from?

Make a plot that shows the number of items ordered in each aisle. Order aisles sensibly, and organize your plot so others can read it.

Make a table showing the most popular item from aisles “baking ingredients”, “dog food care”, and “packaged vegetables fruits”

Make a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are ordered on each day of the week; format this table for human readers (i.e. produce a 2 x 7 table).

Problem 3

Make a two-panel plot showing the average temperature in January and in July in each station across years. Is there any observable / interpretable structure? Any outliers?

Make a two-panel plot showing (i) tmax vs tmin for the full dataset (note that a scatterplot may not be the best option); and (ii) make a plot showing the distribution of snowfall values greater than 0 and less than 100 separately by year.