Zelos Zhu 10/7/2018
library(tidyverse)
library(ggplot2)
library(p8105.datasets)
library(knitr)
library(lubridate)
library(hexbin)
library(ggpubr)
This problem uses the BRFSS data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package.
First, do some data cleaning:
format the data to use appropriate variable names; focus on the “Overall Health” topic include only responses from “Excellent” to “Poor” organize responses as a factor taking levels from “Excellent” to “Poor”
data("brfss_smart2010")
brfss_smart2010 <- brfss_smart2010 %>%
janitor::clean_names() %>%
filter(topic == "Overall Health") %>%
mutate(response = factor(response, levels = c("Poor", "Fair", "Good", "Very good", "Excellent")))
locations_2002 <- brfss_smart2010 %>%
filter(year == 2002) %>%
group_by(locationabbr, locationdesc) %>%
distinct(locationdesc) %>%
group_by(locationabbr) %>%
count(locationabbr) %>%
filter(n == 7)
locations_2002
## # A tibble: 3 x 2
## # Groups: locationabbr [3]
## locationabbr n
## <chr> <int>
## 1 CT 7
## 2 FL 7
## 3 NC 7
There are 3 states that were observed at 7 locations: Connecticut, Florida, North Carolina.
brfss_smart2010 %>%
group_by(locationabbr, locationdesc, year) %>%
distinct(locationdesc) %>%
group_by(locationabbr, year) %>%
count(locationabbr) %>%
ggplot(.,aes(x = year, y = n, color = locationabbr)) +
geom_line() +
ylab("Number of observed locations")
There seems to be this wild outlier-like blip in 2007 by Florida that happens again in 2010. Otherwise, it looks like majority of the states stayed pretty stable throughout the years.
Make a table showing, for the years 2002, 2006, and 2010, the mean and standard deviation of the proportion of “Excellent” responses across locations in NY State.
table_02_06_10 <- brfss_smart2010 %>%
filter(year %in% c(2002,2006,2010) & locationabbr == "NY" & response == "Excellent") %>%
group_by(locationabbr, year) %>%
summarise(excellent_mean = mean(data_value),
excenllent_sd = sd(data_value))
kable(table_02_06_10)
locationabbr | year | excellent_mean | excenllent_sd |
---|---|---|---|
NY | 2002 | 24.04000 | 4.486424 |
NY | 2006 | 22.53333 | 4.000833 |
NY | 2010 | 22.70000 | 3.567212 |
The mean proportion of excellent responses in NY state stays somewhere between and 22.53% and 24.04% . There seems to be a trend where the standard deviation seems to decrease over time.
For each year and state, compute the average proportion in each response category (taking the average across locations in a state). Make a five-panel plot that shows, for each response category separately, the distribution of these state-level averages over time.
brfss_smart2010 %>%
group_by(locationabbr, year, response) %>%
summarise(avg_prop_response = mean(data_value)) %>%
ggplot(., aes(x = year, y = avg_prop_response, color = locationabbr)) +
geom_line() +
facet_grid(~response) +
theme(axis.text.x = element_text(angle = 45),
legend.position = "bottom") +
guides(colour = guide_legend(nrow = 3))
## Warning: Removed 3 rows containing missing values (geom_path).
It seems the average proportion in each response category stays pretty stable throughout the years in most states. With so many states it gets a little hard to interpret which state corresponds to which lines in this 5 paneled plot. But just by eyeballing it, poor response is generally somewhere between 2%-10%, fair response 5%-20%, good response 25%-35%, very good 25%-45%, and excellent response 10%-30%, respectively.
This problem uses the Instacart data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package (it’s called instacart).
The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and giving illstrative examples of observations. Then, do or answer the following (commenting on the results of each):
data("instacart")
num_unique_orders <- length(unique(instacart$order_id))
num_unique_products <- length(unique(instacart$product_name))
num_unique_departments <- length(unique(instacart$department))
The instacart dataset has 1384617 observations/rows and 15 variables/columns. There were 131209 unique orders made. There are 39123 unique products offered from 21 unique departments.
num_unique_aisles <- length(unique(instacart$aisle))
most_items_ordered <- instacart %>%
group_by(aisle) %>%
count(aisle) %>%
arrange(desc(n))
head(most_items_ordered)
## # A tibble: 6 x 2
## # Groups: aisle [6]
## aisle n
## <chr> <int>
## 1 fresh vegetables 150609
## 2 fresh fruits 150473
## 3 packaged vegetables fruits 78493
## 4 yogurt 55240
## 5 packaged cheese 41699
## 6 water seltzer sparkling water 36617
There are 134 aisles. The top 6 aisles most items ordered from are: fresh vegetables, fresh fruits, packaged vegetables fruits, yogurt, packaged cheese, water seltzer sparkling water. The aisles in which most items ordered seems about right, you would expect to see these "aisles" in grocery stores and supermarkets to take up the most square footage/real estate.
Make a plot that shows the number of items ordered in each aisle. Order aisles sensibly, and organize your plot so others can read it.
aisle_plot <- instacart %>%
group_by(department, aisle) %>%
count(aisle) %>%
arrange(department, aisle) %>%
ggplot(., aes(x = aisle, y = n, fill = department)) +
geom_bar(stat="identity") +
theme(axis.text.y = element_text(size = 16)) +
ylab("Number of items sold") +
coord_flip() +
facet_wrap(~department, nrow = 7, scale = "free")
knitr::include_graphics("aisle_plot.png")
I faceted the plot by department (colored that way as well for viewing pleasure). It is important to note scale: certain departments like produce get as high as 150000 products in an aisle sold meanwhile the bulk department doesn't exceed 1000 for any particular aisle. I'm surprised the baby formula aisle sells items at such a more exponentially higher rate than its resepective similar aisles in the babies department. It might be worth to one day plot by proportion, such that each department's aisles proportions add up to 1 in each department (see who is biggest contributor to each department).
Make a table showing the most popular item from aisles “baking ingredients”, “dog food care”, and “packaged vegetables fruits”
most_popular <- instacart %>%
filter(aisle %in% c("baking ingredients", "dog food care", "packaged vegetables fruits")) %>%
group_by(aisle) %>%
count(product_name) %>%
arrange(aisle, desc(n))
product_key <- tapply(most_popular$n, most_popular$aisle, max) #attain the aisle and max sold item at the aisle
product_key
## baking ingredients dog food care
## 499 30
## packaged vegetables fruits
## 9784
most_popular$product_name[which(most_popular$aisle == names(product_key)[1] & most_popular$n == product_key[1])]
## [1] "Light Brown Sugar"
most_popular$product_name[which(most_popular$aisle == names(product_key)[2] & most_popular$n == product_key[2])]
## [1] "Snack Sticks Chicken & Rice Recipe Dog Treats"
most_popular$product_name[which(most_popular$aisle == names(product_key)[3] & most_popular$n == product_key[3])]
## [1] "Organic Baby Spinach"
most_popular_items <- tibble(aisle=names(product_key),
product_name= c(most_popular$product_name[which(most_popular$aisle == names(product_key)[1] & most_popular$n == product_key[1])],
most_popular$product_name[which(most_popular$aisle == names(product_key)[2] & most_popular$n == product_key[2])],
most_popular$product_name[which(most_popular$aisle == names(product_key)[3] & most_popular$n == product_key[3])]),
number_sold = as.numeric(product_key))
kable(most_popular_items)
aisle | product_name | number_sold |
---|---|---|
baking ingredients | Light Brown Sugar | 499 |
dog food care | Snack Sticks Chicken & Rice Recipe Dog Treats | 30 |
packaged vegetables fruits | Organic Baby Spinach | 9784 |
Among baking ingredients, light brown sugar is the most popular item sold which isn't all too surprising. When you thinking "baking" you think sweets and pastries (or at least I do) and sugar is near-essential ingredient in whatever you make. It is interesting to note how much more, practically >300x more so, that the baby spinach is ordered than the snack sticks & rice recipe dog treats. I suppose this is good information to know how much "space" to dedicate to items/aisles.
Make a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are ordered on each day of the week; format this table for human readers (i.e. produce a 2 x 7 table).
desired_products <- instacart %>%
filter(product_name %in% c("Pink Lady Apples","Coffee Ice Cream")) %>%
select(order_hour_of_day, order_dow, product_name) %>%
group_by(product_name, order_dow) %>%
summarise(mean(order_hour_of_day)) %>%
spread(., order_dow, `mean(order_hour_of_day)`)
names(desired_products) <- c("Product", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
kable(desired_products)
Product | Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
---|---|---|---|---|---|---|---|
Coffee Ice Cream | 13.77419 | 14.31579 | 15.38095 | 15.31818 | 15.21739 | 12.26316 | 13.83333 |
Pink Lady Apples | 13.44118 | 11.36000 | 11.70213 | 14.25000 | 11.55172 | 12.78431 | 11.93750 |
desired_products_rounded <- round(desired_products[,-1])
kable(cbind(desired_products[,1], desired_products_rounded))
Product | Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday |
---|---|---|---|---|---|---|---|
Coffee Ice Cream | 14 | 14 | 15 | 15 | 15 | 12 | 14 |
Pink Lady Apples | 13 | 11 | 12 | 14 | 12 | 13 | 12 |
Important to note that in the first table this is in terms of a "mean" hour of the day and in military/24 hour time. For example, a mean time of 13.5 would be 1:30pm. In terms of decimals, every 0.25 is equivalent to 15 minutes. However, our original data was in integer data/whole hour time so I thought it would be worth it to see the resulting table in rounded form (traditional rounding, >0.5 rounds up, round down otherwise). Coffee Ice cream generally sells in the afternoon, a little earlier so on Saturdays interestingly enough. Pink Lady apples are also generally sold mid-day, from 11-14 hours (11am - 2pm).
This problem uses the NY NOAA data. DO NOT include this dataset in your local data directory; instead, load the data from the p8105.datasets package (it’s called ny_noaa).
The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and indicating the extent to which missing data is an issue. Then, do or answer the following (commenting on the results of each):
Do some data cleaning. Create separate variables for year, month, and day. Ensure observations for temperature, precipitation, and snowfall are given in reasonable units. For snowfall, what are the most commonly observed values? Why?
data("ny_noaa")
ny_noaa <- ny_noaa %>%
mutate(prcp = prcp/10,
snow = snow/10,
snwd = snwd/10,
tmax = as.numeric(tmax)/10,
tmin = as.numeric(tmin)/10,
year = year(date),
month = month(date),
day = weekdays(date))
#summary(ny_noaa$date) -- everyday 1/1/81 - 12/31/10
The ny_noaa dataset is quite large, 2595176 observations with 10 variables. A little breakdown of each variable: - There are 747 unique different id's in this dataset. - Dates range for a 30 year span from January 1st, 1981 to December 31st, 2010.
kable(t(apply(ny_noaa[,3:7], 2, summary)))
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA's | |
---|---|---|---|---|---|---|---|
prcp | 0.0 | 0.0 | 0.0 | 2.9823236 | 2.3 | 2286.0 | 145838 |
snow | -1.3 | 0.0 | 0.0 | 0.4987025 | 0.0 | 1016.0 | 381221 |
snwd | 0.0 | 0.0 | 0.0 | 3.7312281 | 0.0 | 919.5 | 591786 |
tmax | -38.9 | 5.0 | 15.0 | 13.9798384 | 23.3 | 60.0 | 1134358 |
tmin | -59.4 | -3.9 | 3.3 | 3.0285374 | 11.1 | 60.0 | 1134420 |
It seems NAs are quite an issue across most of the numerical variables but especially so in the temperature columns, where almost half of the values are NA's.
ny_noaa %>%
group_by(snow) %>%
count(snow)
## # A tibble: 282 x 2
## # Groups: snow [282]
## snow n
## <dbl> <int>
## 1 -1.3 1
## 2 0 2008508
## 3 0.3 8790
## 4 0.5 9748
## 5 0.8 9962
## 6 1 5106
## 7 1.3 23095
## 8 1.5 3672
## 9 1.8 3226
## 10 2 4797
## # ... with 272 more rows
The most observed snowfall value is 0 which is unsurprising, considering it does not snow for most of the year.
Make a two-panel plot showing the average temperature in January and in July in each station across years. Is there any observable / interpretable structure? Any outliers?
ny_noaa %>%
filter(month %in% c(1,7)) %>%
group_by(year, month, id) %>%
summarize(mean_max_temp = mean(tmax)) %>%
ggplot(., aes(x = year, y = mean_max_temp, color = id)) +
geom_line() +
theme(legend.position="none") +
facet_grid(~month)
## Warning: Removed 6007 rows containing missing values (geom_path).
I would not say there is any sort of particular interpretable structure beyond the fact that you can see that temperature varies within stations on a per year basis and that on average the weather stays within a ~20 degrees Celsius window throughout the years for borth January and July as a whole. There does seem to be one outlier though from one stayion in the late 80s in July that dipped much lower than usual.
Make a two-panel plot showing (i) tmax vs tmin for the full dataset (note that a scatterplot may not be the best option); and (ii) make a plot showing the distribution of snowfall values greater than 0 and less than 100 separately by year.
funky_temps <- ny_noaa[which(ny_noaa$tmin > ny_noaa$tmax),]
temp_plot <- ggplot(ny_noaa, aes(x = tmin, y = tmax)) +
geom_hex() +
xlim(-60, 60) +
ylim(-60, 60)
geom_abline(intercept = 0, slope = 1, color = "red")
## mapping: intercept = ~intercept, slope = ~slope
## geom_abline: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
snow_plot <- ny_noaa %>%
filter(snow > 0 & snow < 100) %>%
mutate(year = as.factor(year)) %>%
ggplot(., aes(x = year, y = snow)) +
geom_boxplot() +
coord_flip()
ggarrange(temp_plot, snow_plot, ncol = 2, nrow = 1)
## Warning: Removed 1136276 rows containing non-finite values (stat_binhex).
## Warning: Removed 5 rows containing missing values (geom_hex).
For the first panel, the tmax vs tmin plot, there were over a million values to be plotted so it was better to use a density-based plot. It is interesting to see that there are a decent amount of values where tmin seems to be great than tmax (anything right of the y = x line). This does not make much sense, in fact there are 1183 observations where this is seen. For the sefcond panel, we see that there is a lot of outliers in in all years but in general the IQR (along with the median) stays pretty consistent throughout the years.