Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weather variables moving to missing over time #24

Closed
ismayc opened this issue Dec 26, 2024 · 4 comments
Closed

weather variables moving to missing over time #24

ismayc opened this issue Dec 26, 2024 · 4 comments

Comments

@ismayc
Copy link
Contributor

ismayc commented Dec 26, 2024

I've been exploring the weather data frames further. I noticed that some columns have really started to drop off in terms of how much of them are populated over the last 10 years. Here is some analysis:

library(nycflights23)
suppressPackageStartupMessages(library(pnwflights22))
suppressPackageStartupMessages(library(nycflights13))
library(readr)
library(tibble)

# Turn off scientific notation and increase output width
options(scipen = 999, width = 200)

nyc_weather13 <- nycflights13::weather
nyc_weather23 <- nycflights23::weather
pnw_weather22 <- anyflights::get_weather(c("SEA", "PDX"), 2022)
atl_weather21 <- anyflights::get_weather("ATL", 2021)
ord_weather20 <- anyflights::get_weather("ORD", 2020)
lax_weather19 <- anyflights::get_weather("LAX", 2019)
dfw_weather18 <- anyflights::get_weather("DFW", 2018)
den_weather17 <- anyflights::get_weather("DEN", 2017)
phl_weather16 <- anyflights::get_weather("PHL", 2016)
sfo_weather15 <- anyflights::get_weather("SFO", 2015)
phx_weather14 <- anyflights::get_weather("PHX", 2014)

# Determine percentage of each column that is missing values for
# each of the three data frames above
nyc_weather13_missing <- colMeans(is.na(nyc_weather13)) * 100
nyc_weather23_missing <- colMeans(is.na(nyc_weather23)) * 100
pnw_weather22_missing <- colMeans(is.na(pnw_weather22)) * 100
atl_weather21_missing <- colMeans(is.na(atl_weather21)) * 100
ord_weather20_missing <- colMeans(is.na(ord_weather20)) * 100
lax_weather19_missing <- colMeans(is.na(lax_weather19)) * 100
dfw_weather18_missing <- colMeans(is.na(dfw_weather18)) * 100
den_weather17_missing <- colMeans(is.na(den_weather17)) * 100
phl_weather16_missing <- colMeans(is.na(phl_weather16)) * 100
sfo_weather15_missing <- colMeans(is.na(sfo_weather15)) * 100
phx_weather14_missing <- colMeans(is.na(phx_weather14)) * 100

# Put these together into a data frame where the rowname
# is the name of the data set and the other columns are the same
# as the columns in the data frames
missing_df <- data.frame(
  nyc_weather13 = nyc_weather13_missing,
  phx_weather14 = phx_weather14_missing,
  sfo_weather15 = sfo_weather15_missing,
  phl_weather16 = phl_weather16_missing,
  den_weather17 = den_weather17_missing,
  dfw_weather18 = dfw_weather18_missing,  
  lax_weather19 = lax_weather19_missing,  
  ord_weather20 = ord_weather20_missing,
  atl_weather21 = atl_weather21_missing,
  pnw_weather22 = pnw_weather22_missing,
  nyc_weather23 = nyc_weather23_missing,
  stringsAsFactors = FALSE
)

missing_df
#>            nyc_weather13 phx_weather14 sfo_weather15 phl_weather16 den_weather17 dfw_weather18 lax_weather19 ord_weather20 atl_weather21 pnw_weather22 nyc_weather23
#> origin       0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> year         0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> month        0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> day          0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> hour         0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> temp         0.003829217    0.01145082    0.01148237   65.89014503    98.0769231    98.9922126    99.4391026    83.8566046    99.0155678   97.07481825   97.45077087
#> dewp         0.003829217    0.01145082    0.01148237   65.89014503    98.0769231    98.9922126    99.4391026    83.8566046    99.0155678   97.07481825   97.45077087
#> humid        0.003829217    0.01145082    0.01148237   65.89014503    98.0769231    98.9922126    99.4391026    83.8566046    99.0155678   97.07481825   97.45077087
#> wind_dir     1.761439786    6.79033551    0.72338960    4.01964143     5.5059524    10.8108108     1.4766484     2.3176162     2.7815934    1.60856374    4.65577774
#> wind_speed   0.015316868    0.60689339    0.01148237    3.48292794     5.4372711     6.5620705     0.6753663     0.4452563     0.3434066    0.48657622    3.94214624
#> wind_gust   79.563469271    0.60689339    0.01148237    3.48292794     5.4372711     6.5620705     0.6753663     0.4452563     0.3434066    0.48657622    3.94214624
#> precip       0.000000000    0.00000000    0.00000000   64.39419893    97.1268315    96.6445259    98.1112637    81.2764014    96.5201465   93.06199553   93.92077545
#> pressure    10.449932989    1.13363105    8.55436904   71.29153820    98.7522894    99.5190105    99.5077839    86.2198881    99.1643773   98.00217528   97.81712716
#> visib        0.000000000    0.01145082    0.01148237    0.01141944     0.1488095     0.1030692     0.5952381     0.1484188     0.0000000    0.01717328    0.09158907
#> time_hour    0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000

Created on 2024-12-26 with reprex v2.1.1

@ismayc
Copy link
Contributor Author

ismayc commented Dec 26, 2024

Digging more into the get_weather_for_station() function, I think the issue stems from the cleaning weather_raw in these steps:

    # remove duplicates / incompletes
    dplyr::group_by(origin, month, day, hour) %>%
    dplyr::filter(dplyr::row_number() == 1) %>%
    dplyr::ungroup() %>%
Screenshot 2024-12-26 at 1 52 41 PM

The attachment shows that recordings for temperature, dew point, and humidity are recorded once per hour. So selecting the first row after grouping will incorporate many missing values.

@ismayc
Copy link
Contributor Author

ismayc commented Dec 26, 2024

I think adding in these lines of code just above # remove duplicates / incompletes does the trick?

    # fill in missing values with the only value given in a particular hour
    dplyr::group_by(time_hour) %>%  # Group data by 'time_hour'
    tidyr::fill(temp, dewp, humid, precip, pressure, .direction = "downup") %>%  # Fill NAs
    dplyr::ungroup() %>% # Ungroup to return to original data structure
Screenshot 2024-12-26 at 2 22 42 PM

@ismayc
Copy link
Contributor Author

ismayc commented Dec 26, 2024

Tried to fix in PR #25

@simonpcouch
Copy link
Owner

Resolved in #25. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants