weather variables moving to missing over time #24

ismayc · 2024-12-26T19:00:46Z

I've been exploring the weather data frames further. I noticed that some columns have really started to drop off in terms of how much of them are populated over the last 10 years. Here is some analysis:

library(nycflights23)
suppressPackageStartupMessages(library(pnwflights22))
suppressPackageStartupMessages(library(nycflights13))
library(readr)
library(tibble)

# Turn off scientific notation and increase output width
options(scipen = 999, width = 200)

nyc_weather13 <- nycflights13::weather
nyc_weather23 <- nycflights23::weather
pnw_weather22 <- anyflights::get_weather(c("SEA", "PDX"), 2022)
atl_weather21 <- anyflights::get_weather("ATL", 2021)
ord_weather20 <- anyflights::get_weather("ORD", 2020)
lax_weather19 <- anyflights::get_weather("LAX", 2019)
dfw_weather18 <- anyflights::get_weather("DFW", 2018)
den_weather17 <- anyflights::get_weather("DEN", 2017)
phl_weather16 <- anyflights::get_weather("PHL", 2016)
sfo_weather15 <- anyflights::get_weather("SFO", 2015)
phx_weather14 <- anyflights::get_weather("PHX", 2014)

# Determine percentage of each column that is missing values for
# each of the three data frames above
nyc_weather13_missing <- colMeans(is.na(nyc_weather13)) * 100
nyc_weather23_missing <- colMeans(is.na(nyc_weather23)) * 100
pnw_weather22_missing <- colMeans(is.na(pnw_weather22)) * 100
atl_weather21_missing <- colMeans(is.na(atl_weather21)) * 100
ord_weather20_missing <- colMeans(is.na(ord_weather20)) * 100
lax_weather19_missing <- colMeans(is.na(lax_weather19)) * 100
dfw_weather18_missing <- colMeans(is.na(dfw_weather18)) * 100
den_weather17_missing <- colMeans(is.na(den_weather17)) * 100
phl_weather16_missing <- colMeans(is.na(phl_weather16)) * 100
sfo_weather15_missing <- colMeans(is.na(sfo_weather15)) * 100
phx_weather14_missing <- colMeans(is.na(phx_weather14)) * 100

# Put these together into a data frame where the rowname
# is the name of the data set and the other columns are the same
# as the columns in the data frames
missing_df <- data.frame(
  nyc_weather13 = nyc_weather13_missing,
  phx_weather14 = phx_weather14_missing,
  sfo_weather15 = sfo_weather15_missing,
  phl_weather16 = phl_weather16_missing,
  den_weather17 = den_weather17_missing,
  dfw_weather18 = dfw_weather18_missing,  
  lax_weather19 = lax_weather19_missing,  
  ord_weather20 = ord_weather20_missing,
  atl_weather21 = atl_weather21_missing,
  pnw_weather22 = pnw_weather22_missing,
  nyc_weather23 = nyc_weather23_missing,
  stringsAsFactors = FALSE
)

missing_df
#>            nyc_weather13 phx_weather14 sfo_weather15 phl_weather16 den_weather17 dfw_weather18 lax_weather19 ord_weather20 atl_weather21 pnw_weather22 nyc_weather23
#> origin       0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> year         0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> month        0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> day          0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> hour         0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000
#> temp         0.003829217    0.01145082    0.01148237   65.89014503    98.0769231    98.9922126    99.4391026    83.8566046    99.0155678   97.07481825   97.45077087
#> dewp         0.003829217    0.01145082    0.01148237   65.89014503    98.0769231    98.9922126    99.4391026    83.8566046    99.0155678   97.07481825   97.45077087
#> humid        0.003829217    0.01145082    0.01148237   65.89014503    98.0769231    98.9922126    99.4391026    83.8566046    99.0155678   97.07481825   97.45077087
#> wind_dir     1.761439786    6.79033551    0.72338960    4.01964143     5.5059524    10.8108108     1.4766484     2.3176162     2.7815934    1.60856374    4.65577774
#> wind_speed   0.015316868    0.60689339    0.01148237    3.48292794     5.4372711     6.5620705     0.6753663     0.4452563     0.3434066    0.48657622    3.94214624
#> wind_gust   79.563469271    0.60689339    0.01148237    3.48292794     5.4372711     6.5620705     0.6753663     0.4452563     0.3434066    0.48657622    3.94214624
#> precip       0.000000000    0.00000000    0.00000000   64.39419893    97.1268315    96.6445259    98.1112637    81.2764014    96.5201465   93.06199553   93.92077545
#> pressure    10.449932989    1.13363105    8.55436904   71.29153820    98.7522894    99.5190105    99.5077839    86.2198881    99.1643773   98.00217528   97.81712716
#> visib        0.000000000    0.01145082    0.01148237    0.01141944     0.1488095     0.1030692     0.5952381     0.1484188     0.0000000    0.01717328    0.09158907
#> time_hour    0.000000000    0.00000000    0.00000000    0.00000000     0.0000000     0.0000000     0.0000000     0.0000000     0.0000000    0.00000000    0.00000000

Created on 2024-12-26 with reprex v2.1.1

The text was updated successfully, but these errors were encountered:

ismayc · 2024-12-26T20:55:30Z

Digging more into the get_weather_for_station() function, I think the issue stems from the cleaning weather_raw in these steps:

    # remove duplicates / incompletes
    dplyr::group_by(origin, month, day, hour) %>%
    dplyr::filter(dplyr::row_number() == 1) %>%
    dplyr::ungroup() %>%

The attachment shows that recordings for temperature, dew point, and humidity are recorded once per hour. So selecting the first row after grouping will incorporate many missing values.

ismayc · 2024-12-26T21:22:57Z

I think adding in these lines of code just above # remove duplicates / incompletes does the trick?

    # fill in missing values with the only value given in a particular hour
    dplyr::group_by(time_hour) %>%  # Group data by 'time_hour'
    tidyr::fill(temp, dewp, humid, precip, pressure, .direction = "downup") %>%  # Fill NAs
    dplyr::ungroup() %>% # Ungroup to return to original data structure

ismayc · 2024-12-26T21:56:18Z

Tried to fix in PR #25

simonpcouch · 2025-01-13T15:25:53Z

Resolved in #25. Thanks again!

ismayc mentioned this issue Dec 26, 2024

Updates to weather to remove missing values / adjust tz #25

Merged

simonpcouch closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

weather variables moving to missing over time #24

weather variables moving to missing over time #24

ismayc commented Dec 26, 2024 •

edited

Loading

ismayc commented Dec 26, 2024 •

edited

Loading

ismayc commented Dec 26, 2024

ismayc commented Dec 26, 2024

simonpcouch commented Jan 13, 2025

weather variables moving to missing over time #24

weather variables moving to missing over time #24

Comments

ismayc commented Dec 26, 2024 • edited Loading

ismayc commented Dec 26, 2024 • edited Loading

ismayc commented Dec 26, 2024

ismayc commented Dec 26, 2024

simonpcouch commented Jan 13, 2025

ismayc commented Dec 26, 2024 •

edited

Loading

ismayc commented Dec 26, 2024 •

edited

Loading