Skip to content

rstats-tartu/transform-data-with-dplyr

 
 

Repository files navigation

Data transformation using dplyr (aka five verbs)

Intro

I our previous classes we have been working with small cleaned up dataset to go through steps of creating some of the most common visualization types.

In your workflow you are going to need data visualization at two points, namely during exploratory data analysis where you learn to know your dataset and during report preparation when you try to communicate what have you found. And this is not two stop trip, it's more like a roundabout, an iterative process, where you pass these two point multiple times after you have done some "tweaking" of your data. By "tweaking" I mean here data transformation and/or modeling.

You need to transform your data during analysis, because in real life you rarely start with a dataset that is in the right form for visualization and modeling. So, often you will need to:

  • summarise your data or to
  • create new variables,
  • rename variables, or
  • reorder the observations.

We are going to use the dplyr library from tidyverse to learn how to carry out these tasks.

Sources

Again, we are follow closely R4DS book, chapter "Data transformation", available from http://r4ds.had.co.nz/transform.html. More examples from https://rstats-tartu.github.io/lectures/tidyverse.html#dplyr-ja-selle-viis-verbi

Dataset

Estonian COVID19 tests data was downloaded from Estonian Health Board open data portal https://www.terviseamet.ee/et/koroonaviirus/avaandmed and contains positive and negative test results with test dates, including metadata about subject gender, age group, country, and county. Whole dataset was sampled down and includes 5% of the original data. Dataset was downloaded and prepared using get_data.R script

(covid_tests <- readr::read_csv("https://raw.githubusercontent.com/rstats-tartu/transform-data-with-dplyr/main/data/covid_tests.csv"))
#> Rows: 91087 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (5): Gender, AgeGroup, Country, County, ResultValue
#> dbl  (2): wk, yr
#> date (1): StatisticsDate
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 91,087 × 8
#>    Gender AgeGroup Country  County        ResultValue StatisticsDate    wk    yr
#>    <chr>  <chr>    <chr>    <chr>         <chr>       <date>         <dbl> <dbl>
#>  1 M      50-54    Eesti    Harju maakond N           2020-09-16        38  2020
#>  2 M      45-49    Eesti    Põlva maakond N           2021-02-13         7  2021
#>  3 N      40-44    Eesti    Jõgeva maako… N           2020-12-22        51  2020
#>  4 M      80-84    Eesti    Rapla maakond N           2021-03-15        11  2021
#>  5 M      <NA>     Tundmatu <NA>          N           2021-07-11        28  2021
#>  6 N      30-34    Eesti    Harju maakond N           2021-01-05         1  2021
#>  7 N      10-14    Eesti    Harju maakond N           2021-07-01        26  2021
#>  8 N      70-74    Eesti    Harju maakond P           2021-07-12        28  2021
#>  9 M      60-64    Eesti    Tartu maakond N           2021-06-09        23  2021
#> 10 N      35-39    Eesti    Harju maakond N           2020-11-18        47  2020
#> # … with 91,077 more rows

Created on 2021-09-14 by the reprex package (v2.0.1)

Languages

  • R 100.0%