Consider changing default format for dataframes to arrow or CSV #666

juliasilge · 2022-10-28T15:46:11Z

We have seen users who write using the default from R, and then are frustrated when their Python colleagues can't read. We have considered changing to arrow for a long time:

pins-r/R/pin-read-write.R

Line 115 in 3d2dc6b

# Might consider switch to arrow in the future

Does arrow have enough usage in the community for this to be reasonable? It would be a much better choice if interoperability is one of the main reasons people use pins (to read with Python).

juliasilge · 2022-10-28T15:51:40Z

We recently moved arrow to Suggests in #646 so this would likely mean some new users would be prompted to install another package, even when using the defaults.

machow · 2022-11-01T15:17:21Z

Adding arrow as a requirement seems like it could introduce some friction (maybe?). I wonder if the audience for pins might lean toward CSV (for example, this pins blog post aims at an audience that is emailing CSVs, so maybe emailing CSV -> stashing CSV with pins might feel like a smaller step?).

(This is me mostly thinking of pins as a very early stepping stone for data versioning / sharing, since I'd personally be very into storing everything in arrow/parquet!)

iainmwallace · 2022-11-01T16:06:59Z

I would suggest csv as the default. We often share via Connect and it is frustrating for non R/Python users when they go to the connect landing page for that dataset and they can't download the file in a format they can understand or open easily.

juliasilge · 2022-11-01T20:03:47Z

Reading CSV via read.csv() often has downsides, like guessing that goes wrong, not handling dates, etc. If we consider changing the default to CSV, would it be better (less surprising overall, easier collaboration with Python folks, etc) to use vroom for reading and writing?

wibeasley · 2023-02-23T20:45:06Z

I agree about the downsides of csvs, especially the lack of explicit variable types. When pins saves a csv, could it save a second file that stores the variable info? Essentially a serialized/dput-ed readr::col_types object?

I don't like having to redefine (a) integer vs floating, and (b) factor levels.

If the data is later imported by pins, pins would look for the metadata and use it. But the csv is still valid and can be read by other programs that don't know how to interpret the "mtcars.readr_col_types" plain-text file. The metadata file isn't critical -it's just optional gravy.

juliasilge · 2023-02-23T21:47:57Z

@wibeasley That is an interesting suggestion! As of now, we would recommend that folks follow this vignette for managing custom formats, like reading CSVs with more control:

library(pins)
library(palmerpenguins)

b <- board_temp()

penguin_col_spec <- as.character(readr::as.col_spec(penguins))
penguin_col_spec
#> [1] "ffddiifi"

b %>% 
  pin_write(
    penguins, 
    "very-nice-penguins",
    type = "csv",
    metadata = list(col_spec = penguin_col_spec)
  )
#> Creating new version '20230223T212321Z-809e9'
#> Writing to pin 'very-nice-penguins'

new_col_spec <- pin_meta(b, "very-nice-penguins")$user$col_spec
pin_download(b, "very-nice-penguins") %>%
  readr::read_csv(col_types = new_col_spec)
#> # A tibble: 344 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
#>  5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
#> 10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
#> # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#> #   ²body_mass_g

^{Created on 2023-02-23 with reprex v2.0.2}

Those last two bits could be wrapped up in a pin_read_col_spec() helper function for an individual to use, if they always wanted to set up their files this way.

leslem · 2024-04-15T23:26:32Z

An argument to specify a csv reading function (e.g. read.csv or readr::read_csv or data.table::fread) would be good for the use case that led me to this issue. Or passing arguments on to read.csv would be helpful.

I have a colleague who's writing pins from python as type='csv', and then I want to read them in R, but with read.csv under the hood I get column names modified (to be syntactic) and column types I don't want. For now I'm going to do pin_download() and then readr::read_csv() to get the data read in the way I'd like.

juliasilge · 2024-04-15T23:46:01Z

For now I'm going to do pin_download() and then readr::read_csv() to get the data read in the way I'd like.

This is definitely the right thing to do for now.

CSV writing can be so cantankerous, especially if you are using R and Python or something else. Have you talked with your colleague about considering switching to parquet? Is there a particular constraint that makes that not a good move?

juliasilge · 2024-04-17T18:44:07Z

I was just thinking about the problem reported by @leslem again today, and how it highlights that switching to CSV will not really solve all user pain around this issue.

In rstudio/pins-python#231 @isabelizimm added support for reading .rds files from Python, which means that Python users will be able to read rectangular data written from R with the current default. The rdata package which powers that PR uses the binary types of R objects which means it's kind of like a poor man's arrow. It really improves the situation. I would still recommend that R + Python collaborators use parquet, but with that change on the Python side, maybe we don't want to change the default format for dataframes, at least not without a lightweight option for reading/writing parquet.

juliasilge changed the title ~~Consider changing default format for dataframes to arrow~~ Consider changing default format for dataframes to arrow or CSV Nov 4, 2022

juliasilge mentioned this issue Nov 4, 2022

pin_write() automatically converts characters to factors #596

Closed

juliasilge added the feature a feature request or enhancement label Nov 4, 2022

hadley mentioned this issue Feb 15, 2023

Should pins support parquet? #713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider changing default format for dataframes to arrow or CSV #666

Consider changing default format for dataframes to arrow or CSV #666

juliasilge commented Oct 28, 2022

juliasilge commented Oct 28, 2022

machow commented Nov 1, 2022 •

edited

iainmwallace commented Nov 1, 2022

juliasilge commented Nov 1, 2022

wibeasley commented Feb 23, 2023

juliasilge commented Feb 23, 2023

leslem commented Apr 15, 2024

juliasilge commented Apr 15, 2024

juliasilge commented Apr 17, 2024

Consider changing default format for dataframes to arrow or CSV #666

Consider changing default format for dataframes to arrow or CSV #666

Comments

juliasilge commented Oct 28, 2022

juliasilge commented Oct 28, 2022

machow commented Nov 1, 2022 • edited

iainmwallace commented Nov 1, 2022

juliasilge commented Nov 1, 2022

wibeasley commented Feb 23, 2023

juliasilge commented Feb 23, 2023

leslem commented Apr 15, 2024

juliasilge commented Apr 15, 2024

juliasilge commented Apr 17, 2024

machow commented Nov 1, 2022 •

edited