Chunkwise support for `read.fst`? #269

hope-data-science · 2022-05-21T02:33:25Z

Recently, I've learned a package named chunked. Since fst supports row access via row number, I suggest maybe read.fst function could support this sort of chunkwise operation. Any ideas to include this as a new feature? Are there solutions possible?

Thanks.

The text was updated successfully, but these errors were encountered:

MarcusKlik · 2022-11-16T10:21:45Z

Hi @hope-data-science, thanks for posting your request.

I'm interested in learning your specific use case for chunked data, are your calculations memory constrained?
Or would you like to process chunks in parallel to speed up calculations?

Because fst reads are already multithreaded, using chunks won't do much for the read (and write) times but it will reduce the memory needed. However, a lot of operations cannot easily be performed on chunks and then correctly combined into an overall result. For example, the median cannot be applied to chunks and then later combined, you would need additional logic for that.

(the same applies to any function where some ordering of column data is needed)

MarcusKlik · 2022-11-16T10:45:24Z

but indeed you can use the from and to arguments to read chunks if you want:

library(dplyr)
library(fst)

tmp_file <- tempfile(fileext = "fst")

# write sample fst file
x <- data.frame(
  X = sample(sample(1:100, 1000, replace = TRUE))
) %>%
  write_fst(tmp_file)

# determine chunks
nr_of_chunks <- 8
chunk_size <- metadata_fst(tmp_file)$nrOfRows / nr_of_chunks

# custom function to run on each chunk
my_funct <- function(tbl, chunk) {
  tbl %>%
    summarise(
      Mean = mean(X)
    ) %>%
    mutate(
      Chunk = chunk
    )
}

# run custom function on each chunk
z <- lapply(1:nr_of_chunks, function(chunk, custom_function) {
  y <- read_fst(
    tmp_file,
    from = 1 + (chunk - 1) * chunk_size,
    to = chunk * chunk_size
  )

  custom_function(y, chunk)
}, my_funct) %>%
  bind_rows()

print(z)
#>     Mean Chunk
#> 1 51.680     1
#> 2 46.936     2
#> 3 51.304     3
#> 4 47.824     4
#> 5 52.000     5
#> 6 53.712     6
#> 7 55.440     7
#> 8 51.256     8

from there it depends on the actual custom function used how you need to combine the chunks, in this case:

z %>%
  summarise(
    Mean = mean(Mean)
  )
#>     Mean
#> 1 51.269

hope-data-science · 2023-02-21T07:57:24Z

I think from and to do a good job. I just wonder whether the speed would slow down when from and to changes. If it never changes, I think this issue is done.

MarcusKlik self-assigned this Nov 16, 2022

MarcusKlik added the feature request label Nov 16, 2022

MarcusKlik added this to the Candidate milestone Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunkwise support for `read.fst`? #269

Chunkwise support for `read.fst`? #269

hope-data-science commented May 21, 2022

MarcusKlik commented Nov 16, 2022

MarcusKlik commented Nov 16, 2022

hope-data-science commented Feb 21, 2023

Chunkwise support for read.fst? #269

Chunkwise support for read.fst? #269

Comments

hope-data-science commented May 21, 2022

MarcusKlik commented Nov 16, 2022

MarcusKlik commented Nov 16, 2022

hope-data-science commented Feb 21, 2023

Chunkwise support for `read.fst`? #269

Chunkwise support for `read.fst`? #269