Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunkwise support for read.fst? #269

Open
hope-data-science opened this issue May 21, 2022 · 3 comments
Open

Chunkwise support for read.fst? #269

hope-data-science opened this issue May 21, 2022 · 3 comments
Assignees
Milestone

Comments

@hope-data-science
Copy link

Recently, I've learned a package named chunked. Since fst supports row access via row number, I suggest maybe read.fst function could support this sort of chunkwise operation. Any ideas to include this as a new feature? Are there solutions possible?

Thanks.

@MarcusKlik MarcusKlik self-assigned this Nov 16, 2022
@MarcusKlik MarcusKlik added this to the Candidate milestone Nov 16, 2022
@MarcusKlik
Copy link
Collaborator

Hi @hope-data-science, thanks for posting your request.

I'm interested in learning your specific use case for chunked data, are your calculations memory constrained?
Or would you like to process chunks in parallel to speed up calculations?

Because fst reads are already multithreaded, using chunks won't do much for the read (and write) times but it will reduce the memory needed. However, a lot of operations cannot easily be performed on chunks and then correctly combined into an overall result. For example, the median cannot be applied to chunks and then later combined, you would need additional logic for that.

(the same applies to any function where some ordering of column data is needed)

@MarcusKlik
Copy link
Collaborator

but indeed you can use the from and to arguments to read chunks if you want:

library(dplyr)
library(fst)

tmp_file <- tempfile(fileext = "fst")

# write sample fst file
x <- data.frame(
  X = sample(sample(1:100, 1000, replace = TRUE))
) %>%
  write_fst(tmp_file)

# determine chunks
nr_of_chunks <- 8
chunk_size <- metadata_fst(tmp_file)$nrOfRows / nr_of_chunks

# custom function to run on each chunk
my_funct <- function(tbl, chunk) {
  tbl %>%
    summarise(
      Mean = mean(X)
    ) %>%
    mutate(
      Chunk = chunk
    )
}

# run custom function on each chunk
z <- lapply(1:nr_of_chunks, function(chunk, custom_function) {
  y <- read_fst(
    tmp_file,
    from = 1 + (chunk - 1) * chunk_size,
    to = chunk * chunk_size
  )

  custom_function(y, chunk)
}, my_funct) %>%
  bind_rows()

print(z)
#>     Mean Chunk
#> 1 51.680     1
#> 2 46.936     2
#> 3 51.304     3
#> 4 47.824     4
#> 5 52.000     5
#> 6 53.712     6
#> 7 55.440     7
#> 8 51.256     8

from there it depends on the actual custom function used how you need to combine the chunks, in this case:

z %>%
  summarise(
    Mean = mean(Mean)
  )
#>     Mean
#> 1 51.269

@hope-data-science
Copy link
Author

I think from and to do a good job. I just wonder whether the speed would slow down when from and to changes. If it never changes, I think this issue is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants