-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any indexing possible #239
Comments
Hi @Courvoisier13, yes, the option to add indices to a fst file would be great! To the user, these indices could work like Because the tabular data is on-disk rather than in-memory, the design should be a bit different from the in-memory
So, when the user does: data.frame(X = 1:10, Y = sample(LETTERS, 10)) %>%
write_fst("some_data.fst")
# write an index to file 'some_data.ifst1'
ft <- fst("some_data.fst") %>%
arrange(X)
# subset using index
ft %>%
filter(X > 3 & X < 7) an extra file can be created (some_data.ifst1) were the index is stored (2 columns in this case). Subsequent indices would create additional files. But after this step, the result must be sorted to ensure that these rows can be read sequentially. So instead of using a row vector, we need a reverse-row vector, mapping the original rows to the target rows. The sorting must be done in-parrallel to the reading for large subsets. The good thing is that the original fst file will be preserved and can be copied with or without the index files. And, as you say, subsetting like this will be very fast for small subsets! Additionally, if we have a good mechanism for indexing, grouping could be a logical next step. Then, grouping operations could be performed without actually reordering data in the fst file, which would also be a very powerful feature I think. thanks for your feature request! |
Thanks @MarcusKlik for the detailed analysis. The way I understand it is some sort of index "light" to allow for easy filtering on those indices, and next step would be grouping. Looking forward to your updates :) |
First, many thanks for fst. Now an essential part of my R workflow. Would this mean that data.table secondary indices would be preserved between fst writes and reads, as with the current data.table key? |
Hi @MarcusKlik, |
Hi,
I looked and I couldn't find if there was an issue about this before. But is there a possibility or a plan to create indexes to fst files to make searching and filtering faster. It would make the fst::fst function so much more powerful and would avoid having to create folders and splitting the files to make access to a subset faster.
There are issues talking about filtering, selecting and speed but nothing on indexing. I think that would be a game changer.
What an incredibly useful package. Please keep it up.
The text was updated successfully, but these errors were encountered: