Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any indexing possible #239

Open
Courvoisier13 opened this issue Apr 8, 2020 · 4 comments
Open

Is there any indexing possible #239

Courvoisier13 opened this issue Apr 8, 2020 · 4 comments
Assignees
Milestone

Comments

@Courvoisier13
Copy link

Hi,

I looked and I couldn't find if there was an issue about this before. But is there a possibility or a plan to create indexes to fst files to make searching and filtering faster. It would make the fst::fst function so much more powerful and would avoid having to create folders and splitting the files to make access to a subset faster.
There are issues talking about filtering, selecting and speed but nothing on indexing. I think that would be a game changer.

What an incredibly useful package. Please keep it up.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 8, 2020

Hi @Courvoisier13, yes, the option to add indices to a fst file would be great!

To the user, these indices could work like data.table's secondary indices and can be stored in two or more extra columns for each index (the keys and corresponding row numbers).

Because the tabular data is on-disk rather than in-memory, the design should be a bit different from the in-memory data.table I think (just thinking out loud here :-))

  • A fst file can be read-only. If the user still wants to compute indices, these must be stored in a separate (fst) file, in some user accessible location. If the user does have write access, it might still be better to store the index separately because in that case the index can be removed more easily (without making a full copy of the original file).
  • For optimal performance, data in the fst file must be accessed 'as sequentially as possible'. Therefore, we need an extra sorting step after the index is read (see below).

So, when the user does:

data.frame(X = 1:10, Y = sample(LETTERS, 10)) %>%
  write_fst("some_data.fst")

# write an index to file 'some_data.ifst1'
ft <- fst("some_data.fst") %>%
  arrange(X)

# subset using index
ft %>%
  filter(X > 3 & X < 7)

an extra file can be created (some_data.ifst1) were the index is stored (2 columns in this case). Subsequent indices would create additional files.
When a subset is requested from the fst file, a binary search is performed on the index file to get a vector of row indexes corresponding to the required subset.

But after this step, the result must be sorted to ensure that these rows can be read sequentially. So instead of using a row vector, we need a reverse-row vector, mapping the original rows to the target rows. The sorting must be done in-parrallel to the reading for large subsets.

The good thing is that the original fst file will be preserved and can be copied with or without the index files. And, as you say, subsetting like this will be very fast for small subsets!

Additionally, if we have a good mechanism for indexing, grouping could be a logical next step. Then, grouping operations could be performed without actually reordering data in the fst file, which would also be a very powerful feature I think.

thanks for your feature request!

@MarcusKlik MarcusKlik added this to the Candidate milestone Apr 8, 2020
@MarcusKlik MarcusKlik self-assigned this Apr 8, 2020
@Courvoisier13
Copy link
Author

Thanks @MarcusKlik for the detailed analysis. The way I understand it is some sort of index "light" to allow for easy filtering on those indices, and next step would be grouping. Looking forward to your updates :)

@jpcirrus
Copy link

First, many thanks for fst. Now an essential part of my R workflow.

Would this mean that data.table secondary indices would be preserved between fst writes and reads, as with the current data.table key?

@liorso
Copy link

liorso commented Jun 11, 2022

Hi @MarcusKlik,
I would like to help with this feature, I think it will be very helpful.
I think the first step should be supporting (by fstcore) retrieving rows by a vector of rows instead of a range of rows.
Would like to hear if you already thought about it and any other advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants