Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression rate to minimize reading time? #278

Open
frederickluser opened this issue Jul 25, 2023 · 2 comments
Open

Compression rate to minimize reading time? #278

frederickluser opened this issue Jul 25, 2023 · 2 comments
Assignees
Milestone

Comments

@frederickluser
Copy link

frederickluser commented Jul 25, 2023

Thank you so much for all your great work. I wondered which compression factor would minimize reading time for large files with e.g. 100 million observations, if I'm not concerned about writing time. Do you have any intuition or previous benchmarks from, let's say extreme cases (e.g., compress = 0, 50, 100)?

EDIT: I guess optimal compression rates depend also on one's hardware. In my case at least, I work on a quite powerful machine, 36 virtual processors, 2.3GHz, 440 GB ...

Any comment highly appreciated. All the best,
Frederic

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Dec 1, 2023

Hi @frederickluser, that's an interesting question!

fst uses LZ4 (highest speeds) and ZSTD (lowest speeds) for compression and decompression. In general, the size of your fst file will be smallest for the highest compression settings.

Both compression algorithms will take more time for compression when the compression settings is higher but for decompression time there is almost no difference.

So if you want to write once and read often, your best option is to use the highest compression settings possible. With equal decompression time, the smaller number of bytes that need to be read from disk will shorten your reading times :-)

If you would have an infinitely fast disk the reading time would only be limited by decompression speed, and the actual level selected would probably not matter too much.

Hope that helps :-)

(PS: in the README benchmark figure you can also see that with the fast (but limited) disk speed there, more compression leads to higher reading speeds)

@MarcusKlik MarcusKlik self-assigned this Dec 1, 2023
@MarcusKlik MarcusKlik added this to the fst v0.9.10 milestone Dec 1, 2023
@frederickluser
Copy link
Author

Hey Marcus

Great, thanks a lot for the super informative answer! That is every helpful.

All the best, Frederic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants