-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream API / Async Iterator API #40
Comments
Hey! Thank you for the suggestion. Yes, that is the plan; to implement Async Iterator API. It will require a bit of redesign I think but I will have a version up and ready soon. |
@thecodrr Awesome! If you need some help let me know (although I am not sure I will have time to help 😅) |
After some "alpha" testing of the sync iterator api, there is a little bit of a performance degradation. However, I don't think Iterator APIs should be used to get a final result — their biggest strength is streamed reading where you can pause or kill at any moment. In any case, it is doable of course and the changes are minor as well. The only issue is to reduce code redundancy without affecting the async or sync implementations. I do not want to move the async/sync APIs over to use iterators because that's a very, very real performance tradeoff. Another thought is that perhaps the current implementation isn't best suited for iterators? I mean, sure, it is fast but maybe iterators can perform better or the same if they were implemented completely differently? Right now there is a huge back & forth which makes us add multiple |
I'm interested to test out any developments of this. A use case might be searching with early-exit. Say I have a monorepo of I don't think back-pressure is a concern with globbing - most glob operation results can easily fit in memory because its just a bunch of file paths. 1MM file paths would be in the order of 100MBs. Early-exit would be useful to reduce unnecessary memory consumption, but I can't imagine the use in pausing the globbing process for memory concerns while doing some processing down the line. So just buffer everything at full speed, and then when the first result is found, expose an async iterator to pull from available results. And emit events for stream. I'm not sure exactly the implications of streams vs iterators. I read an article though about Async Iterator being bad for perf: https://medium.com/netscape/async-iterators-these-promises-are-killing-my-performance-4767df03d85b. Says batching is needed to optimize performance. |
There is a built-in method to from iterators to streams: https://nodejs.org/api/stream.html#streamreadablefromiterable-options But maybe there is not as much control over chunking... There is also
Looks like the standard way moving forward. But its experimental at present... |
For basic streaming support we just need to add an emitter here: Lines 144 to 160 in 3598e83
Or here might be better: Lines 65 to 100 in 3598e83
This would allow use to add debugging information about symlink resolution which could be important for performance debugging. I would like the ability to log if a file was skipped because it was a symlink and symlink was disabled. |
This doesn't "sound" good for performance though but I haven't tested. The way the Builder API is structured allows for extensions like these but I am fearful of cluttering the code to the point where it starts to make less sense. That is why I am planning to refactor the code-base a little bit to allow "plugins". The core crawling will act as a base for all these extra features like relative paths, filtering, globbing etc. Each plugin would be fairly independent and won't clutter the code —not to mention that it would allow 3rd party plugins into the |
The testing I have done implies the same. In a very crude sense, you could say that the more you However, the main benefit of Iterators is the ability to work on a single item at a time and bail out when that work is finished (or continue if not). I will have to do more benchmarking to see which approach is the best. I do know, however, that pushing lots of strings into an array takes some time and if that can be swapped for |
An async iterator is probably "bad" for CPU performance - but good for memory usage. This might matter if you're scanning millions of files, although most systems have memory in abundance these days. Presumably, the goal with this feature is to have scripts provide feedback as quickly as possible, as well as maybe saving some memory. What if we were to collect some number of results before feeding them through? Or collecting in time intervals? Or even both? const files = new fdir()
.crawl("/path/to/dir")
.stream(1000, 100); // 1000 entries or 100 milliseconds
for (const chunk of files) {
for (const file of files) {
// ...
}
} The If the iterator yields 1000 chunks for a million files, that's unlikely to create either substantial memory or CPU overhead? 🙂 |
FWIW, fast-glob has globStream which allows us to read directories in chunks, preventing node from crashing from OOM when reading large trees |
Hello, first of all thanks for this awesome package!!
The documentation says that
Is it already a work in progress? I would really like to see this.
Also, I want to suggest providing an Async Iterator API instead of a stream API. The reason is that an Async Iterator can be easily converted into a readable stream using into-stream without loss of performance (
into-stream
automatically handles backpressure and all stream quirks), while the opposite conversion is very nontrivial (actually I think it's impossible, since readable streams start filling their internal buffers at soon as they begin flowing and therefore can't be converted to a one-step-at-a-time async iterator).The text was updated successfully, but these errors were encountered: