-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
poor usage of (?) rio slower than sequential read(2) i/o #5
Comments
As an outside observer who finds this very fascinating, I would love to throw in a few thoughts. That said feel free to disregard them :) The two algorithms seem very different. Let me try to phrase them to make sure I didn't misunderstand what happens. The sequential one alternates between reading and computation and reading and computation etc. The io_ring one does all the reading then all the computation. The ironing handles QUEUE_DEPTH parallel reads on consecutive chunks, does computation and repeats more consecutive chunks, etc. So to throw out a few theories (from normal to far out there) File systems and disks (even nvme disks) are spectacularly good at 100% sequential IO, depending on the FS (and I don't know apples one good enough but I think ZFS does this for example) they will at times try to pre-read the next few chunks in a sequential access and have them ready and cached if they think you might need them. So the 100% sequential access in the first example might have an edge here in being seen like that and optimized at the FS level. I don't know enough about ioring's internals but filing The boring part that parallel requests in a purely sequential workload like this are likely to always loose unless you can recover some of the cost of the parallelism by multiplexing work - the example still waits for all requests to be completed. perhaps moving the sum-loop inside the wait-loop would give a better result? Basically what I fear here is that the way the code is structured, the process has to On this, I might be just very scared because I had a huge performance drop of it, but the summing over the BUFFER_SIZE will probably all reside in the CPU cache from reading to calculation and never having to hit main memory. For the ioring example since it is reading 32 times as much you might hit some cache limitations especially with it being non-linear and other things in the background (while reading) also wanting their chunk of cache-flash. Then again - as I said I just got scared by this so I see cache issues everywhere now 😂 Last and really far out there. thermals? Read, calculate, read, calculate might give the components a bit more time to cool down and prevent throttling (both the nvme and the cpu) especially since this is a laptop? OTOH it feels like a very short period so I'd be surprised by this. EDITIED: Perhaps an interesting secondary compairison would be to calculate the sum from N files once in succession and once using ioring and being parallel? |
I absolutely agree with you that I didn't necessarily expect It actually does turn out that the summing cost was much higher than I expected! I underestimated the cost of that dramatically. I just had it there to ensure that both were reading the same number of bytes. Without it, they're both much faster on higher inputs, but
I'm still doing some summing, to prevent the compiler from optimizing it out. The FWIW, I tried to use |
Here is some numbers from fio:
io_uring is slightly faster, at least in this test case. |
I'll throw that up on here, the flamegraph of running the test (100 times) Notice it is going through the syscalls. After some digging io_ring requires SQPOLL to bypass them, I added an updated gist that uses the option but it fails to run (note: needs to run as a privileged user) and I'm not sure why. https://gist.github.com/Licenser/45786698fc8a3ad78a957d8618882d0d perhaps someone can help with a hint why it ends up as:
|
Because you need to register files for sqpoll
|
Thanks! |
Why are you waiting for all (QUEUE_DEPTH) cqes to complete here?: https://gist.github.com/Licenser/45786698fc8a3ad78a957d8618882d0d#file-black_box-rs-L90. |
I had a different implementation that used a vecdeq with no performance difference. So I wanted to introduce sqpoll with as little change to the original as possible to prevent cross contamination between changes. I’ll try to get the poll to work tomorrow if sirupsen doesn’t beat me to it ;) |
sqpoll won't change anything, i'm easily hitting my nvme disk read limit without it. |
It's less about outperforming the file operation then getting in the same ballpark. At the moment the code above is about 2x slower reading the file with rio then the old fashioned way. |
So I looked into how to register a file, it sounds a lot easier then it is 😂 SQPOLL doesn't seem to be supported at the moment so I tried to add it #8 but I am somewhat lost as to why it doesn't work. https://github.com/axboe/liburing/blob/master/man/io_uring_register.2 is the best explanation I found so far and I think I followed the requirements but I'm probably missing something sneaky. |
I noticed when working on chrisvest/xxv#18 that if you have small buffers and/or a small queue depth, then Rio spends a fair amount of time on the futex calls for servicing the |
Hey Tyler, been eagerly following your progress with
rio
andsled
on Twitter. Been trackingio_uring
for a while, decided to finally spend some time programming in it today.I've been working on moving my I/O benchmarks in napkin-math to use
io_uring
on Linux. I started with the sequential test-case, as this seemed easiest to start with (I skimmed through Axboe's io-uring-cp(1) too, since it'sI had some trouble allocating the buffers for all these reads, FWIW--maybe something
rio
could help with. After poking around the API docs, I found a way to get a mutable slice without upsetting the borrow-checker withunsafe
(🙈)To my surprise,
io_uring
kept performing worse than simpleread(2)
, even though I'm re-using the buffers. I tried tuning the size of the buffers and the depth of the queue, but to no avail (the settings in the script below are the ones Axboe uses in hiscp
example, so it seems sensible). Do note that I'm fairly inexperienced with Rust, so I may be missing things 👀I've tried to make this very easy to reproduce. If you throw this gist into
examples/sequential_io.rb
and run it withcargo run --release --example sequential_io
, you should see something along the lines of:Edit: Some concerns with the summing overhead, see my comment below, but it shows the same pattern when eliminated.
I'm on
5.3.0-29-generic
, I know I could be more recent, but that's the most recent Ubuntu kernel I could dust up that doesn't require re-compiling perf etcEdit: I upgraded to 5.5.1, and results are the same.
The text was updated successfully, but these errors were encountered: