Investigate more optimal way to implement CoreGraphics backend #83

ids1024 · 2023-03-31T02:51:23Z

Apparently macOS (and iOS, #43) has a framework called IOSurface for exchanging framebuffers and textures between processes, which sounds similar to the idea behind dmabufs on Linux. I think we should be use IOSurfaces for a front and back buffer, and use IOSurfaceGetBaseAddress to get a pointer to write into for no-copy presentation (#65)? Assuming it can work with the right pixel format.

Or are there issues with this, or a better way?

The text was updated successfully, but these errors were encountered:

ids1024 · 2023-04-06T17:36:06Z

http://russbishop.net/cross-process-rendering describes how it is possible to create an IOSurface with a size and format, access it from CPU, and set it as the contents of a CALayer.

How would writing to the IOSurface from CPU perform? It would be good to have someone with a Mac that has a discrete GPU test this.
Synchronization: how do we make sure the surface is no longer in use by the display server when re-using it?

ids1024 · 2023-04-06T17:52:19Z

Or possibly we could just use CGImage is is currently used, but with a CGDataProvider that reads from memory we can mutate? Presumably CGImage/CGDataProvider assume the memory isn't mutated, but we could do that once the provider is released. But that isn't so simple without any guarantees about when it will be released, and since we probably can't block waiting for that either.

Edit: See https://developer.apple.com/documentation/coregraphics/cgdataproviderreleasedatacallback:

When Core Graphics no longer needs direct access to your provider data, your function is called. You may safely modify, move, or release your provider data at this time.

ids1024 · 2023-04-06T18:55:13Z

So comparing these:

IOSurface
- With unified memory, this should let us write directly into the memory the GPU, to be truly no-copy. With a discrete GPU, a DMA transfer is required to get it into GPU memory. With integrated graphics on Intel macs, I think memory wouldn't be "unified" and it may need to copy from the portion of the memory allocated to the CPU to the portion allocated to the GPU?
- Not sure how the synchronize, and make sure the IOSurface is no longer in use by the display server.
CGImage with custom CGDataProvider
- Saves the copy currently happening in the softbuffer backend, but CoreGraphics still needs to upload the data to GPU? (Into an IoSurface that is sent to the display server?)
  - Is there any possibility this upload could perform better than CPU access to the IOSurface? Presumably this is worse with unified memory, but maybe not otherwise?
- Clear behavior with a release callback when it is no longer used by CoreGraphics
  - Not sure when it will be released, and we likely need to be prepared to allocate more than 2 buffers, but the current implementation is already allocating a new one every present.

For performance concerns, benchmarking is best. But we'd need a representative benchmark, an implementation of both, and multiple types of hardware.

ids1024 · 2023-04-10T22:15:56Z

Oh, I forgot about buffer stride.

Testing this (#95), it looks like we can't just set the stride to always match the width, so to use IOSurface we'd need to provide a Buffer::stride method. And users of the library would have to consider that.

This would probably also be needed for #42. Or if we wanted to use dmabufs instead of shm on wayland, etc.

LoganDark · 2023-08-24T11:04:41Z

someone with a Mac that has a discrete GPU

Could be me, have a Mac right here with an AMD DGPU, as long as IOSurface exists on macOS 10.14.

ids1024 · 2023-08-24T15:40:53Z

https://developer.apple.com/documentation/iosurface says it was introduced in macOS 10.6 (sorry PowerMac G5 users), so that much shouldn't be an issue.

LoganDark · 2023-08-24T15:46:07Z

Great, I could proceed forward with:

any test branch that (partially) implements this concept; I'd test correctness and performance with a profiler and see if I can make any further improvements
pointers to reference implementations or other info on how this would go into softbuffer; I could attempt to implement this from scratch into softbuffer and see how it goes
providing one of the Softbuffer members a remote desktop to my Mac (since I do not use it, and it's already set up with a working Rust + Xcode toolchain); I'd probably want to hop in a voice call and supervise / advise, so then it would be similar to pair programming I suppose

And as a bonus, implementing it all the way back on macOS 10.14 would ensure that softbuffer still works back to at least that version. (No reason why it shouldn't, but it's a personal goal of mine to keep those old intels supported!)

I should be free to do any of those in around an hour :)

LoganDark · 2023-08-24T15:58:02Z

Reading up it looks like you're talking about having to expose a stride, let me introduce: imgref! If softbuffer needs a 0.4.0 for this, I'd be glad to participate in that API redesign since I've worked with these types of signatures somewhat extensively (grumble grumble looks at unreleased pixels competitor). But anyway, take a look at my proposal above and see if anything looks reasonable to you. :)

ids1024 · 2023-08-24T15:58:40Z

#95 has an implementation using IOSurface. Which requires an API change to expose stride. And it updates the winit and animation examples to use this. #96 instead copies into an IOSurface on present (which requires no API changes). I did some performance testing of both on M1.

I wonder if there's a good way to automate benchmarking of softbuffer performance.

LoganDark · 2023-08-24T16:01:22Z

#95 has an implementation using IOSurface. Which requires an API change to expose stride. And it updates the winit and animation examples to use this. #96 instead copies into an IOSurface on present (which requires no API changes). I did some performance testing of both on M1.

I wonder if there's a good way to automate benchmarking of softbuffer performance.

I'll check them out. I don't have an M1 to test with, but if you do, that should cover everything. My benchmark method typically tends to be instrumentation using Instant::now(), it's not perfect but the margin of error is usually somewhere on the order of milliseconds and copies of large buffers are usually much more expensive than that so it should be good. (I'll figure it out when I have my paws on some local tests)

Once I have some thoughts I'll leave them on the relevant PR, or here if they affect both or are in general.

LoganDark · 2023-08-25T15:21:58Z

Alright, so based on my testing, for total render times:

copy-to-iosurface spikes to 33ms for the first fullscreen frame, then 22ms for each subsequent frame
master spikes up to 22ms for the first fullscreen frame, then 7ms for each subsequent frame
iosurface-wip spikes up to 16ms for the first fullscreen frame, then 16ms for each subsequent frame

I think the 16ms might be a fluke here, it makes you think it might be vsync but it's consistently lower than 16ms for small windows and consistently higher than 16ms for larger-than-screen windows. In fullscreen, it doesn't seem to ever take longer than 18ms or so, but this is still beat by master's 7ms.

Also, copy-to-iosurface is clearly worthless and should be scrapped, as benchmarks prove that more copies won't help anything. /hj

Here are some more detailed breakdowns per-branch:

master:

buffer: 1600x1200
  resize: 0us
  fill: 6028us
  present: 42us
buffer: 1600x1200
  resize: 0us
  fill: 3973us
  present: 19783us
buffer: 2880x1800
  resize: 0us
  fill: 16733us
  present: 25us
buffer: 1600x1200
  resize: 0us
  fill: 3951us
  present: 20us
buffer: 2880x1800
  resize: 0us
  fill: 10252us
  present: 20us
buffer: 1600x1200
  resize: 0us
  fill: 4080us
  present: 15us

copy-to-iosurface:

buffer: 1600x1200
  resize: 4877us
  fill: 3542us
  present: 4655us
buffer: 1600x1200
  resize: 0us
  fill: 3450us
  present: 1811us
buffer: 2880x1800
  resize: 6757us
  fill: 13651us
  present: 12984us
buffer: 1600x1200
  resize: 108us
  fill: 3606us
  present: 4900us
buffer: 2880x1800
  resize: 879us
  fill: 8938us
  present: 12655us
buffer: 1600x1200
  resize: 114us
  fill: 3368us
  present: 5558us

iosurface-wip:

buffer: 1600x1200
  resize: 0us
  fill: 7580us
  present: 51us
buffer: 1600x1200
  resize: 0us
  fill: 6901us
  present: 736us
buffer: 2880x1800
  resize: 0us
  fill: 17087us
  present: 23us
buffer: 1600x1200
  resize: 0us
  fill: 6369us
  present: 25us
buffer: 2880x1800
  resize: 0us
  fill: 16601us
  present: 27us
buffer: 1600x1200
  resize: 0us
  fill: 6400us
  present: 24us

Now Wait Just A Minute, there's something fishy here.

Let's see:

copy-to-iosurface, of course, always takes an ungodly amount of time to present, because of course it does. However, the resize and fill times are basically identical to master (makes sense, since they use the same style of managed buffer). copy-to-iosurface is a strict downgrade.
iosurface-wip has the lowest maximum present time of all of them, just 736μs compared to master's occasional 19783μs (woah) and copy-to-iosurface's 12984μs. However, it has the highest fill time - it somehow takes longer to write into the buffer in the first place.

This makes me wonder if IOSurface is somehow magical! The memory backing it seems to somehow be more expensive than normal memory, perhaps it's some sort of MMIO or something. Anyway, this prompted me to do some more testing. My method of filling buffers quickly is to use rayon to fill it using multiple threads, so let's try that:

master:

buffer: 1600x1200
  resize: 0us
  buffer_mut: 8us
  fill: 4699us
  present: 45us
  total: 4707us
buffer: 1600x1200
  resize: 0us
  buffer_mut: 984us
  fill: 1494us
  present: 19263us
  total: 2479us
buffer: 2880x1800
  resize: 0us
  buffer_mut: 2us
  fill: 8364us
  present: 22us
  total: 8367us
buffer: 1600x1200
  resize: 0us
  buffer_mut: 767us
  fill: 1201us
  present: 24us
  total: 1968us
buffer: 2880x1800
  resize: 0us
  buffer_mut: 2064us
  fill: 2797us
  present: 23us
  total: 4862us
buffer: 1600x1200
  resize: 0us
  buffer_mut: 800us
  fill: 1307us
  present: 14us
  total: 2107us

copy-to-iosurface:

buffer: 1600x1200
  resize: 4791us
  buffer_mut: 0us
  fill: 2454us
  present: 5828us
  total: 7246us
buffer: 1600x1200
  resize: 0us
  buffer_mut: 0us
  fill: 1773us
  present: 1709us
  total: 1773us
buffer: 2880x1800
  resize: 6402us
  buffer_mut: 0us
  fill: 4993us
  present: 12193us
  total: 11396us
buffer: 1600x1200
  resize: 86us
  buffer_mut: 0us
  fill: 1435us
  present: 4449us
  total: 1522us
buffer: 2880x1800
  resize: 879us
  buffer_mut: 0us
  fill: 2782us
  present: 13297us
  total: 3662us
buffer: 1600x1200
  resize: 83us
  buffer_mut: 0us
  fill: 2339us
  present: 4824us
  total: 2423us

iosurface-wip:

buffer: 1600x1200
  resize: 0us
  buffer_mut: 682us
  fill: 4535us
  present: 67us
  total: 5218us
buffer: 1600x1200
  resize: 0us
  buffer_mut: 63us
  fill: 4093us
  present: 454us
  total: 4157us
buffer: 2880x1800
  resize: 0us
  buffer_mut: 133us
  fill: 9323us
  present: 47us
  total: 9456us
buffer: 1600x1200
  resize: 0us
  buffer_mut: 117us
  fill: 3933us
  present: 22us
  total: 4050us
buffer: 2880x1800
  resize: 0us
  buffer_mut: 162us
  fill: 8543us
  present: 35us
  total: 8706us
buffer: 1600x1200
  resize: 0us
  buffer_mut: 87us
  fill: 3817us
  present: 37us
  total: 3905us

Much better?

As far as I can tell, iosurface-wip is the way to go, because it's a lot more consistent than master even if it's slightly slower to write. Meanwhile copy-to-iosurface... yeah. Throw it in the bin, lol

lunixbochs · 2023-08-29T06:20:56Z

ideally these are also tested on apple silicon, to see how it behaves with the unified gpu memory

LoganDark · 2023-08-29T06:22:13Z

ideally these are also tested on apple silicon, to see how it behaves with the unified gpu memory

Of course, I was assuming that @ids1024 (or someone else) would get back to me with comparisons on ASi to see if iosurface-wip really is the best choice for both, but it seems like that hasn't happened yet.

lunixbochs · 2023-08-29T06:28:34Z

what's the easiest way to repro your test?

LoganDark · 2023-08-29T06:37:08Z

what's the easiest way to repro your test?

instrument the code with some Instant::now()s, then eprintln!("took {}us", (b - a).as_micros()); at the end of the frame. I don't have an exact diff

lmglmg · 2024-02-02T13:17:29Z

On the master branch, the winit example consumer very large amount of memory when continuously resized. This issue seems to be fixed on the iosurface-wipbranch. I tested this on a M1 mac.

ids1024 mentioned this issue Apr 5, 2023

set_buffer is slow #18

Open

notgull mentioned this issue Dec 24, 2023

Fps benchmark speed #189

Closed

madsmtm added the CoreGraphics macOS/iOS/tvOS/watchOS/visionOS backend label Apr 30, 2024

madsmtm changed the title ~~Investigate more optimal way to implement macOS backend~~ Investigate more optimal way to implement macOS/iOS backend Aug 26, 2024

madsmtm changed the title ~~Investigate more optimal way to implement macOS/iOS backend~~ Investigate more optimal way to implement CoreGraphics backend Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate more optimal way to implement CoreGraphics backend #83

Investigate more optimal way to implement CoreGraphics backend #83

ids1024 commented Mar 31, 2023

ids1024 commented Apr 6, 2023

ids1024 commented Apr 6, 2023 •

edited

Loading

ids1024 commented Apr 6, 2023

ids1024 commented Apr 10, 2023

LoganDark commented Aug 24, 2023

ids1024 commented Aug 24, 2023

LoganDark commented Aug 24, 2023 •

edited

Loading

LoganDark commented Aug 24, 2023

ids1024 commented Aug 24, 2023

LoganDark commented Aug 24, 2023 •

edited

Loading

LoganDark commented Aug 25, 2023

lunixbochs commented Aug 29, 2023

LoganDark commented Aug 29, 2023

lunixbochs commented Aug 29, 2023

LoganDark commented Aug 29, 2023

lmglmg commented Feb 2, 2024

Investigate more optimal way to implement CoreGraphics backend #83

Investigate more optimal way to implement CoreGraphics backend #83

Comments

ids1024 commented Mar 31, 2023

ids1024 commented Apr 6, 2023

ids1024 commented Apr 6, 2023 • edited Loading

ids1024 commented Apr 6, 2023

ids1024 commented Apr 10, 2023

LoganDark commented Aug 24, 2023

ids1024 commented Aug 24, 2023

LoganDark commented Aug 24, 2023 • edited Loading

LoganDark commented Aug 24, 2023

ids1024 commented Aug 24, 2023

LoganDark commented Aug 24, 2023 • edited Loading

LoganDark commented Aug 25, 2023

lunixbochs commented Aug 29, 2023

LoganDark commented Aug 29, 2023

lunixbochs commented Aug 29, 2023

LoganDark commented Aug 29, 2023

lmglmg commented Feb 2, 2024

ids1024 commented Apr 6, 2023 •

edited

Loading

LoganDark commented Aug 24, 2023 •

edited

Loading

LoganDark commented Aug 24, 2023 •

edited

Loading