Minutes 2019 11 25

GPU Web 2019-11-25

Chair: Corentin

Scribe: Austin

Location: Google Meet

TL;DR

#501 array buffer creation benchmark & Immediate uploads and friends #426, #481, #491
- Discussion about the uploadBuffer closure and what it entails for the programming model.
- Discussion about yet another idea which is failable buffer mapping and have developers synchronize themselves (or mapping fails)

Tentative agenda

#501 array buffer creation benchmark & Immediate uploads and friends #426, #481, #491
Racy buffer mapping
Agenda for next meeting

Attendance

Google
- Austin Eng
- Corentin Wallez
- Ryan Harrisson
Microsoft
- Chas Boyd
- Damyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Mehmet Oguz Derin

Some news

DM: Some news: Landing changes on both FF and Servo to implement WebGPU. Technically we have twice as many votes now!
CW: We’re integrating on ChromeOS as well. Starting to maybe look at Android. Browser integration will be more difficult than just getting Dawn running.

#501 array buffer creation benchmark & Immediate uploads and friends #426, #481, #491

CW: Last week, we were talking about two overloads (only considering buffers) on GPUUploadQueue which is setBufferSubData and uploadBuffer -> ArrayBuffer. Difficulties found with uploadBuffer:
- GC Pressure
- Something related to multi queue depending on how it works
- Unclear interactions in multi-threaded environments. We definitely want to submit on one thread and upload on another. You can’t detach ArrayBuffers cross-thread in WebEngines. This is because image in JS, you’re in a hot loop with an array buffer. If it’s detached cross-thread you’ll deref a nullptr. So if we make upload buffers on one thread, and then submit them on another, then you get a data race. Perhaps one solution is that you have to explicitly detach all upload buffers before submitting.
DM: Closure idea that CW brought up. Could be a good direction. something like: q.uploadBuffer(arrayBuffer => { … });
CW: Crucial problem with staging belt if you want a staging belt with still fast path is that we can’t control lifetime of array buffers unless we scope the usage of the array buffer. Here it is, use it now. The closure is a way to force the scoping. The closure is immediately called with the ArrayBuffer, and at the end of the closure, it’s immediately detached.
DM: There’s some room for not calling it immediately. It needs just a bit of time to request staging space from the server. We can make the closure wait just a little for the client to get the staging area. It makes less heuristics about what’s going on. If you return immediately, you can’t guarantee it points to the staging belt. If you operate on a closure, you can guarantee it better.
CW: To be clear, two designs using the closure: Immediate calling, and calling it maybe soon-ish (not the same stack frame). I think it’s always possible to get more shared memory in multiprocess browsers. The only possibility that might not be covered is sending a handle cross process and mapping.
AE: Think it’ll work on Intel with VK_external_memory_dma_buf
RC: Where in the issue #491 is the thing we’re talking about?
CW: Last week we were talking about proposal version 2. Now, we’re talking about something like v2 & my-last-comment. DM is suggesting that the closure may not be immediately called.
AE: When is it called by? before the next submit?
DM: Yes.
CW: What could happen is that implementations queue the queue submits. It doesn’t happen immediately, but after the Web engine has resolved all the closures. If this is fairly fast because we’re waiting for some more memory, it could be okay. And everything happens in a nice linear order.
DM: The closure is executed on the queue timeline. Which makes sense because you make the closure on the queue timeline.
CW: I disagree, because fences
DM: No but that’s GPU timeline, not queue timeline.
RC: So did we agree then that we’re going to have both a mapping solution and setBufferSubData and uploadBuffer? all three?
JG: That’s the proposal. In my mind, this is getting too complicated. Maybe there is a super cool solution in here, but I think there’s room to do sometihng simpler if we say you can synchronously map buffers, and sometimes it’s fallible.
DM: Sorry Jeff, that’s not the proposal. I would say the mapping today is not needed. The proposal is only about the two new things and doesn’t consider the existing mapping. Proposing that there’s only setSubBuffer and uploadBuffer (maybe w/ closure)
RC: Then we have this reserveUploadSpace as well?
DM: Don’t think we should go there now. It can be chunked or have a representation that’s not expressible at this level.
CW: Unfortunately, my opinion is the contrary of that. Sorry. Buffer mapping + setSubData seems the least spaghetti.
JG: What need do you see for setSubData or is it convenient?
CW: Very much a convenience function.
JG: I asked because one of the paradoxes of minimizing copies depends on if you have a static resource or if you’re generating the resource, mapping vs copying can be the most efficient solution. If you only have one, then one path will have an extra copy -- unless everything collapses down to a mapping anyway. One of the advantages of setSubData is that it could let you copy into an idle buffer.
CW: Never was on our mind.
DM: That was on my mind. There’s no reason we shouldn’t do that if you can directly get the pointer.
JG: Isn’t it always managed mode on Metal?
DM: There is a shared mode.
CW: I remember being surprised that there’s an extra copy when Safari does mapping. We should ask them.
RC: So then closure aside, when you call uploadBuffer and the buffer is returned to you, can you use the GPUBuffer destination while an uploadBuffer is outstanding? or must you detach first?
DM: Considering CW’s concerns, you should have to detach first. Closures let us do this.
RC: Let’s say you call uploadBuffer on a couple of buffers. On discrete hardware, could the returned buffer be a pointer into some GPU staging memory?
DM: Yes, that’s the optimization we want.
RC: Are you going to round-robin or make a new one each time?
DM: That’s the staging belt that I wanted underneath. Some ring-buffer inside.
RC: What happens if you call uploadBuffer twice, but then detach the second one but not the first one, or vice versa. Are there ring buffer problems?
DM: No the order is established by the queue operations. You just have to detach first.
RC: Suppose they leave a hole where they don’t detach something in the middle. Is that a problem?
DM: Use case we don’t want to do. Now we are forcing the users to detach everything every time. It would be a validation error on the queue, even single threaded.
RC: < more about holding things not detached >
CW: Unclear. Could be only the buffers that are used in the submit.
RC: Right, that seems reasonable, but it could mess up the backing ring-buffer idea.
DM: That’s not the use case we want to support. They should not be holding upload buffers for multiple frames.
RC: May not be a use case, but clueless developers may still do it. Should be an error or something to prevent.
CW: Sounds like conclusions: 1) validate buffers aren’t used. 2) to preserve staging belt, we need to validate all previous upload buffers have been detached. Even the ones you’re not using in the submit.
DP: If you gave the app control of managing it’s own staging belt, then you can do what Rafael is describing.
CW: createBufferMapped?
JG: Something I did put up is a polyfill doing this with createBufferMapped. Users could then do their own, even with weird cross-frame timings. The one thing there is that I’m not sure it’s possible to support a zero-copy with the API we have today.
RC: Well the one we have today is all async everywhere.
JG: Yes, and we did that because it’s easy to implement client-side. And to eliminate racy non-portable state and validation issues when mapping stuff and submitting it while in use.
JG: If we’re going to pick a failure case, mapping-didn’t-happen-and-you-should-handle-it is probably one of the most straightforward ones we can do. You can do the async thing we have. Or you have copy bloat. Many apps are going to be trying to hit the zero-copy path. Every time the fall off is a paper cut. If we had fallible mapping then you could implement this staging belt uploadBuffer concept with it.
CW: Imagine we do this fallible mapping. In a cross-process browser that can’t do cross-process-mapping. When someone does a tiny mapping of a big buffer, then you’re flushing a ton of shmem for a small update.
JG: I think that is why it’s usually mapBufferRange.
CW: That’s why I think COPY_SRC only with MAP_WRITE makes sense. What we could also do is have a UMA extension.
JG: That is almost what D3D12 does. The normal path sends you through UploadHeaps and stuff, but you can opt into Custom heaps where available.
AE: What if we did mapBufferRange but no overlaps or only one at a time.
DM: Very possible, if there’s multiple at a time, then it’s like the BufferViews but restricted to uploads only.
CW: Not buffer views because the state tracking is simpler. There are a number of outstanding mappings, and when you unmap all of them are gone.
JG: Oh, so there’s two possibilities. You can map sub ranges, but if it can’t be used at all anywhere else in the API if any one mapping exists.
DM: This is something the closure-based queue uploads can provide. Don’t have to know on the client-side if the range is in use… can check whether the range is currently in use by the GPU. If it is in use, it’ll fill in the staging belt and issue a copy.
JG: So silently copy-bloat.
CW: As a data-point, a round trip takes about 1 ms.
JG: That’s why in WebGL we piggyback those tracking on top of user fences. The trick of if the user can’t see it’s observable, it hasn’t happened yet.
CW: Okay so we have like four different solutions to the same problem, and variants of all of them.
DM: Should write them all down and the problems each has.
- Queue transfers uploadBuffer without closure
- Queue transfers uploadBuffer with closure
- Existing mapping - can have user-staging-belt, maybe uma-extension, map-write with other usages? with subranges?
JG: AI to start a hack md on this. Will send the link later today.
CW: RC what do you think?
RC: Part of me thinks for v1, bufferSubData is the only thing, and for v2 an extension or module can be added. All the other ones have some problems.
CW: SetSubData has problems too.
CW: WebKit is doing a memcpy even if the driver is in-process.

Racy buffer mapping

PR burndown

#496: Add [SameObject] to several more attributes.
- Approved
#498: Add [EnforceRange] to all integers.
- JG: By default if you have a u32 in WebIDL and pass +infinity it is coerced somehow. EnforceRange produces an exception in that case. Isn’t slow to add this because in JITed code it is clear if it is a u32.
- JG: Only GPULimits might not need to be enforced. Name is gross but c’est la vie.
- CW: Semantics approved and bikeshed on the name.

Agenda for next meeting

Comprehensive analysis of buffer mapping and uploads
Make an agenda for meeting with Khronos
- RC: Is the purpose to talk about SPIRV and how to use it? Or is it how W3C and Khronos to work together nicely.
- CW: No it is about SPIRV as a main topic. Exactly what about it, is still TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2019 11 25

GPU Web 2019-11-25

TL;DR

Tentative agenda

Attendance

Some news

#501 array buffer creation benchmark & Immediate uploads and friends #426, #481, #491

Racy buffer mapping

PR burndown

Agenda for next meeting

Clone this wiki locally