GPU Web 2020 03 02

Chair: Corentin / Dean

Scribe: Ken / Austin

Location: Google Meet

Tentative agenda

Buffer upload
Keeping data on-chip
PR Burndown
Agenda for next meeting

TL;DR

Buffer upload
- Discussion on the amount of client state tracking for the different proposals.
- There will need to be an explicit release because there's no guarantee of when the JS GC will run.
- More discussion, basically we made progress on understanding the problem space but no concrete way forward yet.
Investigation: Managing on-chip memory
- Myles introduced the proposal to the group.
- Not an extension to increase the adoption of it and content getting faster for free on mobile.
- Concern that it is too limiting even for deferred renderers.
PR burndown
- #532 Add read-only and write-only storage textures as new binding types
  - Merged
- #554 Add size to setIndexBuffer/setVertexBuffer.
  - Concern about robust buffer access.
  - Folks to take another look at it.

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
Google
- Austin Eng
- Corentin Wallez
- James Darpinian
- Ken Russell
- Shrek Shao
- Ryan Harrisson
Intel
Microsoft
- Damyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Mehmet Oguz Derin
Timo de Kort

Administrative items!

CW: First WGSL meeting tomorrow! 11am Pacific.
MM: This might not be the final schedule for the meeting. We’ll decide tomorrow.
CW: F2F meeting in Phoenix - May. Due to COVID-19 we might have to decide to cancel. We’ll keep an eye on it.

Buffer upload

JG: I think the recent discussion has been really good. Two things:
- Agreeing on how much required overhead tracking would have
- Whether roundtrips are required for some implementations
MM: let’s discuss the second one first. Any impl that doesn’t want to map something till JS calls map(), and has a GPU process, would require round-trip to GPU process, which would be async.
CW: true, if we want to make sure the buffer is 100% not in use by the GPU, and assuming the buffer is in use except when it isn’t. Possible for client side to be pessimistic about when the buffer’s in use.
MM: other consideration: perhaps not wanting to make copy on reads, but maybe for writes.
JG: Need round trip because data isn’t ready? I’m assuming many impls will have shared memory region for mappable buffers that’s mapped into both the parent and child processes.
MM: so the actual map() call needs to tell the GPU process to map it.
JG: could leave it mapped all the time.
MM: Don't think that's possible to map buffer at all times because it increases the number of pinned pages, which talking to kernel team is bad (less freedom to relocate pages). Natural result is that mapping shouldn’t happen by the browser until JS asks for it.
KR: In remote GPU process implementations, a map call might not call the driver's map. It can just allocate a shmem. In WebGL JG found a smart design for map read where there is no blocking <...>
JG: Basically, knowing which fences guarantee the completion of which operations lets you do this. If you enqueue a copy from a buffer to a buffer and then map it, you know you can map it as soon as the fence passes. Once you know the fence has passed, it can be immediately mapped.
MM: think that’s not true in a GPU process arch. Have to ask the GPU process to do the mapping.
CW: In a single-queue world, for mapping, it’s doable (With extra tracking)
MM: the GPU process has to do the map. Web process has to tell it, please perform the map operation.
JG: In WebGL, it’s done eagerly. As soon as you finish the set of operations that copy into a buffer that might be read back, we copy it into shmem.
MM: sounds like a lot of extra copying.
JG: no way to do it without a usage migration flag. “Map this async”.
MM: think another way to do this is mapWriteAsync, which we have today.
JG: What if I told you that if you did this, it does what most content wants to do, and it doesn’t need a round-trip anymore? That’s why WebGL does it.
DM: so you’re suggesting if the impl doesn’t want a lot of pinned mapped pages, once work is done by GPU, the GPU process maps it and lets the client access it whenever it needs?
JG: I wasn’t talking about the pinned pages question. I was talking about whether or not we need a round-trip in the map-read case.
CW: so you’re saying, by eagerly copying data and sending it to the content process, we can avoid the round-trip, and the explicit transition by the application.
KR: This is only done for buffers marked with MAP_READ usage in WebGL which is a strong hint you want data back from it. When the client waits for a fence ,it’s the processing of the fence in the GPU process that then optimistically populates the buffer data in shmem.
MM: For us, it wouldn’t be a hint. It would affect the shape of the API. More of a contract, not a hint. If you mark it as mappable, and you issue a command into a command stream, and a fence after the command, then the implementation MUST transfer the results of the operation back to the Web process before the fence promise is resolved.
KR: Maybe that wouldn’t be that bad if you say you can’t mapReadAsync without waiting on a fence -- maybe produce an error.
MM: The thing I’m worried about is people putting mapping usages on resources when they don’t need it. Copying the entire contents of resources is a lot of copying.
JG: the changed contents of resources, not the whole resource.
MM: then it’s just a diff.
JG: No diff to compute - just the range.
CW: ALternative is that MapRead resources are only TransferDst. For readback, mapReadAsync is less.. mapping async doesn’t exist in native APIs, but mapReadAsync is less “eww” than mapWriteAsync which is more alien.
JG: agree.
CW: think the harder problem to solve is mapWriteAsync. It’s more alien to native developers. mapReadAsync is a less critical use case.
JG: My recommendation for that: if you want to immediately start uploading data to the GPU, createBufferMapped with TRANSFER_SRC, copy out, then delete the buffer. Only problem with that is that people might not delete it properly.
MM: right. Manual retain/release is an antipattern on the web.
JG: Don’t have a ton of options here. It makes things really nice: give you what you want to do, makes sense. The concern is allocating too many things an OOM, but that should be a fairly easy warning.
KR: We have to definitively say that for GPU resources, the developer has to do timely release. We’ve run into this with ImageBitmaps as well. Let’s focus on the mechanism and not on whether explicit release is needed.
MM: I’m absolutely worried about this problem. One reason: in Metal it’s impossible to release a buffer. The only way to do so is to let its retain count go to 0.
JG: cycle counted or refcounted?
MM: refcounted.
JG: When it goes to 0 when it’s refcounted, it immediately goes away. That’s not true in Javascript.
MM: not arguing against the delete call. Arguing against: the most natural way to write code to record commands & upload data requires manual retain/release.
CW: In a way, the most natural way is GPUQueue.writeToBuffer
JG: But that has problems. It’s because we have these complications that we have to pick designs that aren’t as nice. We’re trying to make the API as nice as it can be.
MM: one proposal for as nice as it can be- keep mapWriteAsync / mapReadAsync, and add DM’s proposal “writeToBuffer / writeToTexture”. Can do optimal transfers for the other kind of upload too.
JG: You have to do all the tracking anyway.
CW: why do we have to do tracking?
MM: DM’s proposal requires no tracking.
CW: GPUQueue.writeToBuffer.
JG: DM had two big proposals here. setSubData and Queue based uploads.
DM: the other one (queue based uploads) got closed.
CW: setSubData, now called writeToBuffer, doesn’t require additional tracking. It can be 2-copy in a bunch of systems.
JG: to do it optimally it must require the tracking. You have to know whether the buffer’s ready. Impossible to hit the fast paths without knowing that.
CW: If you need to know if the buffer is ready.. the path where you know the buffer is no longer in use for a multiprocess arch, it would be JS -> shmem, then shmem -> GPU.
JG: unless the GPU resource was already mapped cross-process, sure.
CW: But that seems scary to do in general, for a lot of resources. OTher option: in the content process, create a shmem region, copy into it, send it to the GPU process, which wraps it as a VkBuffer or D3D12 or MTL thing, and then copy that into the GPU resource. Still a 2-copy path. No need to do state tracking on the client side.
JG: yes, but there’s an even more minimal path which does require the state tracking.
MM: I’m not allergic to state tracking.
JG: But everyone needs to agree on whether it’s fine because it shapes what design we pick.
CW: I’m not sure what state tracking you’re talking about.
JG: client-side tracking of mappable resources.
CW: I thought you were talking about state tracking for mapping buffers.
JG: if you want to do minimal number of copies you have to do state tracking.
MM: has anyone actually verified it’s possible to do a 3-way map between GPU, GPU process and web process?
CW: I didn’t but I did ask the D3D team about this on Discord, whether we can map a createAnonymousFile shared across-process with D3D12 and he said yes. But we should probably check.
DM: JG, the tracking you’re referring to: for each operation we have to know which regions of mappable buffers it uses?
JG: yes.
DM: trying to specify it is a little bit strange. What exactly is used may not be obvious to the user. Something we intuitively understand, but when I specify it: if your draw calls reference it, is it used?
JG: If it’s within the range currently bound for the draw call, it’s in use. If you submit such a command buffer until you know it’s completed via fencing, then it’s in use.
KR: Would it be able to be more conservative..
JG: Don’t think we need more conservative tracking that that. Right now with mapRead/WriteAsync, we don’t need a specification about when a buffer is use. It’s all hidden and you’re supposed to wait around until it’s ready.
KR: It’s difficult to understand which regions of vertex buffers are going to be referenced by draw calls, and almost impossible for indirect draw calls.
JG: I have a PR for subrange of a buffer for use with draw calls. If you have that, then you’re limited to that range.
CW: just so I understand: you’re asking about state tracking on the client side for guaranteed-efficient mapping? So at any time you can say map(), but it’ll be efficient if you waited for a fence?
JG: all of them.
AE: it applies to all of them.
JG: The axiom of needing buffer tracking to do a minimal amount of copies is true for all of the copies. In the fallible map option, it is exposed to JS.
CW: but we want to have an API that allows 10K draw calls per frame, and it does that detailed buffer tracking on the content side?
JG: You’re tracking subranges
CW: even worse.
JG: you’re mapping your vertex buffers. Why would you do that? You’re on a non-UMA system. Can’t access the buffers anyway.
CW: What are the usage restrictions then?
JG: You can have any you want, but some will have perf impact. That’s why we have them.
CW: I’m on an Intel system with UMA. Sounds amazing that I can make all my buffers MAP_WRITE. Then run on another system and it’s awful.
JG: manipulating buffers in flight is a terrible thing. Map, change, draw over and over again - not efficient. People will want to try this, it’ll be slow, and we’ll explain why.
DM: Not if the GPU is using a different range from what they’re drawing from. Perfectly fine on Intel.
KR: Already, native apps are using a massive persistently mapped buffer to do subregion uploads and all the fencing they need to ensure no racing. This pattern will not work by design (today). Is it too much of a stretch to need to double/triple buffer their vertex data so that they can use a different one while the other buffer is in flight?
CW: for uniform buffers, the question came before. Problem: they’re baked into BindGroups. hard to change frame to frame because have to recreate bindgroups. same issue with renderbundles. gpu address is baked into the renderbundle. can’t just change it - have to reencode renderbundle, or triple buffer the renderbundles.
JG: one thing that does help - you effectively have a MAP_WRITE usage, and a vertex pushing usage. Then your shared memory is static from the client standpoint. Even if enqueuing copies on the GPU, you have the map available on the client. Expose it only if lack of race conditions in the API allow it. In many of these cases there are warnings we can provide to people that convince them away from some of these. Like allocating everything with all usage flags. Certain patterns of usage with MAP_READ or MAP_WRITE usage flags we should warn. We can / should establish best practices for these sorts of things.
CW: What’s the gain we have with internal state tracking versus or in addition to mapAsync?
JG: mapAsync today doesn't’ really need client-side state tracking.
CW: what’s the improvement relative to mapAsync?
JG: The improvement is: say I rendered some stuff, then injected a fence. LAter I check the fence, and I know that I can reuse the buffers again. Directly map into them, write data for the next frame or render again.
CW: so first, that requires persistently mapping / keeping them pinned. Also, why can’t you do this with mapAsync?
JG: have to enqueue them, no way to cancel them.
CW: you can cancel the Promises. Still pay the GC cost.
JG: you can batch this - another advantage. Multiple buffers -> you don’t need multiple promises. These are different ways of having the same guarantees, modulo the GC and batching. There’s definitely user desire for being able to map things immediately, instead of even having to wait for microtask boundary / collect all the promises. I’ve already implemented the thing people want - fallible mapping - on top of mapWriteAsync. It’s possible to do in a polyfill. The question is what do we have in our core API. I think it’s fine it we just end up with these async ones. Not ideal, generates garbage, people don’t want the garbage.
CW: there was this proposal a while ago - Kai’s? - mapAsync, you could say “.then()”, the buffer would act like a Promise. No additional garbage.
JG: good progress, let’s talk about the other agenda item.

Investigation: Managing on-chip memory

CW: like Vulkan renderpasses
MM: I’ll introduce it. People probably haven’t read it yet.
MM: important parts:
MM: there are tiling GPUs, pretty popular. Important that content work well on tiling GPUs. Already 2 native APIs that have some facilities for making content work well on this class of GPUs. However they both have this problem - they’re opt-in. By default these facilities are not present, have to explicitly ask for them. For people making websites, they’ll develop on desktop, content will work fine, and won’t have considered the possibility of these tiling GPUs. Pretty bad for perf on the mobile GPUs.
MM: wanted to make proposal that fulfilled requirements.
MM: 1) should work everywhere (tiling and non-tiling GPUs) with no perf detriment.
MM: 2) implementable on top of Vulkan subpasses and Metal ImageBlocks.
MM: 3) should be part of the core API. Then devs won’t have to check to see if it’s enabled. Think it’s important for the success of this proposal.
MM: then also: been stated by CW and unofficially by others that Vulkan subpasses proposal is too scary. Didn’t want to just reproduce that. OTOH, Metal ImageBlocks are strictly more powerful than Vk subpasses.
MM: Proposal: Start with the least powerful thing: Vk subpasses, make it as simple as possible, and try to add some performance characteristics like lazy clearing that would reduce memory bandwidth, and put that in the core API. IF we did this, it would lead to better perf on tiling GPUs, and authors will be more likely to use it.
MM: we can talk about the details in another meeting, unless people have read it and want to comment.
CW: One question: Why (3) ? Versus being an extension.
MM: good question. Answer is: want to increase the chance that developers will actually use it. Don’t want them to have to think about whether to use it & then have 2 different code paths. # of extensions increasing -> exponential increase in number of code paths. Think it will lead to increased adoption of this feature.
MM: if we did make it optional -> evidence is that it wouldn’t be commonly used. A feature not commonly used effectively doesn’t exist.
CW: Sort of means things shouldn’t be extensions, or “just work” even if more restricted than it needs to be.
MM: that’s my general philosophy. Places where that doesn’t work. There are extensions adding capabilities that can’t be polyfilled. Those, I’m OK with adding as extensions. Philosophically, we should have as few extensions as possible.
CW: Thank you, I have more comments, but on the details of how it maps to Vulkan which we can work through offline.
DM: my first impression is that having 2 subpasses is too limiting. For gbuffer in particular, would like separate pass for SSBOs . Order-independent transparency would have multiple passes as well.
MM: that’s good feedback. Not refuting, but: one of the goals of this is to add an API that’s not so complicated that devs wouldn’t be scared. Finding a middle ground is tricky.
RC: if we put this in core, will devs not use subpasses, and just do regular renderpasses and it’ll work great on desktop and poorly on tiled?
MM: yes. Go to the top of that issue. The top post, there’s a bargraph. Middle bar describes what you mentioned. Right bar is the perf you’d get if you use this feature.
DP: Looking at the proposal, what if there are no intermediates?
MM: then you get today’s behavior. No intermediates, single pass. nextSubpass() function would be an error.
DP: so there’s no incentive for anyone to use this except for improved perf on platform they may not have access to.
MM: any proposal shouldn’t make it required to have multiple passes. Suggestion is a weak one, but still stronger than today, where you have to feature-detect it.
DP: hopefully it’ll incentivize people to do the right thing. Get the benefit for free.
MM: suggestions welcome.

PR Burndown

#532 Add read-only and write-only storage textures as new binding types
- Basically what we discussed at the F2F, without renamings. Any concerns?
- JG: think it’s waiting on Justin for review.
- JF: did I not already review this?
- CW: you can look offline, if it looks OK please press merge.
- JF: I already reviewed. Merged.
#554 Add size to setIndexBuffer/setVertexBuffer.
- CW: DM helpfully pointed out that this makes the API consistent. Everywhere else buffers are used we have a size.
- JG: maybe we shouldn’t mention in the spec that it’s useful for things not in the spec. If we remove that statement we can merge it.
- CW: yes, that was the only concern stated.
- JG: basically, size is 0. size==0 means use the rest of the buffer.
- RC: does this impact the amount of validation needed in other parts of the API?
- JG: we have to do that for everything except this anyway.
- CW: this also might interact with robust buffer access. Maybe in Vk if you use robust buffer access and don’t say the whole size of the buffer you get crazy values.
- JG: that might be true. Think probably true for other usages.
- CW: does setVertexBuffer require a size in Vulkan?
- RC: in D3D12 it does sometimes. Root constants just take offsets. If we just use that then we have to do size checking in the shader.
- JG: we have stricter validation.
- CW: in Vulkan if we want to rely on robust buffer access behavior in the driver, binding vertex buffers only takes buffer and offset, wouldn’t get clamping to the size.
- DM: it’s specified in a way that it doesn’t make a difference. Even if bound to range it can still return stuff elsewhere in the buffer.
- JG: OK, I’ll fix it up. Who should review?
- CW: I’ll review.
- JF: functions that access this take vertex count, so that implies the amount of data referenced from the vertex buffer.
- JG: if you wanted to accumulate the affected range for validation.
- DM: also indirect calls.
- JG: yes, those don’t do this.
- JG: I’ll look at this more.
#557 Synchronization section of the spec
- NOTREACHED()

Agenda for next meeting

CW: we just open-sourced the first version of the WGSL compiler https://dawn.googlesource.com/tint as of 9 minutes ago!
JG: this is WGSL to SPIR-V?
CW: yes. But it’s very WIP. FYI.
Buffer mapping again?
- JG: yes.
JG: same agenda as this meeting.
AE: #552? Sampler requires knowing use compare function?
- JG: is that a PR we can burn down today?
- AE: it’s an issue, could make it a PR.
- CW: might be issue on other APIs too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2020 03 02

Tentative agenda

TL;DR

Attendance

Administrative items!

Buffer upload

Investigation: Managing on-chip memory

PR Burndown

Agenda for next meeting

Clone this wiki locally