GPU Web 2020 03 30

Chair: Corentin

Scribe: Ken / Corentin / Austin

Location: Google Meet

Tentative agenda

Query API #614 and #619 and #656
Data upload #605, #649 and #650
PR burndown
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Fil Pizlo
- Justin Fan
- Myles C. Maxfield
- Robin Morisset
- Theresa O'Connor
Google
- Austin Eng
- Corentin Wallez
- Dan Sinclair
- David Neto
- James Darpinian
- John Kessenich
- Kai Ninomiya
- Ken Russell
- Shrek Shao
- Ryan Harrisson
- Brandon Jones
Intel
- Brandon Jones
- Bryan Bernhart
- Yunchao He
Microsoft
- Chas Boyd
- Damyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Joshua Groves
Matijs Toonen
Mehmet Oguz Derin
Pelle Johnsen
Timo de Kort

Administrative

CW: W3C wants to push forward with the charter. I can take over the PR if you like.
- He did a pull request on your pull request.
- DJ: looking now.

Query API #614 and #619 and #656

CW: any concerns with occlusion query side of the API?
MM: point I'd like to make is about timestamps, not occlusion queries.
DM: there's an argument for precise occlusion queries, but it's an extension. Extension adds an optional argument to the function. What if there are multiple extensions adding optional arguments?
CW: maybe "extensions" should be "features" per Kai's suggestion. Do we see any other arguments being added to occlusion queries?
DM: no
JG: would it be an error to make it a dictionary with options? Might be better because it's more future-proof. Easy to turn a call with an argument into a dictionary.
DM: I'm not worried about this case, but the precedent it sets for other extensions. Ability to modify API is a privilege of extensions added right now. Would be better if extensions would all follow the same rules, e.g. extending dictionaries.
MM: we shouldn't add optional dictionaries to every call.
KN: adding an optional dictionary later is non-breaking. In this case I'd have the call without an extra parameter. I'm mostly in favor of optional dictionary argument. Shouldn't have significant overhead if you're using the overload that doesn't provide it.
CW: if we add the version that doesn't have the optional dictionary that's fine, and figure out in parallel the structure of extensions?
MM: this PR seems to be adding both pipeline statistics and timestamp queries? Not super confident in how these should be added right now.
KN: I agree, there's already a comment on there.
CW: for occlusion queries: seems mostly good? Deal with precise occlusion queries, what that means for extensions, etc. separately?
MM: I propose that we merge the occlusion query PR now, and we can make additions / changes in the future, to unblock things.
CW: can be merged as soon as the precise stuff and pipeline statistics, etc. are removed, correct?
MM: yes.
CW: Kai can you make suggestions on the PR about what to change?
KN: yes. We remove precise?
CW: and timestamps and pipeline stats.
KN: Sounds good.
MM: thinking about approach we discussed last call about 2 different types of timestamp queries. One, in core, useful but not super useful. Second, fine-grained, can ask per-draw-call info. Thinking about draw calls on tiling GPUs; they're all collapsed. Doesn't make sense to take timestamp of a single draw call there. Only whole render pass. If we have optional timestamp / duration API, would have to be at the pass boundaries rather than around individual draw calls.
CW: is there any API on Metal for iOS allowing timing per-tile?
MM: there isn't. More of a general point rather than specific implementability point. Don't know how Vulkan does this. What's state of the art for timestamps on Vulkan?
JG: they're optional I think.
MM: so tiling GPUs don't have them?
CW: I don't know how it would work on tiling GPUs.
JG: that's what we fond that's why disjoint timer queries didn't work on tiler GPUs
KR: <missed something> and could return an invalid value because they couldn't
JG: What about 0 bits of timestamps that are technically allowed?
KR: That's an option and even some macOSes gave 0 bits of precision. Disabled in Chrome for this for portability. Disjoint timers give you some duration, but it's unclear exactly what it does.
MM: another possibility: keep start/end duration queries. Allow those to drift to begin/end of pass. A little deceitful, but always an option.
DP: it's up to interpretation what the start/end is of a draw call. No problem leaving this more ambiguous on a tiling GPU.
MM: if we go with this approach it'd be valuable to tell the author that if they draw a lot of things inside a pass then it won't do what you expect.
DJ: if it's up for interpretation and unreliable then what's the point of adding this vs. just render pass measurement?
CW: I think it's for a tool that does rough time measurements of various things. Human eyeballs things and sees that things look wrong. If computer uses it for something, more problematic.
DJ: wouldn't one use be, measure something at start of render pass and make some determination later?
DP: could imagine someone doing this at startup to scale workload appropriately. Having said that, Dean's point about putting it around render pass seems enough.
MM: just wanted to make a note, wanted to amend last week's conversation.

Data upload #605 #649 and #650

JG: two PRs. MapBuffer with sub-ranges (not on queue any more). Other is embedded in CommandBuffer upload path. Couple matrices you'd like to update for draw calls. Small, and small copies are free, so embed them in cmdbuf. Could interleave these updates with your draw calls. This is a way to make small content easy, where # copies doesn't matter as much. Also impls can optimize this - embed things directly in the cmdbuf, not creating entire shmem segment for it.
JG: thoughts on small updates in cmdbuf? One weird thing is we put a hard limit on it. "Don't use it for general purpose uploads", per Vulkan spec. Don't use it for vertex uploads. Use map primitives for that.
CW: think the limitation comes more from specific hardware, wanting to use certain hardware for the Vulkan method.
RC: size limitations aside, is this similar to setSubData?
JG: sort of. Instead of being in global mapping or queue timeline, it's in the cmd buffer timeline. If you enqueue buffer update into cmd decoder, and don't submit the command encoder, it floats around and might never get uploaded to the GPU. Idea is, I have draw calls and need them to embed this matrix data.
CW: my concern is that it's not a pit of success - you use WebGPU, I'm going to upload some data. Two pitfalls: 1) blows up if you use too much data, 2) want to upload a giant model, larger than 64K. I'll just use uploadBuffer many times.
JG: it's pretty easy to detect this. Like in WebGL, we will point out performance warnings. If you have an upload slowly growing over time, I argue it pushes you back into the pit of success. Your use case has grown beyond this API, and you should redo this in the right way, using mapping.
DM: I don't understand how it pushes you into the pit of success. You load one mesh, it works, and load another model, it doesn't work.
JG: depends on definition of success. In this API, predictable performance is key. Might as well upload this because you don't have many vertices. Fine until you hit this limit. That's a way to push you onto the correct path. Similar to putting all textures into a texture atlas, but then your texture gets too big. We report something in the console. User files bug against engine developers, they find that models too large need different upload path, and they fix it.
JG: could we rely on warnings? i.e., you can upload an unbounded amount of data this way, we just warn about it.
MM: some APIs have limits on what can go in the cmdbuf and what needs to go out-of-band. Impl that supports only small amount of in-band data promoting it to shmem sounds OK to me.
JG: we could leave it to impls to add warnings. Think it's a good place in the spec to say that you should generate warnings if people are using this as primary upload path.
CW: then why have writeForBuffer without warnings? We can warn users trying to upload large models with this API, could do the same thing with writeToBuffer.
JG: value added here in my proposal is that embedding it in the cmdbuf does something materially different. writeBuffer is something that can be polyfilled on mapBuffer.
JG: can polyfill functionality but not the behavior. The thing this lets impls do is embed data to upload into the cmdbuf. Definitely something Vulkan has an entry point for. If small enough, could happen in other APIs to. In worst case, becomes copyToBuffer. But can move data from client (content process) to server (GPU process) in the cmdbuf.
CW: I don't see the material difference.
JG: it's in the cmdbuf. Convenient because it's right next to the other data. WebGL impls do this today.
CW: in remoting impl of WebGPU there is a cmd buf, and queue submits go through there. writeToBuffer, you can inline the data too.
JG: that's on the queue, not the cmdbuf. Our queue submits will probably not have shmem. We'll do IPC calls for that. We won't have cmdbuf of cmdbufs.
MM: not too many comments. Wanted to say, in response to Corentin's question on #605: "please also describe if multiple calls to mapReadAsync before unmap", etc. I have no opinion, whatever you think is best.
CW: I don't have strong opinion either. Don't think subrange is super useful so any behavior seems fine.
MM: out of 3 proposals today are we trying to pick one? What's the option?
JG: pick one mapping proposal, and optionally the cmdbuf inline uploads. #650 is optional. One of the other two then.
CW: either #605, new mapWithSubrange, and choice of no immediate upload, writeBuffer, or updateBuffer.
JG: discussing mapBuffer with subranges.
JG: requires add'l tracking: not just whole buffer mapped/unmapped, but subranges. Typically should be small number of ranges. N=3, maybe: one buffer, and as you finish with different frames you map e.g. 2 frames behind, 1/3 the buffer used for uploading, as ring buffer, instead of multiple buffers. Not like reading ~4 bytes out of a larger buffer.
CW: several concerns. #605 addresses concern that mapWriteAsync you want to be able to map a sub-range, and only that part gets moved cross-process for GPU upload.
JG: this is like #605 but mapping the whole buffer every time. OK, so you want partial flushes of data?
CW: yes, although you also want to avoid data races so you can have GPU-visible buffer in content process.
JG: what if we took that part from #605 and subrange tracking from #649? Have heard from partners like Unity that they want this.
CW: have we really had this concrete feedback?
JG: it's common practice in Vulkan. I can reach out to them. "Borrowing" mappings which should apply pretty cleanly in their architecture. Also satisfies one of Ken's concerns where you just want to read back a small section of a larger buffer.
CW: think that was Myles' concern.
CW: my concerns about subrange tracking: it's useful if you are able to use a mappable buffer as something other than just copy-src or destination. Only useful to have subrange tracking if you want to use buffers as uniform or index. Imagine I write content on a laptop with UMA GPU and it works. Then you move your page to a discrete GPU and it behaves badly because it's reading the vertex buffer over PCIE and it's not fast.
JG: how does #605 address this?
CW: because mappable usages can't be used as other kinds of sources. Like the D3D12 heaps. Default heap that lives on the GPU and has fast access.
JG: I asked a while ago for #605 with subrange tracking. Didn't get that, so I made this proposal. Think some of the concerns #605 addresses... my main question, if we took the parts out of this you don't like, added subrange tracking to #605, what do you think?
CW: we have a footgun where people can use mappable memory for more than just transfers.
JG: I'm saying put subranges in #605, what are your concerns?
CW: do you allow more than COPY_SRC for MAP_WRITE memory?
JG: does #605 allow that?
CW: no.
JG: OK. then what are concerns with adding subrange tracking?
CW: they're not useful.
JG: ...or they want to do upload queues.
JG: I'm going to copy #605, add subranges to it, and we'll discuss at the next meeting.
CW: #605 with subranges add complexity and not value.
JG: I thought I told you the concerns it addresses, like reading back small amounts of buffers.
KR: Fundamentally different semantic between mapping the buffer or not at all. And having the buffer partially mappable means a lot of client side tracking.
CW: To be fair this is not a concern in #649 because there is a promise going on so you can go back to the service side.
RC: Sounds like one of you says “I think it’s useful” and the other says “I don’t”. Could Jeff provide example use-cases and Corentin can say why he thinks they don’t need this?
RC: One use case could be if you have one big buffer and want to use that simultaneously.
CW: Difficult because we’re trying to design an API for high performance game engines trying to target the web. And as much as browsers are kind of close to game engines, we don’t know 100% how they work and what they want. We should reach out to engines interested and ask them specifically about that question. Unity, Roblox, others.
CW: I think RC is right; we all have our beliefs about what engines want / don’t want, and it’s difficult to change beliefs without data.
JG: Surprising to me to think that clients of our API would not want to map part of buffers. I feel like if I were to ask that, they would say of course.
RC: As a remedial question, for #605, there’s getMappedRange with an offset and a size. How is that different from the range tracking that JG wants?
CW: In #605, it’s either completely mapped or completely unmapped. After that you can request a specific range which is just an indicator for what to flush.
RC: I see, and Jeff is asking that you can still use part of a buffer in a submit.
CW: In #649 you async request of subranges of a buffer to be mapped, and there’s tracking of what is and is not mapped. And exactly as you say, on a submit you can only use subranges that are currently unmapped.
CW: JG is correct that with a few changes, #649 is basically the same as #605 except with subrange mapping.
DM: If we keep the restriction of only staging buffers, then the scenario where the GPU wants to use it and you have it mapped.. I thought the idea of if we have a bunch of 1MB chunks, you can just grab the next one. If they’re just staging only, I don’t think it’s useful to have the subrange tracking.

PR burndown

Add callback entrypoints for per-frame Promise entrypoints. #626
- JG: one concern with Promise-based APIs is Promise generation. Instead offer callback based API, can construct Promises easily from them. Callbacks are no-garbage. Probably useful to some clients.
- #626 adds equivalent entry points to some async Promise-based entry points allowing callbacks to be used. Pick what you want.
- CW: no major concerns but I'm not JS expert
- KN: I suspect people will want it. Little inclined to leave them out until we have real test cases showing garbage is a problem for these entry points we don't expect to be called frequently. Little unsure about adding these because really think these should produce very small amount of garbage per frame, and if JS engine can't handle small amt of garbage per frame, we should get the JS engine fixed. But don't know how to get those kinds of problems fixed, or even identified. But I do anticipate having to add these kinds of entry points at some point.
- JG: would be nice if we didn't have to care about this small stuff, but have gotten some feedback from users that they don't want to generate even a single byte of garbage per frame.
- CW: there will be a lot of other places. Have to generate at least some garbage with command encoders, for example.
- JG: this is just a partial solution.
- KN: I don't think this helps with mapping because the returned ArrayBuffer is already garbage. Maybe leave that out?
- BJ: using callbacks vs. promises to avoid garbage isn't hard, but the way people use callbacks in JS will produce just as much garbage, because people tend to use inline functions and those generate garbage per call.
- JG: that's why I left the Promise entry points in.
- CW: did you have the same feedback for VR content? Producing any garbage per frame was risky?
- BJ: if it's useful I can go into feedback from TAG and other ergonomics folks. Fairly applicable. In WebVR we had a C-like structure. Allocate frame-like object, populate it with latest data every frame. Got dinged badly by TAG. Their statement in general was what Kai said: garbage is part of the web, you shouldn't bend over backwards to avoid it. If this garbage is problematic it's the engine's problem rather than the API's. We rearchitected the API to return Promises, etc. I'd be surprised if you didn't get the same kind of feedback here. In the grand scheme of things though WebXR is a fairly small number of API calls per frame.
- RC: I was on the receiving end of some of that feedback. If you're doing it once per frame it's probably fine. The garbage that's 1000x per frame is the kind you want to avoid for JS engines. I do find that this PR is touching per-frame APIs but the dictionaries we pass into beginRenderPass aren't modified.
- JG / KN: you can reuse those.
- JG: this is based on a lof of feedback from early days of asm.js / WASM. People were excited because it avoids garbage generation by design. Micro-opts of avoiding little bits of garbage is trying to get back to that goal.
- JG: this was supposed to be mostly an example of how we could do this. Don't have to adopt it today.
- KN: we'd have to deal with RenderPassEncoders, etc. too.

Agenda for next meeting

Data uploads
Query API update?
PR Burndown
- Add default for arguments of draw and drawIndirect. #632
- Add pipeline shader module error enumeration. #646
- Add validation rules for copyBufferToTexture #648

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2020 03 30

Tentative agenda

Attendance

Administrative

Query API #614 and #619 and #656

Data upload #605 #649 and #650

PR burndown

Agenda for next meeting

Clone this wiki locally