Minutes 2019 09 27

GPU Web 2019-09-27 New Orleans F2F Day 2

Chair: Corentin

Scribe: Austin, Kai, Ken

Location: New Orleans

Minutes for Day 1

TL;DR

TAG review
- Would be nice to have a first review and then a second one closer to shipping.
- Should be able to jump the queue.
- Apple would like a TAG review on shading language
  - Would be nice to have a need a joint document for it.
Swapchain image usage and API (see investigation and proposal)
- Performance might be better if we only allowed rendering to swapchain textures
- Discussion around the Web needing to be able to copy from the swapchain image.
- Consensus to keep existing API but make storage disallowed.
Exact semantics of GPUBuffer/Texture.destroy()
- Consensus to make destroy() queue the freeing GPU memory
- Submitting commands using destroyed resources is an error.
- Still valid to create a texture view / bindgroups on destroyed texture.
- AI: CW to write spec text for this.
Multithreading
- It isn’t clear how multithreading would work with WASM the way apps excpect it to (without yielding to the browser).
  - Could use some form of share handle?
  - A solution would need to not expose the GC.
  - The SharedObjectTable could be a solution but it doesn’t exist.
- AI: Kai to write stuff down for SharedObjectTable, all to reach to their JS engine folks.
- Sharing whole object hierarchies would be useful but more difficult, Babylon.js said they’d be interested in it at the W3C games workshop.
Default for pipeline and bindgroup layouts #446
- Alternative proposals: bindgroup.getLayout or pipeline.getBindGroupLayout.
- AI: CW to write an updated proposal.
GPUShaderStageBit.NONE seems useless #193
- Consensus: remove NONE but still allow 0 to be passed.
- Need to discuss things on a case by case basis, like 0-sized buffers.
Out of bounds drawcalls.
- Being able to noop drawcalls would make emulation of robust buffer access much easier.
- Worry that this will be slower than taking advantage of hardware robust buffer access.
- Consensus that we should at least allow nooping, unclear if we have concensus to also allow RBA.
Initial extensions we want for WebGPUv1.
- Interest in float16 SSBO, texture compression and subgroups.
- Extensions can be discussed just with an explainer and a meeting agenda item.
- They would live in the main spec, but we might reject some to separate docs.
Review of new GPULimits entries
- Approved.
Next F2F
- Tentatively 2 days of WebGPU F2F week of Feb10 in Redmond
Vertex buffer set API and offsets.
- Discussion of whether batching and ArrayBuffer improve perf.
PR burndown:
- #387 Add 'all' to GPUErrorFilter
  - Closed
- #433 Clarify popErrorScope() rejects with AbortError if the device is lost
  - Leave this open.
- #434 Coordinate systems.
  - LGTMed
- #440 Use Uint32Array for dynamic offsets
  - Should have two overloads, one with a sequence and the other with a uint32array
- #444 The Type must not be the identifier of the same or another typedef
  - LGTMed
Draft working group charter
- W3C would like to send advance notice that we are planning to make a CG.
- AI: everyone to make lawyers look at the charter.

Tentative agenda

Timeline with regards to W3C TAG review(s)
Swapchain images usage and API (see investigation and proposal)
Exact semantics of GPUBuffer/Texture.destroy() (see this discussion)
Multithreading! (re:)
Defaults for Pipeline and BindGroup layouts #446
Spec and CTS workshop (after lunch?)
GPUShaderStageBit.NONE seems useless #193
Out of bounds drawcalls discussion (final act? was missing MM last time)
Initial extensions we want for WebGPU v1 (texture compression, subgroup, …?)
Review new GPULimits entries?
Vertex buffer set API and offsets (see discussion)
JF's question about HTMLCanvas.getContext()
PR burndown
Working Group draft charter
[YOUR ITEM HERE]
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
Google
- Austin Eng
- Corentin Wallez
- James Darpinian
- Kai Ninomiya
- Ken Russell
- Shrek Shao
- Ryan Harrisson
Intel
- Yunchao He
Microsoft
- Chas Boyd
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert

TAG Review

DM: First review soon so we can get first feedback and have re-review closer to MVP, for example enums could change a lot of things
JG: Do we have enough?
MM: Probably enough, TAG has also asked for a distinct issue about shading languages.
MM: They are interested in the different options.
CW: Was told there is a review queue of 3 months
MM: Can skip to the front.
DM: they probably want to know whether it’s new object / same object return, ArrayBuffers returned, when Promises are resolved, etc.
...
JG: Do you think that an explainer is a lot of work?
RC: Think it will be a lot of work. You’re explaining how to use the API to people that don’t do graphics for a living.
JG: Seems laudable to explain graphics to them but not necessary.
RC: https://github.com/immersive-web/webxr/blob/master/explainer.md
CW: Seems like it could be a lot of work, so need to figure out who will do it.
RC: In WebXR it’s the spec editors that maintain the explainer.
….
DM: Added this to the agenda because was adding JS bindings in Firefox and TAG representative reached out.
DM: Wondering if we can skip the queue.
JG: I think it’s likely. Think TAG will be very interested.
JG: I think the W3C would love to have a graphics API.
CW: Concretely, what do we need to do?
DM: Go through our spec, find the parts that are underdocumented and related to DOM interop, and extend the spec on those items. Then write an explainer.
CW/DM: Swapchain, mapping, error model
JG: Anything non-webby, e.g. places we avoid using dictionaries and arrays.
JG: Can explain those if asked.
CW: JG to go through spec and point out things that need to be discussed.
CW: Do this, distribute items to people, write explainers, send TAG review.
DM: Make clear it’s not a final spec and they should only focus on core bits/high-level, scrutinize later.
RC: Can be as high level as “here’s drawing a triangle” - not too abstract.
CW: Myles, you mentioned TAG review for shading language?
MM: Just a thought. Just that they’ve asked for it.
MM: Also think I’m not sufficient to produce such a document.
JG: We’ll have to collaborate.
MM: We don’t actually have to do a review, they just said they were interested.
JG: Can do something [more informal]... Get one TAG representative involved.
MM: Also, Apple feels we should have a TAG review for the shading language.
[several]: Would be nice to talk to them in person at a TPAC but there won’t be one for a year.
MM: TPAC has weekly meetings and we can send it anytime.
JG: One of the docs for first TAG review should at least say “what and why” for the shading language.
DM: Should we wait until TPAC for shading language proposal/review?
MM: No, should be before TPAC.

Swapchain

Swapchain images usage and API (see investigation and proposal)
DM: …
DM: For example, on metal there are framebufferOnly images. It says they can be faster for rendering. On D3D12, you can’t use them as shader textures (which is undocumented), and there is a cost to copying from it if you don’t have Shader Input usage specified. On Vulkan, the only guaranteed bit is the Color Attachment bit.
DM: Table at the bottom of the issue I compiled. Have the option of not even using native swapchains, but this would be more inefficient because it involves a copy. We should aim for the fastest case.
CW: Likely for us that we’ll use that happy performance path with overlays on various systems. We won’t rely on the Vulkan swapchain. On Mac we use CAMetalLayer
RC: DXGISwapChains on WIndows, but can’t synchronize with the webpage. Otherwise, we use Chrome’s compositor.
CW: Point is, for happy path for fullscreen context, likely all browsers will use overlays and never Vulkan swapchains. When in the page, we use either overlay or browser compositor.
MM: if Chrome compositor uses DX11 and WebGPU uses D3D12 how do they interop?
RC: KeyedSharedMutex can share between the two.
DM: Can you clarify the bit about using overlays instead of Vulkan swapchain?
CW: for example: on ChromeOS, or Linux with DMABufs: (hw gpu memory allocation that’s good for passing images around, including to display controller), pass ptr to dmabuf to display controller and it scans out directly. Also how we play videos. macOS has equivalent, CALayers, promoted to overlays via Core Animation. We saw significant power savings using Core Animation. We only use swap chains when using the compositor. When going through browser compositor it’s a regular texture and we don’t have these restrictions.
DM: What about Windows Vulkan? You mentioned Windows uses dma_buf
CW: I’m less confident in this. If I remember, we can import DXGI surfaces into Vulkan drivers (or DXGI swap chain images) which can be presented as overlays.
CW: that’s not to say I disagree with replacing textures with texture views. I failed to add more data to the issue before this meeting.
DM: What about Android?
CW: AHardwareBuffer and dmabuf.
DM: You’re saying we may guarantee more usage flags by never using Vulkan swapchains.
DM: I don’t know whether we’re going to make this choice right now. Settling on render target is a safe place for us and users. If we want to enable more we do so.
CW: Second note I wanted to make is on the Web if you render to an OffscreenCanvas, which WebGPU will be able to do, you can always tear off the canvas backbuffer with transferToImageBitmap, and then you can do whatever you want with it. So no matter what, the textures have to be copyable-from. If swapchain image is not readable, we can’t do that.
DM: Is there a way to make this a non-requirement? As in, if you make the canvas with certain flags, you’re not able to get the Canvas data [with transferToImageBitmap] or you get black? Getting the image data should not constrain our fast path.
CW: The first part was that it doesn’t. I don’t think anyone will use Vulkan swapchain for the fast path, meaning this restriction does not restrict the fast path.
DM: Other APIs have restrictions whether you’re copying from swapchain image or not. If we support getImageData(/transferToImageBitmap). On Mac that means we can never set framebufferOnly.
CW: Wouldn’t we just use IOSurface?
MM: I’d like to get numbers about what the difference is between framebufferOnly and not before we rely on the flag being always true or always false.
DM: Okay, so we will measure it first before constraining it [on Metal].
MM: if it turns out setting framebuffer-only doesn’t affect performance then we can be more flexible.
CW: Making the swapchain return a TextureView instead of a Texture, irrespective of whether it helps us beyond the fast path: The only thing it prevents us from doing is copying to the texture.
DM: doesn’t prevent you from sampling from the swapchain.
CW: right. Prevents you copying from/to the texture.
DM: think it would be a cleaner api. otherwise we expect users to do the same thing and we don’t get the benefit. today they have to create the texture and texture view to render, and that’s redundant.
DM: is there any objection to returning the texture view by getCurrentTexture()?
RC: to be clear: reason we want to restrict to a view is that we don’t want people to sample from it?
DM: cleaner api. and they can sample from it if we want to allow it. bind group creation takes view, not texture itself. texture object only allows us to do the copy operations.
MM: i don’t understand the benefit. We’re still going to have arbitrary rules, this texture view can be used in this case and not in this case.
DM: these rules fit into today’s usage semantics. There are no extra rules. If you create a texture with a particular usage you can only use views created from that texture in particular uses.
KN: I think that if we want to go this direction it interacts with ideas about swap chain. Think we want to make another type that represents this, that extends TextureView. If it has different semantics than TextureView we can make it a different type. For example, you can destroy a textureview that comes from a SwapChain. But you can’t do that with normal textureviews.
MM: you can’t destroy textures from swapchains in Metal.
KN: just imagining that it’d release the reference from JS. Then on platforms that have an implementation that supports this we can release the memory when you destroy. I have more on this for another day.
MM: am I understanding that the counter-proposal is that you have additional usage flags that you spec during texture creation?
DM: we have current approach: getCurrentTexture() returns a texture. Have an option to return a TextureView. Have an option to return something deriving from TextureView. In all cases, the usage can still be provided by user in createSwapChain. That’s the TextureViewUsage. I like that idea of deriving from TextureView if you want certain semantics for those images.
MM: those textures have less functionality than other textures.
DM: no different semantics from regular TextureView.
MM: can’t sample from it, etc.
DM: same semantics as other TextureViews. Depends on usage spec’d when creating original resource. Doesn’t have less semantis.
RC: understand that returning TV allows sampling. Main thing we’re preventing is copying. Why?
DM: think it’s not very useful.
KN: can’t you also prevent copying by not spec’ing that usage when creating?
JG: if it’s useful then why remove it?
MM: if these textures really have same semantics as normal ones, they should be indistinguishable.
JG: only diff is ability to do unordered access in D3D12.
MM: that has nothing to do with copying
JG: it’s a texture that’s slightly different than a normal texture.
MM: this mechanism is completely different.
DM: you give the user a texture, 99% of users will create a view. Every frame. Something that needs to be destroyed.
MM: are you talking about performance?
DM: I’m talking about a cleaner API.
JG: GC in general is to be avoided. That’s one of the explicit requests we’ve received from ISVs.
MM: the link between this object and frequency of GC remains unspecified.
JG: if you gen objects every frame that’ll cause GC. Min GC time slice is 10 ms for some things. That’s a problem for graphics. A problem that ISVs have identified and complained to us about. If you didn’t create new objects and reused the same TextureView...might work, would be weird.
DM: that would be desired.
JG: the limitation that you can’t do copies because those reference Textures is an issue. There are situations where you want to copy back out.
MM: I’m trying to think of a case.
JG: this is the way we implement preserveDrawingBuffer:true in WebGL. We draw into surface, copy from surface into next draw buffer, then use this buffer later.
KN: I think we want to avoid adding preserveDrawingBuffer:true, make users do it themselves, and would want to be able to copy.
DM: they can still sample. Is this a blocker?
JG: then you have to spin up the rendering engine and not just use the copy queue.
DM: we don’t have a copy queue.
JG: we don’t spec one.
KN: I don’t see why we should force users to write a lot of code to do a blit using a shader.
DM: we also previously spec’d a blitFramebuffer.
MM: can you implement copy actually?
KN: they render into their own frame and copy into the swapchain.
RC: they’ll have to submit the copy before end of frame.
CW: even in native apps where they have proper swapchain control, don’t think you can sample from something being used by the swapchain, and presented. native apps also have to do this. in vulkan, need to put the vkimage in a special layout just for presenting, can’t do anything else with it. point is: having this mechanism for apps to do their own preserveDrawingBuffer doesn’t put them at a disadvantage compared to native vk apps.
KN: aside from copying: the switch to textureview removes the ability to copy and making the api slightly simpler.
DM: correct.
KN: OK with restricting the usages. I don’t like the idea of restricting ourselves in API design by not allowing copies, ever.
DM: note: if you want to copy from image you rendered, might as well render to different target and copy.
JG: it’s slower.
DM: there’s a copy somewhere anyway. Not slower.
JG: when you’ve written to output buffer, then submit it, you prefer to do W-R-R. You have to stall if doing R-W-R. Don’t want to wait until both copies (from one to the other) are done.
RC: have to wait until rendering is done
JG: have to do that anyway. Render to intermediary, then copy to output buffer, then sample output buffer, have to wait for both render and copy.
DM: you’re talking about internal impl in FF. User perspective: if they do copy from image they rendered, we can’t submit earlier. No advantage from user perspective.
KN: still avoids wait between R-W. One is W-R-R. The other is R-W-R: two steps.
JG: it’ll be uncommon to drop a frame when you’re done rendering and copy to output buffer. Still, overhead.
DM: I think I see Kai’s [Jeff’s] point. Not sure if it’s a case we optimize for. Is there a high perf app that needs to copy per frame?
JG: more a problem on lower memory bandwidth HW where copies are more expensive. Would be nice to not block the frame between render finishing, and presentation, with this copy. It’s an optimization I do in FF’s presentation for WebGL, would like user apps to be able to do so, too.
DM: so impl can present without waiting for user initiated copy?
JG: yes.
DM: seems intrusive but possible. No more pending writes. Fence injected, wait on the fence to complete. Don’t wait for rendering to finish. Fence is before copy, not after.
KN: not reordering way we submit to API; but where the fences/barriers are. Optimized in driver impl, not ours.
DM: in our impl today, user makes submission, we record cmdbuf, insert semaphore at end. Sem will fire after copy anyway. What you’re saying could be implemented if we tried really hard.
JG: It is. The ability to not block present on the copy is useful.
CW: it improves latency.
DM: you have to separate cmdbuf if you want to fire your sem earlier.
JG: potentially. Or recognize this pattern somehow; pulling cmdbuf apart in impl.
CW: or, simpler, if app submits large cmdbuf with everything, and smaller with the copy cmd. We know SwapChain isn’t used.
DM: is this responding to a real use case?
JG: corresponds to real use case in FF today.
DM: can you describe in more detail?
JG: don’t want to.
CW: for temporal AA for example?
DM: you would render into your own texture, not the swapchain.
…
JG: would be nice to not have to wait on the extra copy before present.
DM: not sure I’m super convinced, but not much excitement about change in the group, so we can proceed to other items and drop this.
MM: if you can provide evidence that the extra object affects perf / GC that’ll be compelling.
JG: I don’t think that’s on us because I trust the ISVs. Enough have approached us about this that I trust them.
KN: every ISV has mentioned this. That said, I don’t think this is the way to solve it; we can do better.
KN: do you want to make a different pull request?
JG: we should only reject the ones we can’t handle. Storage is it.
DM: yes, that’s probably the way to go. Storage unavailable, everything else possible.
MM: how will app know tex comes from framebuffer? Presumably we should know this texture can be used as UAV but that one can’t?
KN: that’s already exposed. SwapChain configured with usages.
DM: could have sync or async errors happen when developer requests swap chain with usage “storage”.

[break]

Exact semantics of GPUBuffer/Texture.destroy()

(see this discussion)

CW: what happens when you call texture/buffer destroy()?
CW: Basically my understanding of GPUBuffer/Texture.destroy, suddenly any usage of the texture becomes invalid. For example if you have a command buffer, submitting it is an error. Motivation is the implementation can free as soon as it’s safe to do so. Guaranteed there will be no new commands that act on the memory.
DM: My understanding was it invalidated the handle but it stays alive as long as other objects hold onto it, such as bindgroups.
JG: sounds philosophical not technical
KN: Is technical.
CW: In DM’s understanding, if you have a CB ‘c’ holding onto the texture, then destroy the texture, it becomes an error to submit ‘c’.
RC: refcount vs. other
JG: could do either way
KN: technical problem with the one DM thought it was: you now rely on the garbage collector to wait for GPU memory to be freed. Can’t free memory yourself.
JG: webgl works that way.
CW: webgl you can do deleteTexture.
JG: ah right, because we have parallel refcounting. delete says, i can no longer add to the internal refcount by doing js commands.
RC: in webgl you can destroy all objects referring to things. in our api, is that true? can you destroy all objects? if not, we have the gc problem.’
KN: that’s what we have right now.
CW: that’s why texture.destroy() preventing further usage helps. Does internal refcounting, but no more internal content.
RC: you’re advocating that texture.destroy(), then queue.submit() anything referring to that texture, being an error?
CW: Yes. This allows the GPU to queue the reclamation of the memory as soon as you do destroy. Otherwise no way for JS to control the deallocation and it has to rely on GC.
JG: It depends on what can hold onto these objects. If the limit is only CBs can hold onto them...
AE: bindgroups, texture views, command buffers.
DM: render bundles.
CW: pass objects. different encoders.
JG: some of these are OK. theoretically, if only thing that could hold on to it was cmdbuf, and cmdbufs were one shot, and you submit cmdbuf, you’re done.
KN: think the main issue is bind groups but there are others. don’t see why we shouldn’t adopt same destruction semantics as webgl.
DM: two things here: we seem to be drawing line between, we’re freeing video memory vs. other gpu resources. pipelines can’t be destroyed. texture views can theoretically also be associated with other bits of memory. have to draw distinction if we want users to be able to better control video memory. with these destroy() semantics, there are no truly immutable objects. that thing might become invalid due to dependent resources being destroyed.
MM: think this is “opting in to complexity”. if you call delete() you have to see if your app still works.
JG: we’re going to handle at least some objects as cloneable / serializable. make them originally, they’re immutable, and you can share them with workers. have to be careful: “except for destruction, that’s immutable” is probably not what we want.
KN: i was also proposing buffers be deletable and they also have mutable state, which is mapping.
JG: idea: buffer.destroy(), buffers in flight are OK. After it’s executed, we reclaim memory. Or buffer.destroy() doesn’t work if buffers in flight? Simplest, destroy() says I’m not using this any more.
KN: in Dawn, when we build cmdbuf we look at every buffer / texture usage, put them in list. Don’t worry about whether they’re destroyed. then when submitted we go through list.
JF: we do the same thing in safari. metal keeps its own references to t hings in flight. we can destroy things from the webgpu api side but it won’t be freed until metal releases the resource.
DM: so you plan to take no advantage of destroy semantics?
JF: talking about both jeff’s and kai’s ideas.
JG: what if you destroy texture, then try to create textureview?
KN: invalid textureview.
JG: might make more sense if you can always create textureview.
CW: get valid textureview either way?
KN: i see.
JG: valid view to invalid texture. they’re all adapter objects.
KN: views might have a backing api concept that you have to null out.
JG: internally.
DM: don’t think it makes sense that creating something from destroyed buffer is different from creating something and destroying resource immediately afterward.
CW: Agree.
RC: Summary?
JG: Destroy marks the object as internally null. Only things already submitted are invalid. You won’t be able to submit new things.
RC: Can destroy in frame 1 and use in frame 10?
JG: ...
RC: As soon as you call destroy, any subsequent operations will the texture (indirectly or directly) will not work.
JG: Different types. Command buffer submission would fail. Mapping would fail. Follow on proposal is that creating the TextureView would succeed. destroy->createView would be the same as createView->destroy. You couldn’t actually use the texture view.
JG: Avoids raciness between threads, for example.
Discussion about whether objects can be shared between threads.
KN: my proposal, objects fully shared, with internal synchronization.
RC: let’s discuss separately. For destroy(), the ordering makes sense.
Resolved:
- Early destroy()
- You can create views from destroyed objects
  - Can create views to destroyed textures
  - Can create bind groups referencing destroyed textures
KN: putting destroyed objects in bindgroups won’t cause error immediately? Question of whether it’s internally null or not.
CW: for all platform objects, textureview / bindgroup, you can destroy the resource without destroying all things that point to it. they’re all pointres. not sure about vkImageView.
JG: internally, i think we’re going to say that we free all the driver objects.
CW: if we can avoid looking at all textureviews pointing to that texture i think we will.
DM: think you can destroy in any order as long as you don’t depend on destroyed resources.
CW: on D3D I’m sure you can destroy resources without use-after-frees. Not sure about Metal argument buffers but think they’re the same.
RC: SRV’s aren't ref counted.
CW: if you destroy resource the srv references that’s ok.
RC: if you use the SRV after that’s a problem.
CW: agree.
JG/RC: Unfortunately no refcounting in JS.
JF: this means that if we call texure.destroy() it doesn’t immediately invalidate all texture views?
JG: they’re implicitly invalidated.
DM: how do we spec that? they become internally null, but we introduce a new state?
KN: when submitting cmdbuf we walk the whole object graph. bindgroups, texture views, textures, etc. all textures / buffers, check if destroyed / internally null.
DM: wondering about wording of that state?
KN: textureview has internal slot with texture.
JG: we talk about internal nullability. is internally destroyed a different state? think it’s just internally null. Unclear when it’ll be internally null but not destroyed.
KN: created with wrong args, out of memory, etc.
JF: can we say, impl detail whether you want to make it internally null, or check at submit time?
JG: as long as no functionality change, sure.
KN: difference of whether we internally null it out, set a flag etc.
MM: would benefit from a lot of tests.
CW: think francois added a bunch of validation.
KN: don’t know; will see.
RC: only reservation: in future, if we have renderbundles hold on to platform-level bundles / cmd lists. pain when you destroy some random texture. Have to find places it’s pointed to.
KN: when you build render bundle, have to build list of used buffers / textures. then when you submit it you add it to the cmdbuf’s. then check cmdbuf’s during submit.
RC: could do. Other thing: in webgl, every other little object is destroyable. I’m fine with how it is.
DM: in our impl we have to do a simpler thing to track usage so we’re fine with it.
JG: who writes the spec PR? me?
CW: I think I can take it during spec workshop.

Multithreading! (re:)

KN: everything should be shared, except encoders.
DM: should cmdbufs be shared?
KN: complex, but yes.
DM: maybe cmdbufs should be transferable instead?
JG: esp ported apps having to manage things across threads is really gross if it’s transferable. many / most native apps are written with idea that things are shared across threads because they’re just stored in memory. when transferable / serializable / etc., have to do that propagation when you create it. maybe not hugely better.
KN: there are also more serious issues with transferring / serializing things in general. in order to get anything from one thread to another, the receiving thread has to yield for onmessage. in principle, maybe this is ok for js, even if a pain - can always do await, etc. but this is a huge pain for webassembly. say you have wasm app written against webgpu.h. have threads, look like pthreads. they create a buffer on one thread, take that handle, use it on another thread to create a bind group. not possible today if they have to receive objects async.
KN: not obvious how to solve this. pretty complicated. you can say, every object has an integer id now. cons up an object on any thread, but you don’t know when you can gc any more. except maybe if you get an integer not corresponding to an object you get an error. or a platform-level thing or webgpu-level thing where you can say, here’s an object and i put it into a queue and i can get it out synchronously. how generic it is depends on how much we want to convince tc39 to do it.
JG: similar to wild idea i was going to suggest: like d3d does resource sharing, OpenSharedResource based on a handle. shared-truth object. seems like only way to solve this for wasm.
KN: might be possible to do something specialized. we add a handle integer to every object we want to share. i started sketching out on my computer. that handle, you can do device.createBuffer(), pass in the handle. reconstitutes it. always makes a new js object. affects gc. depending on when gc runs you may get the thing or a different thing. technically possible.
MM: in your example, why would anyone want to share gpubuffer descriptors?
KN: not descriptor; buffer. if you have a buffer on one thread, on other thread you have a device, and device.reconstituteBuffer() from integer id.
MM: ah, a way to have communication without yielding.
KN: yes. a hack.
RC: how does it avoid yielding? receive int through sharedarraybuffer?
KN: yes. Use case is WASM.
RC: not sure i’m in favor of something where gc is observable. works on one browser and not another.
KN: me either. breaks when gc algo changes. think we need something specialized to webgpu. you have some thing on the device. register this: id, object. put in table. on other end, take out, deregister object. or, do same thing at the general web platform level. that’s the only solution we’ve thought of that doesn’t expose the GC.
JG: one way we can do this: tie lifetime of resources to lifetime of device. lifetime of device can’t be observable. if device in both workers: can’t observe gc of individual objects. when you do createBufferFromShareHandle, you do it regardless of …
KN: I thought of that too. we never GC anything. we can GC the wrappers; those things don’t reproduce the wrapper. but if you drop a GPU buffer we can never GC it. or if there was a getHandle() we can mark it as never GC this. we don’t expect pages to frequently recreate devices so no opportunity to gc.
JG: no proposal for SharedTableObject?
KN: think so, but not sure where they are.
CW: we talked about it with our JS team multiple times. first they said we’re insane, then maybe interesting. unclear. not sure about any proposal to tc39.
JG: i’ll ask.
MM: we shouldn’t build upon technology that doesn’t exist.
JG: we can exert pressure for someone else to solve it.
CW: TypedArrays were created in Khronos for WebGL’s purpose then moved to TC39.
KR: Yep that would probably work for SharedObjecTables too, need to make sure we solve the problem where object tables never shrink.
KN: I think the principled solution here is one that exists in TC39 and not here, we just require it. It would be one where you can see the object in the table so it’s clear it’s being referenced. That way, there’s no garbage collection magic happening.
KR: WebGPU is unique among current Web APIs in wanting more concurrent access to underlying resources from multiple threads, and synchronization is a problem because the JS level doesn’t have the sufficient primitives. So WebGPU will have implicitly internal synchronization and you can’t just take an old object from a Web Worker and stick it in this table. That needs to be spec’ed from the object type.
KN: Might involve generalizing what we already have which sort of clones the reference. If we generalize it and have an IDL-extended attribute that does that, an attribute on the type, which says when you clone it it’s referenced from multiple threads. Thing with that attribute can go in this table. Either the table is flat, or nested structure would be replicated across threads. Would be convenient for use cases, but the nested part could be added later on.
CW: Adding dictionaries and JS objects in there seems much more difficult.
KN: Don’t think so. When you write to it it serializes, when you read it deserializes. Anyway, that’s not important for now. The top-level would be key value pairs. You can put it in and take it out. And delete it with table.deleteShared…()
RC: So the value is the GPUBuffer and the key is a number?
KN: String or a number or something.
RC: So register on thread A, and as long as it’s registered, thread B can take it. What happens when thread A deregisters, can thread B still use it?
KN: Yes, but thread B can’t take it out again.
MM: Would like to discuss with others who know more about this. and see it written somewhere.
KN: yes, i’ll write this down. we should also reach out to others internally.
RC: this is just for wasm?
KN: not just. if you want to write any sync code that does this - either because of c code, or synchronous inside rAF - or any other reason, avoiding races, etc. - you’d want to use this. a platform-level thing for sharing things between workers.
RC: ok. but js folks have other ways of sharing things between threads.
KN: asynchronously. you can do the same in wasm. or emscripten’s asyncify.
KR: But the way modern graphics APIs is the they encode commands for this frame in parallel in multiple threads, so you need some form of shared access or split your engine in the middle.
KN: Yep
DM: you’re suggesting adding small piece of api to JS to interop with this shared table? people will not have to use it, they just write JS?
KN: they don’t have to use it if they want to be async.
DM: emscripten impl will use it all the time?
CW: not necessarily all the time, but would be in the emscripten runtime.
KN: could do async. let’s say someone wants to use dawn as a HAL. don’t want them to think, i’m going to transfer this now. or receive this now, to tell emscripten the right thing. rather, they have a handle to the GPUBuffer, and share it how they want.
CW: people looking to port to webgpu, using webgpu-native impl directly, or doing d3d12 and having d3d12 shim on top of webgpu, there’s no concept of asynchrony in native world that matches what js does. apps do their own event loop.
MM: this shared table is being proposed in addition to postmessage, not replacing it?
KN: yes. these things would still be postMessage-able. this thing would work for SharedArrayBuffer, and objects tagged with the new attribute.
MM: obvious problem with putting mutex in every object is thrashing. maybe one object has a lot of thrashing and should be transferred. theoretically. is that something that could be changed retroactively? if we make objects transferable we can always make that sharable in the future, but can’t go the other way.
KN: i agree.
CW: yes. if you look at webgpu today, all objects are immutable except for buffer and texture that have some piece of internal state (destroy unmapping), and encoders which are not sharable.
KN: and device which can be destroyed.
CW: device has state and sub-objects
MM: so thesis is that there will only be small handful of locks present?
CW: yes, because most objects immutable. buffers / textures, the need for locks are operations not done very often.
KN: need the lock in the submit - coordinating mapped state / destroyed state between threads. only need it when doing those things and not putting it in a bindgroup, etc. submit, map, destroy.
CW: can see how this is super-easy for remoting implementations because there’s a funnel. for impl like safari where commands are executed immediately, can see how this makes things difficult.
KN: this would make us do our state tracking and mutexes client side. (CW shakes head) KN: why not?
CW: tiny bit of it.
KN: we need mapped / destroyed state.
CW: interaction between buffer mapping & queue submit. have to do lot of validation. lock buffers during whole duration, do submit, then buffers can be mapped again. easy to do on remoting impl. on single-process impl it’s more difficult. queue submit will lock buffers for non-trivial amount of time that are used on other threads.
KR: Could probably do some multi-level locking. Queue submits could be expensive but you could do atomic operations that would make it cheap.
MM: we want to try whatever option, but we expect to come back with perf results. we expect you to come back with the same thing. so we think this group should be flexible.
KN: makes sense.
CW: AIs:
- We have an aspiration to do something
- Have to do more
- Kai needs to write stuff down
- We need to reach out to JS engine folks to discuss SharedObjectTable, and maybe TC39
JG: Ask my JS people … and … they said the solution to our sharing problem is could be to have a synchronous poll operation for PostMessage and would be easier to spec and more palatable to people.
JG: It would make it possible to implement something like “OpenSharedResource”.
JG: Still agree that it would be interesting to pursue a dictionary of serializable things. Lots of cool uses.
KN: At the W3C games workshop, David Catuhe from Babylon.js gave a talk saying they need real threading. They would love to put their scenegraph in the shared object table, and would only pay the cost of putting it in there only once.
KR: How can this work?
MM: FP made a blog post about it explaining it. https://webkit.org/blog/7846/concurrent-javascript-it-can-work/
RC: Concurrent Javascript is a dream but sharing a scenegraph with objects that have pointers to each other etc is a lot more difficult.
KN: The serialization algorithm should handle cycles. It’s just matter of saying that when you write it it gets serialized and when you read it gets deserialized. But it’s optimized.
RC: So what happens if you set the scenegraph to null?
JG: If you put an object in the share table, it serializes it in the table and if you try to read from it it deserializes. You get your own copy. Editing is reading the whole thing, edit, and re-serializing.
RC: So for us you put the buffer in the shared table and get it from a different thread. Then you destroy it in one thread. Is it dead or does it need to be destroyed in all threads.
JG: Destroy would act on an internal object that’s shared by ref between threads.
RC: Seems worrisome because you are exposing the OS scheduling.
JG: Already the case.
JG: Idea of a shared queue object you can serialize / deserialize from. Similar to map but different. Maps have atomic style guarantees on accesses while the queue has acquire release.
KN: Like postMessage, but you get as many channels as you want.
CW: It means you need coordination between two threads, whereas with the shared table you can just eagerly put objects in the table and not need any synchronization.
RC: Not sure serialization will work for David’s case. He complains about this case with postMessage serializing / deserializing. He doesn’t need it to be synchronous.
JG: One of the ideas is that it’s spec’ed as serialize/deserialize. But if you can provide the same behavior, you can do something faster.
RC: THat means you need to do something like copy-on-write.. etc
CW: Great that we’re thinking about this but it’s outside the scope of WebGPU. Anything else on multithreading?
DM: To clarify: It was suggested we aggressively poll? Won’t help the WASM case, correct?
CW: Yes, that was my point earlier.
JG: The way I would do this as a polyfill for a ported API, is when you create any texture, you immediately broadcast to everyone. If anyone wants to access it later, they call openSharedResource or something to figure out if you have to poll to pick up the broadcasted object.
CW: Problems if you create threads after resource creation.
JG: Not if you hook into thread creation.
MM: sounds harder to use.
JG: would be tricky. think would still help people port apps.
CW: tradeoff. how easy / applicable it is, vs. how much we have to convince tc39.
JG: that’s part of the tradeoff. i’d rather have dictionary of serializable things. already have preliminary “that sounds reasonable” for polling messages. but think we should think more about shared dictionary thing.
RC: presumably if you make gputexture/buffer, bindgroup, stuff bindgroup in cmdlist, then lose reference, js thing cleans it up, you can still use any bindgroup / cmdlist if you haven’t called destroy.
CW: yes. gc is not observable.
RC: ok. possible to do similar thing with this? only if you’ve destroyed on all threads is the thing destroyed?
CW: not sure i got that. can aggregate gc on multiple threads be observable?
RC: saying, keep internal refcount.
JG: per-worker. is proposal: destroyTexture() only destroys texture on that worker? we serialize them to different workers to share, and destroytexture() on this thread means you can’t submit it on that worker?
CW: when you say destroy you don’t mean gputexture.destroy()? that works on internal thing.
JG: RC’s saying, instead of operating on shared global, it works on the per-worker object.
KN: then we can’t release resource. could be internal object, have internal refcount how many threads it’s accessible on, decrement refcount on destroy.
CW: I think destroy on any thread should post the texture queue thing.
JG: that’s my preference. that’s sort of implementable on top of RC’s proposal if you add more.
KN: it matches the native model better. not that it matters. if you’re in C there aren’t implicit references from each native thread. don’t care what thread things are happening on. you don’t see the reference operation happen; don’t want to have to see the release operation happen.
MM: I agree with Kai. If I call destroy I want it to be destroyed.
RC: so we’re signing up for debugging multithreaded webgpu problems with the cpu scheduler in the future.
DM: It would affect us anyway via “mapped” state.
RC: Not if things are Transferable only.
CW: Still have races between e.g. receiving postMessage messages. Also SAB.

Defaults for Pipeline and BindGroup layouts #446

(explanation and clarifying discussion)
MM: Alternative: bindgroup.getLayout()
CW: Alternative: pipeline.getBindGroupLayout(n)
MM: Why your proposal over your alternative?
CW: It eliminates the concept of “layout” completely. You don’t ever have to touch a layout. Even the error would say the layouts mismatch. As a developer who hasn’t use GPUBindGroupLayout, you can still understand what that means. It hides layouts completely from the user.
DM: I see the problem being that you’re writing the same thing from two different sources and verifying it very late. Did you consider providing a shader module for bind group creation?
CW: I would be worried about this because shader module resources depend on which entry point or which specialization constant when/if we have that.
DM: Didn’t we establish specialization constants to not affect interface?
CW: Okay, well entrypoint.
MM: Would be a big problem where you put all shader source in one file and the one file is the shader module. Just repeating what CW said.
DM: You could provide the entry point.
KN: Pretty much the pipeline then.
CW: If we can ask a pipeline for it’s bind group layout at position N, we might want to tighten visibility. The reason why it’s ALL in the proposal, is that it’s easier to match. If you can request a specific position, we can tighten that.
DM: What about buffers in createBindGroup(). The user specifies ..
CW: In the proposal, the idea is that buffers passed to implicit layout bind groups, they need to have exactly one of UNIFORM or STORAGE so we can deduce. Kind of a wart that makes the flow from pipeline to bind group more appealing.
DM: Okay, so it’s expected to have exactly one in #466.
CW: Same for textures. SAMPLED or STORAGE. THough it’s probably less important for them.
DM: readonly storage?
CW: In #466 no. Though if we have pipeline -> bind group, we could deduce readonly from the shader.
KR: Is the inference of the bind group layout from the shader something that the application has to query or is that internal? Would there be synchronous blocking on shader translation.
CW: No, at least in our implementation. We already have places where we create objects from objects without blocking. It’s kind of opaque reflection which is okay for remoting applications
MM: So if you ask the pipeline for the layouts, in order to no block you have to return success even if you asked for a layout that isn’t present or is invalid.
CW: Would be internally an error?
MM: Pipeline has three layouts, next line asks for the 4th?
CW: Empty layout, or it could be an asynchronous operation? It’s for usability so I don’t think it’s too critical to be synchronous.
RC: Does visibility ALL have performance implication?
CW: Certainly
MM: perf numbers?
CW: Would be nice to have proper data eventually
MM: I want to look into it.
DM: It affects how we bind resources. If it’s not visible, we avoid a call to setBuffer
MM: In creation of the bind group, not at draw call time.
DM: No, I’m talking about setBindGroup at recording time. If it’s only the vertex shader, it’s one call in Metal. If it’s multiple then there are multiple calls.
MM: … (using argument buffers)
CW: Our implementation doesn’t use argument buffers so, it matters
DM: Same for us.
CW: Okay, I think the defaults need polish. I will write down a proposal for pipeline -> bind group flow. Should I write for bind group -> pipeline? or create pipeline layout from bind group descriptor..?
DM: Don’t think so.
MM: I think this is a good general direction. Keep going!!
DM: It’s worth noting that this is specifically for a single pipeline case. It’s introducing a difficult usability jump. No layouts to suddenly all the layouts. Not significant for performance. It is specifically a convenience feature which takes more implementation work.
MM: True, and a big engine wouldn’t use this. OTOH, for someone writing WebGPU code to show something, then this makes perfect sense.
CW: Trying to learn the API, simple use case, etc… If you’re Babylon.js and a ton of shader reflection, it’s not useful. But for a one-of app, it’s very useful.
DM: Okay, then, was it considered to be polyfilled on top with userspace code?
JG: Sounds pretty hard. We’ll end up with readonly things, and then you’ll have to parse the shader, etc.
CW: render pipeline -> bind group flow could be polyfillable but you need shader reflection in the JS app.
JG: I do sort of like the idea of pipeline layout optional and imputed from the shaders. You do sort of have to know what order the implied layout was, but people are usually writing trivial bind group layouts.
CW: And you know it exactly from “set =” annotations.
JG: I’d like to see more investigation.
CW: Okay, somewhat receptive. I’ll write more details and flows.

GPUShaderStageBit.NONE seems useless #193

CW: comes up once in a while. For our bitfields, do we want to allow NONE? Pros: if you create things programmatically, you’d start with NONE and or things into it.
JG: you could replace NONE by 0. Initialize your variables to 0.
CW: could. If no stage uses value you’d get NONE. Useless to have a NONE binding. But will make apps produce an error if they do something wrong.
JG: isn’t this value just 0?
CW: but if you pass 0 as visibility is that an error? #405
JG: orthogonal to “none” bit.
MM: topic “A”: do we provide a term called “none”? “B”: if you provide NONE to BindGroupLayout what happens?
CW: #193 is do we need NONE? #405 is, is it allowed?
JG: I think getting rid of things that are 0 is fine.
CW: it does exist for color write mask.
MM: should exist. Problem in other web APIs. If standard allows for new field, existing apps that want to broadcast everywhere unless there’s an ALL field.
KN: doesn’t make sense. Shader would have to be put in slot.
JG: ALL is convenient. NONE seems useless. Could also polyfill this.
KN: vertex, fragment or compute.
MM: can have a bindgrop visible to all 3, use in this pipeline and this other pipeline
RC: if we had an ALL…
MM: ALL would be the union of all the others.
RC: argument for it being 0xfffffff would be, forward evolution.
CW: what about 0 meaning all stages. ;)
MM: please no.
JG: i could put together a PR removing the “none”s.
CW: think removing this for bitmask=0 makes sense.
MM: is anyone enthusiastic for this?
JG: I’m for getting rid of none.
CW: i don’t care about presence of none. but do care about whether it’s allowed for visibility. think it should probably be allowed.
DM: 1) allow creation of bindgroups? 2) require binding when we use them?
DM: can create bindgroup for compute only and consider not requiring it when constructing render pipeline. i’d prefer rules to be simple: if pipeline layout requires bindgroup then you have to bind a bindgroup. for none, still have to create a bindgroup and bind it.
CW: talking about NONE bindings.
MM: to fill binding with visibility NONE, have to create dummy buffer and assign it.
DM: on some APIs you’d have to do workarounds.
CW: i think that’s a fair argument for disallowing NONE.
DM: i’m fine with that. have feelings about making pipeline code and bind group code match. if we don’t allow NONE for either, i’m fine
CW: ok, so we disallow NONE in shader stage bit?
MM: yes?
JG: can you create something with no valid usage?
CW: or buffer of size 0.
JG: that makes more sense to me.
MM: D3D doesn’t let you allocate buffer of size 0.
JG: i’ve had use for 0-size buffers in the past. No use for piece of memory neither mapped for read or write.
CW: seems difficult discussion. lunch break.
DM: For visibility, there’s no workarounds we would need to support 0, so why don’t we support it?
JG: “Invisible” bind groups.
CW: Invisible bind groups sounds good.
DM: Resource usages fits into a similar bin? No workarounds needed
JG: Create something with no valid usage. I think I agree that it’s not a good idea to make something with no usage, but it doesn’t matter if we let you.
DM: Zero size buffers?
MM: If we’re going to say it’s an officially support part of the API, having the name of the visibility actually be named a constant if it’s official. If it’s not a constant, it shouldn’t work.
JG: Not that much pressure in my mind, It’s harmless. It’s an extra step for us at the spec level to generate an error for no usage flags. The spec is simpler if we allow for you to have no flags. And every other part of the spec validates a flag is there. I don’t think it’s necessary for NONE to be named just because it’s possible.
KN: It’s also part of the API that these are integers in bitfield format. If you don’t have any it’s 0 and you might as well just type that.
MM: Don’t feel strongly. Mild disapproval but it’s okay.
CW: Okay, I think the general mild opinion is to remove NONE and allow 0. Resolved.
DM: Empty bind group layouts and zero size buffers/textures?
CW: I’m fine with allowing both of these.
DM: I want zero size buffers/textures. empty bgl is a bigger thing.
CW: Okay I think we can discuss this case by case basis.
RC: Unordered access on swap chain textures is disallowed because they require implicit synchronization so that any writes are done before any reads happen.
DM: Isn’t this controlled by the resource state?
RC: Yea, if try to make a UAV it is an error.
DM: If the problem is synchronization, can’t you wait for it to finish and then transfer state, etc?
RC: Needs to be synchronized with DirectX itself and other APIs so there’s more restriction.
MM: So the only way for D3D12 to modify it is to render into it?
DM: Can’t copy?
RC: Yes you can copy
JG: API-level synchronization? sounds like there’s some level of validation tracking that’s deliberately not done with render passes and compute passes with UAV that VUlkan takes care of with layout transitions.
MM: We can think about this more, but it won’t change the rules. let’s move on.

Spec editing workshop

Out-of-Bounds Draw Calls

KN: we revisited the topic of whether, for now, in the spec, whether draw calls that do out-of-bounds calls via vertex fetch, to no-op the whole draw call rather than clamp vertex fetches.
It’s much easier to check indices, etc. (though it’s slow) and discard the entire draw call, rather than transform all of the vertex input into a combination of buffer bindings and shader code to implement all of it. It’s a pretty big chunk of work. It’s very valuable for now for the spec to allow this for now even if we don’t do it long term. Myles / Apple weren’t there for that phone call.
MM: we agree with what you said, that it should be allowed to no-op the draw call. Another option: copy the index buffer. We’re at least interested to see the perf impact of that.
KN: copying the index buffer is tricky. Both validating and copying the index buffer mean doing this once per draw call. Doesn’t make you no-op things, but would like to leave the options open for now.
MM: yes.
KN: would be good to have the answer, what should we do for our implementations?
MM: we don’t know yet what option we need to take, so leaving both open for now is fine.
DM: having this high-frequency requirement - per-draw-call basis - when you bind the index buffer you provide the range.
MM: index buffers can have 2 different formats - 16 or 32 bit.
DM: you can compute both.
KN: right. Been a long time since thought about impls. Biggest reason for no-op vs. clamp: it’s difficult to do the copy-and-clamp more often than once per draw call. Check can be done if we have some stored data about ranges of values in index buffer. Can do without checking every single value. For example, with segment tree.
KR: Compute shader produces index buffer?
KN: any checking we do would be on the GPU.
MM: another tenet: all validation that we could do in the GPU has to be in a compute pass can’t be run in the middle of a render pass. So you have to hoist the validation earlier.
KN: (clarification). Have to make sure you don’t raise issues during reordering.
MM: can’t compute index buffer by writing into UAV in render pass for example.
CW: it’s easy to do this b/c you can’t modify vertex/index buffers in render pass. Not like you’ll have to start/stop render pass.
DM: doesn’t anyone have a feeling that this is a huge problem for a niche use case? Consider shipping MVP that are only backed by CPU-provided data?
KN: think best approach would be, have poorly implemented compute impl and well implemented CPU impl. Doesn’t lock out the use case, but don’t think it’s that hard to do a first cut GPU impl of the validation. Think it’s worth allowing it. Maybe worth restricting the API.
MM: afaict nobody’s implemented this yet. So if Q is about implementability we don’t have the experience yet. One way to make a convincing argument is to try it.
CW: Idan (our intern) tried something similar. First index of drawIndexedIndirect. Idan made a GL buffer and wrote a value into it. Seemed tractable.
CW: sounds like consensus is, allow no-oping of the draw calls.
KN: was there another related issue to this?
CW: yes, 3 or 4.
CW: what about reporting errors?
MM: how would compute shader report this?
CW: if compute shader detects problem, writes the value 1 to an SSBO which is read back later.
JG: seems less efficient than what WebGL does. WebGL relies on robust buffer access.
CW: ours would rely on that where available where it’s known good. Metal doesn’t have this so we need to do something there.
MM: on impls that do no-ops, they’d report the error.
JG: right, robust buffer access impl wouldn’t report it. Now we have more divergence.
KR: points about precedence in WebGL.
CW: without robust buffer access we already know if there’s an error.
MM: I would prefer to not have the divergent behavior.
KN: I agree.
JG: Ken’s right in assessment of WebGL behavior. I don’t mind. Would be somewhat sad if we ship with this option even in MVP. Sounds like we’re trying to embed prototype phase investigation into the spec. Goal should be to implement something like robust buffer access everywhere. If that’s the goal that should be the spec and not be a goal to test a different behavior we’re planning to get rid of.
KN: I’d like clamping behavior on vertex fetch but just because we think it’s implementable doesn’t guarantee it’ll work well.
KN: we could go the other way later. I closed this issue earlier, but reopened it.
JG: worried we kick the can down the road further. We should try to not no-op draw calls. People would have sophisticated impls that rely on it.
MM: I don’t agree with the characterization that we want robust buffer access on every platform API. Different APIs have different goals / desires.
JG: logic: fastest behavior that’s safe and robust on many platforms is robust buffer behavior. Spec should allow that. Would like to pick just that, or just no-op’ing. Think it becomes worse when you go indirect.
CW: my concern: on Metal, no RBA. Have to do programmable vertex pulling. Myles’ (?) investigation shows that has worse performance.
MM: I made two points - not based on experimentation, but talking with Metal team. 1) on some hardware, it’s faster. Someone suggested I test on some hardware. 2) flexibility for developers. And that we had explicit requests from large game engines to integrate this into Metal.
DJ: You can’t really easily test if robust buffer access is broken. It doesn’t say what it does. It says what it might do.
CW: Robust buffer access for vertex and index buffers is not easily testable. The OOB behavior -- you just give the start, not the end. It should be easier to test on uniform buffers because you pass both the start and end.
KN: I thought we had workarounds to turn it off
DJ: I have a vague memory that turning robust buffer access on was slower than rewriting for WebGL.
MM: Is the plan that Chrome will have a list of bad robust buffer access implementations?
KR / CW: Yes, a blacklist
KR: Testing this can be hard. Zeros or “any memory owned”.
JG: In Firefox, we’ve never had to blacklist. We haven’t looked super hard though.
MM: Does Firefox use untrusted Vulkan code anywhere?
JG: No Vulkan code in Firefox right now.
CW: OpenGL extension
KR: My recollection is it was on a mobile GPU.
JG: I think there was a recent device that didn’t support ARB_robustness? (Samsung). But it didn’t even advertise it.
JG: If we expect to have to implement this without robustness, does it change our decision?
CW: Yes, it means we have to compile a pipeline twice to support it. Once for indirect, once for not indirect, etc. I’d rather no-op draw calls and tighten the spec later.
JG: Slightly palatable to allow either, but not happy about it. I would be much less happy if we required no-op.
CW: That’s not what the discussion is proposing.
DM: To clarify, on D3D12 setVertexBuffers.. offsets or offsets and range? If it’s just offsets, you could have uninitialized memory in the larger buffer.
KN: I think we got an answer that the clamping is within the resource, not the heap.
RC: The view has the size, so it has enough. The only thing that is more complicated is constant buffers. Sometimes constant buffers are promoted to root. But there’s flags to prevent that. Also [something else] that we don’t have in WebGPU. You can’t index off the end of a constant buffer.
CW: No, you can’t, but we need a separate discussion about binding a buffer in a bind group that’s too small.
MM: Didn’t you say there’s flags to prevent promotion?
RC: Yes. (problem solved)
KN: Related Issue 311 that MM posted about what happens when the bound buffer is too small for the struct. We agreed to resolve it if MM agreed with us. Robin said we can transform to 0 initialized instead of clamping.
MM: Why was that the agreement?
KN: It’s hard otherwise.
MM: It’s also hard to rewrite shaders to make it return 0. -- conditional assignment
CW: Maybe robust buffer access on D3D12 or Vulkan protects us, on Metal there’s no hardware or driver support for it.
KR: WebGL turns those into errors (per draw call check).
CW: We could do that, but we should avoid that as much as possible. Maybe on Metal because you can do pointer arithmetic you can transform stuff in the shader to do it.
MM: Ran into this when implementing WSL. For big complicated expressions, it is quite hard. It’s not impossible, but a significant amount of work to rewrite the shader to do this. OTOH, it would mean if you don’t rewrite, it’s observable (no-op means nothing is drawn), and the earliest time we can do it is at the draw call itself which is a runtime cost.
CW: Maybe GPU-side validation with argument buffers on Metal.
MM: We do that, yes. argument buffer is pointer, length, pointer, length, … our preference is for allowing both to see if it’s even reasonable to do the shader rewriting. If it’s a month of work, maybe we can come back and talk about it again. But until then, we’d prefer to no-op without transforming.
CW: Sounds good

Initial extensions we want for WebGPU v1 (texture compression, subgroup, …?)

CW: Initial extensions for v1! People are implementing extensions now. Francois added a PR for BC texture compression. Want to discuss what would be okay to put into the spec today before v1.
CW: We’d like to add BC compression, add it to the extension dictionary, and add the enums. Yes?
RC: Could be good practice for doing extensions.
CW: As long as there’s spec there doesn’t seem to be disagreement?
CW: subgroups?
MM: Tell me more
CW: Intel is interested in implementing subgroups and investigating and spec’ing.
MM: Ballot thing?
CW: Ballot, shuffle, etc. TBD what they will be called in WebGPU.
JG: Are these things we’ve rejected from the core spec?
CW: Yes, they are optional
MM: Hardware support
MM: How are you judging now vs later?
CW: bring-your-own-extension
MM: I wrote multiple proposals for other extensions. I just want to know what the criteria is. If it’s just whoever brings it to the table, that’s fine.
CW: For texture compression, one is important because we decided the core spec would have text about block-compression but not have any one required. IT’s important for complete-ness to have one of these. Really it just adds enums , etc.. kind of special
CW: subgroups and f16, someone brought it up and ideally apply to a lot of hardware. tangible benefits. It means tessellation probably fits well? Not sure how much hardware support the other investigations have.
MM: I just want to know what the criteria are.
JG: Just put it on the agenda.
CW: ex.) if you brought an ImageBlock extension, …
YH: If we have extension proposals, should we write spec for them?
CW: I think it’s fair to bring it for discussion with an explainer and not a full spec. Good to have feedback early.
MM: For substantial feature requests, likely the group will have substantive feedback. If there’s a full document it might be wasted effort as it is modified.
KN: Most extensions are going to have hardware features which is why it’s an extension so there will be other details to investigate and discuss.
MM: I should also mention that I should be more familiar with the particular subgroup proposal. I know it’s big for performance, and if it’s also interoperable, that’s a big plus for us.
CW: Yes, they should all be very interoperable.
KN: There’s #78 about subgroup ops.
CW: Always good to discuss extensions. the more portable and simple, the better. Things like raytracing is very difficult… and probably not interoperable
CW: Q: should extensions live in the main spec?
KN: I think so.
YH: that’s my question. WebGL style is: Extension A, separate spec, references native extension, simple description about new enum and function, with sample code? that’s it?
KN: I want them all to be in the same doc
CW: I’m happy to have them in the same doc but worried that e.g. NVIDIA will write a raytracing extension that’s huge.
KN: maybe no vendor extensions.
CW: agree.
MM: when I read the vulkan spec that has all the extensions in it I get swamped.
KN: I’d like to be able to have something where we can turn on/off individual extensions. List of checkboxes, I can check what I want to see.
MM: if you implemented that it’d be wonderful.
DJ: it’s implementable.
KN: question is how text diffs work.
JG: my second favorite approach is all one doc. We have a lot of WebGL extensions but not that many WebGL 2 extensions. In same web spec style we’ll talk about extensions before we merge them. Since we have the “firewall” - we only have extensions we want - there probably won’t be too many to swamp the main spec.
CW: We can say we’re happy to have extensions in the main spec, but we reserve the right to exclude it.
KN: makes sense. Extension starts as detailed explainer? Can move into the extension / main spec when ready?
CW: for Intel’s efforts on subgroups, fp16, texture compression - texture compression can have spec work be done. That’s the main part left.
MM: this particular proposal talks about how to check if there are subgroup operations, but not if the ops are the same between the 3 APIs and a support matrix.
KN: more investigation needed.
CW: an investigation was done and the support matrix isn’t that nice.
MM: TBD then.
…
JG: you asking whether we say, “this is how block-compressed textures work”?
CW: yes.
JG: yes, if we can fit all of them.
KN: ASTC’s the most different.
CW: non-power-of-two block sizes.
JG: though it’s weird size, still same concepts.
MM: would be good to be able to tell what do to with an unknown compressed texture format.
CW: let’s defer that discussion.
CW: on our side we’d like to push these 3 extensions (compression, subgroups, fp16 storage) for around v1 time. Anyone else with extensions for around v1 timeframe?
MM: dneto@ asked a while ago about removing the sub-4-byte types from the language due to races. Want to get confirmation to add them.
CW / DJ: These 16-bit floats are incredibly useful. Makes workloads not memory-bound.
KN: 16-bit floats are useful for removing fixed function vertex input.
MM: there are no additional extensions we plan to discuss today.

'

Review new GPULimits entries

JF: combined two investigations. First () about number of dynamic buffers you can bind to pipeline/bindgroup; second, max bind group resources of every other type ()
JF: some specify these per-shader-stage (Metal), others per max numbers of resources you can set on any pipeline layout. Tried best to make that as clear as possible in the constant names.
KN: think the names are great. If we can simplify them it’d be nice.
CW: names are great. The limits themselves are gross.
JF: maxDynamicUniformBuffersPerPipelineLayout:
- Pretty much only looked at by Shaobo, Dzmitry and me. Conclusion, 8 dynamic uniform, 4 storage buffers per pipeline layout.
- CW: the limits come from Vulkan. Have to be careful about D3D12’s root signature requirement.
- MM: is there a min size of the root signature in D3D12?
- RC: yes. Docs say how big the whole thing is, and the size of each that you put in.
JF: max sampled textures per shader stage: comes from Vulkan. Don’t know if we can make it more if there are enough Vk impls out there.
- CW: we can look at the distribution.
- YH: are these dynamically queryable?
- JF: in the individual APIs they’re queryable.
- KN: in the beginning I think we should use these fixed limits for the moment. Adapter can return them, can default to default values, and then pass to device creation.
JF: storage buffers / textures: looks weird. Looks like Vulkan, across all shader stages, can only store 8.
CW: But since we only have two right now, the limit of 8 looks okay for now.
JF: Due to D3D12, might want to change maxStorageBuffersPerPipelineLayout… to maxStorageBuffersPerShaderStage = 4 instead.
CW: Looking at these numbers, it seems fairly limited. Not that big, unfortunately.
JW: maxUniformBuffersPerShadersStage is also because of Vulkan.
KN: We would hope most lightweight apps will never need more than these? so they will be portable by default.
RC: No max texture size?
KN: Not yet, this PR is about binding.
CW: One question, does the dynamicUniformBuffers count for the uniformBuffersPerShaderStage?
JF: Yes, they count together.

Next F2F

CW: Time of next F2F? Either in December or the New Year?
DJ: Apple can host given enough notice.
JG: Mozilla in MTV is not large enough.
DJ: Next Khronos F2F is in Barcelona. We usually have WebGL there.
CW: Would be happy to host in Paris, but maybe not next time.
DJ: I know we can host meetings in Ireland. Not sure the process.
KR: Personally, I’d vote next time for California.
JG: Microsoft?
RH: would be happy to host.
MM / YH / JF: happy with Seattle.
CW: sounds like people would like to stay on west coast. Let’s try for that.
DJ: next year? (all agree)
JF: Khronos F2F is Feb 3-7.
CW: Jan ~15 or so?
(general agreement)
RC: what about March?
Current thought: 2 days of WebGPU, week of Feb 10, about possibly week or so after Khronos F2F. No WebGL until Phoenix.

Vertex buffer set API and offsets (see discussion #421)

KN: how to set default offsets for vertex buffers? Right now has required sequence of offsets. If you don’t have any, have to put list of zeros.
KN: could have list of (buffer, size) pairs - may not be great.
KN: setVertexBuffer takes GPUBuffer and offset defaulting to 0. 1 call per vertex buffer.
KN: other option: any offsets you don’t provide are 0. Overload, maybe a typed array of some kind. Tough because buffer sizes are 64 bits.
MM: why typed array?
KN: easier to deal with in browser, and easier to interop with wasm.
MM: is this performance or ergonomics?
KN: both. If ergonomics, would say array of (buffer, offset) pairs. Might not even be most ergonomic.
MM: is there an option 3?
KN: there are ideas.
MM: how about today’s thing, except unspecified values are 0, so array can be too short?
KN: default to empty array? valid option. not great for bindings perf. would like to have arraybuffer / typed array overload. would work and not be too awkward.
AE: you still have sequence of gpubuffer already./
KN: yes, can get rid of with multiple calls, not sure if that’s a win. API-wise, I’d …
CW: people have said to use interleaved vertex buffers anyway. Maybe sequence isn’t necessary.
KN: I think it’s the most obvious. don’t think the sequence is that necessary. can make the arrays not produce garbage.
MM: how impactful will this be on perf?
KN: probably not that much.
MM: ergonomics is great, but adding a typed array to the mix would be terrible for ergonomics.
KN: don’t agree.
JG: also don’t agree.
MM: if worried about perf, thought we’d try to get people to make lots of setVertex calls which will hurt perf.
AE: I commented: converting array to 64-bit array is slower than multiple calls.
KN: for us currently in Chrome.
JG: changing offset length to 32-bit?
AE: that’s bind group not vertex buffers.
AE: this is JS time. The bindings time.
MM: not that meaningful.
CW: because you have the cost of recording multiple commands.
MM: not timing amount of time it takes to call into Metal, but IPC code to receive messages.
AE: just to send the message.
KN: what’s wrong with that? want to reduce costs.
MM: would like to see perf numbers we can agree on.
DJ: we should measure this too.
CW: Q to Justin: do you apply vertex buffers immediately, or lazy application of the whole range?
JF: for setVertexBuffers? think immediately. but might not be best way to do it.
AE: I think we do batch them. When we call into the driver it won’t matter if we do setVertexBuffer vs. Buffers.
MM: have to consider batching infrastructure itself. Not clobber each other, etc.
CW: how to benchmark?
KN: we’d be pretty happy with either setVertexBuffer, or setVertexBuffers where you can set a fewer number of offsets.
MM: if you can provide perf evidence that would be compelling.
KN: think it’d be one of those two, or maybe something with a typed array.
RC: anything that avoids object allocation will be good.
KN: offshoot question about dynamic offsets in setBindGroup.
AE: we discussed, there’s a PR.
KN: #440. let’s skip for now.

JF's question about HTMLCanvas.getContext()

JF: when did we change getContext(‘gpu’) to ‘gpu-present’?
CW: WebXR’s doing it. agree not the best rationale. around the last f2f i think. no strong opinion.
JF: don’t think it matters that much. wondering who needs to agree on it
MM: do we think there’ll be more than one string to pass to getContext?
CW: based on WebGL, no.
MM: so what should the string be?
CW: actually, maybe gpu-webxr-presentation. not sure what that’ll look like.
DJ: think that becomes “magic” that happens. GPU context knows it’s being called internally by WebXR. or options you pass to the getContext call.
KN: at some point working on swap chains, thought about WebXR
MM: if nobody knows why it’s gpu-present, let’s call it gpu
KN: it’s because it’s not a context. it’s only for presentation. it’s a more precise name.
MM: most people calling this don’t care about it. they just want to get their webgpu on the screen.
KN: i don’t see a reason to not be precise.
DJ: think it’s about time we as a group raise the issue on Immersive Web to add WebGPU contexts.
CW: also should raise issue on HTML spec where canvas element lives.
MM: WHATWG standardizes these. they live in the HTML spec.
RC: recently webxr folks removed presentation context. didn’t have time to do it. think it’s coming back. want to mirror what’s in the headset on the webpage.
MM: have discussed this longer than it merits. i withdraw my proposal.
CW: we should still raise issues on these two other groups to integrate webgpu with them. can discuss the exact string inside the whatwg.

PR burndown

#387 Add 'all' to GPUErrorFilter
- MM: like to defer to Justin
- JF: no strong opinion. case where we differentiate validation errors based off subcategories, may be more annoying, but at that point we could decide to add the ‘all’
- JG: why is there “none”?
- JF: measuring non-queue stuff to complete.
- KN: original use: around createPipeline.
- RC: not just about measuring.
- KN: knowing when it’s finished. so you don’t stall on using pipeline not compiled yet.
- JG: you’d usually use a fence.
- CW: this API has too many fences already.
- KN: maybe should do something about that. doesn’t change rationale that i don’t think there’s a use case for “all” even if we add more types.
- JG: difference between filter:all vs. filter:none. confusing implications.
- CW: more like a selection. not a filter.
- MM: that’s good feedback.
- KN: can be improved.
- CW: conclusion?
- MM: nobody willing to shepherd it, so …
- JG: if there’s a way to catch the event. UncaughtError? would be nice to have just one way to do it. that’s probably enough. think we can keep what we have now.
- CW: seems most people have mild negative opinion.
- MM: closing.
#433 Clarify popErrorScope() rejects with AbortError if the device is lost
- KN: says what the error type is, that’s all. is AbortError right?
- RC: is AbortError existing?
- KN: yes. in web idl spec.
- RC: seems fine.
- KN: other options are?
- RC: can we make our own errors?
- KN: maybe. not really. don’t think types of dom exceptions are very meaningful.
- CW: ship it.
- KN: other option is OperationError. very generic. using everywhere else.
- AE: should pick same thing elsewhere. if map buffer, get device lost for example.
- KN: AbortError isn’t clear, but e.g. do fetch, then didn’t want to do it anymore. not sure.
- JG: can we make that request?
- KN: yes, ok. leave this open and do that.
MM: oldest PR is from almost a year ago. it’s really interesting and potentially far-reaching in its implications.
- Hoping we can get to a point where we can talk about it in the future.
#434 Coordinate systems.
- YH: the concern is wording: how to describe it appropriately.
- KN: would rather merge and fix than trying to get it perfect the first time.
- CW: hard to talk about precisely. would be happy to merge this. i volunteer to fix the text.
- KN: OK. everyone happy to merge? justin?
- JF: yes.
- CW: ship it.
#440 Use Uint32Array for dynamic offsets
- MM: why different from vertex buffers?
- AE: in Vulkan these offsets are 32-bit. Because they’re 32-bit we can put them in a Uint32Array. Based on investigations in WebGL in both Chrome and WebKit this is faster.
- MM: how much faster?
- AE: I don’t remember right now.
- KN: main reason this differs from setVertexBuffers is there’s no sequence of GPUBindGroup here.
- JG: and that we’re changing this to 32-bit. Might as well accept typed array. could theoretically be two steps. change to 32-bit first.
- KN: could do that.
- JG: then, do we want to make it uint32array.
- KN: other issue: people working on emscripten will ask, to avoid creating garbage, could we add offset and length argument like in webgl 2 to avoid creating new views into typed arrays. probably the right thing to do here if we’re going to allow uint32array.
- AE: i can update that, what does myles think?
- DJ: i see the reason but think it’s gross.
- JG: what about sequence or uint32array?
- DJ: i’m more happy with that.
- JF: i vote for sequence or.
- KN: so two overloads, one sequence, one uint32array, start, length.
- CW: start and length are optional?
- DJ: both default to 0.
- MM: whoever writes the spec text should make it clear that those are offsets into the Uint32Array.
#444 The Type must not be the identifier of the same or another typedef
- MM: this is stuff that JS developers don’t type.
- KN: right. according to IDL rules the IDL is wrong.
- CW: what does this change?
- AE: he was changing something in the idl and the generator complained.
- MM: OK.
- CW: ship it.

Draft Working Group Charter

https://htmlpreview.github.io/?https://github.com/grorg/admin/blob/wg-charter-draft/wg-charter.html

DJ: asking whether they can send an advance notice to the advisory committee that the gpuweb CG is asking to form a WG, and here’s the charter.
DJ: they like to send advance warning of these things so they can get a lot of the discussion out of the way before a formal vote. when it’s an official proposal you have formal messages, processes, etc. Also, even in their own companies, can be surprised it’s happening.
JG: yes, inside Mozilla we get these.
CW: AI is for all companies here to get their lawyers to look at the charter.
DJ: I did.
CW: I didn’t.
DJ: and also the AC rep. dbaron@ is Mozilla’s.
KR: TV Raman is Google’s.
RC: if I say yes what happens?
DJ: doesn’t say we will.
JG: are we converting to WG? intended structure is still the same as today. majority of input comes from the cg. wg will have a slightly different colored hat on.
CW: this is not just written in the charter, but we resolved this last f2f.
DJ: W3C agrees that the WG is a rubber stamping mechanism. changes come from cg, not wg. reason we need the wg is we get a tighter IP agreement - first, if we want to publish on w3c recommendation track instead of just draft note. doesn’t stop us from implementing anyway. royalty-free license trigger when you go on this track. in cg, you’re only on the hook for the contributions you personally made to the community group.
RC: if we all say yes then a wg will be formed.
DJ: yes, that’s my idea.
...more discussion…
DJ: some groups in w3c work this way but never called out as clearly in the charter.
CW: we should do this for ourselves and because it’ll make the w3c happy.
DJ: I have in the charter that the discussions are meant to be public.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2019 09 27

GPU Web 2019-09-27 New Orleans F2F Day 2

Minutes for Day 1

TL;DR

Tentative agenda

Attendance

TAG Review

Swapchain

Exact semantics of GPUBuffer/Texture.destroy()

Multithreading! (re:)

Defaults for Pipeline and BindGroup layouts #446

GPUShaderStageBit.NONE seems useless #193

Out-of-Bounds Draw Calls

Initial extensions we want for WebGPU v1 (texture compression, subgroup, …?)

Review new GPULimits entries

Next F2F

Vertex buffer set API and offsets (see discussion #421)

JF's question about HTMLCanvas.getContext()

PR burndown

Draft Working Group Charter

Agenda for next meeting

Clone this wiki locally