Minutes 2021 08 02

GPU Web 2021-08-02

Chair: Jeff

Scribe: Ken

Location: Google Meet

Tentative agenda

Stencil reads are S,S,S,S or S,0,0,1. #1978
Add multi-draw-indirect feature. #1949
Consider limiting buffer sizes to multiples of 4 bytes #1999
Request for a GPUCommandEncoder fillBuffer operation #1968 also #28
Problem with the default queue #1977
(stretch) Should we constrain the location of user input-output stage variables WGSL #1962
(stretch) Does compositingAlphaMode affect using the canvas as an image source? #1847
Agenda for next meeting

Attendance

Apple
- Myles C. Maxfield
Google
- Austin Eng
- Kai Ninomiya
- Ken Russell
- Shrek Shao
Microsoft
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Michael Shannon
Mehmet Oguz Derin

CTS update

(outstanding tasks: https://bugs.chromium.org/p/dawn/issues/list?q=Type%3DTask-CTS&can=2)
KN: Making decent progress. Shrek's working on reftests now. Can't run them in Chromium yet. Working on this to make sure canvas resizing behavior works properly.
KN: working on device lost tests. Basic but exciting stuff./
KN: link above has tests being tracked in Dawn's issue tracker. Feel free to look there for test tasks if you can pick them up. Please reach out if you'd like help; happy to.

Administrivia

TPAC coming up?
MM: it's in October.
- 18 to 29 October 2021

Stencil reads are S,S,S,S or S,0,0,1. #1978

PTAL
See whether this satisfies people
"Implementation-defined unspecified value" - is that the wording to use? (Seems yes)
KN: LGTM.
MM: is the table above it normative? One place says S,S,S,S or S,0,0,1, and the other says S,X,X,X.
JG: I need to fix that, thanks.
MM: assuming table is modified to match the prose, LGTM.

Add multi-draw-indirect feature. #1949

KN: two features here. FirstInstanceIndirect - can use a value other than 0 for firstInstance field of indirect draw. Other one - multi-draw-indirect.
KN: we don't know what the venn diagram of these two features looks like. They are separate features in Vulkan (I believe). But Vk's the only place where it gets to that level of granularity.
MS: technically multi-draw's 2 features as well. multi-draw, and draw indirect count. Only separated in Vulkan.
MM: all of these are optional right?
KN: yes.
MM: one option is, lump it into one feature. If backend doesn't support any of the requirements, it doesn't get the WebGPU feature.
KN: yes. Unfortunately we don't know hardware landscape for these features yet. Hard to make decision. Multi-draw's probably a subset of FirstInstanceIndirect. Desktop & newer mobile devices probably have these features.
MS: from my research - FirstInstance & multi-draw, outside of Metal (Metal only supports FirstInstance) - usually come together. multi-draw's not that useful without FirstInstance. DrawIndirectCount not as widely supported on mobile.
MM: could say - regardless of today's breakdown, it'll make sense in the future to lump these together.
KN: another thing we could do - make them all separate features now. In the future, we could coalesce them together. If it turns out to be important for the current hardware landscape.
DM: not strictly about hardware. If FirstInstance separated out, browsers could implement without hardware support. But if we bundle FirstInstance with the other two, browsers can't polyfill this.
JG: Didn't WebGL polyfill this? need to rebind instance vertex buffer. Can polyfill that.
KN: WebGL doesn't have indirect calls. Can't modify the data on the GPU.
MS: with RenderBundles, if you have FirstInstance, the user can polyfill multi-draw. But without FirstInstance, you can't do it. Can't dynamically change where your instancing is inside a render bundle.
JG: two features then?
KN: sounds good. Any chance we'll want to put multi-draw in core, and say it'll be emulated?
JG: maybe, don't have to decide that today.
MM: user can polyfill it as well as the browser can. Think the "supported" signal is meaningful.
MS: not very useful without FirstInstance anyways, and you can't polyfill that.
KN: come up a few times in the past - there are some features where it'd be useful for us to polyfill, and for app to know whether feature's emulated or not. This could be one of them if we want to go down the emulation route at some point.
JG: sounds like we have consensus for today's direction. Discuss emulated/not in the future. 2 features for now.
MM: clarifying: this follows same out-of-bounds behavior as single-draw indirect, right? i.e., set count to 0 for out of range draw calls? Related: for single draw indirect, the single invalid draw call's count would be set to 0. In multi-draw, would we no-op the whole multi-draw-indirect? Don't have to decide this now, but spec text needs to be written.
JG: the way this is written in GL you write the most generic one first. like DrawIndirectInstanced. Then add on multi-draw, FirstInstance. So it's clear where you can no-op things.
MM: no opinion on the behavior, we just need to spec it.

Consider limiting buffer sizes to multiples of 4 bytes #1999

KN: context for this: have a bunch of places where buffers that aren't 4-byte multiples don't work. MapAsync - has to be 4 byte multiple. Defaults to size of buffer. If buffer's not 4-byte multiple, mapping fails. MappedAtCreation - 4 byte requirement. CopyBufferToBuffer too. There are things you can do with buffers that aren't multiples of 4. CopyBufferToTexture, etc. But mostly not useful, and collides with corner cases in other parts of spec. Can also add complexity to initialization (implementation) - can't use CopyBufferToBuffer because can be restricted to 4 byte multiples. What do people think about saying buffers need to be multiples of 4 bytes? We can take out this limitation later.
MM: we expect authors to use float16 a lot, so would expect the buffers to be multiples of 2 bytes, not 4.
KN: that's a reason why it's useful to be able to do things with multiples of 2 bytes - but there are a lot of things you can't do. Have to figure out how to initialize it for example.
MM: compute shaders?
KN: you have to do something weird to get the data in. Use a compute shader, or texture copy, to get the data in. Or, pad your buffer an extra 2 bytes and you wouldn't have problems.
MM: or, your initial upload uses float32s, and in your run loop the data sits in float16 buffers. That would be reasonable.
KN: yes.
JG: what if we did limit to 4 bytes, and then see user feedback?
MM: I agree with Corentin's concern. There's no platform reason for this. Additional restrictions beyond the platform. App could work, then data changes slightly and it breaks.
KN: I don't know if this is the complete list of operations needing 4 bytes. What if we required multiples of 4 bytes for map usages?
MM: what's the goal? Simplify validation code in browser?
KN: goal's to make it less frustrating. Frustrating to set size of buffer to arbitrary input, then I can't map it later because it's not 4-byte aligned size. Expose the realistic limitation earlier.
MM: don't quite understand. Reason limitation exists is how the map operation works. Creation doesn't require it. It's correct to expose the restriction at buffer mapping time.
KN: also, yes, making things simpler in the browser. We do need to initialize our buffers. Not something the underlying APIs are set up to do with these buffers. Might need special cases to init the last 1-3 bytes. Or, round all buffer sizes up to 4 byte sizes and use regular clearing - but through implementation, have to fake the buffer size being 1-3 bytes smaller than it is.
MM: in Metal there's a difference between mmap and malloc. Buffer creation's more like mmap. We'll have to implement malloc layer on top of mmap. Possible but unlikely for 2 of the users' buffers to live inside the same Metal buffer. We'll have to do this tracking ourselves anyway.
KN: I don't know whether we sub allocate actual buffers like that. We do for staging buffers (writeBuffer, writeTexture). It's a good point.
JG: sounds like there isn't universal appetite for this. Close it as interesting idea, return to it later?
KN: ok. One thing I'm not satisfied about - if you create buffers with a certain size and then try to map them, mapping will fail. Round down mapping size / round up allocated size?
MM: you're proposing if app says they want 3 bytes long, and browser allocates 4 byte buffer and lets you use the 4th byte?
KN: essentially yes.
MM: difference between requested and actual size.
KN: you wouldn't be able to use it when you bind as resource, or use as vertex buffer. But allocation would round up. Then you can map it. Extra byte wouldn't even need to be in the resource itself.
MM: so you map for write, write the last byte, map for reading, read the last byte, it might be different.
KN: you can't make a buffer both MAP_READ and MAP_WRITE.
AI: KN to file an issue with a more constrained test case and concern.
KN: will think about whether to close this issue.

Request for a GPUCommandEncoder fillBuffer operation #1968 also #28

Various options, including doing neither.
MM: think this is easy, either first or second option.
JG: seems nicer if you can fill things with either 1's or 0's. Need byte values then. DWORD increments. Weird, probably tolerable. But gets into problem with non-multiple-of-4 lengths.
MM: does need to be multiple of 4. Trying to fill 3 byte buffer with 1 DWORD would go out of range on Vulkan.
MS: setting by bytes won't do it for 1's.
DM: may need to check Vk semantics. Needs to give you a way to init buffers. non-4-length buffers you wouldn't be able to init at all.
MM: separate out q of offset/length being multiple of 4.
DM: I'd be OK with zeroing out. Already have to do that. On D3D12 we have 0-filled buffer, we just copy from that. Allowing arbitrary fill values will cause problems.
MM: you'd need to use compute shader? ::ClearUnorderedAccessView?
DM: need temporary descriptor, make its lifetime the command buffer that created it; storage usage on all DST buffers.
RC: ClearUnorderedAccessViewUint is also less efficient, has perf cost on some hardware. If we have 0-clear guarantee on all buffers, is this on second usage of the buffer that you want it cleared?
MM: yes. App wants contents set to 0 but they don't want to write compute shader.
MM: if group doesn't want to do it that's fine too.
JG: let's just do zeros. If we want to do it, would be nice if app didn't have to do it too. Everyone holding on to big list of 0's.
MM: if app's biggest buffer is GB large, do you need parallel buffer that big?
DM: we just issue multiple copies.
RC: or we detect you're copying into the whole thing so we can bypass it.
MM: let's do the zeroing thing.
JG: is that zeros with a subrange?
DM: yes, a subrange.
JG: ok let's spec it.
MM: I'll open separate issue about whether subrange has to be multiples of 4.
JG: don't know whether we have that limitation, but OK to open an issue.

Problem with the default queue #1977

DM: circular reference. Q references resources used on GPU, device references Q because it's the default Q. Resources reference Q (?). Not the end of the world, circular references. One contributor was very unhappy about this semantic. In multi-Q world, handing work to a Q that you don't schedule causes problems. Can figure out which is the default Q, but nothing in operations say that default Q's used. Create buffer, map, fill and unmap. Nothing about default Q is used, but it may need to be behind the scenes.
DM: maybe we can consider having the Qs explicit where they're used?
MM: question about circular dependency. Understand the links you drew, but circular dependency only exists when resource is in use? When resource not in use, resource doesn't retain device any more, so cycle is broken? If so, this is intentional - resources are retained while they're in use.
JG: for question of things operating on a queue, creating command encoder or unmapping buffer, do feel it would be nice to make it explicit that that's happening. Would like to avoid case of later adding multiple Qs. device.thing == queue.thing on the default Q. No special-casing. Then only reason to have it not be explicit is if we were to add multiple Qs.
MM: makes sense to add explicit Q option. App has big chunk of work and wants to do it on one queue, should be possible to make it do that instead of doing some of the work on another Q just because it's the default one. But think it should default to the "default" queue.
JG: do you think it's a problem for apps to have to type the extra couple characters?
MM: yes, they shouldn't have to think about it at all. We bless one of the Qs.
JG: don't know if we bless a single Q. Just make the 1 Q we have do the default work. It's just that it exists.
MM: agree it has no special powers, but it is treated specially.
JG: what I don't love pedagogically - there's concept of gradually adding complexity. Would like these things to not be surprises to end users.
MM: agree on the philosophy. Saying you have to have a Q makes total sense, just disagreement of whether user should have to type characters to say it.
DM: we could make multi-Q a feature. You opt in, then it'd enable those queues as fields in the descriptors that you have to specify. If you're single-Q they don't exist. Everything uses the default Q. But at device creation, opt into multiple Qs, then you have to specify it for every operation that uses a Q.
JG: adds complexity to polyfilling. Polyfills need to know whether they have to write the field.
MM: in JS you could always write the field and it'd be ignored.
JG: unmap takes 2 objects. buffer, and Q. command buffer creation - feels weird for that to be on the device, but it's on the Q, and only on the Q because you specify that in the options field. q.createCommandEncoder? But not something you could do optionally.
MM: DM's suggestion seems reasonable. Opting into multiple queues, you have to tell us which Q you want the work done on.
JG: think that makes it more confusing. But no way around this except to have createCommandEncoder both on Q and device.
DM: discussion was on CreateCommandEncoderDescriptor having the Q. Not discussing adding a new method.
JG: I'm proposing that. Most obvious way to create something on the Q is not to add field to descriptor, but to say q.createCommandEncoder. What do we do for texture views? Is it texture.createview or device.createTextureView?
DM: it's texture.createView.
JG: so that's the reason you'd want to say q.createCommandEncoder.
MM: what if we say this takes dictionary with optional Q, and creating command encoder can be on the Q itself?
MS: can be or must be?
MM: must be.
JG: device.queue.createCommandEncoder?
MM: my thought about it being optional is the unmap case. But can imagine it would matter in some cases.
RC: the only thing you use a Q for is submitting. So if creating command encoder you give it a Q, does that mean you can only submit on those Qs?
DM: yes.
MM: in Metal you submit the command buffer.
RC: D3D, you submit on Q.
JG: D3D has more validation.
RC: if we require ppl to spec Q when they make command encoder, does that allow impls to make the command buffer more special / optimized? Otherwise you know everything you need to submit buffer on a queue.
KN: ..
MM: there's no option to create a command buffer without knowing which queue you'll submit on.
RC: does association make things more efficient?
KN: encoders are tied to queues. When you create an encoder it's tied to it. Just a question of exposing that fact. "This goes on the default Q unless you tell us otherwise."
MM: different Qs might have different instruction sets. Copying buffer to buffer might have different representation of hardware commands. Encoder needs to be queue-specific.
MS; since it's Q-specific then creating it without specifying the Q makes it less obvious that you need to submit to the same Q.
KR: concerns about multithreaded command encoding and MT access to Qs.
KN: think we would have this. Encoders can't be transferred / serialized. Only command buffers can be transferred. But you can have multithreaded access to a Q, including submitting to it from multiple threads.
RC: if you need the Q at the beginning, maybe you should submit against the command buffer.
KR: talking about changing basic aspects of the API. Think Corentin should be present before making changes like this.
JG: think we can still make a change like this.
KN: think we should discuss this carefully, but probably not too difficult for users to adapt.

(stretch) Should we constrain the location of user input-output stage variables WGSL #1962

(stretch) Does compositingAlphaMode affect using the canvas as an image source? #1847

Agenda for next meeting

CW: won't be there next week, but still discussion between empty and "null" bind group layout, and how you unbind stuff from the pipeline? will be a pain point.
- wart that we need to fix sooner rather than later.
- MM: we discussed that 4 years ago if I remember correctly. But no idea if the consensus we reached will apply to today's spec.
- CW: we didn't have a spec to record the decision in.
- DM: fix can be 1 line in the spec. BGL is the same, etc. Please do fix it. We want it too.
- CW: maybe I'll make a PR but won't be there to defend it.
- MM: there are other design options. "unbind" this bind group from this slot.
Add indirect-first-instance feature (#2022)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2021 08 02

GPU Web 2021-08-02

Tentative agenda

Attendance

CTS update

Administrivia

Stencil reads are S,S,S,S or S,0,0,1. #1978

Add multi-draw-indirect feature. #1949

Consider limiting buffer sizes to multiples of 4 bytes #1999

Request for a GPUCommandEncoder fillBuffer operation #1968 also #28

Problem with the default queue #1977

(stretch) Should we constrain the location of user input-output stage variables WGSL #1962

(stretch) Does compositingAlphaMode affect using the canvas as an image source? #1847

Agenda for next meeting

Clone this wiki locally