GPU Web 2020 08 03

Chair: Corentin

Scribe: Ken and Austin

Location: Google Meet

Tentative agenda

Subgroup operations #954 (Oguz?)
Support Rgb9e5 packed float texture format #957 (Connor? Corentin?)
Ordering of promise resolution. #962 (Corentin?)
Multisample Coverage on Metal #959 (Tomasz? ~~Kai?~~ Corentin?)
PR burndown
- Change rowsPerImage to count in blocks #958 (Dzmitry)
- Others?
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Myles C. Maxfield
Google
- Austin Eng
- Brandon Jones
- Corentin Wallez
- Idan Raiter
- Kai Ninomiya
- Ken Russell
Microsoft
- Damyan Pepper
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Connor Fitzgerald
Matijs Toonen
Mehmet Oguz Derin
Michael Shannon
Timo de Kort

Administrivia

Was call for comments / review on charter for WebGPU working group
Have been 17 companies that answered on it. Everyone’s in favor of moving forward. Should be able to get this done soon.
CW: One comment (from memory) about abuse being possible to mine bitcoins. Had question of whether it would be possible to detect abuse vs. for example machine learning. Just this comment to address.
TPAC fall coming up. There will be a breakout week. Maybe we could propose a WebGPU subject there? Don’t think the deadline for this is any time soon. Need to decide whether this group will decide at virtual TPAC - think we decided no because we have our own meetings. But if we have a WebGPU breakout meeting it would be nice to synchronize.
- Open topics; can be talk, discussion around new topic, interaction between various APIs, etc.
- At TPAC the topics are usually suggested at the beginning of the breakout day.
- DM: that sounds good.
- DM: multi-queue. :)
- CW: you can, if you want, but the intent is to interact with / socialize with other members of the W3C.
- MM: yes, the audience is the entire W3C group. 99% don’t know what a GPU is. :)
- CW: I promised a multi-queue proposal soon, but hasn’t been done yet.
CW: I’ll be out next Monday, Dean can you organize meeting?
- DJ: yup!

Subgroup operations #954 (Oguz?)

CW: David and I have added comments on two different issues to think about with subgroups.
- MD: should we expose subgroup size? think it makes sense to not, because API support for it isn’t uniform everywhere. WGSL part can be better discussed in that call.
CW: fixed size subgroup vs. not: talked about this last week. Doing investigation, seems we can not expose the subgroup size on the device, because Metal doesn’t have this, and the size can change between pipelines. Can’t expose on pipelines, because neither D3D nor Vk exposes the size per pipeline. Eg., if on Intel, 2 pipelines, won’t know which one has subgroup size of 8 or 16.
CW: if want to have subgroup support, can’t expose it at the API level; has to be a shader constant.
CW: question of whether the shader constant could be a specialization constant, when WGSL has that, so you can size arrays, etc. based on it. But for pure API side, seems not possible.
MM: for Vk if you want the subgroup size, is the only way to ask in the shader?
CW: 3 ways. Unextended Vk, use specialization constant in shader, or use builtin value (just a specialization constant). Can ask at compile time or runtime. There’s an extension that lets you control / query the subgroup size. And a final extension that allows you to query it the same way as in Metal; let compiler decide, and then query it.
MM: for the first option, as a spec. constant, is this something the runtime sets and author queries, or something the author sets?
CW: think of it like a macro #define in C. You don’t know what it is.
MM: so not exposed at API, only in shading language?
CW: yes.
MM: ok, sounds like that’s what we need to do then.
CW: as first step in the API, won’t be able to query subgroup size. Worth talking about fixed subgroup size as well, as a second step. Wouldn’t be the first extension introducing subgroups. But a followup would let you control it.
CW: based on experience report from team that was using lots of subgroups, sounds like this would be useful.
MD: isn’t fixed subgroup size essentially doing what Metal does manually? Doesn’t Metal handle this behind the scenes?
CW: meant you can control the subgroup size as the developer, and you know before compiling pipeline what the subgroup size will be.
MD: possible for that fixed subgroup extension to reject your request?
CW: that extension would allow you to control. If pipeline’s created, it’ll be the subgroup size you requested. Caveat: if shader uses too many registers, shared memory, other impl details, shader compiler could bail out and say “I can’t”. Most workloads should not hit this though.
MD: understood.
CW: have this already right now though. You can induce so much register pressure that shader compiler will bail out.
MD: seems to me you can add this as a second step, maybe subgroup control, seems even more niche than exposed subgroup size. Maybe we’ll get that exposed size in D3D, and we already have it in Vk extensions.
CW: question then, how does querying of subgroup size help in WebGPU as it is today?
MD: I wasn’t going to put it in the host side. Metal says it helps compute a better dispatch size - that was motivation for including it. Not enough support right now. Maybe WebGPU can choose a better dispatch size itself.
CW: makes sense. Diff btwn Metal and WebGPU dispatch: in Metal it controls both local and total size. Number items in workgroup, and number of workgroups. Want to size one to be a multiple of other. In WebGPU the number of workgroups is fixed at pipeline creation time, so maybe less useful.
MD: I’ll update the PR, remove the host side, and leave the device operations.
CW: sounds good.
DM: is it portable to find out from the shader which subgroup size you have?
CW: in every API you can query the size. Don’t know about the specialization constant part though. That’s possible in Vk, not sure about others. Querying about runtime is available in all APIs.
DM: so we can expose it everywhere with the understanding that host side can’t control it but shader can alter behavior based on the subgroup size?
CW: yes. Things like saving GPRs on AMD hardware can be done this way.

Support Rgb9e5 packed float texture format #957 (Connor? Corentin?)

CF: commonly used for HDR textures. Combined exponent among all color channels. Every API supports it back to old OpenGL, so doubt there are any compatibility issues. Vk naming is different from others, but binary compatibility should be the same.
MM: consider operation where I take a Float3 and pack them into this format. Imagine there are multiple representations of the result that can occur for any input?
CF: yes.
MM: e.g. different average between the 3. 1) Do graphics cards use the same packing? 2) Should we care?
CF: as far as I’m aware - no evidence GPUs behave significantly differently. Think they shouldn’t - it’s an inherently lossy format. It’d generally be given to us by the user. Can’t render into it, only read from it, so data has to come from the user.
KN: that’s what I was thinking. Don’t know whether you can write to this in any way - e.g., use it as a storage texture. If essentially a read-only format - it should be philosophically - then that problem would not be an issue.
MM: do we have other read-only formats?
CW: we have compressed textures.
KN: I think of this as a compressed texture format, though it’s not a block based format.
CF: I think that’s a reasonable approach to take.
CW: I wrote test cases for this, and at corner cases, it seems all the APIs’ behaviors match.
MM: what matches?
CW: this format is CopySrc, CopyDest, sampleable. Copying to it and sampling as a texture, the values you get in a Float3 / Float4 match among all APIs. Decoding of bits by GPU is portable / consistent.
MM: cool. When you copy into it you don’t pass a Float3, right? You pass the compressed thing?
CW: yes.
MM: I’m convinced, let’s add it.
CW: also, supported by all APIs we care about, no caveats.
CW: great, let’s have the format! CF can you add the format to the spec?
CF: yes. Are there any catches around read-only-ness I have to worry about?
KN: we don’t have a texture capability table yet.
MM: at least add a TODO about spec’ing it read-only.
KN: can use ISSUE:## as a comment.
CW: thank you!

Ordering of promise resolution. #962 (Corentin?)

CW: would be nice for developers if there were some guarantees about order of resolution. In particular things tied to a queue. Mapping, signaling, timing of GPU command buffer. Would be nice if they resolved in queue order in JS. Then if something resolves you know that all previous ones were resolved too.
MM: makes sense to have a few of these requirements. Don’t think it makes sense to have a total ordering. E.g., requestAdapter while other adapter’s in the middle of a fence operation. But having some requirements sounds reasonable.
CW: ordering requirements would be, queue operations’ Promises have to be resolved in order they were put on queue. On Device we have several promises, including GpuDeviceLost and popErrorScope. Should be clear that popErrorScope should resolve in the order you pop them.
MM: if you map a buffer that’s in use by the GPU, and have a fence that’ll be resolved when the buffer’s unmapped, then … might not want to delay the fence until the data transfer finishes.
CW: why is there a data transfer between GPU and CPU?
MM: if you’re readAsync’ing.
CW: buffer in GPU process should be in shared memory. Readable by CPU.
KN: have to do GPU-to-shmem copy, if there is one.
CW: the fence, when you call mapAsync, you both give intent that you want to map, and get Promise. If fence comes later, any operation needed to make buffer accessible on CPU would already be enqueued.
KN: true, that GPU-to-CPU copy has to be well ordered w.r.t. rest of queue.
MM: can you expand on that?
KN: think we’re both still thinking. :)
CW: imagine there’s operation that has to happen for buffer content to be visible to CPU. Then that operation, you can enquque when you receive the mapAsync call. In timeline: JS, mapAsync called, then signal(). GPU process: mapAsync received, turns into some visibility operation / copies / etc. on queue. Fence signal comes on same queue after mapping operations. When fence passed, these operations will be resolved. Probably need to associate … already done operations for mapAsync when fence is signaled.
KN: still CPU side operations. Could be copies that need to happen on the CPU that don’t need to be ordered relative to the queue.
MM: think I understand what Kai said. IF you have GPU buffer and shmem between GPU and web process, if you want to do MAP_READ, there’s a GPU side op (set visibility flags) and CPU memcpy into the shmem.
KN: yes, GPU-to-CPU memcpy, and possibly also CPU-CPU memcpy if you don’t use shmem.
MM: would be good to know if you could signal fence before one or both of those are completed.
KN: reliability here is probably more imp’t than reducing latency. Portability win is worth more, IMO.
DM: maybe it’d help to understand better in a situation where ordering matters. What kidn of error are you trying to prevent?
CW: some power users could put a bunch of mapAsyncs and wait on a fence. They don’t need the rest of the Promises. Once fence completed, they can mapRange all they want because guaranteed that mapAsyncs are all completed (if they succeeded). That’s one example. Might be more of these in the future. Once fence passed, you know that timing of command buffers have passed and they’re all available.
KN: don’t need to cater to someone doing it that way; they could Promise.all() them. But for portability, they might rely on that, and then we have to ensure it works.
DM: I see. People’d call getMappedRange after certain point and assume it works everywhere. Not possible to guarantee they waited on the Promise and didn’t just drop it?
CW: right.
MM: I’m willing to accept this. Kai’s logic makes sense.
KN: we really wanted to allow this and wanted to be portable, we could come up with a design. Unnecessary though. We could put getMappedRange on an object and pass it to you in the Promise, to make sure you waited for the result. Is this worth the small latency improvement? I don’t think so.
MM: what’s the cost of doing that?
KN: added API complexity.
MM: I think it’s trading complexity
KN: maybe so.
DM: we wouldn’t be able to use the mappedAtCreation flag.
MM: good point, that’s very important to continue to work.
CW: API is less composable without getMappedRange. Imp’t that to access it you only need the buffer. State leaking because you need the ranges to be disjoint, but maybe it doesn’t matter as much.
MM: can I nominate Kai to write some tests for this? Where web developers might rely on this ordering.
KN: yes, I think I already volunteered myself for this by writing the test plan for it.
CW: think we convinced ourselves that we should have a total ordering of ops on queues, Promises that are tied to queue timeline. What about ordering between popErrorScopes and/or device.lost()?
KN: I think that’s necessary as well. You can see that there’s a device lost and drop your remaining promises, or don’t handle errors that come in. Don’t think it costs us anything to add that ordering.
MM: ordering is, all pop Promises happen before device lost occurs?
KN: yes, and they might resolve with DEVICE_LOST or something.
CW: if you have bunch of popErrorScopes, they resolve in the order. If device lost, everything before it resolves, and everything after has device lost error. (??)
MM: agree, we probably don’t want to keep these promises around forever.
CW: think that resolves this item. There are a bunch of other Promises, but they probably don’t need ordering guarantees.
MM: if I popErrorScope and then request shader compilation, then pop another error scope, is that guaranteed to be in order?
KN: I think it’s not necessary.
MM: seems this naturally solves the problem with buffer mapping. For popErrorScope and compilation info, the thing comes along with the resolved Promise.
KN: yes, and a bunch of other ones give you the thing you were waiting for.
CW: sounds like good way to decide what should have ordering and what shouldn’t.
AI: KN: type up the consensus into the issue.

Multisample Coverage on Metal #959 (Tomasz? Kai? Corentin?)

CW: in WebGPU there are 3 features that interact with sample mask of fragment being written:
- coverage coming from triangle
- sample mask on pipeline
- sample mask builtin set in the shader
- alpha to coverage enabled, controls sample mask based on how much alpha you have on first render target
CW: in WebGPU we say the sample mask output on pipeline, output from shader, and alpha to coverage are mutually exclusive. Restriction comes from Metal.
- DM: I don’t think we agreed on that. Thought we’d allow both, and do the work in the shader. Logic was: Metal defines what alpha to coverage is, and if user writes SV_Coverage it won’t induce much extra overhead.
- CW: that would make sense, but section 10.3.4 of the spec says “if that builtin’s not statically used by the shader, and if alpha-to-coverage is enabled, then shader output mask becomes the alpha-to-coverage output mask”. Sounds like alpha to coverage supersedes the shader builtin.
- DM: we shouldn’t supersede it. Either disallowed to be used together, or allow them to be used together. Disallowing at the start is simpler for us - less code to inject in the shaders.
- CW: disallowing seems fine. Allowing seems OK as well. alpha-to-coverage injected in shader might be different from what the driver does itself. Don’t know whether alpha-to-coverage needs to be invariant between pipelines, for example.
CW: everyone OK with disallowing alpha-to-coverage and samplemask to be used together?
- MM: sounds good.

PR burndown

Change rowsPerImage to count in blocks #958 (Dzmitry)
- CW: somewhat controversial.
- DM: we have 2 fields, the rowsPerImage and BytesPerRow. Different semantics for what is a row. One is row of blocks, one is row of texels. My intention is to fix this. Either consistent definitions of rows, or different namings to not confuse people.
- DM: think it’s fine to change semantics of rowsPerImage. People using it should be careful about what they’re doing.
- CW: I’m concerned about changing rowsPerImage to be in blocks rather than texels, because think it should be in the same units as origin, …, GPU descriptor size. Don’t want to move the whole API to talk in terms of blocks, would be a lot of overhead. Having rowsPerImage be consistent with other things that talk about image sizes.
- CF: I ran into this while implementing compressed textures for WGPU. Blocks for rowsPerImage - every other case, texture-to-texture copies, or texture side of texture-to-buffer copy, they deal with an opaque concept of pixels. Don’t know exact format, layout in memory, etc. But in this view, I’m copying blocks of texels from here to there. In texture side, can be all texels, even though it has to be block-aligned, even though you’re doing something more opaque than writing to a buffer.
- CW: In GpuTextureCopyView would there be a rowsPerImage in blocks but origin in texels? … wait, this is part of BufferCopyView.
- DM: different structure. This one doesn’t have origin.
- CW: ok. I have fewer concerns.
- DM: GpuTextureCopyView has origin - but it’s not the structure we’re talking about. Texture data layout.
- CW: I was confused. This means Gpu texture layout talks only in terms of blocks. Others talk about texels. More consistent. I withdraw my concern. What do others think?
- KN: I like the idea of it being in blocks from a structural standpoint since that’s what it is. I’m neutral on how it’s defined because people will deal with it either way.
- CW: OK, guess we resolve it at the editors’ meeting.
- DM: OK.
- CW: that’s the logical place to bikeshed the naming.
Others?
Precisely define depth24plus as "either depth24unorm or depth32float" #938
- KN: think we agreed a while ago to change the definition of this to be literally either of these. All this does is change the spec to say this, and expands the notes to say what the differences are, practically.
- DM: sounds good.
- MM: sounds fine.
- CW: since everyone’s fine with it we can just merge it.
MM: looking back at notes about subgroup operations. Did we resolve to accept the PR and merge it in? Is there more to do?
- CW: for the API side that’s it, there’s nothing to do.
- CW: because it’s a shading language PR, and it needs 2 stamps, we need a WGSL-side stamp.
- MM: OK. I can comment about that on the issue.

Agenda for next meeting

DM: the PR I made about the texture binding. #970
KN: maybe: Set negative GPURenderPassColorAttachmentDescriptor.loadValue for unsigned integer formats #965
CW: guess right now topics come when you’re implementing a feature.
KN: New test development process, and invitation to contribute
- KN: I’m planning to send information about the test planning and development process. I’ll probably talk about that at the beginning of next week’s meeting.
CW: since there are few topics for next week’s meetings, and because some people are off, maybe we should just cancel it and gather topics for 2 weeks out?
MM: +1
KN: fine with me.
CW: OK, canceling next week’s meeting, will call for agenda items for 2 weeks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2020 08 03

Tentative agenda

Attendance

Administrivia

Subgroup operations #954 (Oguz?)

Support Rgb9e5 packed float texture format #957 (Connor? Corentin?)

Ordering of promise resolution. #962 (Corentin?)

Multisample Coverage on Metal #959 (Tomasz? Kai? Corentin?)

PR burndown

Agenda for next meeting

Clone this wiki locally