GPU Web 2020 07 27

Chair: Corentin

Scribe: Ken

Location: Google Meet

Tentative agenda

Call for participation in WebGPU NAPI (Node.js etc.) discussions
Mapping defaults (just a PSA?) (Kai) #951
Multisampling (Dzmitry) #932
Subgroup operations (Oguz?) #954
PR burndown
- Make BindGroupLayout parameters' default/undefined consistent #949
- Use out-of-band messaging for the mipmap and array layer count #945
- (Migrated from #489) Expose limits on the adapter #495
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Myles C. Maxfield
Google
- Austin Eng
- Brandon Jones
- Corentin Wallez
- Idan Raiter
- James Darpinian
- Kai Ninomiya
- Ken Russell
Kings Distributed Systems
- Daniel Desjardins
- Dominic Cerisano
- Hamada Gasmallah
- Wes Garland
Microsoft
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Matijs Toonen
Mehmet Oguz Derin
Michael Shannon
Timo de Kort

Administrative

FD (email): formal call for review of the GPU for the Web Working Group Charter is still ongoing until the end of the week
KN: Call for participation: Let me know if you’re interested in participating in a discussion group or biweeklyish meetings for WebGPU in JS in native, for example in Node.js via N-API. Note this effort is outside W3C.

Mapping defaults (just a PSA?) (Kai) #951

KN: fixing a few issues with in-band messaging for sizes, which we didn’t intend to do
This removes the default of 0 for mapAsync, changes it to “undefined”, and remove the restriction that 0 can’t be passed to mapAsync. Now returns a 0-sized ArrayBuffer.
- MM: sounds good.

Multisampling (Dzmitry) #932

DM: we have the spec text for working with multisampled targets, but don’t spec what sample counts are available. We can choose the level of detail needed for the spec. In Vk, you ask for a mask, and then specify the particular format, MS count, ask what I can do with this image format. Very fine grained format query.
DM: first approx: MS of 4 is what we support now. It is supported everywhere.
DM: this PR goes one step farther. Here’s the mask of samples supported for all the render targets. Requires all the render targets support this. Also doesn’t expose all the device capabilities. In D3D12 some formats have restricted supports compared to others.
DM: it will match Metal. Will possibly allow on D3D12 to say all formats support MSAA 8X if our internal queries support it.
DM: compromise between full per-format query and the others.
MM: GPU size 64, are there some devices out there supporting 8 billion samples?
CW: we could make SwiftShader support that :)
DM: might be too much.
MM: I think it should be size 32.
DM: that’s fine. Could be size 8 if we want to.
CW: have some concerns there’s a real limitation in HW out there. e.g. 128-bit formats used with sample count 8, takes huge amount of space. Devices out there have limited space per pixel in the render pass. If we start adding MS formats we might run into limitations of overall number of bits per pixel. There’s such a limitation in Metal. Counting number of bits in render target stuff. We’re going to open a huge can of worms.
JG: we might have to already. Large formats, single-sampled, and you have a lot of attachments - that might be enough. For some of the updates to add more stuff to Metal, we started getting FRAMEBUFFER_UNSUPPORTED in strange places in OpenGL. Strange limitations in total bytes per pixel.
MM: right. CW is your thought that you don’t want to have more than 4x multisampling because the spec gets complex?
CW: let me re explain. Concerns that limitation in D3D12 that you can’t 8x 128-bit formats because it’s bigger than one of the device’s per-pixel limit. Not sure we can expose the multisampling capabilities in a simple mask like Dzmitry proposes.
JG: WebGL 2.0, we tell you what your max samples is globally, but can then query a list of which sample levels are supported for each given format. It’s not the same for all formats.
KR: I’d like to highlight that capability too. WebGL allows asking sample counts per format, and I don’t think we should forget that it’s needed there. WebGPU should expose some way to find out the supported counts for a format.
JG: for users that just want e.g. 4x MSAA, we can give you “some amount” of multisampling based on what you requested. Thinking, could we do it in a way we auto-figure out which number of samples we’re using without having people check explicitly whether 4 or 8 is supported?
CW: wont’ there be issues in apps that do fancy things with gl_SampleMask?
JG: yes, right, never mind - that’s what Dzmitry brought up.
MS : might affect OIT algorithms too.
MM: what if there’s a way to ask the system, give me some multisamples, and the system figures that out? And also a way to ask for a specific number of samples?
KR: I’d be in favor of adding a per-format query so you know exactly what is supported.
JG: Should be easy to polyfill “I want all of the multisampling” by asking for the list and then using the highest one.
DM: Jeff’s suggestion is reasonable. If you’re never going to construct a view that you bind in a bind group, then the sample count doesn’t actually matter so much.
JG: I worry about two different classes. Unsized sample counts & sized sample counts. Can I bind these to different parts of the shader pipeline? Use sample masks? Inability to hide this from the shader level makes this less attractive to me.
MM: instead of 2 different classes (known / unknown sample counts) can we have a different constructor which returns a different value set, and you can query it?
CW: probably needs to be async.
JG: We could do that, but fundamentally that’s just the inverse of asking for how many samples are allowed, and calling max on the list. Would require no other changes on our part, assuming we’re surfacing the full list of samples supported/
KR: If you do that, and the call of support is implicit and synchronous… we had a silly bug in Chrome where if you called MultisampledRenderBuffer… it would be a blocking synchronous call which slows start up.
JG: Well because this one is not state based and is adapter/device based, it could be forwarded to the client on initialization and done fully in the client synchronously.
CW: Right now we have limits on the adapter. You’re not going to want to opt into per-format sample counts. How about: adapter gives you what sample counts it supports. You have to opt into the additional sample counts, for all the textures that support them, and the adapter also tells you what formats the counts are for. .. and then you still need to check per texture.
JG: Feels like diminishing returns in giving a tool that’s useful in a somewhat narrow setting. We don’t expect a lot of cases where the max count is not 8. Opting into a lower value doesn’t seem useful given that the main porting hazard will be some devices not supporting it for some formats. Would work better for a tiered system, but it seems like too much data to put all into limits.
MM: Haven’t done work looking into tiered limits yet.
JG: Similar in WebGL world.
DD: Similar for our GPU compute systems. We have to try to bucket our WebGL stuff into tiers. Too many combinations for our developers to handle. Encourage any kind of introspection, but a bucketed solution seems like a more useful design.
CW: Okay, how do we move forward then? It would be nice to have feature levels, but we don’t have them and lack data.
JG: Think we don’t need the perfect solution right now. We just need the tools to iterate for now, and expose something to query the sample counts for a format.
MM: Didn’t we just say tiers were a good idea?
CW: Yea, but we’d like to think about the # of devices or time spent using WebGPU, and we don’t have that right now.
JG: There’s also another aspect: the query and having tiers are not mutually exclusive and actually feel complementary. Tiers/bits-based it becomes hard to determine ex. how much memory I can allocate. Even if we have tiers, we should also be able to reflect what the tiers mean in code. That way it’s not only documented somewhere.
CW: in a sense, a tier’s the same as, at device creation, going into limits/extension thing and doing a min/max in a bunch of places.
MM: this is one more place where device geometry can be exposed to fingerprinters.
JG: with tiers a requirement, doesn’t exclude querying the limits of the tiers.
CW: Also don’t want to prevent UAs from doing whatever coalescing they want. WebGPU will have tiers, but a browser may insert tiers in the middle or coalesce them.
JG: that’s less orthogonal to this question. If we choose how many samples these formats can have based on a tier system, but have tiers be UA defined, there’s not really a way for user apps to opt into one of the lower tiers that’s also supported. Except if there are UA-specific queries. Sounds like a limits object with a map of sample lists that we do expect people to copy - which is probably OK.
MM: how’s an app supposed to pick the right tier if they’ve never seen that tier before?
JG: reflection.
CW: I wasn’t dismissing the idea of tiers defined by this group. Also the app can know that they can check if there’s sample count 32 because one vendor told us we have it. A side-tier that can be queried off the GpuAdapter. …
DM: Started talking about tiers in the wrong context. They’re important but not need for MSAA. Corentin raised a point that wasn’t investigated. Limitation of a render pass, a single target, etc.? Let’s do this first and then see. Maybe it’s a limit like maximumBitsPerTarget. Much simpler than having a per-format query and more portable too.
MM: +1.
CW: you’d be happy to say, we don’t support sample count 8 on these devices that don’t have support for every single format?
DM: No it’s the opposite. We would say it is supported, but we would say there’s a maximum number of bytesPerTexel
JG: we can only do that if it’s compatible with all of the restrictions of the APIs. If some APIs like OpenGL can give you “whatever” for the list of supported samples - even if they can support 2/4x MSAA 128-bit targets, it might not support 8x for that target. Need to make sure we match our actual restrictions for our underlying APIs with possibly some over-restriction if it seems useful.
MM: idea is we should investigate what those restrictions are so we can make a good design.
JG: we know some of the restrictions already.
CW: Jeff and I have a concern that this is hardware-dependent and hard to know.
MS: doing it by bits won’t work for e.g. Depth24Plus. No defined size, won’t know what that maps to. You won’t know the bits.
JG: it’s useful even if we don’t go that path to get this info. We’ll have to reject these invalid draw targets even if we can’t do this simplification. If we want to block this multisampling support on this investigation we can. Think we know what we can expose in the API based on our understanding today. This investigation could be viewed as an iteration step.
DM: if we merge the PR first, follow up with an investigation, and add one more limit later, it’ll change only what the sample mask limit of the impl is. We can land it separately later. I’m concerned about per-format queries. Opens a can of worms. Then people will ask, why don’t you tell us about render target usage? Storage texture usage? API will become less portable.

Subgroup operations (Oguz?) #954

MD: host side (exposure of this support as an extension on adapter query time) and shader side
MD: these ops have been in HW for a long time now. Getting more and more supported.
MD: when doing GPU compute they’re useful for global reduction, or reductions taking into account localities.
MD: from host side: exposes subgroups as one single extension instead of banding it to different HW. Case for this in WebGPU: if we have multiple extensions it’s out-of-context wrt the rest of the API. HW can be split into two: permutation support and reduction support. All permutation hardware is getting reduction support. This is HW that doesn’t make much use of SIMD group operations. Excludes quad operations, compute-only, but on host API, have to reject given shader sources. Another q: where will we expose the subgroup size? If on compute pipeline obj, easy to implement for all OSs. Most restricting case, exists on the compute pipeline. In Metal might be good reasons for this - maybe they’ve experimented with HW that dispatches kernels in different bands.
CW: maybe Metal has good reason to expose subgroup size on per-pipeline basis rather than per-device.
MM: can’t comment on that.
CW: in Vk, there’s no guarantee that all compute pipelines have same subgroup size. Just a constant in the shader. Intel HW can do 8, 16, or 32 depending on stuff. In Vk they came up with an ext to say “I want you to do 16 and only 16”. In WebGPU I think we should go with a fixed subgroup size per device.
JG: I’d assume you’d want a dynamic size for performance reasons.
CW: the subgroup group at NVIDIA - you really want to know the exact sizes of things, how many invocations are in your subgroup. They could not do anything on Intel until they had this control.
MM: that control doesn’t exist in Metal though. Nothing on the device that tells you what the size is.
JG: feels like “as simple as you can do but no simpler” exploration.
MM: I think Oguz did that exploration. This proposal is as simple as it can get.
MD: yes. Also on DX side it’ll be minimum rather than countMinimum. Good to have it dynamic.
CW: I have concerns that this might be really difficult to work with. On platforms we care about it’s probably OK if we don’t have the control we want you don’t get subgroup support.
MM: saying Metal doesn’t have subgroup extension is thumbs down.
CW: on Metal, we can make per-GPU assumptions. NVIDIA = X, AMD = Y, Apple GPU = Z. Easier to have hardcoded values on Apple platforms.
MM: I can’t say anything specific but on Metal this is on the compute pipeline as an intentional choice
MD: also in Metal there’s nothing hardcoded about subgroup size, only band of subgroup ops supported.
CW: OK, so pipeline creation has to be async so we can get this data back. createReadyPipeline is OK but not the other one.
MM: I agree.
CW: language design: I’ve experienced the pain of the team next door - we should figure out how to make this workable while editing WGSL shaders.
DM: is the query that Metal has compatible with Intel’s behavior?
MM: Intel’s behavior of having some compute pipelines have different answers?
DM: since the HW doesn’t have a fixed subgroup size. Can we impl on Intel HW on non-Metal? On Vk, can we get the behavior behind the Metal impl on Intel?
CW: if you get the extension that lets you limit the subgroup size, yes. Shipping in drivers for 1.5 years.
MM: you’re asking, is there a way for the Vk impl to tell the developer what it thinks the width should be?
CW: the extension lets the Vk driver tell you what width it picked. If you force it to e.g. 16, …
MM: so with this extension the proposal Oguz made would work great. ?
CW: it would be implementable, yes.
MM: and without the extension you’d just return the same value everywhere?
CW: without the extension you can’t know what the true width is.
MM: thought it was part of the physical adapter?
CW: no, you pass in add’l data, please give me this subgroup size etc.
MM: I mean without the extension. Doesn’t device tell you what the width is?
CW: no. It gives you min/max.
MM: surprising.
CW: might have to re-research. Any place we can impl the thing where adapter says “I’m 16 always”, we can also implement this with the async query. Still have concerns about usability. Could ask for a couple of people to ask why it’s a usability concern.
CW: so, for this extension, next step for host side: better investigation of what we can do in Vk.
MM: are you going to take that AI?
CW: no, sorry. Think Intel Shanghai had an investigation already. We could ask them to put it up.
CW: what other next steps?
MD: in either case think imp’t step is to reject things if hardware doesn’t support subgroups. Can reject at pipeline creation time. But pipeline creation errors are more likely validation errors. More strict rejection is at shader module creation time. Clarification on this would be the next good step.
MM: request: think this can be done at shader module creation time. Entry points are annotated. Can transitively follow each function call - if there’s a func call to one of these ops, you can trace every entry point that could hit it, and so whether it’s valid. Also the extension is at the … level. Shader module creation comes from the device. Think you can do that validation too. So think all validation can be done at shader module creation time.
MD: yes, think that’s correct. Shader module creation could be the best place. Any more thought could help.
CW: sounds good.
MM: who’s going to do the AI? Don’t want this to stall.
CW: I’ll reach out to Intel Shanghai. See if they can post it. I’ll take the AI to give the AI to someone else.

PR burndown

Make BindGroupLayout parameters' default/undefined consistent #949
Use out-of-band messaging for the mipmap and array layer count #945
(Migrated from #489) Expose limits on the adapter #495

Agenda for next meeting

More subgroups?
Do we carry over both subgroups and multisample?
- DM: subgroups, yes, b/c of the deadline. MS needs an investigation before we can make progress.
MM: all 3 APIs now have raytracing. Wanted to take group’s temperature before doing a big investigation. Think there’s a set of common functionality.
- CW: think it might be a bit early. Felix Mayer also did a big investigation because he implemented WebGPU raytracing on D3D12 and Vk. Worried about adding big extensions before we release WebGPU.
- KN: don’t think it’s a bad idea to do the investigation, but think we shouldn’t prioritize it. Lot of other things to do.
CW: feature levels more?
CW: please go through the issues you opened, see if they still need discussion, put Needs-Discussion label on Github if so.
CW: feels like we’re getting to the boring part of the standard where we need to figure out the details, it’s good, but think we need to talk about those smaller issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2020 07 27

Tentative agenda

Attendance

Administrative

Mapping defaults (just a PSA?) (Kai) #951

Multisampling (Dzmitry) #932

Subgroup operations (Oguz?) #954

PR burndown

Agenda for next meeting

Clone this wiki locally