Minutes 2019 05 15

GPU Web 2019-05-15 Mountain View F2F Day 1

Chair: Corentin

Scribe: Ken

Location: Google HQ in Mountain View, California

NOTES FOR DAY 2

TL;DR

Status updates:
- Apple: API mostly implemented with MSL, subset of WHLSL implemented.
- Google: Most of API implemented and wired on Mac. Showed some demos.
- Microsoft: Helping with Dawn / Chromium
- Mozilla: Good progress on wgpu-native, looking at Gecko integration.
- Intel presented a recap of their efforts
- Early adopter feedback:
  - Various missing features (storage textures, polygon fill mode, etc.)
  - User coming from GL need explainers of synchronization and timelines, otherwise API nicer than GL. Some operations are more complicated than GL (loading a sprite).
  - Babylon.js: mainly that loading textures is hard because Chromium doesn’t load from ImageBitmap yet. ShaderModule reflection would be nice.
  - TF.js: will need to know GPU architecture to choose correct shader.
Organizational stuff
- WG formation
  - Still agreement that the WG should just stamp drafts from the CG.
  - Micheal Smith detailed the process of creating a WG at W3C.
  - Need to think about the WG stamped extensions as well.
- Snapshot should focus on API, not features
- CTS
  - CTS in a separate repo.
  - Google suggests porting the dawn tests as baseline testing but concern that there Dawn shouldn’t be the benchmark.
- Spec writing and test plan
  - Spec editor appointed by chairs but will do a formal call. This is an editor role to wrangle contribution and structure the spec.
  - Suggestion to develop test plan at the same time as the spec
- Website, logo, …
  - Preference for webgpu.dev over webgpu.io
  - Discussion around choice of community forum. Agreement that the forum isn’t a place for spec development.
WebGPU Compat
- Reach of WebGPU is smaller than initially thought, in part because of slow adoption of Vulkan on Android.
- Discussion of mechanisms how WebGPU could also reach OpenGL ES 3.1 and D3D11
- No agreement reached on if / how this WebGPU compat should happen.
Command buffer reusability
- Discussion of Hans Kristian’s blog port on frame contexts, doesn’t seem a great fit for WebGPU.
- Add GPUEncoderDescriptor.reusable #294
  - Would help engines bake part of their scenegraph.
  - Concern that this would remove the motivation to optimize JS - WebEngine boundary and act as a crutch.
  - Revisit this with a benchmark.
- GPURenderBundle #301
  - General agreement that WebGPU should have that feature.
  - Discussion of which commands should be allowed in bundles.
  - Discussion on the cost of creating Argument buffers in Metal and optimizations
Bind group mutability
- Discussion of which timeline they mutate one, and how it could be implemented.
- Deferred post-MVP.
Fingerprinting and exposing GPU geometry
- Safari will eventually mask the GL_RENDERER for WebGPU.
- Feedback from WebGL developers was that the renderer string is crucial.
- Could only expose the renderer string in some trusted context with permission transfer.
Essential compute geometry
- Need to expose shmem size and warp size for example.
- Other issue is that maximum local size can depend on the compute shader and not all APIs expose the information.
- Needs investigation.
RGBX / BGRX formats
- Could help having one less copy when compositing WebGPU content on some OSes
- Postponed after MVP

Action Items:

Everyone:
- Contact your legal team and confirm you could join the WebGPU WG.
Dean:
- Update the WG charter
Dean / Corentin:
- Call for comments on the WG charter
- Make a call for chairs of the WG
- Call for spec editor
Corentin:
- More detailed investigation on WebGPU-Compat.

Tentative agenda

Status updates
Feedback from early (native) adopters (Dzmitry)
Feedback from early (Web) adopters (Corentin)
Organizational stuff
- WG formation (Dean)
- Snapshot + CTS + spec writing (Corentin)
- Website + logo + community forums (Corentin)
WebGPU-compat (Corentin)
- Extending WebGPU to D3D11 and OpenGL ES
- texture / sampler split or combination
Command buffer reusability and frame contexts (Dzmitry)
- Including performance numbers already gathered
Case for mutability of bind groups (Dzmitry)
Fingerprinting and exposing some GPU geometry (Corentin)
- Device limits, in particular, like the max texture size (Dzmitry)
- Feedback from WebGL developers on usage of the device string (Ken)
BGRX / RGBX formats (Corentin)

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
- Robin Morisset
Google
- Austin Eng
- Corentin Wallez
- David Neto
- Kai Ninomiya
- Ken Russell
- Shrek Shao
- Ryan Harrison
- Victor Miura
- Idan Raiter
Intel
- Brandon Jones
- Yunchao He
Microsoft
- Chas Boyd
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Joshua Groves
Michael Smith
Timo de Kort

Status updates

Apple:

DJ: JF implemented a large amount of the API. Implementation can accept MSL and a small subset of WHLSL. Goal to make the WHLSL compiler good enough to do useful things.
DJ: Don’t want people to try out with MSL.
DJ: Not planning on showing demos while they use MSL.
DJ: No error handling yet and some missing features.

Google:

CW: Most of the API is implemented in Blink and wired up on Mac to Dawn+Metal.
CW: Missing: error handling (error scope), just print to console
- DJ: Why not Windows?
- CW: Easier to do system integration on Mac with IOSurface since we already use it extensively in Chrome. Plus Metal is easier with implicit synchronization. RC et all to look at Windows.
CW: Missing: debug stuff, some buffer mapping, usage tracking per subresource, other random things
CW: Demos!
CW: Babylon.js - WebGPU port. Mostly done by Babylon.js folks with some help over chat. Speedup right now comes from reusable command buffers because they’re bottlenecked in JavaScript. WebGL 12 FPS, WebGPU 40 FPS. Bottlenecked on setting the pipeline.
- DJ: how performant is shaderc / wasm? Basically free?
- CW: not been a visible bottleneck.
- KN: it’s pretty fast.
- CW: probably more expensive to compile shaderc.wasm
- MM: if they’re using 10 shaders why are they setting pipeline state millions of times per second?
- KN: like the Vulkan Gnomes demo, where the objects looked the same but they were actually different.
- MM: so this is not what a real app would do.
- CW: correct, synthetic benchmark.
DM: it’s exposing something to clients that wasn’t available before.
- JG / CW: discussion about yesterday’s WebGL F2F chat about potentially recording display lists. Also multi-draw.
CW: TF.js. Did parts to WebGPU with compute shaders. Works. We weren’t able to benchmark gains in the browser because of bottlecks somewhere else. Shaders should be 70% faster than current ones due to use of shared memory.

Microsoft:

RC: Have one fulltime dev on Dawn. Working on lazy-resource initialization.
RC: Intel developers: Bryan working on memory allocators, Brandon working on render passes and hooking up PIX in Dawn.
RC: Natasha implemented new API in PIX to be able to use multiple versions of the DLL.

Mozilla:

DM: good progress on native implementation (wgpu-native). Supports similar feature set to Dawn now. All API that’s in sketch is supported. Sub-resources and textures not tracked properly. Debug labels not there. Error handling is basic. Work of integrating it into Gecko is in early stages. Can show native samples that we have.
- DM: cube, shadow, road (raytraced terrain with polygonal models on top)
- Working on D3D11, 12, Vulkan, Metal on macOS / iOS

Intel:

https://docs.google.com/presentation/d/1ZoTb8AoR2Hc6uA9qBmr6uOoN0YSyeFjOtMQBw4wjemU/edit#slide=id.p3

Early adopters pain points

DM: my feedback from me and 3 users:
- Missing storage textures. Why don’t we have them yet? Did they disappear?
  - CW: no reason to not have them. We need them.
  - MM: agreed.
  - CW: question of naming.
  - DM: may need to define set of formats that are compatible.
  - CW: should definitely be in the API.
- Depth formats are very limited right now. 32-bit float, and one with stencil. Might want non-high-precision depth with stencil. Extending to 64 bit is wasteful. Emulate D24S8 on platforms that don’t have it natively?
  - MM: Metal doesn’t have it on some platforms.
  - CW: That one’s nice because it packs well in memory. On Vulkan there are some issues where if you clear just depth or stencil there are memory readbacks and it’s slow.
  - KR: ran into this in WebGL multiple times.
  - DM: just expressing concern over this. No concrete proposal.
  - MM: problem is that they use too much memory. Don’t want to emulate with something that costs memory.
  - RC: could be a good extension.
  - CW: if we can emulate it and it’s only imperceptibly slower on some platforms that would be good.
  - DM: it’s on Metal MacOS. Just not iOS.
- Texel buffer views. Filed pull request to see if we can add them. Should be a separate topic on the agenda.
  - MM: why asking for?
  - DM: accessing buffer contents as texture from shader, so shader can work without knowing whether it comes from buffer or texture.
  - MM: convenience so they don’t have to write new shaders? Or do they get perf from fixed function units?
  - DM: if you do bilinear sampling I assume you get higher perf.
  - DM: texel buffer views are in all the APIs and have been there a while. People expect they’re available.
- Polygon fill modes. People want wireframe.
  - CW: that’s not supported on all Vulkan.
  - DM: clearly something that people use. May think about what platforms don’t support it, and can we emulate it (at least for lines, maybe not points).
  - KR: what about wide lines? (trolling, sorry)
- Vertex formats are limited. No unsigned bytes or unsigned shorts. 3 bytes, 3 shorts also missing.
  - KN: we took them out because of D3D. Intel investigated.
  - CW: there’s no universal uchar, etc.
  - DM: it’s fairly easy for people to work around. If Dawn clients don’t complain about this it’s not a big deal.
  - CW: might be a way to emulate this on platforms that don’t support it. Could be tricky.
- Some users coming from GL background. Confused, what can they store and reuse between draw calls? What is the synchronization? Need better understanding of GPU timeline, synchronization, etc. How would mapping buffers work?
  - CW: Francois Beaufort wants to help write articles, and maybe the spec.
  - DM: one common q. I have a vertex buffer, how do I update it?
  - DJ: it can be written as a wrapper pretty easily I hear.
  - DM: creating a sprite is really complicated. ~8 steps.
  - JG: right, but creating 10,000 of them with different properties is only 10 steps.
  - CW: it’s feedback.
  - MM: we’ve heard similar feedback.
  - KN: we’re adding copy ImageBitmap to texture. Would eliminate ~3 of the steps.
- API feels very nice in multithreaded environment. A bit like OpenGL, this time done right. Nice to end on a positive note.
  - CW: are your customers running wgpu-rs in a multithreaded environment?
  - DM: yes, and it works.
Babylon.js
- CW: main feedback is loading textures is hard.
- 1. Chrome doesn’t implement the new copy from ImageBitmap proposal
- 1. rowPitch 256 constraint
- 1. lack of generateMipmap
- In the end, they ended up spinning up a WebGL context, upload texture, generateMipmap, and then readPixels.
  - Could make life easier by providing sample utility function for generating mipmaps.
  - DM: you were previously wondering whether we really need that large rowPitch alignment.
  - RC: they complained about the 256-byte alignment to me too. Working around it in JS involved a lot of copies and was painful.
  - DM: isn’t it covered by the ImageBitmap API?
  - RC: yes, but if you do it directly from JS arrays, it’s difficult.
  - MM: yet another convenience we could / should provide to developers.
  - CW: you can do it in compute shaders, but we should discuss this.
  - KN: I’d like to see the feedback after we have ImageBitmap uploading.
  - RC: agree.
  - CW: I’d also like to provide a compute shader generateMipmap.
  - JG: generateMipmap is historically poor because lots of things aren’t mandated. Scaling algorithm often not gamma correct. Usually like “I don’t care that much” but it’s explicitly a convenience function and not the best version of what it should be.
  - KN: we could make a good version and they can also modify it.
- lack of combined texture and samplers
  - CW: used to this in GL. When modifying GLSL code to be WebGPU they took every single texture in GLSL, split into texture / sampler, and did #define of previous name to sampler2D(texture, sampler).
  - DM: I think it’s a bad design if we do it. It is convenient, but it puts them on the wrong path. It puts them on the wrong path for performance.
  - CW: I agree but it’s the feedback we received.
- (need requirement for flipY convertColorSpace and premultiply for ImageBitmap)
  - CW: if we do an ImageBitmap thing, we have to make sure ImageBitmap creation can do all of these.
  - KR: they’re in the spec, we’re working on feature detecting them, but we really need all browsers to support these options.
  - JG: by the time we ship WebGPU we’ll support these.
  - DJ: Safari will support these.
  - MM: color spaces is just a small handful of enums?
    - KR: yes. It’s either that it does the browser’s native colorspace conversion or it completely ignores it and provides the raw pixel data. (Used for normal maps, gloss maps, etc.)
- ShaderModule reflection
  - CW: in WebGL world, everything’s referenced by string name, so their whole engine is set up referring to things by name. Reflection would help them a lot.
  - MM: we’ve heard the same thing. Every modern OpenGL tutorial on the web recommends not looking things up by name, and providing explicit locations in the shader. Then you don’t have to make queries. Should we really go against all these best practices?
  - CW: it’s just convenient for people to do these things.
  - KR: one difference would be that this info would be on the browser side, not the driver side, so would be faster to query.
  - CW: spirv cross provides both the spir-v output and a json file containing reflection data. Maybe not best for the browser to ingest this.
  - MM: think we should ultimately offer it, but maybe warn against it. Maybe just take the OpenGL route.
TF.js:
- KN: you need to select among several shaders. Knowing the hardware parameters doesn’t tell you how to select it. Need to know the HW architecture. Not to say we shouldn’t expose them, but knowing how to programmatically turn those numbers into something that runs fast on an architecture is hard.
- CW: what optimizers want is to know what GPU architecture they run on. How shared memory banks are accessed, the way workgroups are numbered is transposed between AMD and NVIDIA, etc.
- MM: we’re talking about more than threadgroups, etc. Isn’t this an antipattern? People write websites for Chrome, that’s not healthy.
- KN: do you have a solution for it?
- CB: this is a big distinction between typical web scenarios and where you need max performance. If you want to write portable code, don’t superoptimize it.
- KN: right. It’ll still work.
- CW: let’s bundle more of these discussions into “device limits”. When optimizing shaders we had to know we were optimizing for AMD for example.
- KN: the boilerplate for doing compute is a lot less than for doing rendering.
Other early adopter
- Persistently read-only mapped + use on GPU if read-only (deduplicate data on mobile?)
- CW: worthy goal to deduplicate memory, but not sure how well it’ll work in practice, even on UMA architectures. Need different cache behavior on CPU. Someone talked about this, not sure how much time to invest in it. Think it’s not possible to do in a performant manner.
- MM: also you didn’t mention the API complexity.
- KN: if you have a buffer by definition read-only then maybe you could use it with MAP_READ as well as other things. Have been thinking about taking usages away after creation.
- CW: just something that came up. Don’t have to worry about this for MVP.
- JG: What if we had RO or WO ArrayBuffer
- KR: Experience from Java is that will be difficult to implement in a performant manner.
- RM: Is there a different layout for the readonly array?
- KR: The VM didn’t have the opportunity to optimize at the callsite if there were multiple implementations of the interface.

Organizational stuff

WG formation (Dean)
DJ: I drafted the WG charter. No comments that I’ve seen since San Diego in January. If all we need to do is draft a charter and then convince our reps to sponsor it, I think we’re ready. One thing: we in this room decided that we effectively want to work in a living standard way, so the CG does the work and publish its version of the spec, and the WG would occasionally make the formal standard version of it. This is the way some other groups in the W3C work.
- MS: that’s the way the WebAssembly group has been operating for some time and it’s worked out well. No problems with how that’s been done. Existence proof already.
- DJ: don’t remember now whether I did that at the time, whether I copied text from the wasm charter.
- KN: I think you did.
- CW: think all we need to do is include that in the charter.
- DJ: I don’t know the next step. The AC reps that are present, encourage the creation of a new WG? What happens then? Who needs to decide whether we can propose formation of the group?
- MS: internally to w3c, we need to get it to the management team for review. After, next step is to go to the AC. It’s time consuming because of process steps. Generously, min 2 months, but realistically, more time. Good outside goal: have it done by TPAC. Seems realistic at this point. Not sure about group’s plans for meeting at TPAC, but if you had a WG (or even not), you could get space to meet there. Not a requirement, but feedback from others on the w3c leadership team is to make sure that the CG really wants a WG (nobody objecting to having a WG in parallel to the CG). E.g., no idea that the WG’s going to replace the work of the CG, so that you can keep both groups. Might want to consider a call for consensus about whether you want the work to proceed on creating a WG. If you do have agreement, and have that on record, I can point people to that later. Then I can take the charter Dean’s produced and move it forward in the internal steps for W3C. When we get further along in approval, need to have someone who’s agreed to be chair of the WG. Anybody that would commit to being a chair would have to get agreement from management / legal at their company, and that can take time. Don’t need to have that decided before we move the charter along, though.
  - If there are any issues around patents - anybody who believes or thinks there might be a possibility that their company has essential claims the spec might infringe on - better to know earlier rather than later. Once work moves into WG, any org that joins the WG automatically gives implementers of the spec a royalty-free license to patents related to the spec. If you don’t want to do that there’s a whole other process. As far as I’ve heard there aren’t any patent issues.
- DJ: I think that other than double-checking the charter’s using the same terms as the wasm charter, all we need is consensus here, and Mike can then take it.
- CW: additional homework is to contact our legal teams and see if we can join the WG. Also folks should look to propose a chair for the WG.
- DJ: hopefully the chair of the WG doesn’t have much to do. Just move the finished document through the steps the W3C needs.
- DJ: fwiw Apple’s always wanted to have a WG to publish the spec, as well as a CG, and would be happy to join the WG.
- CW: I assume we would be happy to join too, just have to check with lawyers. Let’s agree on the charter and talk with lawyers to make sure no roadblocks, then once everybody is happy, let’s move forward.
- RC: we can all ship stuff even if there’s no WG, right?
- DJ: right. Just running the risk that might be stepping on essential claims from other CG member companies, until the WG publishes the spec.
- CW: Khronos also has ARB_ extensions ratified for similar reasons. Might want WG to ratify the core spec and some extensions.
- DJ: think they’d just be classified as new documents in the process. Possibly frustrating: publication process produces a CR, and any comments on the Candidate Recommendation must be addressed. It then goes back to the CG for resolution. Want that feedback as early as possible. Want it to be a stamping process, not a discussion process. We should say in the charter that the WG isn’t going to come make a big change to the spec. Other thing about how the WHATWG and W3C are interacting.
- MS: that’s because of the history of acrimony and disagreement of the HTML work, but that’s not an issue in this group. Think this is simply the same setup as wasm, so can borrow from them. Also, wasm charter is the product of some lawyer review, so the language is probably correct already there. Don’t recommend getting super creative. Doesn’t need to be super detailed; don’t have to have everything anticipated.
- CW: so, next steps:
  - Dean to update with any wasm charter chunks.
  - Everyone comments on it, make sure everyone’s happy with it.
  - KR: I think you want to email the mailing list and get documentation that everyone’s happy forming the WG.
  - MS: we don’t really have anything in place that says you have to have 100% agreement on moving forward with the WG.
  - DJ: if they’re a W3C member they have the option to vote no on the charter when it comes up for AC review.
  - MS: right, and it’ll take a while after it is announced. I’ll just wait to hear from you all. Then I can hustle on it. Should try to have a sense of urgency even though there’s no hard deadline, since the process wheels turn pretty slowly.
  - DJ: once we think the charter’s correct, we’ll give everyone a chance to comment while we talk to lawyers and AC reps. Then give it to Mike and send email to internal list saying here’s what we intend to do, and solicit any comments. Do we want to talk about meeting at TPAC? Geographically, would be difficult for most members.
  - CW: the W3C reached out and we said we wouldn’t meet.
  - DJ: I don’t think it’s worth the travel. Better to focus on what we’re doing right now.
- Next F2F?
  - CW: after the summer?
  - KR: WebGL WG will meet in New Orleans.
  - CW: we could co-host something.
  - DJ: problem was no room to host.
  - CW: we will figure something out. Could book a meeting room in a hotel. Since most people are in WebGL WG it would be best to meet there.
  - KN: September 23-27
  - CW: let’s aim for New Orleans for the week of the F2F.
  - KR: want to encourage meeting during the same week as the Khronos F2F.
  - DN: we can also influence the meeting of the Khronos WGs to be best aligned for WebGL
- DJ: also we wanted to set up a Khronos liaison?
  - KR: I think the ball was in the court to nominate someone / some people from Khronos to span the orgs.
  - DJ: so Neil has contacted the W3C? Discussed how to not have Khronos IP leak?
  - KR: not sure, we should reopen the conversation. Would be good to have HW folks at the table.
  - KN: W3C members: don’t see Qualcomm, ARM, AMD, NVIDIA. Just Intel.
Snapshot (Corentin)
- CW: Kai wants to make more changes to CTS harness? Harness was useful when starting impl in Chrome. How do we feel about taking a snapshot and calling it MVP1, Alpha1, etc.? How many more changes to interface itself?
  - DM: would like to add storage textures at min. Maybe texel buffers.
  - CW: some form of command buffer reuse / bundles.
  - MM: think bundles are interesting. Would like to discuss.
  - CW: we should decide whether we want it in the MVP or not.
  - DM: we want people to have a good way to move to WebGPU today. They’re not missing bundles.
  - CW: agree but bundles seem like an important feature.
  - RC: do we want to add all the buffer views to make uploads better?
  - CW: no, it’ll be very complicated, we should do it afterward. Will take weeks of discussions to get proper design.
  - KN: for the snapshot, changes are more imp’t than features.
  - MM: does snapshot have ImageBitmap texture uploads?
  - KN: it should. We’ll merge.
  - JF: is the tuple notation for createBufferMapped valid IDL?
  - KN: no. I changed it to sequence<any>
  - DM: why is it not a dictionary?
  - KN: it could be. Went thru IDL and made it compile. It’s a typedef.
  - CW: think there are some cosmetic changes, making things not terrible IDL. There is the WebGPU snapshot tracker. Almost everything is green. We don’t need to solve everything at this F2F, but in the next couple weeks it would be great if all the changes were in that spreadsheet. Need to have a CTS that works on multiple implementations.
  - DM: I’ll work on storage textures and texel buffer views so we can decide if we want them.
  - CW: SGTM.
CTS (Corentin)
- CW: We’re going to port all of our Dawn end2end_tests to JavaScript to start. Then can add more.
- DJ: would suggest dumping things into wpt. wpt is annoying because you have to check out the whole thing.
- KN: I’d like it to be in a separate repo.
- DJ: would prefer to have a separate repo so the issues are more clearly about the CTS vs. the spec.
- KN: OK.
- MM: assume the tests should be testing sentences or fragments of sentences in the spec?
- CW: suggesting at this time there are 0 tests and no spec. Dawn tests provide baseline feature coverage. Should be non-controversial. Blend modes, etc. Really just trying to help other impls.
- MM: don’t want to benchmark based on whether we’re compatible with Chrome.
- CW: only want to have non-controversial tests.
- KN: we take a bunch of end2end tests, put them in a folder in the CTS, disable controversial ones, and as we write spec text we update the test plan. Then we put tests in the actual CTS folder.
- MM: suggest the spec and test suite should be developed, if not in lock step, then near the same time.
- CW: Kai’s suggesting the spec and test plan should be developed simultaneously, so we don’t lose anything. Many sentences in the spec will interact with others in the spec.
- MM: regardless of what shading language is picked there will be many tests for it. Want to propose they’re in the same harness.
- CW: sounds fine. There are many spir-v tests for example? Could take tests from the Vulkan CTS?
- DN: recommend doing them new. They mostly come from glsl and glslang.
- CW: because the API calls for shading language tests are basically all the same, suggest we have them in a data format. Should handle ~99% of the shading language tests.
- MM: so not duplicate the JS API calls. That’s fine. As long as it’s possible to write a shading language test outside that harness.
- CW: is there appetite for us to port the dawn end2end tests? It’s extra work. Would it help you all write implementations?
- MM: probably valuable. My only concerns are procedural. There needs to be a process where these tests can be modified / disabled.
- KN: there is - pull requests and agenda items. Shouldn’t be a problem.
- CW: if anyone thinks a test is controversial the default can be to disable them.
- RC: would like to submit the tests a few at a time.
- CW: we just can’t wait for all impls to review these tests. Reviews have taken too long.
- KN: think we should work on just getting the tests in, and then liberally land PRs to disable things.
- CW: how many are we talking about?
- MM: sounds pretty good - 200 tests are not that big.
- CW: actually the validation tests are larger. 400-500 tests.
- Discussion about putting the ported Dawn end2end tests in a separate folder.
- VM: do you think you’ll be able to tell which tests you don’t like by reviewing them, or by running them and finding they don’t work and you don’t want to support them?
- MM: there will also be tests we fail but we think are good tests.
- CW: OK. We’ll start making PRs and see if the process is working. At first we’ll ask impls to vote on the tests. Let’s assume Edge and Chromium are the same impl.
- MM: FYI, Apple wants to contribute, and intends to write tests.
- CW: the Dawn tests are nice but not comprehensive.
Spec writing (Corentin and Kai)
- CW: we should send an email and call from spec editors.
- MM: they’re technically appointed by chairs.
- CW: we’ll do a formal call for spec editors in the group. First job will be to segment out parts of the spec and assign them to people to write. Don’t expect spec editor will write everything by themselves.
- JG: do we have clear idea of how spec editor differs from productive person in the WG? In WebGL WG, Dean and I are co-editors and we write things sometimes.
- DJ: spec editor is the one doing the final merge into the spec, for example. Maybe last edits before integration.
- MM: at least in CSS, editors will periodically reorganize / reword sentences, for example. More project management and editorial function.
- DN: isn’t that true once you have an initial draft?
- DJ: agree. We just need to get content in there that’s mostly right, and then the editor or the group could go through it.
- CW: would like someone to manage the project of writing the spec. Need to section off parts and make sure someone writes them.
- MM: mailing list conversation?
- CW: yes. I’ll call for spec editors.
Test plan (Corentin)
- CW: the idea is that the test plan is prose, to describe how the feature will be implemented. Will also have a check box, has this test been implemented.
- JG: these would have to be medium-level?
- CW: “test that the buffer can not be used while it’s mapped, for all different uses”. For example.
- DM: sounds like a Github issue.
- CW: it’s done with the test change, so you can review it along with it.
- KN: and you can see the spec text that was added for the test plan addition.
Website + logo + community forums (Corentin)
- CW: I set up webgpu.io but that’s not conflict-free. We also have webgpu.dev (and webgpu.graphics).
- DJ: webgpu.dev is fine.
- Bikeshedding.
- Agreement to use .dev. Will redirect others.
- CW: need a logo. webgpu.glitch.me
- Bikeshedding.
- CW: if you feel strongly please provide something.
- CW: need a community forum. Did we agree people are using Discourse?
- JG: how is that different from Github issues?
- DJ: this is people asking questions.
- KN: I imagine this like webgl-dev-list.
- KR: we used to have Khronos forums for WebGL, and they bit-rotted and became flamefests. Were ultimately shut down.
- DM: the cool kids use Discourse. Comparing mailing lists to Discourse is like IRC to Slack.
- There’s also a Reddit group.
- DM: Reddit is a closed platform we don’t have control over. We can host Discourse ourselves.
- JG: it’d be a big disadvantage to host it ourselves.
- KN: we can pay them to do it.
- DM: the forum doesn’t have to be the official place.
- KN: think that’s fine.
- JF: do you need a username to view Discourse posts?
- DM: no.
- RC: would be fine to link to it from w3.org, no?
- DM: someone (WICG) clearly bought a Discourse instance. You can have a channel?
- CW: OK, people can agree to set up an instance.
- JG: just want to make sure we won’t be discussing the spec there.
- CW: right. Just support forum, community discussions. No bug reports.
- MM: will there be moderators?
- DM: there will be a code of conduct and some moderation will probably be required.
- JG: for general questions, stack overflow will be a natural area.
- DM: any interest in a chat forum?
- CW: IRC would be nice.
- DM: Mozilla’s moving away from IRC.
- VM: we adopted Slack for Chromium.
- CW: we could do that and either pay for it or not.
- MM: does Slack require invitations for people to join?
- CW: no.
- VM: though, service level requires you to pay if you want to retain history.

WebGPU compat

CW: D3D12 and Vulkan don’t cover the majority of the Windows / Android / etc. ecosystems.
CW: suggestion: we do WebGPU compat. Separate from main WebGPU.
CW: an additional set of restrictions on the main WebGPU spec that make WebGPU runnable on OpenGL ES 3.1 and D3D11, as much as possible, and could require emulation (potentially expensive) of certain features.
- App can request webgpu compat adapter at adapter creation time.
- Has additional validation.
- If on Safari, that fails, and they can ask for regular WebGPU device.
- If you ask for webgpu compat on a browser that supports it, you will get the restrictions, so you can use it to develop.
- WebGPU compat code is valid WebGPU code. Your app also runs on real WebGPU, but with lower overhead if it supports it.
MM: why do you expect that such an adapter creation would fail on macOS?
- CW: Safari doesn’t need to care about webgpu compat, because you can assume Metal is present.
- MM: but if the website says give me webgpu-compat and fails ungracefully, it looks like Safari’s broken.
CW: if you ask for webgpu-compat in Chrome, you’ll always get the additional overhead. We expect devs to try WebGPU first, then WebGPU compat.
VM: as alternative, could say, I request WebGPU compat, and you get back a full WebGPU context maybe with a flag.
CW: could have two flags, one including “force WebGPU compat for testing”.
VM: Jeff brought this up, saying WebGPU compat “will be” WebGPU. Can the features above that feature level be exposed in a different way?
JG: if you do this then there’ll implicitly be a WebGPU1 and WebGPU2.
CW: we assume we’ll remove WebGPU compat.
DJ: you can never remove something from browsers.
JG: how much can you implement on top of WebGL 2?
CW: doesn’t have compute shaders.
KN: also storage buffers, etc.
DJ: these are systems that don’t support Vulkan? Compute shaders in OpenGL / D3D, but not Vulkan or D3D12?
CW: it’s a big portion. Also # of Windows 7 / 8 systems that have Vulkan and D3D12 is very low (compared to what we’d hoped).
RC: what about systems that have D3D11 and no compute shaders?
CW: bin them. We should guarantee we have compute shaders in WebGPU.
RC: any system shipped before 2009 will have D3D11 but no compute shaders.
VM: they’d be the same.
CW: except WebGPU compat has fewer workarounds.
DJ: less work for them to use WebGPU compat?
CW: using WebGPU compat is easier. ANGLE backend is 50K lines of code. Dawn backend for Vulkan is 6K lines.
RC: only concern: having to emulate things. What kinds of things?
CW: the list:
- Most hardware has native support for separate textures and samplers, and combined textures and samplers can be more expensive. D3D11 and all modern APIs use separate texture and samplers. Forcing WebGPU core to use combined textures and samplers would make the API less flexible and potentially slower.
- Same thing for texture views and in addition to losing flexibility it will require draw-time validation that texture views pointing to the same texture are all the same. It can be optimized with dirty bits but is about the level of complexity expected for the whole draw-time validation in WebGPU core.
- Same thing as separate textures and samplers for cubemaps being views on 2D array textures.
JG: strict upper bound will be WebGL 2.0 activation.
VM: we see ~90% activation for WebGL 2.0 in Chrome.
MM: this adds stuff to the shading language. Has to understand what this combined texture + sampler object is.
Discussion about textures / samplers.
MM: we should not hold back the development of the non-compat mode.
DM: can you say, each texture must only be used with one sampler in the program?
CW: also has to be statically determinable.
MM: don’t know if you can do that without making a new type.
CW: there are a few compat problems but we can certainly add restrictions to the compat spec to make them compatible with D3D11, etc. Point is that this is independent of core WebGPU spec, and information only flows from WebGPU to WebGPU compat, not the other way around. I think we should first ship the MVP and then worry about WebGPU compat.
MM: worried about: if we make WebGPU compat that works on many more computers than “real” WebGPU, then WebGPU compat will be the “real” WebGPU and this CG will have to live with more limitations than we’d hoped for.
- VM: that’s true. But the adoption curve for hardware supporting “real” WebGPU right now is too low. Devs only target something if it’s 90% penetration. There’s hardware shipped which doesn’t support this and the adoption curve for replacement is too slow.
- CW: the intent would be to un-ship WebGPU compat later.
- MM: I’m worried we won’t be able to delete it because too many apps will ask for the compat thing.
- CW: we can eventually return the real WebGPU.
- MM: if we start working on compat, upgradeability should be a requirement.
- KN: think we can do that.
- CW: it can be by design. Only adds validation rules. Doesn’t change the structure. If it passes WebGPU compat validation it’ll also pass WebGPU validation.
- DJ: if this set of machines is really that large then maybe we’re targeting the wrong level of API.
CW: imagine we want texture views because everything except OpenGL has them. But these things are properties of the texture object. For WebGPU to be compatible, you say that for every texture, every view, draw calls must not use the same texture through multiple views, but validating this is expensive.
Discussion about performance characteristics and returning WebGPU core instead of compat.
DM: this is the portability concern.
CW: we’ll find some way to make it go through the Devtools.
CW: If an app requests WebGPU compat, we can return WebGPU core. But the app must be able to require compat for testing.
MM: does that need to be a standard?
CW: yes.
RM: could be a flag in the browser, like a command line flag. Would answer concern that people would lose this in production.
VM: you have to be able to request webgpu-compat in production. If you ask for WebGPU and your device only supports compat, have to return compat.
CW: Safari has to worry about webgpu compat insofar as supporting WebGPU, and returning WebGPU for webgpu-compat.
RM: clarifying?
- 1. Give me full WebGPU and fail if not possible.
- 1. Give me full WebGPU and if not possible give me compat.
- 1. Test for the case where I got compat.
CW: exactly.
RM: the first 2 can be in the API, chosen by the developer. Have a flag in the browser for testing (3).
CW: yes. Perfect. Only thing that changes for core WebGPU, you say “compat: true” in the adapter creation settings. Just don’t fail if that arg comes in.
DM: people don’t test with extra flags. What you’re proposing goes against that.
KN: for individual small things that’s true. But they’ll know that my app doesn’t work on 75% of my market. This is a big thing that seriously impacts portability of their app.
MM: thought Corentin’s list was 3 little things?
CW: yes, but impact on systems you can reach is big.
VM: Kai’s saying that you only need to be aware of 2 major modes.
DM: if everyone did compat validation would it be a better API?
CW: disagree because it’s not future looking.
KN: we can take out the validation. If it weren’t for the overhead we’d just have the validation and remove it later.
DM: unsure about the validation cost. Not sure to me that you have less overhead.
RC: sounds like you want to see how much validation cost there is.
CW: do you want combined textures and samplers in WebGPU in the long term?
...
DJ: think more likely that people will develop on Safari, ask for WebGPU compat, app works, and they’ll ship it, because Safari didn’t implement the WebGPU compat validation.
What about phrasing this as an extension?
VM: if I don’t need separate samplers then maybe I don’t request that feature. If I run into a case where my code needs it, then I can turn it on.
DJ: and by doing that I cut out some of the market.
VM: yes.
DJ: that’s sort of what I was proposing.
RC: so all the things Corentin pointed out, we could turn into WebGPU extensions?
Discussion about whether we would enforce the compat restrictions right now, before MVP, and phrase them in terms of enableable extensions.
VM: everything has a lifespan and have to draw the line somewhere. Later there will be cool WebGPU features.
CW: and they’ll be extensions.
VM: if group cares about market reach then maybe we should design for it now.
DM: let me talk about background. Not the first time this has come up. There’s VkPortability. Answering several questions. We don’t know all of the limitations after ~1 year of investigation. Maybe will be similar for D3D11 / OpenGL. Many times, asked, why didn’t the Vulkan spec think about this?
VM: Vk folks added some features that prevented running on ARM’s Mali chips. Should have been able to veto. Was a mistake. A restricted Vulkan is likely the solution.
DM: I agree with Dean that if we think OpenGL ES on Android is the target audience, we should design our feature set appropriately.
DJ: it seems to me that we’re really designing WebGPU2, and we really want to release WebGPU1 for best portability. At Apple we don’t care that much because it’s not a significant portion. Would still be bad for the industry if people were asking for webgpu-compat, we gave them webgpu, and apps have to still test on other hardware.
RC: in D3D11 there’s also the concept of feature levels. Could ask for FL 10 and not get compute shaders. We’re signing up to do the same thing with this.
DJ: one soln: if you did require webgpu-compat, if only we could make the perf equally bad everywhere.
MM: not going to slow our browser down.
KR: are there ways to make the compat validation faster?
CW: many of these things are orthogonal. OpenGL ES is baroque though. We will add validation in a lot of places. This texture format not being renderable, for example. In WebGPU a lot of validation can be precomputed, but we’ll add a lot of code to check for these various OpenGL ES cases. Adding CPU overhead where it’s not needed.
CW: it’s a lot of work to go through OpenGL ES spec and figure out whether the things are compatible with WebGPU.
YH: comment about WebGPU support and WebGL 2.0 / ES 3.1.
CW: ultimately won’t need the compat validation. Was afraid that Safari would reject this because of unnecessary slowdowns.
DJ: it’s not just that, but people writing apps which are WebGPU compat, where they’d lose perf even if it’s available on the device. Think we should consider telling people to just use WebGL.
VM: we have a strange dichotomy where compute isn’t available.
MM: this is a short term problem.
VM: it’s a 5-year timeframe.
DJ: in that timeframe will they have WebGL 2.0 + compute?
CW: you’d have to write a WebGL 2.0 Compute and WebGPU backend both.
MM: reason you’d use WebGPU is for the additional perf.
VM: you’d still get perf b/c it’s a better API, prevalidated, and everywhere you have Metal / DX12 / Vulkan it’d be faster.
DJ: are we sure that someone writing an app to WebGPU in compat mode will be faster than a WebGL app?
VM: yes, assuming things like cmdbuf reuse.
CW: most likely yes. I don’t see why not. Maybe niche OpenGL features they were using. BlitFramebuffer and have to do it themselves.
DJ: where does that come from?
CW: prevalidation. Draw call validation in ANGLE is insane. Binding groups of resources together. Pipeline compilation. No state leakage between diff parts of your app. Can reuse cmdbufs. Map buffers directly. There are a huge number of advantages from using WebGPU on OpenGL ES 3.1.
DJ: compute shaders, while imp’t, I was talking about WebGL 2.0 + compute.
VM: WebGL 2.0 + compute will be a niche feature too, because of availability.
DJ: to do it we’d need a Metal backend for WebGL.
DJ: we’re still targeting an imp’t set of devices that support compute, so theoretically they’d need to support WebGL 2.0 + compute.
KN: every device you’d write to would either have the ability to do WebGL 2.0 + compute, or WebGPU compat.
CW: then nobody would write a WebGPU backend.
DJ: that’s one of the solutions. Not a great one.
CW: then WebGPU adoption suffers. If you want to write an app to run everywhere, you have to write WebGL 2.0 (assuming Safari ships it). Right now it’s WebGL 1.0. People will continue to write WebGL 1.0 for best adoption.
MM: procedurally we’d probably have to change the charter.
DJ: charter says we’re writing on top of Vulkan, DX12 and Metal.
CW: theoretically people could do WebGPU compat outside this group.
DJ: to make informed decision we’d need to know the decisions we have to make. How much perf giving up. How many apps using these features. I accept the argument that there’s a large number of devices that can’t
KR: I don’t think it’s worth worrying about Windows machines before 2009 that don’t have compute support. I think if we are realistic about the market we’re missing (“Android”)...
KR: Could we get ES 3.2 extensions that add the missing features for WebGPU?
VM: In practice these systems don’t update drivers.
VM: 3.1 is basically the bleeding edge...
KR: About 35% have 3.2.

OpenGL ES version	Distribution
2.0	21.1%
3.0	29.8%
3.1	13.6%
3.2	35.5%

* [https://developer.android.com/about/dashboards#OpenGL](https://developer.android.com/about/dashboards#OpenGL)

VM: we don’t have Apple’s ecosystem. Takes time for drivers to penetrate into the market.
KR: have to think that it’d be possible to roll out OpenGL ES extensions.
CW: time to ship those extensions would be the same as to ship Vulkan.
DJ: problem is that Google has shipped extensions for Android before, but can’t get vendors to push them to older phones.
CW: can only force people to push it like AHardwareBuffer interop. Otherwise there’s no hope.
VM: you’re talking about many years. Could go through spec process at Khronos, vendors build it, and make it a requirement.
CW: WebGPU that targets Vk / D3D12, will have trouble shipping on older Windows systems (we don’t worry about that much), and the Android market, where Vulkan’s coming but isn’t there yet.
DJ: you’re still providing the same outcome in terms of API. You request one thing and subtract features, or you ask for the other and add features. Could have called it WebGPU 1 and 2, or 1 and 0. End result will be, there will be a version of WebGPU that does one thing, and another which does more and is faster.
VM: it really is more features. If you don’t need separate texture views it’ll run as fast as Metal or Vulkan.
RM: thought it would be slower because of extra validation.
VM: true but it’s not clear how much slower.
CW: that’s why I suggest moving WebGPU forward, and looking at supporting older systems later.
KN: NetMarketShare says Windows 7 is 44.6% of the Windows market (and decreasing).
VM: think it’s worth thinking about a few years out. Android ES 3.1 availability is on the rise.
CW: look at WebGL 2.0. Hard to get them to adopt it b/c it doesn’t run on Mac (iOS). People wouldn’t adopt WebGPU for the reason.
DJ: solution is to get Apple to adopt WebGL 2.0. Problem here is that the hardware doesn’t run it. Have to reduce the API.
CW: but we don’t want to induce constraints on this CG.
RM: if we offer something that runs on everything, and say it’s important, it doesn’t matter what people call it, they’ll target it.
VM: idea is that you don’t change the API. Just impose restrictions. Later you can just ignore the restrictions.
MM: in that case I’m not sure we can do it without adding a new object texture+sampler, and we don’t want to do that.
DJ: I think there’s a question of whether we’re going to need API changes for this. Need to know the options.
VM: want to say too, combined sampler isn’t vastly difficult. It’s a simplification. The old option can be implemented on the new stuff.
MM: it’d just suck if we’re still dealing with those concepts 10 years from now.
VM: we’d still have to support it, but if the app never used separate samplers in the first place, would it matter?
MM: for maintenance. Want an API in 10 years that doesn’t look like OpenGL.
DM: separate image samplers were added in DX10.
DJ: I think we need to do the work, the only question is how we do the work.
CW: few ways:
- Add restrictions to WebGPU core
- We restrict / subset WebGPU for 2 feature levels
- Are these the only two ways?
MM: this process could take 4 years and then it doesn’t matter anymore :D
DJ: we’ll end up at the same place no matter what the naming. Want to know more exactly what the restrictions are.
MM: Dawn has an OpenGL backend?
CW: requires 4.4 and isn’t conformant.
CW: I can run it on ES 3.1 and come back with additional required validation.
CW: I’ll:
- Look into the constraints
- See what validation has to be added to WebGPU core to make those disappear
- See how the WebGPU API design would be affected if we did this as the core.

Command buffer reusability and frame contexts (Dzmitry)

Add GPUEncoderDescriptor.reusable #294
DM: this is sort of related to Hans-Kristian’s blog, and his Granite renderer
DM: his claim: reusable command buffers are not very useful. Tracking resources, we don’t know when cmdbuf will be submitted when it’s recorded. Would have to relate cmdbuf to set of resources, etc. H-K suggests a different abstraction. FrameContext at the API level. Backed by a CommandPool, DescriptorPool, tracking of resources, etc. Typically one per frame. Open the context, record, and resources used will be associated with the context. Have to submit all of these cmdbufs before recording the next FrameContext.
MM: Context opens. Record cmdbuf. Submit cmdbuf. Context closes. ?
MM: can’t reuse old stuff?
DM: right. Would FrameContext reduce tracking overhead for resources?
CW: you can just have strong refs to resources inside cmdbuf.
MM: why is this better? What does it buy?
KR: reducing fragmentation in memory pools?
DM: yes, resetting pools wholesale. Natural association of cmdpool with FrameContext. Our pools will be fragmented. Free individual cmdbufs instead of resetting pools. This is Vulkan vkCommandPool. Opaque from our point of view. To use efficiently and reduce fragmentation in driver, it’s good to reset the pool as a whole.
CW: it’s good to avoid mallocs/frees. Vulkan CommandPools are just linear allocators. When you recycle them you use the same allocations.
DM: if you free an individual command buffer it’ll have to do some defragmentation in the command pool.
CW: HK also in the blog post said that one way to have textures never freed before they’re used is strong refs. But since you have to submit cmdbuf during next submit you don’t need to do this strong ref so you save reference counts in the backends.
DM: it’s more that you tell the texture internally that it’s FrameContext number 200 it’s using, so easier to figure out when it can be freed.
DM: something to think about. Not personally convinced.
MM: something we’ve run into?
DM: haven’t profiled it.
MM: let’s profile it before adding to the API.
CW: the CommandPool recycling seems internal to the API.
DM: need a contract with the user.
CW: one CommandPool per cmdbuf?
DM: don’t think that’ll work well. Let’s profile it.
RC: what about command buffer recording on workers? What’s the “frame” context in that case?
CW: makes the API less composable to have a FrameContext and have everything be a part of it.
VM: you could have the concept of a set of cmdbufs / reference trackers that you release together.
DM: sounds similar to FrameContext.
RC: should we talk about the opposite of this? Reusable cmdbufs?
CW: there’s pull request “Add GPUEncoderDescriptor.reusable #294”.
- This is important for one big reason: JS is a slower language, and there are costs to going across the JS / engine bindings. The fewer crossings, the better.
- The Babylon.js example didn’t have better performance than WebGL because of the JS overhead. Then we added reusable cmdbufs and it sped up. They’d bake parts of their scene graph into cmdbufs.
KR: Let’s not jump to that conclusion. We’ve been seeing the performance hit of calling from JS into the browser engine for 10 years. Now we have an API designed to allow lots of small draw operations in an efficient manner, and we’ve got clear evidence that the JS to browser overhead is restrictive.
CW: say you have an engine and most things are static, don’t want to walk through it each time.
KR: we have measurements.
AE: I did SetBindGroup + Draw, 20,000 times. If you just do the bindings, it takes 2 ms. Kind of bad because if you do the impl (recording the commands), in JS it’ll take 3 ms, and for us to turn those intl MtlCommandBuffer takes 1.15 ms. We can’t record commands fast enough to be able to submit.
MM: so JS bindings have less throughput than recording Metal commands?
AE: yes.
CW: with this you can factor your scene graph traversal, batching, etc.
MM: different browsers have different bindings costs. Putting this new concept in is basically permanent. Babylon wrote some code that does some JS. Not clear if they wrote the best JS, etc.
CW: don’t see how this is controversial. All APIs have the option of reusable cmdbufs except Metal.
DM: HK’s experience is Vk-based. My experience with Metal is different. Profiling Dota 2 in Metal, lot of time spent in command recording, which HK says is free in his blog post.
VM: the issue of how an app has to be written to be free is imp’t. To only have the cost of the API I have to build my own display list.
KR: It’s not just wasm. We’ve clearly highlighted bindings problems in at least one browser. We’ve reached the limit where we can accept this limit (if we can’t make WebGPU faster than WebGL). We need to push on the JS engine folks to improve this.
RM: I wonder how big the functions we’re calling in C++ are. It could be possible to inline some of the functions into the JS side.
KR: Right now the cost of getting out of JS is in one place - argument marshalling. It’s not optimised in JS.
RM: Change the calling convention?
CW: Mark some of the IDL functions as “fast call” and get the arguments a particular way (e.g. if we know the types are correct).
KR: This wouldn’t need an extension to WebIDL. They’d use annotations. The browser engines just need to do the work. I don’t want an implementation detail to inform API decisions.
VM: Getting our bindings faster is good either way. But we should still discuss if there is a way to get a large improvement via an API change.
CW: e.g. a CAD application that walks the scene graph, serialise in your own buffer your commands, and look in your buffer to do the draws. If we are suggesting that all applications do this, then we should do it for them.
KR: Fair point. Except that we might hit a point where we want to make a single change to an existing recording.
CW: We won’t do that.
VM: Babylon got a huge speedup by doing one bindings call rather than many. I think it is a valid optimisation.
KR: Is this violating the way people use low level APIs. What does DOTA do?
DM: They don’t.
KR: What about consoles?
VM: PS3 didn’t reuse them.
CW: I’ve not worked on consoles but they define a binary format.
DM: the engine I worked on had its own cmdbuf representation but if we gave them one they wouldn’t necessarily use it.
VM: data point: Antoine’s worked with Vulkan and in general cmdbuf reuse is not recommended, because it forces the driver to instead of generating the most optimal single cmdbuf representation, it makes something that has to be patched. We’re still sending all the commands to Vulkan / Metal / etc. Still getting the 100x.
DM: if it’s reusable it’s the responsibility of the driver to make it run fast on the GPU.
DJ: unless by putting in the patching hooks it can’t optimize.
VM: if you think of both the client and driver side…
MM: GPUs can overwrite bytes when they know they can; e.g. specialization constants.
VM: engines today are focusing more on multithreading. Maybe you’re using 4 cores but also want to use the GPU fully.
DM: imp’t question about what the test did.
CW: JS went from 70 ms per frame to nothing. We were blocked on the GPU process recording commands. 400,000 SetPipelines per second.
VM: the app was designed to change pipelines that many times.
MM: were the first and second pipelines the same?
CW: no. First we don’t do caching. App had 10 pipelines and trees would randomly be one of them.
MM: if you aren’t caching then are we looking at the right example?
KR: We don’t have to look at this example. Can look at Austin’s - very rudimentary, tells us the time going into bindings alone.
VM: the goal of the trees example was to show switching pipelines, which WebGL can’t do efficiently.
CW: an app walking through the state and recording the commands will always be faster than one which doesn’t.
MM: if the native app can do it then we should be able to too.
JG: rerecording also requires revalidation.
CW: language is slower. Developers have less time to optimize their apps. Good luck with data-oriented optimization in JS.
KR: People trying to do ECS in JS. When making WebGL, there was no way to get data through efficiently so we added TypedArrays.
CW: Agree we need to optimize our bindings. But still force app to go through their 500 objects in the frame instead of ahead of time. Dynamic stuff still needs optimized bindings, but orthogonal to factoring work.
KR: Can still factor work. Instead of doing pointer chasing (probably source of slowness), can record in JS if they want. Worried that taking this API shortcut will make bindings overhead seem less important. But optimizing bindings is still needed.
MM: maybe we should work on bindings perf and revisit this.
VM: maybe we should try removing Dawn’s validation and measure with and without. If we can show that for a native app, it’s better to reuse command buffers, that’s a data point.
RM: If you have a benchmark you can share, it would be good to make sure the browsers are all optimizing the same thing [e.g. bindings].
MM: do you want microbenchmark or Unity?
RM: Ideally both.
AE: Have a benchmark with reuse vs no-reuse.
- https://austineng.github.io/webgpu-samples/animometer.html
MM: do we want to talk about bundles as distinct from reusable command buffers?
VM: Data on Austin’s test
- 100,000 draw calls
- JS: 17.6ms Dawn: 41.2ms without reuse
- JS: 0.3ms Dawn: 14.4ms with reuse

Add GPURenderBundle #301

DM: if we have reusable bundles then that may replace reusable command buffers.
CW: if we have one we should have the other.
...discussion…
CW: idea: on device you can say “create render bundle encoder”. Record bunch of commands. Finish. Then you have render bundle. All you can do is execute it as part of render pass. You have to give it render pass compatibility information. In the API, there’s GpuCommandBuffer and GpuRenderPassEncoder. Now GpuRenderBundleEncoder. Good for:
- Reuse
- Allows splitting the work to encode a render pass between multiple workers, so your giant opaque pass doesn’t have to be in a single worker.
MM: from this PR it looks like everything a regular render pass encoder can do, so can the bundle encoder. What about restrictions in native APIs?
CW: what are the Metal restrictions?
MM: can only set buffers which have indirect buffers. Can draw. That’s it.
MM: checking the restrictions.
https://developer.apple.com/documentation/metal/mtlindirectrendercommand
- setting buffers
- setting pipeline states
- draw
RC: what if the native APIs don’t support something?
CW: let implementations deal with that in order to have universal reusability. If native impl can’t handle a certain command then split the bundle into two. In D3D12 / Metal / Vulkan you’re encouraged to not have too big a bundle. About a dozen draws for example.
RC: sounds like it wouldn’t help Babylon.js much.
CW: they can reuse them.
RC: so, divided by 12?
CW: my opinion in this PR is that RenderBundles are a WebGPU abstraction that doesn’t necessarily have to match the underlying abstraction. In this version, you can put anything you want in it and impl handles it. Goal is to handle parallel recording and easy reuse on the app’s side.
MM: is eventual thought that if every native API has a facility and they’re similar, it would be unfortunate if we can’t use it?
CW: they’re different.
DM: could consider a subset.
CW: Vulkan secondary command buffers can do anything.
DM: I’m talking about a subset.
CW: why force the app to split themselves?
DM: if you want to save JS overhead, bundles solve that. Most of the work will be in render passes. Reducing that by factor of 20 would help?
...
MM: think that would be a good addition to the API. Had some discussion internally.
CW: do people agree that RenderBundles with the equivalent of SetBuffers in Metal, SetBindGroup, SetPipeline, Draw?
DM: samplers?
CW: part of the bind group.
MM: one more thing: creating a buffer in Metal is sometimes expensive. Argument buffers are buffers. So creating bind groups can be expensive. We can use arg buffers as bind groups, or not. Need both paths. If we will have bundles, and in there you can change a bind group, that has to be an arg buffer.
CW: preventing setting resources in a bundle might be too much of a cost.
DM: think you must use arg buffers only if bundles are supported.
MM: might force us down a slow path for bind group creation.
DM: in WebGPU itself, resource creation etc. can take time.
MM: our BindGroup creation can be 1-10x slower than other platforms’.
DM: you might say, change binding 5 in this BindGroup to this thing. Mutating a BindGroup. In your case, the same buffer’s used, not making a new one.
MM: to use Bundle API would need to create arg buffer, and that’s the cost.
CW: you can amortize the cost of arg buffer creation if app can mutate bind groups cheaply.
MM: Apple should do more investigation to see if we would be required to pay high perf costs to use bundles.
- MM: example of allocating BindGroup, then later recording bundle and having to back-allocate an argument buffer, and that’d be expensive.
- DM: seems to me that bind group creation can be slow.
- MM: agree.
- CW: what’s expensive in arg buffer creation?
- MM: GPU allocation.
- CW: similar to problems with Vulkan’s DescriptorPools. Can amortize the cost of allocation.
- MM: worth trying. Think it could work.
- DM: to expand on that, you don’t need a new allocation. Can use an old one and a different region.
CW: so, yes bundles?
MM: yes.
CW: things that can’t go in bundles:
- blend color
- stencil reference
- scissor
- DrawIndirect
- ??
CW: in bundles:
- AI: CW: add a table here.
- AI: DM: put table in PR #286.
MM: good idea.
CW: in WorkerContext: GpuRenderBundles would be transferable.

Case for mutability of bind groups (Dzmitry)

DM: currently bind groups are immutable. All native APIs support some sort of mutability. This is something we could easily expose. Q’s: should we? What are the use cases? Impl would have a descriptor heap and map the bind group to one or more heaps.
RC: there are perf implications. You can say, this descriptor’s data’s static. Then some IHVs say they can loop unroll your shader. Otherwise they don’t so the shader’s slow.
MM: they do that after pipeline creation? Cool.
RC: have to be careful. If people go crazy with this they lose optimizations.
MM: presumably there’s a perf penalty, because if there weren’t you wouldn’t need bind groups.
DM: there’d still need a reason to have bind groups, because you can bind a lot of things in one command.
MM: not that we shouldn’t have made them immutable, there’s just a cost in mutating them. Justin and I haven’t tried mutating an arg buffer yet, btw. Should proceed under assumption that you can.
CW: think ability to mutate bind groups is great because it favors reusability. Can repurpose a bundle and change just the uniform buffer, texture, etc. Removes some pressure because creating bind group will be most common device op. Reusing bind groups sounds good.
MM: any concern about cmdbufs being in flight that reference bind groups?
DM: yes. Device timeline operation is a question. If we have semantic of happening after all submissions, will be confusing. Some cmdbufs recording while submissions ongoing, and one half before the change and one half after.
CW: sounds similar to how buffers work.
MM: if it’s an enqueued command that addresses the sync issues.
DM: how would synchronization work?
CW: was thinking it’s beneficial if we reused command buffers. Now not sure. For sync aspect: should just version BindGroups.
DM: so record half of cmdbuf with one state?
CW: bindgroups are objects containing content. Can update them to be discussed as device or enqueued operation. Suggest that impl can version bind groups. When you update a bind group, it gets another allocation somewhere and replaces the pointer to the allocation with a different one.
DM: does that happen instantly?
MM: you’re saying it’s a copy?
CW: yes. Idea is, linear allocator. Based on layout, can have BindGroups. CPU-side state tracking. UpdateBindGroup would find a free spot and point there.
MM: and copy the contents to the new one.
DM: I was thinking about partial content.
CW: then I’m not sure. If partial, then have to copy. If still doing allocations, why is it useful? Would be helpful with reusable command buffers. Can change the content of the object and re-send the same commands.
MM: not sure that’d work in Metal. In metal if you record one of these bundles you’d probably want to create a device indirect arg encoder and every webgpu call should call into the native API one by one.
CW: Metal’s great because you can copy buffer to buffer, and that’s synchronized properly. Guarantees correct use of indirect arg buffer.
MM: I meant, if I record a bundle and you say, bind this BindGroup, we’ll resolve that to the Metal BindGroup and bind it. Then want to redirect it to a new one. The existing command won’t be updated.
CW: In Metal you could replace that with a CopyBufferToBuffer. From staging arg buffer to arg buffer for BindGroup.
MM: so fix it up in postprocessing.
CW: not sure. Arg buffer position in memory would be fixed; stage copies into it. Like setSubData, except “setSubDescrptor”. Never mind.
DM: unclear if that’ll work.
CW: also need more info on synch guarantees of Metal. Use arg buffer, copy into it, use it again - synchronized properly?
MM: no, because buffers are just buffers in Metal.
DM: you have to tell the cmdbuf that there are resources you’re using.
MM: if GPU is running a shader accessing these resources, you can’t just read/write to that indirection buffer.
CW: say RenderPassEncoder, then BlitEncoder, then new RenderPassEncoder. Synchronized properly?
MM: yes. You can copy any of these things around.
CW: then you can do something smart in your impl.
DM: don’t think that’ll work. Have to call UseResource on contents.
MM: it’s getting nebulous for me.
CW: I see the point but not why it’s a problem.
RC: should we do this post-MVP?
CW: yes. JS devs can right now do everything this can do. For now they don’t need it.
KN: I think that creating BindGroups was a massive cost in TF.js. 0.4 ms per BindGroup. Pretty sure it’s our implementation.

Fingerprinting and exposing some GPU geometry (Corentin)

Device limits, in particular, like the max texture size (Dzmitry)
RC: what Safari implementations mask the renderer string?
DJ: all of them on iOS, and none of them on macOS. But I’ll fail in my pushback, and eventually they’ll be masked on macOS too. Have argued: difference in hardware on macOS is greater than iOS.
MM: and iOS has no bugs. :)
DJ: I also reached out to Google Maps team, and they said they really only needed it for Windows hardware. Even apple.com does use this string to turn on/off certain WebGL features, and they were disappointed when we masked it. On Twitter, there was a handful of people saying they were using it, but unclear if their world would end if we removed it. Nobody admitted to use it for fingerprinting. We do know people are using it for fingerprinting.
RC: questions about Firefox privacy / anti-fingerprinting mode?
- There was just some press about this.
RC: when will that ship?
JG: soon. Experience will be, if you flip this pref, they should get the masked renderer string.
MM: for WebGL or other APIs?
JG: for the whole browser.
CW: that’s the side of the users, let’s see from the developer side.
Feedback from WebGL developers on usage of the device string (Ken)
KR: Figma does 2D rendering using WebGL and found tons of bugs and need the renderer string to render correctly. They need to know what the GPU / driver is to workaround it before the browser is able to.
KR: Imagine if us browsers couldn’t work around driver bugs.
KR: We tried to optimize something on ChromeOS and broke apps. WebGL devs were only able to w/o the bug thanks to debug renderer info.
KR: Photopea use zero extensions and ran into bugs in some browsers.
KR: Other major WebGL consumers are using it. Chrome would break major users if we didn’t expose this extension. This was essential to launching Google Maps in WebGL and they could find the few GPUs that broke. Not using it too much anymore but other high-end users still need it to achieve correctness and speed.
KN: think it should be in the core WebGPU API, but you can put whatever masked info you want into it.
MM: what would we put in?
CW: Metal device string. The unmasked renderer string == the Metal device name and your OS version, unless you have a custom driver.
MM: something analogous in the other APIs?
CW: yes. (D3D, Vulkan) Vk has physical device name. We pushed for having extended driver info and driver version, etc.
KN: we have PCI ID on desktop.
CW: D3D12 has everything too.
RC: it has the name too.
CW: driver version might be more complex?
RC: yes. It’s doable. ANGLE doesn’t put that in the debug renderer string.
VM: we don’t give the full renderer device ID, just the OpenGL driver’s.
RC: as part of this I asked the privacy person on Edge. There are features on the web platform like microphone access, where if you’re the top-level page, you get a permission dialog, but in an iframe, it’s refused. top-level can delegate access to an iframe. Maybe we can do a similar thing for this, just not have a permission prompt. top-level gets access, not iframes, and access can be delegated.
MM: we’d want to be able to control that more, e.g. for the entire WebGPU API. Different browsers may have different strategies of what they expose where.
JG: can we say, querying this is in the core API, but browsers can decide what to expose?
CW: would be good to have permission transfer to iframes.
JG: agree in general.
CW: for RequestAdapterList as well?
JG: yes. Stuff about wavefronts, etc.
CW: if we have permission transfer, and expose renderer string, then what if we expose to the main windows, all the adapters with all their bells & whistles? And to iframes, only a single adapter, no name, exact WebGPU spec limits, no extensions, and main page can pass gpu permissions to iframes?
RC: why can’t we keep RequestAdapter still with hints?
CW: Spectre.
MM: think we should have normative text in the spec saying the renderer string may or may not be meaningful.
KN: it could be nullable.
MM: we might want to put wrong information in there.
KN: kind of like it as nullable, but contents of string should be non-specified.
CW: if you have a fancy new feature and browsing the web, could spoof it.
MM: think policy of what things are in what frames should be not in the spec.
CW: how about permissions token?
MM: yes, that should be in the spec.
KN: we’ll need the token somewhere.
VM: or take advantage of same-origin iframes, etc.
CW: that’s a UA decision.
KN: not sure where the spec for that is, but think that’s in it.
- <frame allow=””>
- https://developer.mozilla.org/en-US/docs/Web/HTTP/Feature_Policy
CW: so we have GpuAdapter.name , which is the renderer string? Can be anything the UA wants? Add the token for permission delegation?
VM: then UAs can do anything they want with that bit. Expose warp size, etc.
<insert discussion about essential compute geometry information>
RC: in general: I don’t think we should change the API to return every adapter. Prefer “max power”, “min performance”, etc. There’s a DXGI API where you can enumerate adapters by power preference. Don’t want web developers to do a find string index for “nvidia”.

Essential compute geometry information

CW: now we should discuss essential compute geometry information. Shared memory size, max warp wise, max subgroup sizes.
YH: register numbers?
CW: that’d be shared memory probably.
MM: we could do that better. Instead of telling them number of registers, we could tell them a yes or a no.
KR: then you’d have to binary search shaders?
MM: yes.
RC: kind of agree with Ken about that.
VM: lot of different geometry constraints for shaders. You can instead inspect the pipeline. WebGPU could tell you the ideal dispatch size.
CW: what’s important is the local workgroup size, and that’s a property of the shader, not the dispatch. Local is in the shader, global is in the dispatch. That’s the number of threads.
VM: is that a GLSL thing?
MM: no, D3D and Vulkan have it.
VM: CUDA and OpenCL don’t have it.
CW: that’s a constraint in D3D12 and Vulkan. You have a box that uses shared memory, and then tile that box throughout the global workgroup size. Better know the wavefront size and the amount of shmem per execution unit. Need to come up wtih our own names for this, because I just used 3 different vendors’ names in the same sentence.
VM: when optimizing compute shaders, you don’t know how many registers you’ll use until you compile the shader.
CW: don’t think we can expose this. Have to put their shader in Radeon Analyzer.
VM: if you could compile and then get a suggested size, but that would require 2 compiles.
DJ: what should we do on Metal where we don’t have this info?
KN: you can get the max workgroup size for a given pipeline.
MM: there are a bunch of methods on the Metal Device but that doesn’t say anything about the particular program. Can tell you max number threads per threadgroup.
CW: and shmem size.
- maxThreadgroupMemoryLength: The maximum threadgroup memory available, in bytes.
- maxThreadsPerThreadgroup: The maximum number of threads along each dimension of a threadgroup.
VM: in WebGPU you have to compile a pipeline but you have to spec the geometry in the local shader to know whether it’ll compile on Metal?
KN: it says 1024, but after you compile, it’s 256.
CW: threads per thread group are on the Device.
KN: pipeline can have even lower threads per thread group.
VM: it can limit number of threads you can dispatch.
DM: is this threadExecutionWidth?
…
https://developer.apple.com/documentation/metal/mtlcomputepipelinestate?language=objc
DJ: The MTLComputePipelineState can provide different limits, and you only know this after you’ve built the pipeline.
CW: that’s unfortunate because if app makes a fat pipeline they can only do 4 threads per threadgroup on some GPUs.
KN: how do we validate this? Table said 1024, so I did 1024, and it reset the GPU. Reopened in Xcode, and said “you exceeded max total threads per threadgroup on this pipeline”.
CW: someone should look at this and decide what we should do to design an API.
MM: one req’t: some info will inform your decision about threads per thread group. How does the app account for it?
KN: that was my question [much] earlier.
VM: you’ll have to specify shader, compile once, and re-adjust.
CW: impls will have to check that maxTotalThreadsPerThreadgroup is > 0.
MM: in Metal, threadsPerThreadGroup is not in the shader. In 2-step model the shader won’t have this info.
CW: warrants investigation of what Vk and D3D12 are doing.
KN: don’t remember specialization constants but the work group size can be a specialization constant.
CW: you’d probably fail pipeline compilation if you went over.
MM: so you do a binary search?
DN: you can do queries later. What’s a recommended size.
MM: sounds like a good approach and we should investigate.
DJ: can compile and then ask, what’s a good dispatch size.
MM: you do a specialization constant.
DN: in CL model you aren’t as compiled as a pipeline.
DJ: and Vulkan doesn’t do that?
DN: once you’ve built the pipeline you either succeed or fail pipeline creation.
MM: can you have all pipeline creations done redundantly?
VM: app would get info back and say, I’m going to do 64x4. Have to be able to query results of the first compilation.
DN: you can’t query, just see whether compilation succeeded.
MM: Vulkan people have to figure out how to make this work. Our implementation does 1x1x1.
JF: Our implementation queries the compiled pipeline state for maxTotalThreadsPerThreadGroup and divides that by threadExecutionWidth. (to calculate dimensions of a threadgroup)
VM: needs more investigation on Vulkan.
MM: not just trying to make things faster but also have semantic meaning. Impl can’t just choose what it wants.
CW: going to be a fun investigation to see what backend APIs expose and make something useful to the application.
DM: I’d like to look into this.
VM: D3D and GL will also have issues. Have to specify workgroup size ahead of time. Just fails if you can’t compile.
<jumping back to previous topic for a minute>
CB: we’re trying to get all hardware architectures to adopt a standard lane width of 32. Number of maxThreadGroups, etc., hasn’t changed from D3D11. Lots of GPUs with optimal sizes nowhere near the max. For many, the max threads per thread group is 256.
MM: you suggesting to have numbers in the WebGPU spec and tell developers to use them?
CB: just giving you info.
CW: can I tell in D3D what the occupancy might be?
CB: there’s metadata that comes back from the GPU. Think it’s exposed through PIX, not the API.
VM: need a Vulkan extension to mimic that. :)

BGRX / RGBX formats (Corentin)

CW: we removed these because Metal doesn’t have them. Why add them back?
- Discussion about issues with IOSurfaces always having an alpha channel.
JG: we could duplicate pipelines and mask out the alpha channel.
CW: suggest having 3-channel color formats, and the only thing you can use them for is output attachment. SwapChain tells you, I want BGRX. Compile pipeline to BGRX. Then because of renderpass and pipeline compatibility, the alpha channel of your texture is untouched.
KR: is there any leak through the API of this information?
CW: no. OutputAttachments are write-only.
DM: will we have more like this?
VM: we have to read from these, no?
KR: not the API. Just the browser.
RC: can the app sample from it?
CW: no.
MM: how does WebGL deal?
KR: back buffer is not a resource the API can see.
JG: little conflict between alpha:false and opaque.
KR: not really. alpha:false implies opaque.
JG: do you want draw call time overhead or compositing overhead?
CW: I want none, that’s why I propose BGRX.
DM: why not say, this is a custom SwapChain format?
CW: could say that, and say that those can only be used as output attachments.
RC: would we ever use FP16?
KN: can’t be secret. Have to be able to identify each format in the application. Depends on the canvas’s formats.
JG: can we support BGRX formats in general?
MM: blit has to be implemented as compute shader, for example.
KN: doesn’t require shader rewriting.
CW: read-only except for output attachment. Could fix up the alpha channel to opaque.
JG: why would you need that?
CW: copy from buffer to RGBX. Buffer to RGBA on Metal. If non-1.0 alpha, then not RGBX but something else now.
JG: because you can copy directly to the SwapChain output?
CW: yes. It’s just a texture, haven’t spec’ed subset of usage. When using swapchain, want texture to have certain usage.
MM: can we implement opaque CALayers by putting something opaque behind it?
KR: don’t think so. These are issues in the OS compositors.
KN: issues with superluminal values too.
VM: can we change the blend mode of the CALayer?
KR: not sure. opaque:true doesn’t work for IOSurface-backed layers.
Discussion about how feasible this is.
MM: I’d like to take half a day and see whether I can make this work with Core Animation.
CW: we’ll have answers from the DirectComposition and IOSurface oracles.
DJ: what about Android?
CW: we can ask Google folks too.
KR: have we determined that there are workarounds?
CW: yes.
CW: we all need to investigate but this is post-MVP. We can work around this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2019 05 15

GPU Web 2019-05-15 Mountain View F2F Day 1

NOTES FOR DAY 2

TL;DR

Action Items:

Tentative agenda

Attendance

Status updates

Early adopters pain points

Organizational stuff

WebGPU compat

Command buffer reusability and frame contexts (Dzmitry)

Add GPURenderBundle #301

Case for mutability of bind groups (Dzmitry)

Fingerprinting and exposing some GPU geometry (Corentin)

Essential compute geometry information

BGRX / RGBX formats (Corentin)

NOTES FOR DAY 2

Clone this wiki locally