Minutes 2022 04 06

GPU Web 2022-04-06 (APAC-timed)

Chair: KG

Scribe: KG, KN

Location: Google Meet

Tentative agenda

Administrivia
CTS Update
“Tacit Resolution” Queue
FP16 Extension against spec #2696
Are duplicate timestamp locations allowed on the same pass? #2662
Define WebCodecs interaction in an extension proposal #2720
Streaming implementations and indirect draws/dispatches #2189
(stretch) Consider putting most of the writeTexture validation on the device timeline. #2680
Agenda for next meeting

Attendance

Apple
- Dan Glastonbury
- Myles C. Maxfield
Google
- Austin Eng
- Kai Ninomiya
Intel
- Hao Li
- Jiawei Shao
- Ningxin Hu
- Shaobo Yan
- Yang Gu
- Zhaoming Jiang
Microsoft
- Rafael Cintron
Mozilla
- Kelsey Gilbert
Mehmet Oguz Derin
Michael Shannon

Administrivia

CTS Update

KN: Not much to report. Testing ongoing. Lots of contributions from Intel. Our team is continuing to work on things. Working on our implementation of how we run the tests, but maybe not super interesting to others.
MM: We’ve integrated the CTS into CI, and all (but 3) failing (this is good)
MM: Our test runner doesn’t support listing variants inside tests. So when we imported the suite, I needed to generate ~1000ish test HTML files that each handle individual tests.
KN: Seems reasonable.
KG: That’s what we expect to do too.

“Tacit Resolution” Queue

FP16 Extension against spec #2696

KN: Current shader flag is “shader-f16”. Do we want to bikeshed it?
KG: Is it called “f16”?
KN: It’s “enable f16;” in wgsl, but “shader-f16” on the api side.
MM: Why do we have the api-side flag?
KN: For feature detection
MM: Can’t you just try to compile a module for feature detection?
KN: Yes but I don’t want that to be the way it ought to be done. I think there’s value in being able to (easily) test with all features disabled, so just the base featureset, for portability testing.
Let’s create a shader-f16 Feature on API side (in addition to the WGSL enable directive)

Are duplicate timestamp locations allowed on the same pass? #2662

HL: We found that some metal devices can’t have duplicate locations like this. I like choice #3, because I also think that it’s not necessary to get the timestamp in the same location many times.
MM: No opinion on how to solve this, of the three choices.
KN: If there were a compelling usecase for it, we’d want to do #2 if possible.
MM: It’s possible.
KN: But given that I don’t see that usecase right now, #3 is probably our best option.
**MM: I think I’m on the hook for this PR then. **I intend to implement this approximately now.

Define WebCodecs interaction in an extension proposal #2720

To be driven by Intel (Shaobo or Ningxin if present?) there are some inputs from their WebML efforts
SY: Recently we worked on a full-gpu video case example. Due to lack of interop between VideoFrame and WebGPU, the performance is very bad. Many use cases including WebRTC which need this interop.
NH: webmachinelearning/webnn#226 Integration with real-time video processing
NH: Conversation happening in Web Machine Learning CG. Video inputs very likely to come in as GPU resources. More and more video processing uses machine learning models so it’s important that we can interoperate between video, WebNN, WebGPU.
NH: Overview of the usecase: The MediaCapture-Transform API gives a VideoFrame for each frame of video. Then to do processing on GPU, you want to bring in that VideoFrame. Example background blur, which is popular in video conferencing. … can be done in fragment shader, segmentation can be done in compute shader [with ML], then composited and sent to be shown on the screen or sent to the network via WebRTC.
KG: Thanks, useful overview. Sympathize with the need for fast paths …
KG: Cautious about implementing too many fast paths too quickly - each one has to be implemented and maintained. Firefox also doesn’t have much of a WebCodecs implementation yet, so probably no one in our org who can speak fully to it.
KG: Haven’t had consensus on when we want to put this in. Seems like something we’re going to really want around a year from now…. If we ship and the best you can do is use an ImageBitmap, then it’s probably not going to be so much slower that. Challenge is we want this eventually, but not yet.
KN: The shape of this right now assumes that if an impl has WC and webgpu, then the interop will just work, with no separate enable.
NH:
MM: KG, you were discussing not adding too many fast paths too early. Is that an argument against this PR as it stands, or a wish that the PR would come later?
KG: I’d rather not have to think about this yet.
KN:
KG: This is effectively an idea for how the interaction works that may work correctly in Chromium, but without other implementations and more time/expertise to think about this, how it fits into Gecko’s architecture etc, I can’t make any good design inputs about whether this lifetime is OK, can we make it live long enough.
MM: At least two possible conclusions: Chrome marches ahead and ships WebCodecs-WebGPU interop and Chrome’s implementation is what the spec ends up saying. The other possibility is that the other implementors say please slow down so we can have input and Chrome agrees to do so.
KG: People can ship whatever they want, but actual compatibility and caniuse have sway. Also industry pressure not to ship things until they’re on standards track. So I don’t want to cede and say just because you can ignore our rules that I’m going to rubber stamp potential spec language. So where I’m at now is I can’t prevent Chrome from shipping something, but can’t agree on a particular API now.
MM: Can look at this as interacting with the process document I wrote recently, which requires 2 browsers.
KN: Hard for me to represent this since I have different opinions on shipping things than what’s already happened in WebCodecs in Chrome.
KG: Ignoring all context, boils down to that I can’t be sure this proposal works.
NH:
KG:
MM: WebKit has not come out and said WebCodecs as a whole is a good idea, which affects our priority for anything relating to it.
KG: Was said that this would
KN: In principle, this seems like something that should go through standardization, either independently, or as part of the WC spec. We would want it to be feature-detectable, and would want to do an origin trial for it. That feels like a lot of process for something so small. I don’t know how to reconcile the ideal standardization process with the state of things today.
MM: We do have experience that says if you support two apis, users expect interop to work. Even if you’re supposed to feature-detect whether interop works, users don’t try it.
KN: I hope we can make this at least easy to FD. Could make a separate importVideoFrame entry point.
KG: This aims to make this the most efficient solution. We were more open to standardizing VideoFrame in WebGL because texSubImage2D/etc just copies the frame into a texture. [Easier to be confident that we can get somehow data from the VideoFrame into the GL texture.] The tricky part here is in trying to spec something we’re confident in being zero-copy.
KN:
MM: To restate, I think there’s a fear that if we make this API, and we know it’s zero-copy in Chrome, it may end up not being optimal in other browsers when they implement it, and now we’re stuck with one API that’s not optimal, and have to make a new one that can be optimal on all browsers.
KG: We can add VideoFrame to [copyExternalImageToTexture] more easily.
KN: (finding this to be a very compelling argument, standardizing a copy instead is the only way we’ve thought of to resolve the tension between the way the two APIs are being standardized)
NH: Think that’s usable. Can start with the API and optimize internally, which is good for existing users. But it is still a copy (it’s in the name), so think users will eventually need to change their code in order to optimize.
MM: Reasonable to implement and see if you can optimize to do it without a copy.
KN: copyExternalImageToTexture very much not designed to be zero-copy. GPUTexture is a user-allocated resource and could be bound into a bunch of stuff. We’d have to swap out its backing texture to make that work.
KG: Let’s consider: Post-v1, ideally wait until WC experience, and instead work on optimizing our one-copy paths.
KN: Sounds good to me, but I want to talk with the rest of my team.

Streaming implementations and indirect draws/dispatches #2189

MM: Having a good discussion with AE on the bug.
AE: I posted numbers to bound how big an indirect command buffer for one of the approaches is. E.g. support 6 billion draws, and chunk things so that we don’t need to make too many resources all the time.
MM: The approach you described, is like each buffer has the same size… I think we want to say the buffers grow (different alloc strat) so that the number of buffers doesn’t grow without bound. I did some investigation: With the smallest draw call that I could write, draw calls started taking about a second to execute around 50million draw calls. I wanted to find “is there a limit”, like will it always run in finite time even with unbounded draw call counts. That’s concerning. We might want to have a limit for the max number of draw calls per pass (e.g. 1 million maybe). I think that might be sufficient. I want to find out if we need to recompile the validation shader just once, or if we need to re-compile it for each pass. And if we need to recompile, is that fast enough.
AE: I think we won’t need to recompile. Dawn does it in a different way, because we don’t do the streaming thing. We do our validation in a compute shader. For the draw call limit, even without needing to do validation with draws, would we still want e.g. 10million draws just immediately lose the device?
MM: I think more GPUs have more pre-emption now than they used to. Ideally that means that if your draw calls take a long time, then we don’t actually need to limit things strictly. E.g. even for some ten second calls, the device is still responsive.
AE: I think either a 6B or 1M draw call limit is probably fine.
MM: Oh also, I made the smallest program I could that would draw, and that took seconds. If we maxed out the number of draw calls, we could probably just split the passes. Specifically, splitting a pass is worrisome for small passes, but free for large passes.
AE: You’ll also need to change storeOp:dontcare to storeOp:store, and you’d need to know to do that ahead of time.
KN: So we’re looking to add a draw-calls-per-render-pass limit?
MM: Yeah I think that’s where we’re ending up here.
KG: Is this something where we might support an optional renderpass option for “I expect to have up to this many draw calls”?
<time limit hit>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2022 04 06

GPU Web 2022-04-06 (APAC-timed)

Tentative agenda

Attendance

Administrivia

CTS Update

“Tacit Resolution” Queue

FP16 Extension against spec #2696

Are duplicate timestamp locations allowed on the same pass? #2662

Define WebCodecs interaction in an extension proposal #2720

Streaming implementations and indirect draws/dispatches #2189

(stretch) Consider putting most of the writeTexture validation on the device timeline. #2680

Agenda for next meeting

Clone this wiki locally