GPU Web 2024 05 15

GPU Web WG 2024-05-15

Chair: CW

Scribe: JB

Location: Google Meet

Tentative agenda

Administrivia
CTS Update
“Tacit Resolution” Queue
Add extended range #4500 (comment)
"get a copy of the image contents of a context" does not handle case where canvas is unconfigured #4606
Should command buffers be invalidated by an invalid submit? #4599
Should there be a limit on the number of open GPUCommandEncoder instances? #4622
[editorial] Consider if there's a much better name for "image copies" #4573
(stretch) Add synchronous GPUAdapter.info #4550
Agenda for next meeting

Attendance

Apple
- Mike Wyrzykowski
Google
- Brandon Jones
- Corentin Wallez
- Kai Ninomiya
- Ken Russell
- Stephen White
Microsoft
- Rafael Cintron
Mozilla
- Jim Blandy
- Kelsey Gilbert
Nvidia
- Anders Leino
Myles Maxfield

Administrivia

CTS Update

Nothing major. Please run the new WGSL tests!

“Tacit Resolution” Queue

Empty!

Add extended range #4500 (comment)

KN: The CSS WG is still discussing whether clamping is required, so this PR loosens the verbiage to be more lenient so it can match what CSS does eventually.
KN: The two changes are the enums names that say "project" and the text saying it is projected in the color space, but not necessarily by clamping.
KG: Will review offline.

"get a copy of the image contents of a context" does not handle case where canvas is unconfigured #4606

KN: This is just a corner case. We have a “null pointer exception” in the spec text, in effect. When we try to get the picture out of the canvas, we haven’t defined what happens when the canvas hasn’t been configured at all. So we don’t know what should happen. Blank image? Transparent image? We know the size, because that’s associated with the canvas. I think it should probably be a blank transparent image, which is equivalent to nothing at all when composited
KG: It would be nice if there were not a new intermediate state for this, and do whatever the canvas would have done if you hadn’t gotten a context from it.
KN: Good point. That’s almost certainly a blank transparent image the size of the canvas, but I can work out the details.
(General consensus to make it behave like a canvas that hasn’t had a context attached yet)

Should command buffers be invalidated by an invalid submit? #4599

BJ: This is something that came up as we were doing our spec revisions. Looking through the logic for the spec versus Chrome, Chrome would invalidate every buffer you submit, but the spec says that if one command buffer is invalid, the remainder of submitted buffers remain valid. There seems to be general agreement that once you’ve submitted a list of command buffers, whether the submit succeeds or not, all the command buffers should be invalidated. points out that the specified behavior means you may have a bunch of valid command buffers lying around that you weren’t expecting.
BJ: Myles points out that this might incentivize people to make lots of little commits, which is an antipattern. A more effective remedy for them would be to check the command buffers’ validity ahead of time by popping the error scope.
KG: Is there a general consensus?
CW: Seems like it.
KG: If we wanted to solve the problem that Myles points out, we could specify that they all get maximally consumed: if there’s a submission failure in the first command buffer, it keeps trying to consume them until it’s done with the list.
BJ: I think that’s what’s being proposed. If any of them are invalid, then all are invalid
CW: Paraphrasing kelsey: command buffers are checked for duplication
KG: It’s a two-layer thing. The main principle is that this is an operation that consumes something. If it were memory management, it would be essential to make this clear. Fundamentally, this kind of operation should consume the buffer and take it back as much as possible, even if it fails, because it leaves the caller in a difficult state to interpret. Past that, we need to consider how to avoid discouraging less-than-ideal behavior.
MM: I’m not trying to stop the train. In the code blocks in the issue, in the first code block the execution is different between the two examples. Command buffers are generally not going to be independent. So I think we should just continue with this.
CONSENSUS: All command buffers passed to any submit are invalidated.

Should there be a limit on the number of open GPUCommandEncoder instances? #4622

MW: If we do not have this, then I don’t think we can encode any work on the encoders until the call to submit, because there’s the possibility that the website creates a bunch of encoders, but then fails to submit them, then we’ll block because we don’t have enough available command buffers.
CW: I want to avoid global validation. But Chromium encodes on submit, so this isn’t an issue for us. But also, OOM and device loss are always valid ways out of the situation. If that were the choice, then you’d probably increase the command limit with the appropriate Metal API call. Alternatively, you could reserve a few command buffers, then you could switch to encode-at-submit once you run out. But that would be more complexity (two paths). If you could set a limit of, say, 1000, then developers might never hit it. 60 seems borderline, middleware might consume a lot, or complex pages. So we should try to raise the limit substantially higher than 60, to make the global limit more acceptable.
MW: we don’t really see people hitting this limit, although games ported from Vulkan could indeed hit it. So some GPUs could have a higher limit. But there’s a substantial performance hit on Metal from having a high limit.
CW: People might not hit the limit in development, and only discover it on the user’s machines. I’m not sure that we need a configurable limit.
MW: There’s no limit on the device, it’s a limit on the Metal queue, which is presumably tab-specific. We can’t just create more Metal queues, because the driver wouldn’t be able to force the ordering of command buffers.
MM: Could you use shared events?
MW: I’d need to investigate that, it might be a potential option.
MM: Perhaps the reason not many apps have hit the limit is that there’s no mac with 64 cores. Such PCs do exist. So a limit of 64 is too low
KG: It’s worth mentioning that Josh Groves says that wgpu was actually hitting the 64 limit in the wild, and raised the limit to 2000.
MW: Why does the number of cores have any relevance?
MM: A common pattern is to have each thread recording a different command buffer
CW: We can’t do this sharing on the web right now.
CW: Room: what if we had it configurable, but required to be at least 1000
MW: Anything greater than 64 will have consequences on Metal mobile devices. Your app will get “jetsam”ed: your Safari tab will get killed. Command buffers are totally not free. This is why the proposal from Apple’s side is to keep the limit at 64, but let developers raise it if they’re willing to accept the consequences.
BJ: This same limit wouldn’t apply to render bundles, would it?
MW: No, those don’t create their own command buffer.
KG: Many people won’t realize they’re going to run into it, they’ll run into it only after they ship, and we’re going to be introducing issues where the systems wouldn’t otherwise have any problem. We’d be saving some tabs from being process-killed, but we’re increasing developer pain with this validation.
MM: Isn’t that an argument for applying validation consistently everywhere, and applying the limit to everyone?
KG: The place I see these happen is not the case you’d see on a developer machine until you’d created a test case that matches what the user is seeing. Sometimes that’s hard, users are doing strange things. But we still have to debug the problem, even when they’re doing something weird. But games will work on CI and then fail on user machines
KN: Depending on how this is implemented, it might be a new case of limit, depending on what maximum we expose. If we wanted to let you ask for higher, but with a higher risk of getting your process killed, that’d be a new thing for the spec. It might kill the whole browser, which wouldn’t be good.
KN: Secondly, I would very strongly prefer not to have global limits. I don’t like the complexity of the two-path solution, but if it lets us avoid a global limit, I think it’s worth it. Or, we could allow OOM on command buffer creation. But it would probably break the whole app.
BJ: If this does become a limit, we’re going to have to make some arbitrary choices. There’s fundamentally no limit in Chrome’s implementation, so we’re going to pick a high limit.
BJ: It depends on how we as a group view this particular limit. Is it like running out of texture memory: there’s plenty around, but you can still easily create a desktop app that creates much more than a mobile app could handle. And in that case we clearly don’t want to put a limit on texture creation - that’s best handled with OOM. Do we view this as a similar situation? I think we can. But I also feel like a two-tier encoding system would make a lot of difference here. Don’t want to be insensitive to imposing an engineering load on other teams.
JB: Would it be possible for Metal to treat all command buffers as bundle and at submission treat them as bundles?
MW: No: bundles don’t support all the commands that command buffers do
MM: Could we compromise on the limit?
MW: Anything over 64 starts to have an impact.
BJ: I don’t think it’s too terrible for us to push for best practices that discourage lots of outstanding queues. Even in Chrome, where we’re not actively writing to command buffers as people record, we’ll still accumulate memory, so it’s not a terrible idea to discourage profligate command buffer usage. “Are you leaking?” warning
CW: It seems like we have a large panel of options. The option BJ was describing: Browser should probably warn people when they have a lot of command buffers open - far below the limit, maybe 10. And that would help developers avoid getting surprised by what happens on the users’ machines. Give devs advance warning.
KG: This seems parallel to classic OOM: application works great on desktop, yet falls over on mobile. Even allocating 4G of memory seems like it’ll attract attention from the oom killer. This is seem as tolerable/necessary even if not ideal, and people cope with it. These feels right on the edge between “we should validate this for portability” and “we should let each device do what it’s capable of”
MW: We could do this silently, the downside is that it would work on higher-end machines but not lower-end machines or mobile devices. I agree we have the problem in general. The limit would help ensure more consistent behavior.
MM: BJ earlier asked what the difference is between this and running out of texture memory. First, there’s a developer expectation difference: developers understand running out of texture memory. Second, from the discussion in the issue, it seems like developers do hit this issue more often than they run out of memory. So I think there is a difference between the two situations.
KG: I agree there’s a difference, just not enough of a difference to merit a hard limit.
MW: The upside to the hard or configurable limit is that it would push adoption of moderated command buffer usage, which improves portability.
KG: Sometimes it’s useful to embed certain amounts of complexity in our implementations because we only have three WebGPU implementations, as opposed to thousands of content pages. It’s better for browser vendors to solve the problem than to foist it off on web devs.
CW: In WebKit, is it just a matter of the amount of engineering? Not having the two-tier thing? Or is there a fundamental reason to never do it? Say, “Virtual calls are expensive”?
MW: The engineering aspect is solvable, but it introduces a latency when you replay, and a bit more memory pressure.
CW: Those consequences would only happen in the cases that would run into the limit anyway.
MM: IF you were writing a native Metal app, you wouldn’t record your commands into a memory buffer. You would simply raise the limit on the Metal Queue.
CW: But we don’t have a fixed workload. We can’t anticipate the web page’s workload.
MM: You might be able to, if the queue doesn’t support enough command buffers, then you could make a new queue with a larger capacity and then slowly migrate over to the new queue. So you’d gradually raise the queue limit
MW: That’s an option. Pages will still get killed, and we’ll have to give them feedback that they shouldn’t encode so many buffers.
MM: Would adding a limit provide that feedback?
MW: Yes, because you’d get the feedback immediately, regardless of the machine
CW: You wouldn’t necessarily get the workload that provokes the failure.
CW: Would it makes sense to say, we can have OOM on CommandBuffer.finish? And have warnings in browser that discourage making lots of command buffers? And take MM’s suggestion of gradually increasing the queue size. Would that be a good outcome?
MW: We’d like the OOM on createCommandEncoder. This would be a solution, at least.
MM: There are two different priorities: one: create a system where the browser doesn’t crash. Two: support applications that need more than 64 but less than the device’s limit.
CW: I’d like to suggest that kind of many-bandaid solution. Seems like a good tradeoff between making applications care about a hard limit, while at the same time highly likely to work in WebKit, while also steering developers into not using too many command buffers. A palatable tradeoff?
MW: The downside is that they’ll not check for this OOM exception instead of the limit - which people also won’t check for
KN: Would the OOM error be a stopgap until you’ve done something in the implementation to paper over it?
MW: We’d have to see if we even want to ever allocate more than 64. It may just never work on iOS. So we’d very much prefer that the app simply doesn’t allow using more than the hard limit of command buffers.
MM: Would you prefer that it be impossible for the app to request the 65th command buffer?
MW: Yes.
KG: It was mentioned that in Metal, you can request a higher limit when you create a Queue. Could we surface that limit somehow in WebGPU itself?
MM: WebGPU doesn’t have multiple command queues right now
KN: It’d go in the device descriptor.
MM: Okay, so device descriptor specifies a command buffer in flight cap. What happens if the content surpasses that?
KG: Things will fail.
AL: It seems like these shouldn’t be so expensive. Why are they so expensive?
MW: We’re informed that it slows down notifications in the kernel, and notifications for buffer mapping and work completion get worse latency. I could get some numbers.
AL: Is that in-flight command buffers? Why is it different from having them in a memory buffer?
MW: No, the limit includes non-submitted buffers
MM: This isn’t just normal memory: it’s pinned memory that the GPU needs access to. And there’s actually a fixed-size table in the kernel.
AL: Couldn’t we record buffers in memory.
CW: Hardware is weird.
AL: Drivers always let you pretend you have more things than you really do
MM: This was answered above. Latency, memory consumption.
CW: We can’t change the way drivers behave. Ideally the drivers would be different, but we have to take them as they are.
KG: I think it’s worth considering whether it would be possible to surface the native-level workaround for this.
MM: What does the group think about the proposal where there’s a minimum command buffer count, and if the app exceeds that, it may or may not be validated?
CW: That could be surfaces as OOM or device loss. So, yes, but I don’t know if it really addresses the concern
AL: If you pick 64, doesn’t that mean you can’t use half your threads on a machine with more cores?
CW: Yes, and it’s even worse because command buffers in flight contribute to the count, too.
KG: If it’s tolerable in the native ecosystem to make people request higher limits in advance, why wouldn’t it be okay in WebGPU too?
MM: If the application sets that number to 100, then the likelihood that they’ll get OOM if they use fewer than 100 is very low.
CW: It seems like we didn’t find consensus this time. Let’s try to continue discussion in the issue and think about our options.
Continue discussion in subsequent meetings

[editorial] Consider if there's a much better name for "image copies" #4573

(stretch) Add synchronous GPUAdapter.info #4550

Agenda for next meeting

Seems like maybe not a lot of topics for next week. Skip next week? Consider a Pacific time meeting the following week, May 29th?
Discussion about feature level API and Compat (no issue yet?)
Hosting community discussions on advanced features like ray-tracing and bindless?
Items to revisit on Pacific time
- Add optional feature clip-distances (if there are new questions) #4588
- Add optional feature dual_source_blending #4621
- Push constant proposal #4612
Should there be a limit on the number of open GPUCommandEncoder instances? #4622
[editorial] Consider if there's a much better name for "image copies" #4573
(stretch) Add synchronous GPUAdapter.info #4550

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2024 05 15

GPU Web WG 2024-05-15

Tentative agenda

Attendance

Administrivia

CTS Update

“Tacit Resolution” Queue

Add extended range #4500 (comment)

"get a copy of the image contents of a context" does not handle case where canvas is unconfigured #4606

Should command buffers be invalidated by an invalid submit? #4599

Should there be a limit on the number of open GPUCommandEncoder instances? #4622

[editorial] Consider if there's a much better name for "image copies" #4573

(stretch) Add synchronous GPUAdapter.info #4550

Agenda for next meeting

Clone this wiki locally