Minutes 2019 12 16

GPU Web 2019-12-16

Chair: Corentin

Scribe: Ken / Austin

Location: Google Meet

TL;DR

Data uploads
- Discussion of #506 (synchronous mapping) and #511 (guaranteed efficient)
  - Concern about potential for races or amount of client-side state tracking needed.
  - With the assumption that developers will write helpers/wrappers we should pick the simplest solution spec-wise.
- More discussion about “setBufferSubData” whether it makes sense to add it for developer friendliness vs. how it could need to be optimize later down the line.
  - Discussion about producing a helper library, but that would require a charter change if the group does it.
PR Burndown
- Merged #516: Remove createBufferMappedAsync
- #517 make imageHeight optional: discuss more offline
- #518 changes extension to be a sequence of string: discuss more offline.

Tentative agenda

Data uploads
PR burndown
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
Google
- Austin Eng
- Corentin Wallez
- James Darpinian
- Kai Ninomiya
- Ken Russell
- Shrek Shao
Microsoft
- Chas Boyd
- Damyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Mehmet Oguz Derin
Timo de Kort

Administrative items

No meetings until next year (next one is 6th of January)
Someone made a PR on the CTS because they found the fence tests weren’t testing that things complete in finite time
- They’re working on a software implementation in TypeScript (!)

Data uploads

#501 #426, #481, #491, #511, #509
CW: there was some progress on #509 - setSubData PR.
CW: there’s decent agreement on things there. Looking:
- Most parties can agree to some form of setSubData, so hopefully we can put this in.
- Then there are more complex but more efficient paths we can discuss.
- JG: you can certainly vote on this, but I vote against. I don’t think it’s the right path.
- DM: maybe we can talk more about the right path to get in sync.
CW: OK, we can talk about the more complex path of mapping / upload buffer.
DM: I’d like Jeff to describe what the two approaches are. Seem almost the same.
JG: They are almost the same. Talking about #506.
JG: choosing a different constraint to break. Synchronous one makes sure it’s still infallible; you always get a result back. Inherits diff. constraints. Guaranteed efficient one: you never return null. It’s fallible, so you get a different set of constraints.
- There’s been some disappointment with any approach that’s fallible / can introduce race conditions.
- This is the difference between the two (#506 vs #511)
CW: #506 is synchronous mapping but it can be fallible. #511 is something closer to WebGL’s efficient map_read - where if you look at fences as the app, you’re guaranteed you’re on the fast path.
JG: cool thing about sync (#506) - your data is always there. You don’t need a zeroing pass. Synchronous one has a constraint that makes this simple. If you can map it on the CPU then you can’t write it on the GPU. When you unmap, at worst you’re enqueueing a copy into that buffer.
JG: (#511) Guaranteed efficient: I like this because it’s the primitive I most think about. People will most be trying to approximate this. People want to map directly with minimal number of copies. Guaranteed efficient mapping lets you do this. Obvious when it fails. Someone could make a mistake about not checking fences properly. Fails loudly and clearly. Other ones which have the infallibility constraint just get slower - the webby solution, but people using close-to-the-metal APIs generally don’t want this behavior.
JG: overarching: lot of these primitives can implement other versions of the other with minimal polyfills. Make the async wrappers for map_async around efficient mapping. Most of the proposals - except setSubData - which is the most limited way of doing uploads. It’s OK for that purpose - but I’d rather not have the base primitive have these limitations. Would like you to be able to build your own upload queues.
CW: Very much agree advanced users will do their own UploadQueue which is why setSubData is not enough. The problem is that if an application tries to do a race, and you as the spec need to prevent them from doing it. Synchronous buffer mapping does have some race conditions, or stalls, or breaks some Web APIs. Guaranteed efficient - also has some race conditions / client-side tracking.
JG: the client-side tracking is fairly trivial. I reimplemented it in JS and it’s not very long.
CW: not saying it’s lots of lines of code; just a lot of data to keep on the client side.
JG: Not a lot of data. One field per queue, per fence, and per buffer.
CW: in guaranteed efficient mapping: ...when buffers are used… they’re unmapped, or something similar. “executing, and when their fence id is updated…” you have to do this on the client side. Per bind group, have to remember all buffers in use. Same for render bundles, etc.
JG: Okay, but that doesn’t seem bad. We’re sort of in a local minima right now, where the amount of data we have to track on the client side is really low -- which is cool, but it does limit us. If we do track some data on the client side, then we can do more.
CW: question: if I create a bindgroup containing a buffer; i do a bunch of stuff leaving the bindgroup in error state; and put things in a queue. Does the buffer get uploaded? You’re in error state.
JG: No, it’s submitted, it’s just internally null. Submit on the server-side is a no-op but it’s still submitted.
CW: so in the spec we’ll have client- and server-side stuff?
JG: don’t think we need it. Same primitives still apply. Submit is an error on the server side, but on the client you can’t tell whether it’s an error or not till you hear from the server. And the only way to wait on the client is on a fence.
CW: But basically that means, in the spec, if the bind group is an error, we still need to return one that contains a list of buffers it has. And when we do a queue submit, even if it’s an error, we need it to have side effects. Look at all the command buffers and buffers referenced by it, update their fence ids, then do validation, and if it fails… that’s weird.
JG: true. But for this cost, you get the benefit that mappings are always direct. We’ve established in this investigation there’s no perfect way to do this. There are different constraints, and tradeoffs. We can’t pick something that breaks zero constraints.
CW: In your opinion, each of the different versions can implement all the others -- maybe with an extra copy here or there.
JG: with some overhead, yes.
CW: people caring about perf will implement the best for them. So we can pick the version that’s simplest to specify. People will implement the staging bit they need.
JG: OK. Then I think we can close all these PRs, because the simplest thing is what we already have.
MM: are we discussing in addition to setSubData, or replacing setSubData?
JG: want to know your feelings on both.
MM: If it’s in addition, then I don’t have much of an opinion. We have two desires: data races hard, and a way for developers to upload data without too much head-scratching. If we have two solutions: one that’s easy, and one that’s hard. We’re willing to roughly accept any details of the hard one. Does that make sense?
JG: can I ask, how good do you want the easy mode to be? We already have a relatively easy mode with createBufferMapped.
MM: I opened up this conversation months ago because I think what we have now is not sufficient for developers. Common case for new developers, uploading small amount of data per frame, is: allocate new buffer mapped, and then forget to destroy it.
JG: I don’t think that’s any different from people using any other API poorly. I don’t feel a compulsion to make this API as-easy-to-use-as-possible. I think we should make it somewhat easy, and no easier.
CW: if you look at WebGPU you need some graphics knowledge, but I would argue it’s relatively easy to use.
MM: what we have today is easier to use than WebGL.
JG: I don’t think so. I’ve written both and I don’t think it’s easier. I’ve implemented a polyfill of WebGPU as of March on top of WebGL 2.0. BindGroups are a mess.
CW: that’s why we have getBindGroupLayout. :) There’s an idea that WebGPU is easy-ish. Let’s not debate vs. WebGL. It’s easy-ish like Metal is easy-ish. BindGroups are a mess, and immediate data uploads.
MM: I think I can add a little bit. When we released Metal, we heard feedback from many that using Metal is easier to use than OpenGL ES. I realize that’s not the specific topic of discussion, but it is at least possible to create an API that’s easier to use.
JG: at the same time, I don’t want to put Three.js in the browser and then try to make it fast. These are all constraints we’re working against; I have a different opinion about how much we should dull the safety scissors because we give them to end users. It’s pretty simple now because we’ve worked a long time on it. Also because we’ve trimmed it down to an MVP. Sort of easy to use now, but I don’t think the main users of this are going to be amateurs. I’m not working on this API for the amateurs; we already have APIs for the amateurs and other people make them. (Three.js, Babylon.js). It’s great. In the spirit of providing a new and novel platform, i want to make it possible to do things, rather than impossible to mess things up. When you try to make these things hard to mess up you get other problems: buffer bloat, copy bloat, etc.
CW: There’s an example which is Metal which is considered a simple API to use, and it’s still a powerful API..
JG: it is, but there’s not a lot of choice on that platform.
CW: the point is it’s a powerful API that’s simple to use.
JG: But it has a lot of the problems we’re trying to avoid. It has races. If we had the same constraints as Metal, we could provide racy APIs. But we can’t. We have more constraints.
CW: what we’re trying to get at is, setSubData (or whatever) is a small concession to make for newcomers to the API. We might have a different view of how many of these newcomers there will be.
MM: I think it’s worth mentioning that -- let’s pretend that the only users of this API are going to be full time 3D graphics devs working with this type of API. Those developers probably wouldn’t call setSubData. So the world you’re worried about where there are heuristics to make the browsers compete wouldn’t actually happen. We can sort of wager that if we provide both of them, and you’re right, that’s a win. If we’re wrong, then we’re providing something to a population that needs it -- and that’s also a win.
JG: but all this time on the web: we make these APIs, people publish benchmarks, and then we have to burn down perf problems in badly designed benchmarks. These generate months of work. There’s little indication that these help real world apps. If you introduce setSubData, the thing people compare between browsers isn’t DoTa, Call of Duty, etc.; it’s Chrome Experiments, mostly implemented by amateurs. A lot of the time I see something performing better in one browser than another is that it’s implemented poorly. If we have setSubData, either people won’t touch it, or even the pros will say “this is good enough”, and then we’ll get a Firefox browser bug, and we have to spend engineering months to fix that slow path, or convince someone else to use something else.
CW: Same issue if people don’t have setSubData and they use createBufferMapped.
JG: in one of those it’s deterministic and clear what happens.
MM: The bloat (# buffers created) kicks in, and that’s not deterministic. Depends on how the GC is implemented.
JG: people also create ArrayBuffers each frame before they realize they shouldn’t do that. Don’t want to adjust our API for the people that allocate megabytes and megabytes of storage per frame.
MM: Also don’t want to adjust our API because of the fear someone will write a bad benchmark. I won’t to provide a good service to our users.
JG: Ideally I’d like to see APIs that are determinisdtic in ways like guaranteed efficient mapping. It’s very clear when it’s going to work. In this proposal, even if the thing is ready, we never give it to you early. We only give it to you if we can prove that it’s ready. In that way, it’s deterministic.
CW: it’s the same thing for the other buffer mapping proposals - they’re all deterministic.
JG: they’re not because they rely on implicit fast paths that aren’t observable until you check for them. Whether you hit the fast path is dependent on browser internals.
CW: we’re getting into a new problem domain now. You were concerned about browser perf problems with setSubData, now we’re in the same situation with buffer mapping?
JG Don’t think that’s what I was trying to say. I think setSubData goes too far in a dangerous direction to help users and will end up hurting a number of them, and definitely costing us development time on our ends to tune our heuristics to match whoever has the biggest browsers.
KN: I’m not confident that even with an advanced app / polyfill, they’ll get the best perf out of other primitives. If you have ~3 matrices, I think it’ll be fastest to do memcpys - they cost almost nothing for small data. I think there are situations where setSubData is the most performant, unless we do extensive other heuristics on the other buffer mapping paths. I don’t think any of these paths is immune to having to do weird heuristics.
JG: Some of these I don’t see as having heuristics for them. I don’t know how you’d put that into efficient mapping. It enforces best practice, and you’re forced to come face to face with it.
CW: efficient mapping keeps a shadow copy of things.
JG: synchronous mapping does.
CW: vs. client side tracking, with fences / data races / etc. Going back up the stack: I think out of the more complicated solution, if we can convince ourselves that the current solution is the simplest because it’s airtight. May not have synchrony, but also doesn’t have extra copies, etc. to avoid data races. And I agree that mapAsync is not for the “Fruit Ninjas”. For those individuals we want something better. createBufferMapped is not it. I think we can go the extra mile and that we need it for the Fruit Ninjas, and there will be a lot of them. Dzmitry sees people building random games on wgpu-rs. People are itching for a modern API they can use.
JG: It sounds like you’re saying you want them to use the entrypoint that they’re not supposed to use, but it’s easy.
CW: When I learned graphics, it was WebGL, and that was great because it was simpler than ES. I think a lot of people are going to be the same. They’re going to learn graphics with WebGPU and will start using setSubData and can grow to use mapAsync when they realize it’s a bottleneck.
JG: the problem is they won’t realize it’s a bottleneck because it’ll be fast on some browsers and not others.
KR: What about a standard library staging belt? And this is what the WG recommends to everyone new coming to WebGPU? Built on top of the mapping primitives, and we guarantee it’ll perform well on all the browser implementations. I personally don’t think we should have hidden copies and shadows. If we can eliminate this by providing one user standard library, then that seems like a big win.
MM: is there an example of this ever working in the past?
JG: Yes. One for WebGL is the matrix libraries used in some of the demos in the WebGL repo. A lot of people use that and continue to use it for their matrix stuff. It’s not embedded in the API.
CW: I think it’s orthogonal to graphics APIs - since 2004 or so.
MM: procedural thing: if we do this we’ll probably have to change the charter.
JG: to say we’re producing a library?
KR: I’m just saying we should open source and recommend something if you're used to doing synchronous data uploads.
DM: I’m not sure what shadow copies you’re referring to. There are no more shadow copies in setSubData than in createBufferMapped.
JG: there are extra shadow copies when the buffer isn’t ready to be written into.
CW: setSubData is always pipelined.
JG: so it’s always slow.
CW: it’s always more copies than doing the sync thing.
JG: is that mandated? If not then it’ll happen client side.
CW: setSubData mandates that the copy’s pipelined.
JG: if a sufficiently advanced browser realizes the buffer’s ready to be copied into, it’ll happen client side, and be faster.
MM: how does the browser know no work will be submitted after setSubData but before enqueued that changes the result?
JG: this is queued-up setSubData?
MM: I see.
JG: I can nearly guarantee your impl will try to do this within ~12 months.
CW: we have so many things to do I can guarantee we won’t.
JG: It’ll be not important until someone’s motion mark uses it.
KR: or some well-publicized application
DM: ...
JG: setSubData is not an optimal solution. I foresee it as something people will want to, and continue to, use because it’s not async. Then we’ll have to implement heuristics on top of it, and client side tracking will come home to roost because that’s how you implement fast buffer mapping. This is a real thing that real apps suffer with a lot.
DM: So we are not talking about stalls, right? We’re talking about whether or not there will be a copy of the data.
JG: Right. For instance, when we do video-to-texture optimizations, those are mandatory now, because you can’t copy 4K textures twice. This sort of copy minimization is something we will have to do.
DM: Does this apply to setSubData? I fail to see how we’d end up in the same situation.
JG: when you’re trying to upload a lot of streaming data to the GPU.
DM: So for videos, it goes through the staging all the time because they’re always in use.
CW: Jeff’s point was that there are some optimizations that become necessary because the size of the data grows, or because there are benchmarks that play videos for 7+ hours straight, so everyone has to reach the battery performance of the leader. If someone optimizes setSubData then everyone will have to.
JG: you’re allowed to not think this will happen, but I think it will happen and that’s one of the couple of reasons I don’t want setSubData. It’s fine for the WG to add it but I’ll voice against it.
CW: it’s nice when things are unanimous.
DM: I think a situation where we can tweak the impl without breaking the API isn’t as bad as you paint it. I’d like it if we could continue to optimize this post v1.
JG: The nice part for not having setSubData is that we never have to do that. It’s always done optimally by a dev choosing which polyfill to use. It’s always the same efficiency in every browser.
MM: I thought Kai just said that some cases setSubData would be faster.
KN: I think so - it’s my intuition - up to 1KB, or something.
RC: in this synchronous buffer mapped world, you have to do the OnCompletion thing which returns a Promise. Then do you have to manage Promises in Jeff’s proposal.
JG: for the synchronous one?
RC: have to wait on the OnCompletion thing, which returns a Promise.
JG: In order to ensure minimal copies. If you don’t care then you don’t have to wait on it.
CW: 10 minutes left. Lack of consensus here. Let’s move on to PR burndown.

PR burndown

JG: half of our PRs are buffer mapping.
CW: we’ll skip these.
CW: there’s programming model specification. Please take a look, if it looks OK let’s merge it.
#516
- JG: sounds good to me
- CW: creates complexity for async mapping
- RC: LGTM
- JF: thumbs up
#517
- ImageHeight optional - zero is only valid if depth of copy is 1
- DM: should we default to the texture height instead of only being allowed when it’s <1 height?
- CW: if it’s texture height then if you have a buffer that’s the right size for your copy and a large texture, then you could have a buffer overrun.
- JG: this is also the same as the default for other APIs, like OpenGL. I like it.
- KN: you like having it default to be invalid, or default to the size of the texture? In the spec text I wrote, 0 means only valid if you don’t use it. I could go either way but like this default.
- CW: need to discuss more offline?
- KN: sure.
Was a PR for depth16 unorm format, but not impl on Metal so was closed.
#518
- working around limitation where it’s hard for IDL to expose any kind of object. Also differences between “false” and “undefined”.
- DM: how is this better than having a method which returns extensions?
- KN: on the GpuDeviceDescriptor, it replaces the dictionary with an array, so if there are unknown entries, the browser can reject it instead of silently ignoring your request. Technically, could be done with an object as well. Could take in an object with a bunch of keys and have the browser reject for unknown keys.
- DM: I see.
- KN: people can review this offline.
#519
- Renaming ImageHeight to something else

Agenda for next meeting

MM: what’s the path forward for buffer mapping? If we talk about this for another week will that help?
RC: we can ponder over the holidays what we want to do.
MM: One thing that might help is that if all the interested parties can consider what evidence, if presented, can make them reconsider. And then that can generate a bunch of tasks for us to see if that can be presented.
CW: yes, that’s good homework. Would still like to have buffer mapping maybe not as the primary subject of next meeting, but as a secondary subject.
CW: Kai wanted to suggest change on how we do adapter discovery. Doesn’t provide more fingerprinting, is easier to work with for developers. We could do that.
MM: Think it’s important to have a plan for keeping data on chip problem, because one of the solutions might involve fundamentally how passes work. I think it’s worth talking about that sooner than later. Maybe in a month I’ll have a proposal?
CW: would be nice to see. Kind of a scary subject because it’ll be very complicated.
KN: yes, but exciting.
MM: worth talking about?
KN: I think so.
CW: I’m worried that the solution will be super complex.
CW: can put these two items tentatively on the agenda, then buffer mapping. If other things come up please add.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2019 12 16

GPU Web 2019-12-16

TL;DR

Tentative agenda

Attendance

Administrative items

Data uploads

PR burndown

Agenda for next meeting

Clone this wiki locally