Minutes 2018 02 21

GPU Web 2018-02-21

Chair: Corentin

Scribe: Ken

Location: Google Hangout

Minutes from last meeting

TL;DR

Feedback from Roblox https://github.com/gpuweb/gpuweb/issues/48
- Resource allocation: Roblox saw big gains from suballocating themselves
  - Suballocation pros: knows which resources have the same lifetime and allocate once. Cheaper big unclear why, need more data
  - Suballocation cons: exposes platforms differences, aliasing
  - Consensus: suballocation could be useful but tricky, can add after MVP
- Implicit UAV barriers in Metal vs. explicit ones
  - Roblox found that the implicit UAV barriers in Metal drops perf.
  - Happens when two dispatches use the same buffer, Metal implicitly makes the second wait for the first. Explicit barrier in D3D12 and Vulkan.
  - Portability concern about application relying on overlapping dispatches.
  - Ideas: synchronization happens at compute pass boundary, SSBOReadOnly “usage”.
- Memory barriers
  - Vk mem barriers too hard for everyone, D3D barriers are better. Echos feedback heard internally by IHVs at Google.
  - Need more feedback from ISVs
  - Mapped D3D barriers to Vk is “not hard”
Opt-in non-portable behavior
- Application can decide to have non-portable behavior, in core spec or extension.
- Concern that if some big player starts relying on the extension, all browser will need the extension, making the API non-portable.
- Extensions are for things that are strictly better if you have than if you don’t (render to float in WebGL).

Tentative agenda

Buffer mapping, esp. persistently mapped buffers
Agenda for next meeting

Attendance

Apple
Myles C. Maxfield
Google
- Corentin Wallez
- Kai Ninomiya
- Ken Russell
Microsoft
- Chas Boyd
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
- Markus Siglreithmaier
Yandex
- Kirill Dmitrenko

Feedback from ISV (Roblox)

Makes a big mobile game engine / universe for kids
https://github.com/gpuweb/gpuweb/issues/48#issuecomment-367364417
DM: Arseny is TD at Roblox, responsible for all client-side rendering, optimize well, works on Metal 1.0, Vulkan, D3D12
Asked a few questions about overhead of various APIs. Got pretty detailed answers on most topics.
From what got: we seem generally on the right track.
On the issues where we don’t know what to do (multiple queues, resource allocations), they’ve given us good clues as to what to expect and pursue.
Resource allocation
- Esp. for resource allocation: relying on the driver does not work. They have to allocate large blocks and subdivide them themselves.
- Different memory types of Vulkan didn’t pose a big problem for them. They have a fairly generic solution which picks the right memory type and works with it.
- CW: would be good to know the source of overhead when the driver takes over resource allocation. What could it be doing worse than the application?
- DM: not a lot of detail on why the driver was slow. Can ask Apple for clarification.
- MM: can’t tell you off the top of my head. Could do research.
- JG: maybe driver has to use heuristics.
- CW: don’t disagree. Would still be nice to have more data on this. Feels like they can easily sub-allocate and manage their own memory. An implementation of WebGPU should be able to do the same unless they’re doing something very application specific.
- MM: worth mentioning that the allocations in a web app’s point of view come from the browser. Rather than the browser passing all calls to the driver, could do something in the middle.
- JG: why to prefer sub-allocation: one big synchronization point with all of these APIs is having to check resource allocations for success. Sub-allocations always succeed as long as they pass our validation.
- MM: even with fragmentation? Your resource can get fragmented in the same way all memory can.
- JG: if you’re choosing where in the heap you put your resource there’s no fragmentation you don’t know about.
- KN: these sub-allocations don’t prevent you from making overlapping sub-allocations.
- MM: the aliasing possibilities are a little scary.
- CW: since that’s a big reason for suballocation, if we want to ..
- JG: without explicit suballocation – with implicit suballocation – you have fragmentation issues in our pseudo-driver. Otherwise you have to round-trip to the driver.
- KN: if we implement a sub-allocation strategy in the browser, and in the driver we do allocations and try to figure out where to put the sub-allocations, we can make it fast to figure out whether you got a sub-allocation because we can track the allocations in the client side.
- JG: if it’s automatically handled by the server, we’d need heuristics in the browser to figure out where to place things.
- MM: a little surprised that you think an app could do better at hiding fragmentation than the browser.
- JG: browsers can have arena allocations for textures of certain size, and some other strategy for other objects.
- CW: depending on app: you can know to allocate all buffers, textures, etc. for one “chunk”/”game level” at once, and destroy all of it at once.
- JG: D3D12 has concept of placed / (managed?) resources. Create a mini-heap just for that object. Or place heap in a certain spot.
- CW: in D3D12 and Vulkan, the number of heaps is limited by WDDM. Residency is per memory object / memory heap. Naive solution of browser creating a mini-heap per resource definitely does not work. Would run into limits very quickly.
- JG: ok. that’s what the driver conceptually says it’s doing, anyway. That style seems to be something you can do.
- CW: CreateManagedResource, CreateCommittedResource (creates mini-heap, not fast), CreatePlacedResource (placed explicitly inside memory heap).
- CW: one easy solution that would work: expose memory types, tell app “create memory heap and place resources in it”. even in Vulkan, resources you place in heap are opaque, and maybe don’t even alias inside the heap even if you tell them to.
- CW: that’s one solution that could work. Concern: it’s hard to use, somewhat non-portable (exposes differences in system), and aliasing is hard/scary. That’s why not everyone is comfortable with it.
- MM: good start. Point: app won’t check for allocations to succeed if they come out of a heap: that poses a problem if the allocation might not actually end up in the heap, since the allocation will fail.
  - More simple: add heaps if needed in version 1, and leave them out of MVP?
  - Browser can try to do sub-allocations if it wants to? If it’s not sufficient, can add it later?
- DM: if we do sub-allocations in the browser right now, an app library can theoretically do the same thing. Don’t see why we should put that in the browser.
- MM: that’s what I was thinking. If every API that takes resources also takes (offset, length), then apps can do it themselves.
- JG: that’s just doing suballocations, right?
- MM: yes
- JG: suballocations aren’t strictly needed for an MVP, but they are a thing we should be looking forward to doing in the API.
- MM: agree. Valuable to consider, but don’t need to be in the MVP.
- DM: is it generally possible to release a library inside the browser which we intend to move to userspace later?
- MM: not really. Web usually describes thing as a JS library and then considers folding it into the browser. For example: position:sticky.
- DM: was thinking, formalize API, and consider moving it out.
- JG: can model implicit allocation on top of explicit allocation, but not vice versa.
- DM: moving the implicit out, once we support explicit allocation in the API.
- CW: that’s something we should consider doing. When you move these things out you have to spec explicit allocation at the same time. Same thing as spec’ing explicit instead of implicit.
- JG: conclusion: for now, resource sub-allocation not required for MVP
- CW: would like to have more investigation on this.
- JG: defer for now.
- DM: need feedback from Apple. Why is Metal driver not allocating efficiently?
Multiple Queues
- DM: this is another mystery why Metal can not do async compute under the hood.
- CW: text doesn’t talk about multiple queues. If you’re in a compute pass and issue two dispatches that use the same resources, there’s an implicit WaitForIdle in between. Don’t think we’re talking about multiple queues.
- DM: how would you run two dispatches in parallel w/o multiple queues?
- CW: example about dispatching two workloads
- CW: different workloads can access same SSBO in Vulkan. In Metal, if you issue two dispatches that refer to the same resource, the driver waits for the first one to complete before starting the second.
- DM: so, not about multiple queues, as much as about memory barriers / UAV barriers.
- JG: possible to use implicit resources to work around this?
- MM: don’t know. Totally feasible that the driver can see past what you’re doing and see that you’re aliasing.
- CW: WebGPU can’t guarantee that your two dispatches can run in parallel, because they can’t on Metal. Should we have explicit SSBO barrier or make it implicit as in Metal?
- MM: just because they could execute at the same time doesn’t mean they will.
- DM: if no semaphores between queues: should run at the same time.
- MM: not required per spec.
- CW: think this is not about multiple queues, but about SSBO / UAV barriers. That’s something we have to decide: do we have UAV barriers or will they be implicit?
- DM: in your model, if we implicitly see a UAV buffer, do we insert a barrier like Metal does?
- MM: right, because there’s a race.
- DM: there’s inherently a race -- it’s an UAV.
- MM: how far should we go in protecting against these?
- CW: personally, don’t have enough data to know. Somewhere between explicit UAV barriers and fully implicit.
- MM: should there be a barrier, either implicit or explicit?
- CW: can’t prove correctness of code using SSBOs.
- MM: concerning case: trying to implement a mutex by having one dispatch modify a buffer and a second one which tries to read that value. If the dispatches can execute in parallel, eventually both dispatches will complete. But if they can’t, then they will never complete, and that puts Metal at a disadvantage.
  - While loop which polls this shared variable and expects it to get updated by the other dispatch
  - CW: you’re saying that a kernel could be written to require parallel execution and that’s non-portable
  - MM: yes.
- DM: have numbers showing double-digit percentage performance improvements, at the cost of a slight amount of portability.
- DM: there will certainly be ways for users to mess up. This isn’t the only one.
- CW: while researching NXT, thought: using SSBOs, can screw yourself up, even in the same dispatch. So we didn’t worry about the multiple dispatch case.
- RC: how does Metal look and see that it has to do certain things in sequence?
- MM: good question.
  - 1. Metal 1.0: application attaches every resource, CPU inserts barriers.
  - 1. Metal 2.0: GPU-driven rendering where shader attaches resources: you can mess yourself up.
- RC: does Metal check reading from output / writing to input?
- CW: in Metal, buffer bindings per stage, like D3D11. Can look at shader and have bitfields of things written and read. Set bits on dispatch and compare to see if a previously written resource is being read.
- RC: so Metal driver can say, I know what resources you’re using and they’re mutually exclusive?
- MM: not quite right. Metal shaders use raw pointers, and you don’t know what resource they’re pointing to. If two subsequent dispatches have a nonempty intersection of attached resources, they get a barrier.
- RC: so even though you explicitly attach things you can still mess things up due to use of pointers?
- MM: no. Metal says, in dispatch #1, you have rsrcs A, B, C. Doesn’t know you’re reading/writing. Dispatch #2: C, D, E attached. Because those sets intersect, it says it has to insert a barrier.
- RC: so it assumes the worst from you?
- MM: yes.
- CW: In NXT, buffers have a storage usage / storage mode. More and more, it becomes clearer that we should have STORAGE_READONLY. In compute, can tell system that buffers are only for reading.
- MM: think that’s a good direction for this API to go. All currently active shading language suggestions can figure out which resources are read from or written to. API can do the right thing.
- DM: another proposal: what if we take advantage of compute pass notion, and say that everything inside a pass is not synchronized by default?
- CW: interesting idea. Basically the same thing as having compute passes and UAV barriers, just expressed differently.
  - Difference between this and UAV barrier seems ergonomic.
- DM: we already have passes. We were debating whether to have separate compute and blit passes.
- MM: though we agreed to have compute passes.
- DM: so we can have that semantic.
- MM: compare to render passes, where what you proposed is already the case. Barriers would occur at draw call boundaries. Probably need more time to think.
- CW: interesting point.
Memory barriers
- DM: all memory barriers suck. D3D’s are slightly better. Vulkan memory barriers are gotten wrong by users, driver developers, library developers, and probably others.
- Don’t have numbers to know what they’re providing, but they are hard and still being investigated.
- MM: perhaps agree: if we have memory barriers then can we say that they’ll be simpler than Vulkan’s?
- CW: multiple presentations from Android team have been pointing out that barriers are slowing things down because they don’t know how to use them, and the validation layers are long.
- CW: if we have memory barriers, I think they should not be as expressive as Vulkan’s.
- DM: more feedback: was inquiring about frame graphs from Yuri O’Donnell from Frostbite. Was hoping framegraphs would address mem barriers’ usability.
  - It’s not what they’re solving at least at the moment.
  - Framegraphs mostly just work as a nice conceptual abstraction, not as an efficient barrier abstraction.
  - ...to throw more stones into memory barriers.
- CW: not sure if mentioned: don’t think WebGPU should provide a framegraph API.
- DM: agreed, that wasn’t proposed.
- DM: would like to get more opinions from ISVs on memory barriers, as well as some perf numbers. But agree that fully explicit memory barriers do not seem feasible for us.
- CW: we’ll have to validate memory barriers in order to guarantee that you can’t see other peoples’ data. Validating full memory barriers is more work than validating simpler ones.
- JG: if we had Vulkan style, you could validate them and pass them to the driver. Less expressive: you’d have to generate the more expressive ones.
- CW: that’s what we do in NXT. We have roughly D3D12 barriers, and generating Vulkan barriers from them was relatively simple.
- MM: how many switch statements?
- CW: (large handwave) not very many.
- DM: does barrier insertion code work with multiple queues?
- CW: single queue right now, but that’s because multiple queues require semaphores and that’s where the complexity would come from.
- MM: you’re saying that the transitions will all be serialized.
- CW: saying complexity of mapping D3D12 barriers to Vulkan’s is mostly orthogonal to multiple queues.
- DM: the complexity doesn’t come from D3D12 / Vulkan, but because NXT doesn’t have the source state.
- CW: without focusing on NXT’s design...submissions are well ordered.
- JG: multiple queues -> dependency graph traversal.
- CW: no, Vulkan requires you to submit to queues in the right order.
- JG: that’s what I meant. Have to ensure that all prerequisites for a graph node will be finished by the time you need them.
- CW: yes, agree, that’s what I meant.
- DM: seems easy to convert when you’re serializing to a single queue.
- CW: NXT only has the target state, and derives the from and to states. Those are still D3D12 style. Doesn’t matter that NXT knows the from state.
Shading languages
- DM: Arseny expressed big concern about using MSL. Doesn’t think it’s a good idea to use it. What’s interesting is that they’re converging on a SPIR-V based pipeline even though they’re using HLSL.
- MM: hard to take a conclusion from that.
Resource binding
- DM: their performance showed significant slowdown of Metal because they were using Metal 1.0 without argument buffers. Considers the Vulkan binding model to be a good one for us.
- DM: we previously said we have an implementation of descriptor sets backed by Metal argument buffers. Any concern about efficiency? Should we do it the other way around?
- CW: argument buffers are from GPU driven rendering, and can produce them from inside shaders as well.
- MM: there are two of these, including being able to switch two resources at once.
- CW: argument buffers seem more descriptive than descriptor sets.
- MM: thought we agreed to use descriptor sets already. Metal 1.0 will be slow; fast on Metal 2.0.
Administrative issues
- DM: we can split this feedback into the sub-issues filed.
- CW: not sure what’s best. Don’t have issues for some of these.
- MM: think we shouldn’t split it up because it’ll make it harder to find the feedback, but do whatever you want.
CW: great to have ISV feedback. Should more proactively reach out. Thanks for doing it. How did you reach out to Roblox?
- DM: was just referred to Yuri.

Buffer mapping, esp. persistently mapped buffers

Not discussed

Opt-in non-portable behavior

DM: feasible for an API to provide a less portable layer, much like Rust has unsafe? Dereference pointers. What if the API that are marked as non-portable? e.g. non-coherent memory flushes for persistent mappings.
CW: maybe as an extension.
MM: Apple would be opposed to this. Portability is defined by the least portable part of your API, not the most portable part.
DM: but portability is not like security. People will look for the less portable parts. People will not be shocked when something isn’t portable.
JG: helps with responding to bug reports. What did you use? “The non-portable API.”
MM: solution is to not create the non-portable API in the first place.
CW: that’s why an extension might be a good solution.
MM: this is something we’ve thought about a fair amount. Feel that function calls should be in the API or not - no extensions. If the call’s in an extension but the web relies on that extension, it should be in core.
JG: think it helps. Having an extension lets you add features without requiring they’re present everywhere. For example, render-to-float-texture. Where it’s not available, it’s not ready yet.
MM: in that case your web page is broken and it’s a browser bug.
JG: opt-in is useful. Developer when making it or choosing how to design their app knows which extensions they’re using. Even the errors are descriptive. Lets you polyfill when you don’t have full support.
CW: that assumes more extensions are better. Render-to-float is better than not being able to. But adding nonportable behavior extensions isn’t guaranteed to be a good idea.
MM: it’s either part of the web platform or not. Whether it’s part of extensions doesn’t matter.
CW: these expose hardware features. Exposing more of these features is better. Expose them when you can.
JG: guarantee that we don’t want to expose non-portable behavior if it’s not beneficial.
JG: arguing against a hard line: supported everywhere or it’s not.
MM: just arguing against adding an extension exposing non-portable behavior makes the web worse.
KR: Maybe make a proposal of what is in the non-portable extension and what isn’t. In WebGL there was a split between mobile and non-mobile (e.g. NPOT) and developers were not happy about it. Pushing back on developers caused them to fix their apps to use portable behavior (e.g. POT) which helped their apps be more portable.

Agenda for next meeting

More about buffer mapping, esp. persistently mapped buffers.
Based on this feedback: come back to some of these topics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2018 02 21

GPU Web 2018-02-21

Minutes from last meeting

TL;DR

Tentative agenda

Attendance

Feedback from ISV (Roblox)

Buffer mapping, esp. persistently mapped buffers

Opt-in non-portable behavior

Agenda for next meeting

Clone this wiki locally