Minutes 2018 07 11

GPU Web 2018-07-11

Chair:

Scribe:

Location: Google Hangout

Minutes from last meeting

TL;DR

Synchronization guarantees
- Concern about compute barriers causing costly stalls on big GPUs.
- Implementation can track resource ranges for more optimal barriers
- Consensus:
  - Dispatches races with themselves
  - No cross dispatch races
  - UAVs race in a render pass
  - Opt-in races for compute deferred for discussion post-MVP
- Suggestion to have barrier hints “best of explicit and implicit worlds”.
- Discussion of whether we need compute passes at all, same for the blit encoder
Render target aliasing
- Concern about a simple stack allocator, probably need more information from the application upfront
- Discuss next meeting.

Tentative agenda

Synchronization guarantees of WebGPU (PR 60)
Render target aliasing
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Myles C. Maxfield
Google
- Corentin Wallez
- Dan Sinclair
- James Darpinian
- Kai Ninomiya
- Ken Russell
Microsoft
- Chas Boyd
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Elviss Strazdiņš
Joshua Groves
Tyler Larson

Administrative

Scheduling conflicts between some attendees (from Google and Intel in particular)
MM: we are inside the 3-month window for reservations at Apple’s campus.
- Need to coordinate with the WebGL meeting.

Synchronization guarantees of WebGPU (PR 60)

What are the sync guarantees of UAVs inside compute passes?
JG: the way the PR is phrased, inside a render pass, access to storage isn’t guaranteed by the API. But inside a compute pass, subsequent dispatch commands are well-defined. Inside a single dispatch command, accesses to a UAV aren’t well defined.
CW: we can’t ever make accesses to UAVs inside a single dispatch well-ordered. Inside a compute pass, what do we do?
- 1. No guarantees. You have to end / start a new compute pass.
- 1. Guarantee that new dispatches can’t race with old ones.
- 1. Validation enforces you can’t use the same storage twice in one compute pass.
- Validation is a bit annoying because people will have to end/start compute passes all the time.
MM: in Metal subsequent dispatches are well ordered. Concerned if people use a Mac, write their code, and it works on macOS and nowhere else.
CW: for Metal, also targeting iOS heavily, and there’s a small GPU there. Don’t lose much overhead if you enforce synchronization because there’s low occupancy anyway. On a desktop GPU the same synchronization has more cost because you lose more occupancy than on a mobile chip.
JG: can’t we insert barriers between each dispatch call that touches a storage resource?
CW: this enforces well-ordered-ness. On AMD if you have a bunch of small dispatches which touch disjoint regions of a storage buffer, you’d lose a lot of performance.
JG: could we make smaller-sized views into these objects which you attach to the actual storage resource, so we can see that they’re disjoint and not emit the barriers? Like breaking up your writes into chunks.
CW: if we have implicit barriers then we can do this; an implementation detail. But something that would leak up to the performance of applications.
DM: not an impl detail if you make a view on the API side with a range.
CW: this is already the case for buffers, texture storage (texture view).
DM: we already specify the range of a buffer for a UAV...okay.
CW: yes, every API does this except maybe Metal where you give an offset and maybe not the size. But this is part of robust buffer access as well.
JG: implementations of the API would require that specification of disjoint range checking do this.
MM: so impl will have to have enough smarts to insert automatic barriers?
CW: I expect we’ll get yelled at by AAA game developers, at which point we’ll have to give them a way to do their own barriers for compute jobs.
JG: sounds like the most palatable approach. Opt-in where you can inject your own barriers.
MM: we would verify these?
JG: for safety, but not raciness.
CW: think of an algorithm where you use atomic operations, atomic exchange, etc. to make your thing racy but correct, and you need all the performance. Adding stuff to linked lists, for example. But for MVP, think it’s OK to have only well-ordered-ness across dispatches in compute passes.
DM: from API complexity standpoint why do we need compute passes?
CW: in a compute pass you can’t use a buffer both as storage and uniform, for example.
DM: it’s users restricting themselves? what does it give us? if we can accept raw dispatch calls, and order them.
CW: the only barriers you would have are memory read/write barriers.
MM: D3D has a difference between UAV and usage barriers, for example. With this model you could guarantee only UAV barriers are inserted.
DM: seems like an impl detail of what barriers can be inserted. UAV barriers might as well be more expensive than the others. If we are ordering the dispatch calls then the compute pass becomes rather useless.
CW: disagree. Tells you when you have to care vs. don’t care about UAV barriers.
- BeginRenderPass -> iteration loop -> back.
- Same thing for compute passes.
- Only need to care about UAV barriers at the topmost level, along with usage transitions, etc. In compute pass iteration, track which resources were released when, insert barriers there. Scoping makes it easier to reason about things.
MM: disallows dispatches in the middle of render passes. Have to end render pass before starting compute passes.
CW: D3D12 allows you to spray dispatches wherever, but Vulkan requires you to not do dispatches in render passes.
DM: we think of UAV barriers at a higher level. Insert them between dispatch calls in your compute pass.
CW: at the end of a compute pass we can forget about most state from before. Top-level thing, copy + renderpass invocations + random dispatches, need a lot of context to understand state of resources. But in compute pass, everything’s in the right layout.
DM: but you still have to check their layouts.
CW: during command buffer validation, know what’s going to be used with each usage in each pass.
DM: seems like a tiny implementation detail to do this per pass rather than per dispatch.
MM: helps with Metal compute encoders. Have to switch encoder between compute and 3D work.
DM: you also have the blit encoder, didn’t we remove that?
CW: would like it to be top-level.
JG: one benefit: with compute pass, can batch your barriers. Could we just allow people to volunteer their barriers, and we’ll do all the usage migration if you don’t do that? and if you do do it, you can do it eagerly. Impl fixes things for you if you don’t get it right.
CW: it’s feasible. Not sure if it’s a good idea to mix implicit + explicit barriers. Right now transition things just before they’re used in our impl. But we want to travel back in time and put it in the earliest barrier bunch so the resource is ready far ahead of time when it’s used. If app provides points when you want the transition you sort of have to do it at that point.
JG: heuristics can be useful but make perf hard to estimate. Have to retune heuristics for new test cases. Batching yourself could be attractive rather than a sophisticated heuristic.
CW: yes, we had this debate about explicit vs. implicit.
JG: one of the desires was that implicit should be acceptable. But explicit can be useful. Would be nice to have best of both worlds. If we didn’t have to have a separate compute pass construct, would be nice.
MM: trying to understand idea. So semantically, barriers would have no behavior change, because if you mess them up, API would fix them for you?
JG: yes. API would tell you “you’re migrating this, but not using it in the new form, perf warning” for example.
MM: conformant implementation could do nothing with those barriers?
CW: yes.
MM: need to think this over.
JG: seems we’re tentatively happy to go down the base-implicit-barrier path. But it’s down to – push the PR forward as it is.
CW: yes, we’re pretty much in that state. Only question:
- 1. We have compute passes or no concept of compute passes?
  - JG: could be a different PR.
  - CW: interacts with this one though.
- 1. This PR talks about opting into races post-MVP. What do people think of this?
  - JG: I’m for this.
  - DM: pass plus a flag is more complex than otherwise.
  - JG: you could have a flag on the whole encoder.
  - Encoders are command buffer
  - DM: where did you get this terminology?
  - CW: Metal
  - JG: currently, you create an encoder, encode into it, and when you finalize, it gives you a command buffer.
  - DM: this is not a Metal terminology. Metal Encoder is the pass. You encode a pass into a command buffer and then commit the buffer.
  - DM: why do we need to invent our own terminology?
  - CW: in Vulkan, the command buffer is both the recording and submission part. Less type safe.
  - JG: we could call this CommandBufferFactory.
  - CW: if you don’t have compute passes then the flag would be on the command buffer encoder / builder / whatever.
  - CW: we can visit this later. Don’t want it for the MVP.
  - DM: can we remove the flag from the PR and proceed?
  - CW: how about I tag it as maybe-post-MVP-feature?
  - JG: yup
Consensus:
- Dispatches race with themselves
- In MVP, dispatches do not race with each other, impl can be as smart as possible to avoid these
- UAVs race within render passes
Still discussing:
- Compute passes at all?
- Post-MVP can we opt into races?
DM: do you have a case for transfer / blit passes to exist?
- MM: I’ll ask about this.

Render target aliasing

CW: feedback from Metal team is that aliasing isn’t that useful except for render targets, in order to save memory when you have large frame graphs.
CW: suggestion was: have some form of special allocator, reuse render target memory without reallocating.
MM: propose we delay this discussion – didn’t have time to follow up on it.
CW: one thing was thinking about: in frame graph, have ABCDE, and things that read/write these various resources. Arbitrary which render targets have to live with each other. Feels like a stack allocator is not good enough. Can’t keep information that’s not at the beginning of the stack.
MM: is this maybe why Vulkan’s subpasses inside renderpasses require all this info up front?
CW: yes…
MM: actually can’t be because programmer has to do all this itself
CW: allocate render targets inside tile memory. Tile memory between subpasses is aliased differently between render targets.
JG: trying to do all the tile memory optimizations for you in a frame graph. Subpass dependency lets the implementation do things for you to figure out the best combination to keep things in tile memory.
MM: convo with Metal team was not about perf, but about memory savings.
CW: my concern is that your frame graph can be arbitrarily complex. A scheme where you reuse memory fast may not work well. Fragmentation, etc. Unless the app specifies the whole graph up front.
MM: think it’ll have to be: programmer can describe how they want it to happen, and we either do it, or tell them why they did it wrong.
- Right now, all render targets live forever. Render graph can reference anything.
- If we want aliasing, someone may want to use a render target we’ve already reused.
- App does something bad, we should report error.
- Need to spec up front restrictions that future rendering will abide by.
DM: is fragmentation a really big concern? Allocation patterns would seem to allow easy reuse.
CW: need to think about it more. Have to look at Frostbite’s frame graph presentation from GDC.
- https://www.ea.com/frostbite/news/framegraph-extensible-rendering-architecture-in-frostbite

Agenda for next meeting

CW: Pass types, transfer and compute needed at all?
MM: follow up on aliasing of render targets
CW: work on scheduling of F2F, coordinating with WebGL WG
KN (added later): https://github.com/gpuweb/gpuweb/pull/62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2018 07 11

GPU Web 2018-07-11

Minutes from last meeting

TL;DR

Tentative agenda

Attendance

Administrative

Synchronization guarantees of WebGPU (PR 60)

Render target aliasing

Other topics.

Agenda for next meeting

Clone this wiki locally