Minutes 2019 10 28

GPU Web 2019-10-28

Chair: Corentin

Scribe: Ken / Kai

Location: Google Meet

TL;DR

#411 + #469 consensus to keep vertex input.
- AI: KN to update #469
Buffer views: little progress.
Programmable blending #442
- Render order views allow accessing either a “framebuffer attachment” or a UAV such that fragment shader invocations access it disjointly.
- Has to be an extension as it is an extension in all APIs.
Managing on-chip memory #435
- Metal has image blocks, and Vulkan has input attachment and multi-pass render passes for this.
- Desire to not have a proliferation of backing API-specific extensions.
- Unclear what would be a shape for an API that can both target Metal and Vulkan while being nice to use.

Tentative agenda

Vertex input #411 + #469
Buffer views
Multi-queue
Advanced feature investigation
PR burndown
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
Google
- Austin Eng
- Corentin Wallez
- James Darpinian
- Kai Ninomiya
- Ken Russell
- Shrek Shao
Intel
- Yunchao He
Microsoft
- Chas Boyd
- Damyan Pepper
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Mehmet Oguz Derin

Administrivia

RC wanted to confirm F2F dates. Microsoft can host.
Wanted to confirm Feb 11-13 for the next F2F.
WebGL one day, WebGPU two days.
Sound fine with everyone?
- JF: works for me.
- CW: points out that people might have to travel the Friday of Valentine’s Day.
- KN: could make WebGL F2F short, and do it the third day to end early./
- KR: sounds fine.
CW: OK, will tell Rafael this is fine and he’ll book a room.
CW: there were a bunch of PRs with action items that didn’t get done before the deadline, so this might be a short meeting.

Vertex input #411 + #469

CW: One part: do we need vertex input at all? Or do programmable vertex input?
CW: if we do have vertex input, Kai had some clarifications to the naming conventions that made it easier to understand.
- CW: there was an action item for Kai to rename something?
- KN: thought it was still up in the air.
MM: I’d like to propose that we do not remove vertex input.
- CW: I agree.
- MM: why should this be done?
DM: we’re discussing whether this should be added, rather than removing something. You don’t really require vertex buffers to draw.
MM: two arguments we’re missing:
- 1. Adding additional flexibility is something we’ve heard 3rd party authors ask for. They want fewer shaders, not affected by vertex format. Specializations per-vertex-format increases number of specializations they need, esp. compared to other APIs.
- 1. Regarding the benchmark I (MM) posted: on some platforms, vertex fetch is faster. If web authors are describing vertex fetch separately from shader, then impl can decide whether to do vertex fetch in the shader if the HW benefits from it. If authors have to put vertex pulling inside vertex shader, it’s harder / impossible to do that. Better for performance and ergonomics.
DM: counterpoints:
- 1. You’d still have to create a different pipeline. Whether the shader changes is just a matter of small convenience.
  - MM: it’s not a perf goal. Goal is asset management. Number of on-disk files, shaders that have to be generated by offline compilation process.
- 1. It’s good to have numbers, but this benchmark is very artificial. Don’t think this one benchmark should govern our decision. Hard to believe a real world program would go slower based on vertex fetches. The question of whether an impl would switch back and forth...if we have vertex attributes we won’t be messing with the binding model.
  - KN: I was thinking the same. If the HW prefers one to the other, the driver should be doing it for us. We shouldn’t be doing the transformation anyway. Driver will do it if it’s best for the hardware.
DM: I can keep defending this position, but I think there’ll be an uproar if we remove vertex attributes.
MM: what do you think Damyan?
DP: performance-wise, PC ecosystem’s moving toward vertex pulling. One big challenge is for tooling. Arbitrary pulling from shader makes it almost impossible. Can’t e.g. visualize a mesh if we don’t know the vertex format.
MM: think the industry’s moving toward vertex pulling doesn’t really affect this proposal. Options are (a) require vertex pulling and (b) allow vertex pulling.
DM: I don’t understand how the tools let you visualize meshes.
DP: you know which one’s the position.
DM: how?
DP: does WebGPU not do this? In HLSL / D3D you always have semantics associated with vertex attributes.
CW: in WebGPU there are no semantics. There’s just the special “output” position.
DM: that’s just the output. No semantics for inputs.
CW: other tools like RenderDoc let you visualize some attributes some ways. e.g. “Assume this is a 3D mesh”.
DM: could do the same thing with SSBO that’s bound.
MM: need vertex fetch annotations about how to look at the buffers. A tool visualizing SSBO would be out-of-band. Having annotations in the program might be more elegant.
CW: maybe you could get away with having no vertex input but it sounds like everyone agrees that they’re useful and we should have them in WebGPU. Sounds like we have consensus to keep vertex input?
- JG: more or less. I think we should unfortunately keep it, but it was a cool exploration.
CW: now that we agree on vertex input, let’s look at #469.
KN: I was supposed to change the name. Thought we would merge it after the name change.
CW: OK.

Buffer views

CW: upon writing proposal, turned out much worse than I thought. Mateusz came up with something that might be workable. I’ll make an updated proposal based on what he said. Maybe discuss next week?
- DM: do you want to share what troubles you encountered?
CW: idea: you could split a buffer into sub-buffers. Buffer could be in “partitioned” state. Tree of depth 1. Only children are usable. Now: two children are mapped. Can you merge them into a single mapped range? Or, you want to use dynamic offsets into the whole buffer. Want to use parts of buffer in different ways, e.g., implementing a ring buffer. Use dynamic offsets while different regions are mapped. Bunch of interesting problems. How do you specify bind group on the whole buffer? Have to check that dynamic offset is inside an unmapped region. Problem: large thing is a GpuBuffer. Other idea is that the larger thing is a GpuMemoryPool that gives you Buffers.
DM: don’t we have this problem with texture views today? Instead of offset, we have 3 different coordinates.
CW: difference with textures is only one state. Texture starts “available”, then goes to “destroyed”.
DM: but it can be used differently in diff parts of the program.
CW: that’s an impl detail.
DM: so without mapping the perceived state is simpler.
CW: yes. It’s more complex in the underlying APIs because of layout transitions. The WebGPU state is a lot simpler.
Thanks to Corentin for doing this investigation so far and mentioning the pitfalls.

Multi-queue

JG: postpone till next week.
MM: gentle weekly prod.

Advanced feature: Programmable blending #442

MM: keeping data on-chip, and programmable blending.
MM: I’d like to hear what Jeff has to say about programmable blending. https://github.com/gpuweb/gpuweb/issues/442
JG: seemed like, of the six, one of those that’s a best match of tractable and useful stuff. Keeping things on-chip sounded like more complicated discussion.
MM: couple of features. Can’t sometimes do programmable blending via framebuffer, but via UAV reads/writes. Vulkan has an extension that can achieve that?
MM: general options:
- 1. Don’t have the feature.
- 1. Have the feature, and expose it to web content like a framebuffer, even though it isn’t one under the hood. Slower.
- 1. Expose it the way Metal / D3D do. Another object you can do ordered reads/writes to.
CW: to clarify: (2) would say, on the framebuffer / renderpass, you say “these attachments will be RasterOrderView”? In shader, instead of values like fragment color initialized to 0, it’s initialized to the previous value?
MM: yes. Trick is: both renderpass and PSO have to know what’s going on so they can coordinate.
CW: do you know if there are validation constraints in D3D/Metal where RasterOrderView has to match size of framebuffer?
MM: can’t talk about D3D, but in Metal the size of the resource is distinct from which threads block which.
CB: kind of thing that might be enforced by existing drivers. We’re not seeing much adoption of this in the AAA space, feature-wise.
MM: any thoughts why that might be?
CB: maybe it’s not been promoted as an available feature. Wasn’t in some current consoles, too. Not sure. Will be supported more going forward.
MM: interesting case. Future is something that people have asked for, but this particular solution isn’t. Either we might want to pick a fourth option nobody’s picked yet. Don’t want to expose a new object type.
CB: not sure how you want to expose it. Long-term, having you be able to preinitialize a variable in your shader with the value of the current fragment and operate on it is likely the way we want to go in the shading language, but mobile devices have work to do to make that come true. Today it’s a kind of a view. That’s what a group is in Metal?
MM: yes. Another piece: already have some restrictions. If texture bound to framebuffer you can’t read/write in shader. I think we want those restrictions. If you want to treat texture as programmable blended texture, want to have framebuffer-like restrictions.
CW: makes sense. Hard to know perf penalty of doing framebuffer-side shader interlock thing. Shader-side, you widen the region during which pixel shaders can’t run at the same time on the same fragment. OK for mobile, scarier for desktop.
MM: did similar investigation inside the managing on-chip memory issue. Compared using regular framebuffer stores to using UAV writes. These UAV writes don’t use rasterizer order groups, just regular writes. 30-40% decline. Rasterizer order groups will make it worse. Pretty big hit for programmable blending.
JG: devsh.graphicsprogramming@ commented, one reason why you might see worse performance with UAVs is that there are certain fast paths for fragment output that are lost with UAVs. Raster Order Groups might retain that?
MM: in Metal/D3D a ROG requires UAV writes.
CW: in Metal / D3D12, ROGs don’t require you read/write from the same pixel.
MM: you annotate the object that this is a ROG. Threads get synchronized: those that target the same framebuffer output (pixel, whatever), not the same location in the texture you’re reading / writing. Accesses that get synchronized are every read/write to the resource. If you have two threads that access same FB location, first one gets to execute all reads/writes across resource before second thread can do any of its accesses.
CB: essentially introduces barriers that will affect performance.
CW: seems this is implementable pretty efficiently on top of image blocks, in the APIs that have those (currently, only Metal). Maybe best to expose it as a renderpipeline + BeginRenderPass thing, which says, this attachment is raster ordered? This way it’s implementable efficiently on tilers? Should this sort of thing be an extension?
MM: has to be an extension. It is an extension in all 3 APIs.
CW: agree.
DM: if it’s in the render pass then how does the pipeline know about the needed shader changes?
MM: author has to know and say it up front.
MM: author has to tell PSO that output is actually an ROG one, and tell the renderpass that this particular output is a ROG one. At time pipeline state is set in renderpass, that call has to make sure they match.
CW: then fragment shader output variable will look pre-populated with value from previous pixel.
DM: will we have problems with limits if we treat attachment as resource, part of binding model?
MM: I didn’t consider that. What are Vulkan’s limits? 4 bind groups, how much in each bind group?
CW: 4 bind groups, each has 16? bindings?
MM: even if single bind group can have 100 bindings, probably can’t reserve one because of lifetime issues. Has to have reserved field inside, makes it pretty tough.
DM: yes.
CW: seems there’s nothing in Vulkan allowing implementing that this way right now.
MM: possible to do it another way?
CW: aliasing the current framebuffer? Or using it as a sub-pass input in addition to output, and putting in subpass barriers before every draw?
MM: is that well defined?
CW: definitely corner case in the spec. Wouldn’t trust driver to get it right.
MM: one action item: someone who knows vulkan better should investigate it.
CW: I can take this. We have experts on render passes I can reach out to.
MM: overall direction I’m hearing: we should try to implement this like a framebuffer.
CW: sounds good.

Advanced feature: Managing on-chip memory #435

MM: relevant for deferred renderers.
MM: where the second pass doesn’t have to read from a neighborhood in the first pass.
MM: If the second pass only has to read from one specific location in the first pass, big speedups for tilers. Intermediate data can stay on-chip.
MM: made a little test. On iPhone (forget which one), perf input was 40% for one intermediate gbuffer plane. Pretty big one though; RGB16F. For deferred renderers they mostly have more than one plane, so this estimate is conservative.
MM: significant performance we’re leaving on the table for tiled renderers.
MM: anything controversial so far? (No.)
MM: D3D12 has no explicit support for keeping data on chip?
DP: renderpasses is the solution for that.
MM: found the docs for renderpasses. Talked about how with RPs you can avoid loads/stores from main memory. But nothing about keeping data on-chip so that it guaranteed never leaves.
DP: we’re mostly talking about discrete GPUs. Main memory = system RAM. There’s local memory. Tilers = on-chip memory. (not sure I minuted this right)
MM: Vk has subpasses. Interesting b/c later subpasses have special function in shader to read output of previous subpass. No x/y coords because it can only read the same pixel. Metal has explicit control. Declare variable, this one lives on chip. Annotate it in both first and second shader; same annotation = memory persists.
MM: thoughts so far?
CW: problem with Vk model: it’s hard to use correctly:
- 1. Vk doesn’t have notion, this is guaranteed on-chip memory. Only give hints to driver.
- 1. Apps have to declare an object graph, makes it complicated to use. Instead of just attachment compatibility, have this whole renderpass / multi-pass graph.
- At least some effort by one title on mobile to use it, but that’s about it.
MM: Metal approach works great, big perf improvements. On other hands, requires significant changes to shader. Have to deal with this memory yourself. Have to know pixel depth of textures. Also suffers from the problem that it’s complicated to set up. Other thing imp’t to Apple: don’t want to make 100 diff extensions for one particular device. Want content to work in as many environments as we can. Having one extension working the same way Metal’s ImageBlocks work would probably not be a good solution. Most developers would forget it’s there. Would like all WebGPU apps to get the 40% speedup. Think we need to come up with a new solution different from the existing two, that’s easy for authors to turn on and get right.
CW: definitely agree that it would be sad to have extensions targeting only one API. On the other hand there are two mobile APIs, Metal and Vk, and I’m concerned that ecosystem hasn’t converged on a single solution. Vulkan render passes/loadop/storeop widely used, but subpasses not yet.
MM: The question this group should be considering here is: do we think this feature is worth investigating more? Do we want to try to get this 40% perf improvement? If yes, additional research is needed. But if we want to pursue, we can work on it more. I think 40% is a big deal and that we should try to pursue.
JG: I think we should try to pursue it. It is complicated.
MM: I haven’t written a benchmark on Vulkan because it’s hard. Is it true that using subpasses will keep data on-chip? Can we expect this kind of perf improvement on Android too?
CW: intent of multipass renderpasses is definitely to keep data on chip. That’s why the feature was designed. Designed to mesh well with desktop tilers with less explicit control over where memory lives. Definitely perf improvement should be there. Way back when I remember writing code exercising multipass renderpasses and it was difficult. I have very mixed feelings about it. Perf improvements are available, but will be hard to make it work.
MM: that answers the question. Should come up with a good looking API that gives you the benefit. Can’t promise anything soon but can try to think about this and come up with reasonable design.
CW: I’ll investigate more internally and ask about what other people at Google think about this kind of workload.
MM: OK. Think that’s all I’m prepared to talk about on these 2 issues.
DM: there is evidence of the perf improvement from Vulkan subpasses, linked from the issue.
MM: got it. ARM results from benchmarks.

PR burndown

CW: nothing much since last week.
CW: spec editors: should we discuss initial GpuBuffer.destroy() text?
- KN: don’t think we as the editors have add’l input on it. More input, we can discuss in this group. This is what we agreed upon at the F2F. Assuming everyone’s happy we agreed at the f2f then there’s nothing to discuss here.
- CW: ok, can review offline.
CW: initial spec text for navigator.gpu. Small functional change split off?
- DM: if there is a basis for the change, I just ask the basis be formalized somewhere. Don’t want this PR to be changed.

Agenda for next meeting

CW: updated proposal for buffer partitioning / buffer views.
Multi-queue.
KN: I’ll do the vertex input renaming, don’t have to discuss.
JG: volunteers to discuss issues?
JG: sparse resources?
- MM: they’re super complicated on 2 of the 3 APIs. The other is more restrictive. May be able to discuss it.
JG: tessellation?
- MM: can talk about.
CW: feel free to volunteer other issues. I’ll volunteer #426, clarify buffer usage validation rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2019 10 28

GPU Web 2019-10-28

TL;DR

Tentative agenda

Attendance

Administrivia

Vertex input #411 + #469

Buffer views

Multi-queue

Advanced feature: Programmable blending #442

Advanced feature: Managing on-chip memory #435

PR burndown

Agenda for next meeting

Clone this wiki locally