Minutes 2018 09 10

GPU Web 2018-09-10

Chair: Corentin

Scribe: Ken

Location: Google Meet

Minutes from last meeting

TL;DR

REGISTER FOR THE F2F
Push constants
- Discussion on whether it could be an implementation detail or not.
- Some usages of Metal’s setBytes could be slow
- Suggestion to not have any size restriction and back with a UBO if needed.
Aesthetics of copies
- Agreement to go forward with the interface
- Debate on whether native APIs constraints should be exposed or not.
- Benchmarking would help come to resolution but difficult to get good data.
Details of command buffers
- Agreement on Dzmitry’s proposal
- Discussion whether top-level encoder and command buffer should be split concepts
Promisify getAdapters and createDevice
- CW to gather some data.

Tentative agenda

Push constants #75
Aesthetics of copies #69
Details of command buffers (reusability, encoder vs. command buffer) #77
Promisify getAdapter and createDevice #67
“Don’t care” mechanics (stretch)
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Myles C. Maxfield
- Thomas Denney
Google
- Corentin Wallez
- Dan Sinclair
- Kai Ninomiya
- Ken Russell
Intel
- Yunchao He
Microsoft
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
- Markus Siglreithmaier
Elviss Strazdiņš
Joshua Groves

Administrative Notes

CW: Dean and I have to implement the CLA, but it’s basically there. If you were waiting to contribute under that license, it’s there now.
If you want to attend the WebGPU F2F on Thursday September 27, please contact Myles (or add your name to the wiki).

Push constants #75

CW: Intel did a nice investigation. Only a few details that people discussed there.
- Do people have concerns with the concept of push constants?
MM: would be helpful to have an executive summary.
CW: inside big GPUs you have a small amount of memory you can write to very quickly. Usually used to store pointers to bind groups. Can also put indices in it. This can be per-draw data which is very fast to update.
MM: mechanism that lets a shader get data from the CPU?
CW: yes. Think of it as a very small number of OpenGL uniforms.
MM: doesn’t allow you to do anything you couldn’t do before?
CW: true
MM: rationale for MVP then?
CW: easy to implement. Orthogonal to rest of API. Reason to include: part of the binding model, so part of the core API. Every API implements some form of push constants.
MM: so, if we want to include it, we want it in the MVP. For a GPU driver / low level framework, makes sense to expose every single concept. But for something as high level as browser framework, not sure exposing it as API concept makes sense. Not saying we shouldn’t use push constants, and not arguing against performance. But if it’s expressible with the WebGPU API and can be used under the hood, doesn’t seem necessary to expose.
CW: the way you give data to a shader is via uniform buffers / storage buffers. These can not be emulated on top of push constants. Inherently, push constans are CPU data put into the command buffer, so you can’t reference GPU data from push constants. Can’t get faster performance on workloads “if browser can use push constants” – that doesn’t work.
JG: push constants aren’t something we can use in browser backends heuristically. if we don’t expose it at the API surface, then we don’t let people take advantage of this concept.
MM: why does it have to be exposed at the API level?
JG: you can’t detect things that way.
CW: push constants are a way to give numeric constants to the shader. The only way to give data to shaders otherwise is uniform buffers. So the only way to make them faster is to make uniform buffers faster – no way to “just” do that. Push constants are CPU-side data sent to the shader.
MM: in Metal this isn’t how it works. If runtime knows the size and binding location of each uniform buffer, then why couldn’t it say “this one is a good candidate for push constants”?
CW: data for uniform buffers lives on the GPU and could have been produced by the GPU. Push constants come from the CPU.
MM: for data coming from the CPU to the GPU, those could be implemented by push constants.
CW: correct
MM: so for these usages let’s use push constants.
CW: it’s not possible to detect that you’re giving the shader data coming directly from the CPU. The data comes from a uniform buffer and you don’t know where it came from.
JG: you’d have to guess the usage.
MM: for an API using Metal’s SetBytes call, you’d use constants.
JG: are you saying we should expose this under another form?
MM: we can have a new API which suggests that a different usage pattern is in use. Distinction between push constants and regular uniform buffers, where push constants have restrictions (small, alignment, etc.). Regular buffers don’t have this. Shame to add these restrictions.
JG: these restrictions add value.
MM: you can describe a data upload to the GPU and detect that it’s a good candidate.
JG: it would be an estimate. Don’t want this API to be doing estimation.
CW: we could have OpenGL-style uniforms and that could be implemented in the way you’re suggesting. But that would be WebGL 3.0 and we don’t want to do that.
MM: proposing a new API but not having any restrictions on size or alignment.
JG: so you’re proposing an API which could be both fast / not fast?
MM: yes, the user chooses and can try to conform to “fast” restrictions.
JG: concerned about needing to match certain restrictions in order to make things faster.
DJ: have this issue with Metal anyway. You don’t know what the restrictions are in advance.
JG: think this is a mistake. The other APIs don’t have this restriction.
CW: if there are restrictions in Metal, we would be fine including those restrictions inside WebGPU. They might not be Metal restrictions but would make Metal backends faster.
JG: if there’s a desire to propose a different API please propose it in an issue.
DJ: we’ll talk with the Metal team again. We know what the restrictions are but they’re non-public so we can’t talk about them now. We’ll get back to you.
CW: don’t quite understand why that affects the exposure of the concept of push constants.
JG: sounds like Metal already has the kinds of restrictions Myles was talking about. Better to maybe have explicit and implicit restrictions which would knock you off the fast path.
DJ: that’s right. If we clearly document the restrictions then it might be fast on Vulkan/D3D12 and not on Metal.
JG: could you please document those restrictions? Don’t want to hold back the API and make it a guessing game.
CW: if you could persuade the Metal team to provide the official restrictions that would be helpful.
KR: would restrict porting existing applications that use this concept.
DJ: we’re proposing we don’t include this concept.
DM: you’re saying we should not put any restriction on the constants, and have them be either CPU or GPU based?
MM: correct.
DM: if Metal has certain restrictions on what lengths, etc. are fast that doesn’t sound that bad
JG: don’t think this is a good reason to make it harder to understand perf characteristics on other backends. Falling off fast paths was one of the reasons OpenGL was suboptimal. We should try to do better than that. Part of good API design is having predictable performance.
MM: we can continue discussing this but suggest we defer until we can say more.
CW: could you make a proposal and make an issue?
MM: yes
CW: this is sort of whether and how WebGPU should be opinionated.
DM: one thing missing: we more or less agree … we don’t have a clear best use case for other APIs than Metal either. The situation isn’t clear on those APIs. So we might as well not have a limit.
JG: there are good suggestions in the bug, and lukewarm approval for pull request. Surprised this discussion is taking this long.
DM: 128 is a soft limit in D3D12. We could perhaps do better.
CW: let’s put it back on next week’s agenda.
RC: Q for Dean: PR has links to SetBytes/SetVertexBytes/SetFragmentBytes. Are they saying not to use those?
MM: the behavior of that call is implementation-dependent. Some drivers will do some things and some will do others.
RC: does that include being slow?
MM: different drivers do different optimizations but they behave as specified.
CW: Vulkan and D3D12 were built with IHVs at the table so push constants should perform well there. And hopefully then they’ll perform well on Metal.
CW: let’s think about it more and come back to it next week.
CW: if Apple could run the proposal for constraints on push constants by the Metal team that would be helpful. Seems it would be the fast path on every driver because they’re fast on D3D12 and Vulkan.

Aesthetics of copies #69

CW: seems similar to push constants. Q is: do we want to have constraints so we’re always on the fast path? Or have behavior where sometimes you’re on the fast path and sometimes on the slow one?
MM: we need to know how much of a perf cliff this is, to know what flavor of API we want.
CW: seems that mobile (iOS) would be a place where if you go slow path you’d have much worse performance, or at least power consumption.
MM: one of us (or I) can do a bunch of benchmarks and see what the difference is between compute shader and blit. Then can make a decision together.
JG: could be hard to benchmark; are multiple engines involved? How loaded are they?
CW: is there an open-source mobile Vulkan driver?
KN: open-source NVIDIA driver might exist on mobile?
MM: if the requirement for this investigation is “buy a couple Android phones” seems fine.
JG: seems it’ll be hard to benchmark. Will have to evaluate whether those benchmarks are valid, etc.
MM: don’t think we have to figure out percentage, but 2x, 5x, 10x, etc.
JG: might only show up when you’re loaded – e.g. compute engine is loaded but blit engine is free. Then it might matter a lot.
MM: sounds like you’re the best one to implement this benchmark.
JG: or we inherit the restrictions coming from these platform APIs.
CW: trying to see if we can come up with a benchmark using e.g. OpenGL or WebGL
JG: feels like, if we require these things to be aligned to 256 bytes, everything is fast. Annoying to have those requirements, but everything would work and be fast.
MM: have to understand that we’re providing a web API. Typed arrays don’t have this. PostMessage neither.
KN: think for a high-performance application it’s worth the small amount of engineering overhead to make sure something’s aligned.
JG: produce errors early and often when unaligned data’s provided.
MM: hard to believe that people are arguing against collecting data.
JG: I can spend 3 days coming up with something or we can do what the docs tell us to do.
CW: given infinite engineering we should back everything by data. But we don’t have infinite engineering so for MVP we can follow the recommendations.
JG / MM: discussion about difficulty of writing representative benchmarks.
MM: info I’d like to gather is not 1% vs 2%, but rather general background information like 2x, 5x, 10x slower. The native API docs are not the same as web API docs.
JG: based on my experience writing a web graphics API for 7 years I think it’s fine to include these restrictions.
KN: even if the benchmark results tell us there isn’t a perf hit we can’t believe it. If it tells us there’s a big hit then we know where we stand.
CW: or even different hardware. Remember looking at the open-source Vulkan AMD driver and they implement it with a compute shader if the blit’s big enough. This is specific to AMD which has good compute capabilities. May not be true for Qualcomm, etc. Will be hard to create benchmark.
JG: the fact that these restrictions show up in both D3D12 and Vulkan means that they’re needed.
MM: not arguing that we shouldn’t have this, just that we should gather information and come back to this.
CW: if you want to … I’m worried that it will not give actionable results.
JG: don’t want you to go do a bunch of benchmarking and come back with no good result.
CW: the title of this agenda item is “aesthetics of copies” and we haven’t touched on that at all.
JG: think it’s fine now. Don’t see a way to make it better.
MM: think the user interface is fine.
JG: it’s pretty gross but sort of intractable grossness.
RC: one word of caution: in WebXR API we had a “Point” type. Got some flak “Why don’t you use DOMPoint / DOMMatrix”?
JG: oh no
RC: that was my reaction too. If we get that reaction then maybe we should add those parameters or add more parameters to these copy functions. (Objection might be, “but that’s a rectangle!”)

Details of command buffers #77

CW: DM made a pull request and people mostly agree. Were some naming concerns and should the encoding object be different from the build command buffer?
- CW: think there are benefits to having a separate encoder and build command buffer object. Makes it clear you can share command buffers between workers but not command encoders.
- In types, you have the concept of making command buffers on multiple threads, finishing them on those threads, and submitting them to another thread for queue submit.
RC: is the command buffer read-only once you have it?
CW: yes, once you have a command buffer it’s read only.
MM: every instance of a command buffer has one instance of an encoder?
KN: the encoder creates command buffers.
MM: how do I make one command buffer with both a compute and render pass in it?
CW: Device, I want a command encoder. Get a render pass encoder, do stuff. Similar for compute encoder. Then ask top-level encoder, give me the command buffer. Can’t do anything with that cmdbuf any more. At that point, you have a cmd buf which is finished and immutable.
MM: impl will receive these command buffers, should one of these be a platform command buffer?
CW: on D3D12 and Metal, the cmd encoder will own a D3D12/Vulkan CommandBuffer. When you say compile / end, transfers ownership of that CommandBuffer into the WebGPU CommandBuffer.
RC: no way to take one CommandBuffer, do some encoding with one type of encoder, finish that, and then do some more encoding with a different type?
CW: you can do that. There are 3 encoders, the top-level one and the pass-level ones.
DM: it is confusing though.
MM: I get an object representing a platform cmdbuf (...)
KN: you don’t get a cmdbuf until the very end.
MM: it’s confusing.
KN: it’s similar to what Metal has; only difference is there’s a finish() at the end.
JG: the PR we’re talking about makes it sound like there’s no CommandBufferEncoder any more.
KN: not sure now.
CW: PR everyone agreed to generally. In addition to that, should we separate the encoding and command buffer concepts? But maybe we can land that PR and make a follow-on one.
RC: so what’s in the PR now isn’t what we want?
KN: arguably.
JG: easiest path forward is to change the least amount. Could drop the change which gets rid of the command encoder.
MM: I bring up the distinction because Metal does occlusion queries on the command buffer, not the command encoder.
CW: interesting. How about we land the PR as is right now, and then make another PR so we know what we’re discussing?
JG: in the spirit of not thrashing concepts around, think it’s better to not get rid of CommandEncoder in this patch. If we land this as is and try to land a patch to reintroduce CommandEncoder it’ll be confusing.
CW: so could you update the PR to do that?
DM: would rather land the PR in the state that it is now, and maybe reintroduce the concept, under a different name.
KN: don’t think it makes that much of a difference.
DM: one of the main points of top level CommandEncoder is to let type system move things to separate thread. Do we really need that?
CW: yes. The way you take advantage of multithreading in these new APIs is that you encode in parallel and send object to other thread for submission.
DM: Metal doesn’t send.
MM: Metal has two ways to do multithreading.
DM: the other way you say commit() and it submits it to the queue.
CW: the difference in Metal is that you can ship encoders to multiple threads. Other APIs create encoders.
DM: Metal: multiple command buffers to multiple threads, commit them in whatever order you want.
MM: I see. You mean the submit function. Some APIs you can submit to a queue on multiple threads, but some have to send the cmdbuf to a single thread.
DM: exactly.
CW: think this would make it harder to order command buffers.
DM: it’s not like we’re inventing this model.
MM: in the model where you pass the object from one thread to another, the order you receive them is as undefined as in submitting multithreaded.
CW: when you submit you know for example which node in your scene graph they were associated with.
CW: let’s commit this PR and make a follow-on one for additional discussions.

Promisify getAdapter and createDevice #67

CW: these can be slow, so let’s make them Promises.
DM: we need numbers. How slow is it?
JG: can’t speak to Vulkan and others, but in WebGL, device creation is slow enough that people do want promises.
CW: I can get data on CreateDevice. For GetAdapters, when we try to add instrumentation to Chrome to gather data, we found it was slow.
KR: on dual-GPU MacBook Pros for example context creation is slow enough that we would have liked to make Canvas.getContext async.
CW: you at least need to load D3D12.
DM: if the issue with D3D12 is that you need a device, let’s measure D3D12.
CW: I know Vulkan CreateDevice is slow. Can provide numbers. Think GetAdapter is slow too. I’ll investigate.
MM: thought in Chrome, these devices were all in the GPU process?
KR: But we need to do synchronous IPC to return the value.
KR: We measured that the cost of d3d11 is at least 100ms .We virtualize so not an issue in Chrome, but touching platform device can be slow, could be worth making async. Esp. if we want to give the user more control over which GPU content runs on. All browsers will need to destroy / recreate resources which can be many MS. Regret not doing that in WebGL for errors in context creation, and has a spec wart. Probably better to make getAdapter be a promise.
KN: DM made the point, why can’t it just fail later if we have internal nullability?
CW: because we need the value, the extensions available on the device, etc. They’re not in the Maybe monad.
JG: you could do that, but part of setup is figuring out what your constraints are. You’ll be synchronizing right away.
DM: do you mean physical or logical device?
CW: they’re the same
DM: then we don’t need to wait
CW: but we need to wait for the physical adapter which is synchronous.

“Don’t care” mechanics (stretch)

(ran out of time)

Agenda for next meeting

Making an agenda for the F2F
Push constants again
Command buffer vs. encoder
Promisify-ing stuff
Copies, if we have data
MM: Intel’s been doing great investigations, we could discuss those.
CW: let’s look at Intel’s investigations first, to see if we need to gather more data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2018 09 10

GPU Web 2018-09-10

Minutes from last meeting

TL;DR

Tentative agenda

Attendance

Administrative Notes

Push constants #75

Aesthetics of copies #69

Details of command buffers #77

Promisify getAdapter and createDevice #67

“Don’t care” mechanics (stretch)

Agenda for next meeting

Clone this wiki locally