GPU Web 2020 09 21

Chair: Corentin

Scribe: Austin

Location: Google Meet

Tentative agenda

Multiqueue (Dzmitry, Corentin)
- Multi-Queue Investigation #1065
- Strawman Multi-Queue Proposal #1066
- Multi-queue proposal with explicit transfers #1073
Method of ensuring GPUShaderModules can contain MTLLibraries #1064 (Myles)
Query API: Disallow resolve the unused queries or resolve them to 0? #1072 (Corentin?)
Should clear values for integers lose precision, or be emulated on D3D12? #1085 (Kai)
Copying of stencil aspect on Metal #652 (comment) (Austin/Kai)
PR burndown
- createBindGroup: Require a superset of the layout's bindings #1061 (Corentin)
- Add filtered texture and sampler binding types #1076 (Dzmitry)
- GPUColor: remove sequence overload from the union #1079 (Kai)
- ~~GPUOrigin/Extent: remove dictionary overload from the union #1081 (Kai)~~ Not ready
Agenda for next meeting

Attendance

Apple
- Myles C. Maxfield
Google
- Austin Eng
- Brandon Jones
- Corentin Wallez
- James Darpinian
- Kai Ninomiya
Kings Distributed Systems
- Daniel Desjardins
- Dominic Cerisano
Microsoft
- Damyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Matijs Toonen
Mehmet Oguz Derin

Administrivia

CW: AI to send a poll for the VF2F.

Multiqueue (Dzmitry, Corentin)

Multi-Queue Investigation #1065
Strawman Multi-Queue Proposal #1066
Multi-queue proposal with explicit transfers #1073
DM: The way different native APIs see synchronization - Vulkan is most explicit. Makes sense to talk about Vulkan with the idea that if it works there it can work on D3D12 and Metal. In Vulkan, it can either be concurrent or exclusive. In exclusive mode, it can only be on one queue at a time. In concurrent, there are restrictions, particularly for textures and is a problem for performance. Can’t use certain compression or layouts. For exclusive access, the releasing queue needs a pipeline barrier, then a semaphore, and then another pipeline barrier on the acquiring queue.
JG: The releasing pipeline barrier is only if you want to keep the content of the texture?
DM: Yes.
DM: Goal was to see how much API we require to support running on multiple queues. The answer to that is actually none. We can do fully implicit synchronization of resources, and that’s what the strawman API is based on. HOwever, there’s a problem that if we try to implicitly handover a resource from old to new, we need a submission just for that pipeline barrier, and that’s an unfortunate cost we should not pay. The act of submission in Vulkan is considered costly.
DM: To address that, the strawman proposal has an API for explicit handover of textures when the command buffer is done. This allows us to issue pipeline barriers as we encode the command buffers.
MM: How about D3D and Metal?
DM: On D3D12, buffers don’t have the concurrent flag. They are always concurrent. Textures need to be synchronized in order to start using them on a different queue. They need to go through COMMON layout (decay). But you need to use a D3DFence between the submissions.
CW: To be slightly more explicit, you need a state transition to the common state if you want to use the resource on a COPY queue. It is much more “dumb” - doesn’t have shader cores.
RC: You have to make textures with SIMULTANEOUS_ACCESS flag as well if you want to use it on multiple queues.
CW: Lot of discussion about how copy queues would work, but it seems like the proposal would work with D3D12 copy queues. And it can be accessed as readonly on multiple queues.
CW: On Metal, it’s not clear how things work, because you can create multiple queues and synchronize them with Events, but it’s not clear how/if resources should be transitioned to multiple queues, and if you can use a resource on multiple queues simultaneously as readonly.
MM: I can take an AI to do that.
CW: Metal doesn’t have any kind of transition state, so the assumption is that the runtime does it for us, so that’s why we haven’t spent much thought on it. Any additional information would be super helpful.
DM: Two main questions: whether we require explicit synchronization - ownership transfer, and whether we allow textures to be simultaneously used on different queues.
DM: Corentin’s proposal is different on both of those points.
CW: DM’s proposal is fully implicit. There’s no concurrent access for textures.
CW: My proposal has explicit transitions. The key idea is that the transfer of ownership queue-to-queue could be tied to the fence primitive which also does synchronization. If developers are going to use multiple queues, they really want to know that things will run in parallel, so the synchronization should be very explicit. A nice way to do this is to tie ownership transfer to the sync primitive (fence).
CW: If you want to transfer, you transfer the resource to another queue and simultaneously signal the fence. Then when you want to use the resource on another queue, you have to wait on the fence.
CW: For concurrent resources, you can transition the resource to the “device” or “undefined” queue, and then any queue that wants to use it readonly can use it if it waits.
CW: A bit ugly for transfer out of concurrent.
CW: Main idea here is keeping the synchronization graph up front to the developer. The nice thing about this though is that it doesn’t require changing the API for the single-queue use-case. So we can defer this if we want.
MM: I think there’s some desire for multi queue sooner than later.
JG: Would be nice to avoid a situation where there are two separate ways to use the API. Would like to guarantee that using with a single queue is a subset of using with multiple queues.
CW: Yes, both proposals here have that, as long as you define the concept of a “default” queue that can do everything.
MM: One thing that I appreciated is that you just wrote a verbal program - describing what code would look like. DM can you do the same to describe what calls an author would make if they wanted to use a resource on one queue and then on another?
DM: They would just submit on one and then submit on the other. It’s implicit so they don’t need to do anything.
MM: But yes, multi queue would be nice because it’s a killer application of WebGPU.
DM: Talked about this with JG as well, and this is an area where we do need feedback from ISVs.
RC: Is the killer app to use multiple queues or the same queue from multiple threads.
CW: The GPUQueue is supposed to be shallow cloneable through postMessage, so you can have the same GPUQueue on multiple workers and submits are internally synchronized. Is that enough or does it need proper multi queue?
JG: Both are desirable. Async compute is something people talk about and we’re seeing more use in the AAA space as IHV support comes online.
KN: My understanding as well is that multi-queue is super important for copy queues b.c. it uses a separate hardware path. It is fundamentally exposing hardware architecture, not actually multi-threading. It could help with that, but it’s mostly there to expose the hardware.
MM: That makes sense, separating threading from multi-queue.
MM: If the purpose of multi queue is to have async compute, async copy, that’s something Metal already has built-in. Our software is smart enough to split up stuff that’s not dependent and run them in parallel.
CW: That’s something that came up as well - why can’t WebGPU just do that magic too?
CW: I think the difficulty is we don’t know how big a task is and if it’s worth offloading. I would ask if you have insights into the Metal driver, but idk if you can share that.
MM: We will do that for all tasks that could be independent.
MM: (all public). Our iOS devices have two separate hardware “channels” one is for 3D is one for compute, and they run totally asynchronously. So when you create a MTLComputeEncoder it’ll run on the compute channel, and when you create a MTLRenderEncoder, it’ll go on the render channel.
CW: Okay so compute is always async.
MM: Unless we determine a data dependency and insert stalls. That’s how it works on iOS. Desktop is obviously different. The way this is exposed to devs is that if you instrument an app using the Instruments app, there’s two separate timelines so you can see the independent scheduling of work blocks.
CW: Okay, thanks for the explanation.
JG: Difference in focus of the API. Vulkan is very explicit so you can make all the right or poor choices for yourself. Metal relinquishes some control in exchange for things being pretty easy and normal cases work well. That’s the common fundamental friction between the two approaches. It’s sort of up to us to determine where we fall in the solution space.
MM / CW: Yea I don’t want to debate the philosophy.
JG: I think that’s an important lens to look through as we think about how to design the API. We’re at the tail end up understanding what the existing APIs do, and now we have to think about how we solve the problem.
CW: iOS has two separate channels but that also heavily influences the design of Metal. WebGPU has to run on all hardware so it’s different. Metal targets specific hardware.
MM: I feel like I have to push back on some things you said.. anyway we should stick to specifics and not design philosophies or design intentions.
JG: Okay, but we have to think about the intention of our API.
CW:
1. Should we have multi queue in the first version of the API? If no, then maybe we can defer, if yes, then:
```
2. Implicit/explicit transitions
```
```
3. Concurrent textures ?
```
MM: What is the benefit of explicit synchronization? Is it more performant or something else?
CW: As a dev you know the code can run in parallel. If implicit, the dev doesn’t know for sure, and worse small changes to the code could add unintentional synchronization points. Being explicit has advantages in that the dev knows for sure that it’s parallel.
MM: Does the developer have to write it twice? once for parallel compute, and one for not?
CW: Part of the decision of how we expose it. Personally I think they should declare how many queues they want, and they get exactly that. If the hardware doesn’t have all that, then we virtualize it.
MM: Sounds like your guarantee about the dev knowing that it will run in parallel isn’t a guarantee if you virtualize.
CW: Right, I guess the guarantee is if there can be parallelism, then they definitely do take advantage of it.
MM: Doesn’t implicit do that too?
JG: But they can’t footgun themselves if it’s explicit. If it’s implicit, it’s easier to accidentally change something that de-parallelizes a lot of things.
MM: In that example you just gave, in the implicit model they would add a line of code and a bunch of sync would happen. In the explicit model, they would realize there’s a lot of extra sync and they would have to write code to do it, or they would get undefined results.
CW: They would get a deterministic error.
JG: modulo race conditions.
CW: Deterministic because when you do an ownership transfer, the internal state atomically changes so it can’t be used until another queue acquires it. On submit we would validate that all resources are usable on the queue.
JG: I do worry that in the fully explicit, very strict model, is that a developer error could happen on some machines but not all, it could work well on system A, but not on system B..
MM: Say you have a big app and lots of events firing. In some of them, you’re executing WebGPU code. If two events get flipped - like keyboard press versus XHR request, your app might not handle that correctly, and you need to test and handle all the different orderings correctly.
CW: Yes - and that relates to discussion about how long it takes for a map request to come back.
RC: Did we ever decide if someone can use a resource from multiple threads at a time? In order to pull this off, they need to postMessage the queue, fence, and resource to another thread, and also use the queue transfer APIs.
KN: There’s two things: in the current proposal objects are shallow-copyable and can be used on multiple threads. You would copy resources to both threads at the beginning of time, and use it in each thread separately. Synchronization would be internal. This is orthogonal to multi-queue though. You can run into race conditions between different threads without multi-queue, and you can use multi-queue from a single thread. They don’t impact much.
RC: Any object can be transferred?
CW: The idea is that almost any can be transferred - except maybe encoders.
CW: Okay, this is good discussion. In the spirit of time, I’d like to move on to whether we need this for MVP.
MM: Okay, the killer app I was talking about is multiple CPU threads. For us, it’s not super important because we’ll get the benefit no matter what.
CW: On our side, it looks like a can of worms and difficult to implement, test, and specify correctly. So we’d rather postpone it. Like subgroups.
JG: Difference to me is that subgroups are their own feature, and multi-queue is like it’s own primitive.
CW: That’s why these investigations were important - they do seem strictly additive.
DM: We’d want feedback before v1. Maybe it would be that implicit sync is hard to get parallelism. In wgpu, we want to implement soon-ish because our devs are very excited about async compute.
MM: What’s the policy in wgpu for experimental or un-shipping features?
DM: If it’s supported on more than one API we consider adding it as an extension and then suggest it for a web extension.
DM: We’d like to see it in MVP.
CW: So wgpu is volunteering to do a prototype and get feedback from users?
MM: Yes, I think that could be a good path as well.
DM: I think most of the code would have to exist whether its implicit or explicit. But if we at least know the group will put it in MVP then it’s more worth the investment.
CW: On our side, it’s clear it’ll happen at some point, though we feel pressure to ship and it may take a while to get it correct.
MM / JG: I don’t think so.
DM: Even if it does, we’ll still be bottlenecked by WGSL, so..
CW: Okay, we’ll need to think about this more.
DC: I’d like to add that many-to-many thread-to-queue relationship would make the GPU an efficient AShMem - asynchronous shared memory. That’s really great for ML.
MM: Do ML frameworks like tensorflow use multiple queues natively?
DC: They would if they could. Like batching matmul operations.
DC: Efficiency is found in not having to transfer lots of data on the CPU.
CW: On the CPU, you wouldn’t access this unless you map buffers. The CPU would take exclusive ownership of the memory.
DC: The idea is to have the CPU use as little mem as possible.
MM: So one thread uses one queue and another thread uses another queue, and both queues see the same memory. Looks like data is transferred but it really isn’t. Is that a good description of ASHMem?
DC: Yes.

Method of ensuring GPUShaderModules can contain MTLLibraries #1064 (Myles)

Query API: Disallow resolving unused queries or resolve them to 0? #1072 (Corentin?)

CW: Is it allowed and just write zeros, or is it disallowed?
MM: Why wouldn’t it just be zero? It’s just as if you started/ended with nothing in between.
CW: That was our suggestion as well - any others? That might involve a copy of zeros into a buffer instead of resolving.
MM: Can we know which of those two paths we would take ahead of time?
CW: would know at ResolveQuerySet time.
MM: Okay, so the answer is yes.
CW: Might involve a compute encoder anyway to convert ticks to nanoseconds.
DM: Could be problematic if you have 4 billion queries.
CW: There could be a maxQuerySetSize (suggested elsewhere) - or it would be a bitset.
MM: We could have minimum maximums. CSS does this a lot.

Should clear values for integers lose precision, or be emulated on D3D12? #1085 (Kai)

KN: Not all integer texture clear values are possible on D3D12, because clear values must be represented by floats. Three options:
- Round double clear values to float on all backends, or just expose as float in IDL.
- Emulate unrepresentable clears on D3D12 with a fullscreen quad.
- Only allow integer texture clears to “safe” integers (-224 to 224). (Allows us to pick either of the above, later, as a non-breaking change.)

Copying of stencil aspect on Metal #652 (comment) (Austin/Kai)

(See also #873, which this is blocking, but we won’t plan to discuss it today.)

PR burndown

createBindGroup: Require a superset of the layout's bindings #1061 (Corentin) *
*
*
*
Add filtered texture and sampler binding types #1076 (Dzmitry) *
*
*
*
GPUColor: remove sequence overload from the union #1079 (Kai) *
*
*
*

Agenda for next meeting

Multiqueue
ShaderModules and MTLLibraries
Same agenda.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2020 09 21

Tentative agenda

Attendance

Administrivia

Multiqueue (Dzmitry, Corentin)

Method of ensuring GPUShaderModules can contain MTLLibraries #1064 (Myles)

Query API: Disallow resolving unused queries or resolve them to 0? #1072 (Corentin?)

Should clear values for integers lose precision, or be emulated on D3D12? #1085 (Kai)

Copying of stencil aspect on Metal #652 (comment) (Austin/Kai)

PR burndown

Agenda for next meeting

Clone this wiki locally