Minutes 2018 10 08

GPU Web 2018-10-08

Chair: Corentin

Scribe: Dean

Location: Google Meet

Minutes from last meeting

Resolutions

Resolution: Accept pull request 94, with edits from CW based on the notes here.

Resolution: Concerns about sync access to resources with multiple queues, hopefully make it automatic.

Resolution: Try to avoid having the browser expose the geometry of the hardware for multi-queue.

Tentative agenda

Multi queue
Numerical fences proposal #94
Command buffer submission model
Extension validation
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
Google
- Corentin Wallez
- James Darpinian
- Kai Ninomiya
- Austin Eng
Intel
- Aleksandar Stojilijkovic
Microsoft
- Chas Boyd
- Rafael Cintron
Mozilla
- Jeff Gilbert
Yandex
- Kirill Dmitrenko
Elviss Strazdiņš
Mehmet Oguz Derin

Numerical fences proposal #94

CW: Have people looked at this?
MM: The latest version of Metal has a MTLEvent type that can be used to do synchronisation between GPU/CPU/etc. Uses a 64 bit value. Documentation says it is until reaches the value. https://developer.apple.com/documentation/metal/mtlevent
CW: Sounds a lot like the D3D12 less than.
MM: I will write a test.
CW: What prompted Metal to add this?
MM: We’ve been adding new features to our Metal Performance Shaders. Over the past year the new features have included ray tracing and using multiple GPUs. I think this was the use case.
CW: So every API (other than Vulkan) has numerical fences. So the proposal in #94 is really close to Metal/D3D, except a fence can only be signalled by one queue. Comments?
(silence while people read the pull request :)
???: How do you avoid deadlocks?
CW: If you have a numerical fence you need protection for deadlocks. E.g. wait on 5 and then never signal 5. So you have an internal state on the fence that keeps track of the greatest ever signalled value. When you wait you have to block on at least the greatest value.
MM: How does that work with the deferred command buffer?
CW: The signal operation is on the Q, and thus executed immediately.
MM: If I have a wait/signal pair, I have to call the signal operation first.
CW: Yes, so the runtime can prove that the wait will exit.
JG/MM: Sounds good.
RC: What is the rationale for it only being on a single Q?
CW: Since there is no sync btw Qs, you might be able to break the monotonicity between the Qs. Easiest solution was to force signals to be on a single Q.
MM: What’s the problem with having 5 before 4?
JG: You could go from 0 to 5 to 4, which isn’t monotonic.
CW: Depending on ordering, you could block on the wait on 5, even though it happened, because you’re back at 4.
JG: The other solution would be to forbid ever setting a value to a lower number.
CW: How to implement?
JG: Would be virtual.
MM: I like CW’s solution.
CW: I don’t see a reason to have signals on multiple Qs.
MM: Could you transfer ownership?
KN: What’s wrong with two signals?
(nothing)
JG: This sounds good for MVP. See if people complain.
RC: I thought the mono-incr rule means that putting things on the same Q…
JG: The fence knows the largest value it’s signalled, and refuses to wait on any value that is higher.
RC: Why does a single Q solve this?
JG: It solves non-monotonicity if you can simple on multiple queues.
KN: If you signal 4 on one queue, and 5 on the other, then wait(5) makes it look like both have finished when maybe only one finished. (??: Not exactly the problem.)
MM: Why should fences be monotonically increasing?
CW: Because it makes sure that we can validate the wait will work.
JG: Otherwise deadlocks can happen. Absent of monotonicity, you can’t guarantee that your fences will execute.
RC: Makes sense. Another reason to do it on a Q is that if we have Qs on multple threads, it would be messy.
MM: Fine.
CW: I’ll tidy up the proposal.
JG: When you’re on the same Q and you signal 5, then signal 4, is it an error?
CW: Yes. A JS exception will fire. It’s in the docs I wrote.
JG: We might want to revisit this if we share Qs across threads. That might produce a racy error.
CW: Yes, interesting. I’d like ISV’s to try this out first.
RC: So this Q signal is GPU side. Is there a CPU side signal?
CW: No. Because linux cannot wait on a value before you signal. You’d have to enqueue a wait and tell it to not start the work before the signal. This is a linux limitation.
JG: You’re supposed to get around that by not submitting until you want it to run.
JG: It lets you presubmit. Make a note of that concern.
CW: Should we rename fences to timeline and create them from queues?
RC: With WebGPU fence, you signal a fence and you return a fence?
JG: In the prose proposal, there is a createFence() and a signal() that returns void.
CW: Oops. My mistake.

Resolution: Accept pull request 94, with edits from CW based on the notes here.

Multi queue #95

JG: I don’t have a fully fleshed out proposal. But enough to talk about.
JG: Briefly, surface the concept of different types of Qs. In Metal, you have access to all Q types because everything is virtualized. https://developer.apple.com/documentation/metal/mtldevice/1433388-makecommandqueue In D3D you can create Qs at will, arbitrary number by type (Graphics, Compute, Copy - all Graphics can do Compute, and everything can do Copy) https://docs.microsoft.com/en-us/windows/desktop/api/d3d12/ne-d3d12-d3d12_command_list_type. Vulkan lets you create Q families (that can have Graphics+Copy). https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VkQueueFlagBits.html
JG: In the sketch IDL I just added booleans for supporting Graphics and Compute. If you don’t set them, you get just a Copy.
JG: I made it so they are created at device creation time. A seq<Q> that you can access. You can grab Q-0.
CW: How do we know not to create too many Qs for the device?
JG: Implementation detail for now. In the newest version I have marked some TBDs, which say the implementation will decide. E.g. in Metal you’d create one of each. Same with D3D12. Vulkan is slightly more tricky - could examine the hardware to decide what to do.
JG: You could also let the app opt-in at device creation time, if it knows better. E.g. i want 4 graphics Qs.
JG: The other TBD, in Vulkan (but not MTL or D3D) is that you can have Q families that ….(missed). These concurrently accessed resources can be slower than others. I don’t see a way to do a good job on the migration without exposing API.
CW: Similarly, I’m a bit worried about the validation in the passes will make it harder. I didn’t find a way to do it.
JG: (missed - sounded like a Cylon)
RC: I will check on D3D.
JG: That is most of what I can talk about. I did add a WAIT function, that should be merged with pull request 94. In order to synch, you need to be able to wait on fences. I’ll update this PR first.
JG: I’ll come back with a more detailed discussion of transferring between Qs.
CW: For the time being, I see it that the proposal can’t work because you can be reading at the same time as writing.
JG: Yeah. This is similar to buffer access. The same problem is at play. We need a sketch API that will help us get to a solution for how to do this.
CW: One thing that I’d like to suggest for multiple Qs, is that each time we have this, we default to Q-0. So that most users can just say map the buffer, and it works.
JD: How important is multiple Qs?
CW: Doom 2016 suggested that multiple Qs helped performance.
MM: Is this necessary for the MVP?
CW: It interacts with buffer mapping and the sync model.
CW: For the MVP we can just expose the single Q. But we should at least know it will work out.
MM: Let’s say the system detects the programmer has done something wrong (e.g. read when writing). Should we decide if that’s an error, or if the system should insert WAITs?
CW: Thought about having the system insert WAITs, but I thought about it and it sounds very hard.
JG: Similar to fences. Try to forbid bad behaviour.
MM: The issue isn’t that it is invalid, it’s that it can’t be run right now. You either run it later or reject.
JG: I think we’d run it later.
MM: Yeah, let’s pursue that and revisit it if necessary.
CW: I don’t think we’ll find a solution.
JG: Everything decays into a lock-step list.
CW: Right, I don’t think we’ll get better than that.
MM: If the programmer gets it right, they’ll take advantage.
CW: It’s very hard to validate and prove you can be concurrent.
MM: The browser is exposing a certain number of Qs? (Not the app requests a number)
JG: Yes.
MM: So what would Metal do, where it has infinite Qs?
JG: On MTL, you’d expose a Graphics and Compute Q, and maybe a Blit Q. But the implementation would decide how many to expose)
RC: And some hardware would expose e.g. 3 graphics Qs?
JG: Yes.
CW: I have concerns that this will make applications adapt to the hardware, rather than do what makes most sense by default. I think the application should describe what Qs it wants, and we either give them that, or virtualise on top of that.
MM: My thought it similar to CW. If the meaning here is so they could do parallel work. It’s not clear what this API means in Metal. If it means we should examine the GPU to decide, then we’d need to use secret APIs to determine what the GPU supports.
JG: The developers usually just make a educated guess by looking at the hardware.
CW: The browser could decide. But if you expect parallelism, it might break down.
MM: I think most of the use cases are going to be creating one compute and one graphics. Why not create some cookie cutter proposals?
JG: Mine had one cookie cutter. Or do you mean for all hardware?
MM: Yes.
JG: I think that makes more API friction. Yes, most developers won’t use multiple Qs. But I think that most users will be on engines that do this multi-Q intelligence for them, so that should be possible. E.g. Epic will get this right.
RC: So what does Epic do to decide how many compute Qs to request?
JG: I think they have hard-coded data in their engine.
MM: Don’t like exposing the hardware id to the Web.
JG: We already do that for WebGL.
MM: Can we not do that for WebGPU?
DJ: We did it for Google Maps, but they’ve since said they don’t use it as much any more.
JG: It allows developers to work around bugs on particular hardware.
JD: I was on the Maps team when we requested this, and we absolutely needed it to work around some bugs.
MM: Why not leave it out for now, until it is necessary?
KN: WebGL already has it, so it doesn’t hurt anything. We can replace it with a fixed string later if we want to remove it.
JG: It’s necessary now.
RC: It’s already on the WebGPU IDL, under adapter.name.
JG: Because it is there, certain ISVs will have tables that help them know how many Qs they can use.
CW: Would like the application to opt-in to a specific queue geometry, would have tables used at device creation time.
MM: That works fine for Metal, because we can say we have everything.
CW: And our implementation will do something similar, and just virtualize.
MM: It could be like the video element where you ask it if it can play something, and you get back either “no” or “maybe”.

Resolution: Concerns about sync access to resources, hopefully make it automatic.

Resolution: Try to avoid having the browser expose the geometry of the hardware

Specification

MD: When will we start working on a real specification?
CW: We’ll probably stick with just the IDL for the moment. When that is stable, we’ll start on a real specification.

Not discussed this meeting

Command buffer submission model
Extension validation and errors/exceptions
https://gist.github.com/kainino0x/c6588632fa3e655572b8dbe93ac2210f

Agenda for next meeting

Command buffer submission model
Extension validation
AI CW: write proposal for don’tcare/discard.
AI KN: more concrete proposal for exceptions/extension validation
(stretch) push constants investigation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly