Skip to content

Minutes 2018 01 24

Corentin Wallez edited this page Jan 4, 2022 · 1 revision

GPU Web 2018-01-24

Chair: Corentin

Scribe: Ken (and Corentin)

Location: Google Hangout

TL;DR

  • Administrative
    • WebGPU F2F April 24th in Montreal
    • CLA is agreed upon, Dean will present it next meeting
  • Synchronous API for buffer readbacks (Jeff’s explanation of WebGL’s API)
    • In goal in WebGL: read back data without pipeline stalls. Was able to retrofit these semantics in WebGL with the GL_STREAM_READ buffer usage and fences checked by the application. No attempt to add an async-map primitive.
    • Desire to have something more explicit than WebGL’s API as it is too easy for a developer to break the async readback path.
    • Majority of Web APIs are non-blocking, so should probably WebGPU
    • Should the async map require a specific range of the buffer, it would avoid sending the whole buffer on implementations that need to IPC it.
    • Debate on a polling API vs a promise-based one.
    • Consensus:
      • No GPU/CPU data races
      • What you receive from the API is deterministic
      • Should be non-blocking, either poll-based or promise-based

Tentative agenda

  • Synchronous APIs and upload / downloads
  • Agenda for next meeting

Attendance

  • Apple
  • Dean Jackson
  • Myles C. Maxfield
  • Google
    • Corentin Wallez
    • Kai Ninomiya
    • Ken Russell
  • Microsoft
    • Ben Constable
    • Chas Boyd
    • Rafael Cintron
  • Mozilla
    • Dzmitry Malyshau
    • Jeff Gilbert
    • Markus Siglreithmaier
  • Yandex
    • Kirill Dmitrenko
  • Elviss Strazdiņš
  • Joshua Groves

Admin

  • CW: WebGL face to face will be April 25 in Montreal. WebGPU f2f April 24, hosted by Google Montreal. I will send an email after this meeting asking for attendee confirmation.
  • DJ: All lawyers seem to have agreed on CLA, should be able to present what they agreed on to the group next meeting.

Synchronous APIs and upload / downloads

  • http://jdashg.github.io/misc/async-gpu-downloads
  • JG: The original desire is to get data back from the GPU / driver / GPU process
  • JG: distinction between discrete / UMA architectures
    • Also command buffer client/server, different processes in Chromium
    • Desire: download data back from the client without incurring a blocking command where you have to wait for us to send a signal to the driver, have the driver wait for the buffer to be ready to be read back, and then sending result back.
    • Open question was: how can we make this process asynchronous and not allow pipeline stalls?
    • Enqueuing writes into the driver, telling it “clear this memory buffer to something”, read it back and give back the results.
    • Copy that buffer we want to read back from into a CPU side one, but actually enqueuing the copy. Enqueue a fence allowing us to know once that copy is complete, and we’re ready to map the CPU-side buffer.
    • Reason you need the fence: no other way to ask the OpenGL implementation whether the copy is complete.
    • No attempts to add a map buffer async primitive. Incompatible with OpenGL.
    • Synchronous API would be particularly bad in Chromium’s multi-process architecture.
    • Then can query the fence and ask it whether the copy to the client-side buffer is complete. Can read the data back out of it and unmap it.
    • Provided some example code that’s based on OpenGL ES 3.0 primitives.
    • Key to this: create a client-side buffer you can copy into. It’s specified as a usage hint for the buffer saying “please put this buffer in CPU-side memory”. Then you can map and read out of it.
    • Concept is mirrored in D3D11 and Vulkan where you have device memory which might or might not be mappable. Also, whether you can map them and expect the result to be client-side memory.
    • CW: D3D has a readback heap; Metal has a way to download stuff too?
      • Some APIs let you eat the synchronization stall if you really want them right now.
  • CW: do you have a proposal of what this would look like in WebGPU?
  • JG: question was whether we’d want to expose these copies. An advantage of doing so, as opposed to a more complex primitive like callbacks or promises: you can see what’s going on and make choices.
  • Enqueuing a copy into locally mappable memory and knowing when that copy’s complete are sufficient primitives to allow building new primitives on top of them.
  • We’ve used this set of primitives to allow you to do asynchronous downloads from the GPU.
  • CW: think in WebGPU we want to not have blocking APIs. Web tries to avoid them. So the Map operation (or whatever gives the data back) should have a way to say “not right now” – i.e., not ready.
  • JG: you could say, if you will incur a pipeline stall, then return null from the map operation. JG’s opinion is that blocking operations do have a place in the API – I’m an outlier – but just a data point.
  • CW: is proper CPU/GPU synchronization guaranteed?
  • MM: in this model just proposed, you schedule the transfer, add fence, poll fence, then map. Say the fence is triggered but then the GPU progresses and modifies the buffer. Then the CPU and GPU are using the same buffer?
  • JG: yes. Firefox implements warnings on pipeline stalls. Writing to the buffer sets the most recent synchronization stage. If you write to a buffer, set a fence, and write to the buffer again… have to make sure all writes to the buffer are complete.
  • MM: schedule a transfer, enqueue a fence, then schedule another write into the buffer. Then wait on the fence. Sounds like, second transfer has to have some synchronization so the CPU is done with it? Ownership transfer is a requirement.
  • JG: think this makes sense.
  • CW: WebGL prevents you from using a buffer that’s mapped.
  • KR: Jeff’s work for the WebGL WG is about a readback primitive, GetBufferSubData rather than MapBuffer. All the work to try the readback non-blocking is an optimistic caching policy. If it fails, it falls back to the synchronous readback. In the case Myles presented, the wait for the fence operation must detect the second write to the buffer and invalidate the cache. In the WebGPU model with a buffer mapping, the mapping would have to fail. In WebGPU the enqueued readbacks will be known to the fence and if there is another readback.
  • MM: This makes the buffer mapping code complicated for developers.
  • KN: agree with Myles that there’s a problem - you’ll only find out your program’s wrong if there’s a race condition. Think we may need better guarantees.
  • CW: NXT’s guarantees are a bit stronger than some people would like. Have a primitive, MapReadAsync. In this model, you need to transition the buffer to using as a Map target. Becomes unavailable for anything else, including writes. Buffer’s locked to that usage, can’t transition from it. At that point, you can either map, or map fails because it took too long. Then unmap. Have the guarantees that you can’t mess up.
  • MM: think in general that’s the approach Apple would like to see. Idea that there is some bit of state whether this buffer is able to be used by the CPU or GPU, and never any confusion, is great. Little subtlety about how that mechanism is involved with usage transitions. Would like to get away from that. But main point is that synchronization can be handled in a transfer of ownership style.
  • KN: It is one of the many cases where doing things implicitly would be more confusing to devs than explicit. If app developer is putting in transitions, and they do it directly, it’s obvious why. If implicit, and for example you write / insert fence / write again, then basically implicitly you’ll see your map fail because it wasn’t able to put it into the right mode, or map it without blocking. A lot more confusing than explicit transitions.
  • CW: either that, or would take forever for the Map to resolve, because you keep enqueuing more work.
  • RC: KR, in the case of WebGL2, do you mean that getBufferSubData would fail or block?
  • KR: It would block because we had to follow OpenGL ES semantics in WebGL.
  • RC: equivalent API would fail in WebGPU.
  • JG: We warn against stalls in Firefox so it means it is possible to detect when the user is doing something wrong.
  • JG: part of what makes this easy / possible in WebGL is that WebGL is a single command queue. Know that once you hit a fence that says “all previous commands are complete”, we know that all previous copies are complete. Know that all previous writes are done by the time the fence is hit.
  • CW: detecting things like this in WebGL seem feasible because; 1) not so worried about overhead, 2) single thread submitting commands.
  • JG: no way to know which buffer you’re intending to read back to. Just saying that the fence will satisfy all previous writes to the buffer.
  • CW: multithreading makes it more complex in WebGPU. Can enqueue from multiple threads.
  • KR: In WebGL the overloading of the semantics of fences to make the readback more async, is still completely conformant to the ES semantics. This is shoehorning of a solution. native APIs work with fences and stuff but this is not how things work when remoting the API. Google was suggesting an API where the async readback would be explicit and we proposed a lot of revisions of it. We ended up at Jeff’s API but there’s downsides. For example we need to have shadow copies of all the buffers (and the whole of it) that is done eagerly. Would like to see WebGPU be explicit about the range of data that will be read back. The shadow copies are necessary in Chrome to share between the GPU process and renderer process. Without shadow copies then you’d get back to the blocking version of GetBufferSubData.
  • MM: Is this description of fencing for WebGL or Chrome’s implementation.
  • KR: We were trying to get async readback because it is a pain point of WebGL. We tried hard to get async readback. There’s two options 1) something new and explicit and 2) use the WebGL 2 API as it is now and if devs do things wrong then they block.
  • MM: So WebGL will have fences?
  • KR: WebGL 2 already has fences. We’re just specifying that if you do EVERYTHING the right way, then the readback is non-blocking on all implementations.
  • JG: WebGL’s approach here was sort of a way to hint to the implementation that you intend to do async readbacks from the GPU. Shaped a little weirdly because of the way the API works, but this is the way the usage hints work under the covers in the underlying APIs.
  • KR: In explicit APIs esp. since we can spec what we want, is there agreement that we want to be more explicit on the range of data being copied?
  • JG: I think yes, and this falls out naturally when you decide where buffers live (e.g. in client-side memory). If you can guarantee that the copy is validated then that’s the ideal solution. It’s explicit, does require you to specify the destination. If you want to use Promises, you’d need to write a library.
  • CW: ignoring Promises. Sounds good. Want to prevent data races between the CPU / GPU, and not block. Has to not return anything useful until the GPU’s done with the buffer.
  • JG: the way GLES does this is that when you’ve Mapped the buffer, you can’t enqueue any writes to it while it’s Mapped.
  • CW: GLES has Mapped and Unmapped state. Mapped -> CPU can use it, Unmapped -> GPU can use it. Would need a third state, “Being Mapped”, where GPU can’t access it and neither can CPU.
  • JG: wouldn’t characterize it this way. You’re saying, ‘Try Map’. Saying here, Map the buffer, that call transitions the buffer, and then you can’t make any more calls against it.
  • CW: this is hard. You would need to tag all buffers used by a particular command buffer and say that they’re a particular generation.
  • MM: earlier, Ken described two approaches, the other one was an async call that does a download for you, why not pursuing that?
  • JG: it’s not necessary
  • CW: we think it’s not necessary either. The only downside is a little more GC pressure. Might like this better in Chrome. “Create a mapped buffer for this small range”. Would like to avoid transferring the entire buffer every time if the app isn’t going to use it.
  • JG: with explicit copy/fence: saying I intend to read back something of this certain size. You explicitly malloc a client side buffer for the portion you intend to read back, enqueue copy of the data, then map the data, then delete the local buffer.
  • MM: so if you’re already going to make a copy then why not have a single call that returns a Promise?
  • JG: you can write one on top of the other.
  • MM: in the Promise callback you’d implement your own Fence object, and set it there.
  • JG: that’s not the way the underlying APIs work. In normal CPU work, to transfer memory, you allocate memory, copy into shared memory, and copy back.
  • MM: everything on the web is message passing. Async primitive saying “upload or download this data”. Example: web workers. Model: make one call, and in the callback your data is ready. Other way is complex. Hope your maps don’t fail. If everything goes right, you get your data.
  • CW: Jeff’s proposal is less magic than what WebGL is doing.
  • JG: if you have the guarantees about things being allocated in a client-side buffer and exactly the right amount of stuff is copied, then you’re doing exactly the amount of work you want.
    • Allocate client-side area; enqueue client-side copy; wait for it to complete. Whether it’s wrapped in a Promise is sugar on top.
  • CW: don’t think Myles is suggesting that we skip the allocation of client-side memory. Only the question of whether we repeatedly poll.
  • KR: The magic in WebGL would go away in this proposal: you explicitly say 1) I have a CPU visible region 2) enqueue a copy to it 3) now that the data is ready give it to me as a typed array. We assume we guarantee there is no data race.
  • MM: The key part is that MapBuffer is that the transition of the buffer between CPU usable to GPU usable. In this proposal two events happen: 1) fence gets resolved, 2) you call Map(). Having this window in the middle doesn’t satisfy the constraint of unambiguous ownership.
  • JG: since we control the entry point and know what was enqueued, we know whether the buffer’s going to be written into and that there’s a race.
  • MM: the browser knows the current state.
  • CW: you can specify exactly how it works but it’s still magic to the developer. Call Map and maybe it works?
  • JG: what’s the counter-proposal? If you have an explicit transition it’s still fallible.
  • CW: it’s not fallible. You know when the transition happens what the state is. The transition’s atomic; once it’s done, nobody can use the buffer for anything except Map/Unmap.
  • CW: transition’s not fallible; you put a memory barrier in.
  • KN: not question of whether transition is fallible. Q is; what i fyou transition to mappable usage, and then map it immediately rather than waiting for something?
  • MM: that’s what I was asking. One solution: the API could return a Promise which gives you that ArrayBuffer.
  • JG: that prevents optimizations like coalescing readbacks. Do you have an API saying “ read back this data from multiple buffers”?
  • MM: impl is free to coalesce these at its own schedule.
  • CW: each of us that has an opinion writes some code, semi-spec, and diagrams. Let’s focus on what we agree on:
    • No data race between CPU and GPU for mapped buffers.
    • Deterministic what data you’re getting from the API (the GPU’s data, zeros, ...).
    • Should be non-blocking. You should either get a Promise API, or the Map should not work, or fail.
  • MM / JG: agree
  • KR: Need to discuss persistently mapped buffers as pushed by Dzmitry

Agenda for next meeting

  • Continue this conversation. Need writeups.
  • API shapes
  • Persistently mapped buffers?
Clone this wiki locally