Skip to content

Minutes 2018 04 11

Corentin Wallez edited this page Jan 4, 2022 · 1 revision

GPU Web 2018-04-11

Chair: Corentin

Scribe: Ken

Location: Google Hangout

TL;DR

  • Promiseable vs. pollable PR
    • Adds WebGPUFence.wait() that polls and can wait in workers. Transition only at micro-task boundary in UI thread.
    • No disagreement.
  • Google’s error handling proposal PR
    • Similar to WebGL where return values of createBuffer might not point to a valid object. WebGPU objects can be either valid or invalid.
    • During development tools provide synchronous errors, in production there is a device error log that can be queried asynchronously for telemetry and fallback.
    • Simplest model for async error handling because it “looks synchronous” for valid code, only the error handling is asynchronous.
    • OOM in resource creation can be handled by the application explicitly.
    • Question whether bind group creation can lead to recoverable OOM.
    • Proposal is a lot to take at once, the group will need to come back to it.

Tentative agenda

  • Promiseable vs. pollable
  • (Stretch) error handling
  • Agenda for next meeting

Attendance

  • Apple
    • Dean Jackson
  • Google
    • Corentin Wallez
    • James Darpinian
    • John Kessenich
    • Kai Ninomiya
    • Ken Russell
  • Microsoft
    • Chas Boyd
  • Mozilla
    • Dzmitry Malyshau
  • Yandex
    • Kirill Dmitrenko
  • Joshua Groves

Administrative stuff

  • CW: Got message from lawyers about their having a meeting soon to resolve nitpicks
  • DJ: We should get people into the vendor mailing list.
    • CW: Got someone from Unity and Epic
    • DM: Will send an email about to Roblox
  • Next meeting to plan the agenda of the F2F

Promiseable vs. pollable

  • CW: Jeff Gilbert put up pull request about fences; we suggest a poll and wait mechanism
    • Except, on UI thread, the timeout has to be zero, and inside the microtask for one of these things, the status of the fence can’t change.
  • DM: think this goes inline with what Jeff was saying. Think he will agree with it.
  • DJ: seems fine.
  • CW: being able to wait in workers and having fence’s status change without JS giving control back to the browser...is this OK? This changes from the norm of the web
  • DJ: think there are justifiable reasons to go beyond the norm of the web in this area.
  • CW: talked with Rafael about it; he seems to be reasonably happy with it
  • CW: let’s finish the conversation on Github and then merge this pull request
  • CW: how do we want to extend this to other parts of the API?
    • Keep this dual wait/promise system for other parts of the API (like errors)?
  • DM: don’t think we should. For errors, asked Luke Wagner. Got feedback which was sent over. The Promise would work if not used on strictly hot paths. So: what are the hot paths? And, we originally objected to Promises because they would be hard to port to WASM. Do we have more objections to Promises, because they’re the JS way of doing things?
  • CW: when you get an error it’s already “too late”, so seems OK to use Promises for Errors.
  • DM: was really talking about object creation.
  • CW: that’s part of the error handling proposal we have up for discussion.

Error handling

  • CW: https://github.com/gpuweb/gpuweb/pull/53/files
  • CW: error handling is in 4 categories
      1. Synchronous, during development.
      1. My app had an error and I can’t really recover, but I want data to send to a server or crash reporting mechanism.
      1. I’m creating this resource and I want to be able to recover from out-of-memory.
      1. The type of error where you want to show “Oops, WebGPU failed” and fall back to some other API.
  • First part of proposal:
    • Device has an error log which can be queried async.
    • Every error goes into the log. Querying the log returns a Promise with a list of recent errors.
    • This handles the telemetry case (2) above.
  • DJ: what about object creation errors?
    • CW: would rather not wrap every object allocation which can fail return a Promise, because in most cases there’s no Error. And if there’s no error, you can use the object right away.
    • CW: each object in our WebGPU proposal is Valid or Invalid. You just have a handle to the object. If you use Valid objects, everything works as expected. If you use an Invalid object, Invalidity propagates to newly created objects. Use of Invalid objects also results in errors going to the error log.
  • CW: for blend states, you can’t tell whether the allocation of the blend state worked and took effect successfully. But for Buffers and Textures, the Promise tells you what happened.
  • DJ: so you’ll never get a Null object?
    • CW: correct.
  • DJ: And if you want to test whether it worked, you check the Device error log? Or if you created something where you don’t know if it’s valid, you have to use it before checking the error log?
    • CW: you can, for Buffers and Textures.
    • DJ: and you check that by checking the device error log?
    • CW: you could, but that’s not the intended way. During development, developers use the synchronous error checking. But after you ship, you can’t tell whether objects were created successfully, except for Buffers and Textures.
  • CW: that’s the high-level idea. Alternative: WebGPUObjectStatus on every object. Seems to add overhead to the API without many benefits.
    • During development, make sure your code’s correct.
    • In production, only objects you can check are valid/invalid are Buffers and Textures where you opted in to the checking.
    • Opt in with a creation flag. “Want to be able to recover from failure.”
    • Read it from the status of the object rather than the error log. Error log is still only used for reporting.
  • DJ: would like some sample code showing production vs. development. Don’t quite understand what you do between the two. Understand what you do for textures and buffers, but not other objects.
  • CW: no code difference between production and development. Only difference is a device creation option or something in DevTools which gives you synchronous errors.
  • DJ: so you’re suggesting having it be more or less automatic when DevTools is open.
  • KN: yes. Not totally automatic, but something like that.
  • CW: code between production and development is pretty much the same, and doesn’t care about error handling. Passes validation. Context loss is a different thing. Then, later, the application can fetch the error log. Application can send telemetry to the server if it finds errors being generated.
  • DJ: ok.
  • KN: big question for everyone: can you think of use cases for error handling that aren’t covered by this design? Use cases where it’s useful to catch an error immediately? Where an error log and fallible allocations aren’t enough?
  • DM: does seem to me that this is enough to handle errors. But it introduces complexity, i.e., weaker typing in the objects. Not clear if we need this complexity. Is there prior art in web APIs to have such weakly typed objects? They’re essentially nullable. They just contain errors.
  • KN: WebGL and OpenGL work in similar ways. Operations that cause errors are no-ops.
  • CW: or, for example, you have a number which represents a texture, and you did TexImage2D on it. Then you have a number referring to a texture which has nothing in it.
  • CW: don’t think it’s weaker typing. From a functional programming standpoint it’s the Maybe monad. It might be a strange concept but isn’t weaker.
  • DM: Maybe monad is weaker typed than real objects. Since most people work in the WebGL WG maybe it was considered successful. But I don’t consider the WebGL error model successful.
  • CW: a problem with the WebGL error model is that it doesn’t propagate, so the following operations might succeed.. Also it’s synchronous, which doesn’t work. Can’t call glGetError each frame because it’s too expensive when run on Chrome.
  • KN: the idea where an object might or might not be valid seems useful. It’s just the error reporting in WebGL which seems broken.
  • CW: if we take it as an axiom that we want to handle errors asynchronously, this seems the simplest alternative. Alternative might be that objects are Promises that are resolved later, but this is super hard to port existing content to.
  • DM: this group is designing a web API that doesn’t copy any existing native API. Concern about native APIs’ structure doesn’t seem relevant.
  • KN: the concern is, how would you port C code.
  • CW: can’t ask Unreal engine to port to a new kind of backend where things are done over multiple frames because of Promises.
  • DM: disagree; could be done in the backend with enough restructuring.
  • CW: can’t convince Epic Games to do this massive restructuring.
  • DM: you’re async already. Doesn’t make sense to create things quickly all the time.
  • KN: it’s async b/c it doesn’t contain your data yet, but it’s synchronous because once you’ve created the object you can try to use it. I don’t see a design which works where we have CreateBuffer return a Promise<WebGPUBuffer>. You need to be able to do things with the object handles (Buffers, etc.) that you don’t know are valid yet.
  • DM: I don’t think anyone’s proposing synchronous resource creation. For Promise-like resource creation, I expect you’d use “then()...then()”. There’s always this async operation of filling up the resource.
  • CW: not true. Look at a game engine like Unreal. They have some upload buffers. But if they run out, they have two choices. (1) throw out the uploads till next frame. (2) (didn’t get it). (3) or allocate a new Buffer, add it to the staging buffers, map it, and upload.
  • CW: it is super important to be able to do operatoins “right now”.
  • DM: you’d consider this “hot path” something we have to handle efficiently, just because they didn’t make a pool large enough?
  • CW: yes, definitely. For example, doing any kind of one-off operation. There is nothing that prevents us from giving the developer that. You can just pipeline everything. It would be best if the web API didn’t force applications to handle asynchrony when they don’t actually need to.
  • DM: you can hide it, but it still is asynchronous. In my mind, exposing the asynchrony is better, because the developer may have a more sophisticated way to handle it.
  • CW: how would an app handle asynchrony better?
  • DM: maybe avoid constant checks if something is valid or not.
  • DM: mentioning a couple of frames later with regard to resource creation. If you’re writing an Unreal backend and you make a request for a resource, it’s not going to come 2 frames later.
  • CW: it will come 1 frame later. End of the microtask.
  • KN: an engine will not typically relinquish control until it’s done with the frame – if ever. You don’t expect an engine to execute multiple tasks per frame. Instead it kicks off one per frame, like requestAnimationFrame. The app won’t typically say, “here’s my main task, and I kick off a couple of new tasks, and these need to complete before the next frame”.
  • DM: can’t really object to that; I haven’t worked with rAF.
  • KR: Comes back to the API being remotable. We studied D3D12 and Vk, and they are very synchronous where every operation returns a result. We are trying to come up with an API that targets these modern APIs but is remotable. OpenGL is basically this, it was remotable for the start. The only issue is that glGetError was synchronous. We are trying to make the API remotable but still keep a synchronous programming model, otherwise we’ll get things like 30-deep promises. A synchronous model also mirrors what applications do already with Vulkan / D3D12. That’s why a handle-based model is good: it is remotable and matches what people do already. Asynchrony would be a big programming model change and hard to teach / use.
  • KN: one of the tenets is: the app can pretend that everything’s happening synchronously in the API. It makes the programming more tractable. All the native APIs work that way.
  • DM: thanks for that. What would you say – I like the option of having something fallible vs. not fallible. Clients typically expect things to work properly, or explode. What do you think if we had fallible versions of buffer and texture creation which returned Promises, and then had non-fallible versions which return Buffers and Textures and the world blows up if they fail?
  • KR: We thought about this a lot. Don’t know if application will be able to handle OOM. Maybe? There are OpenGL applications out there that handle it, probably. But right now in Chrome handling OOM is hard to do on all platforms. Because of limits we <...> it was really hard to handle OOM nicely. In WebGL OOM is bascially a blow-up. Hope WebGPU can do better by keeping similar handle semantics but maybe being able to recover from OOM. That’s why this is fusing failable and infaillable allocations. We should try to see if we can test this with some of the WebGL stress tests. Applications should be able to try and recover from OOM. Having UE/Unity split init and allocation would be a huge change.
  • DM: so when the user’s supposed to handle theerrors in this proposal, they still have a Promise which tells them the error?
  • CW: yes
  • DM: that doesn’t seem that different from a Promise-based allocation API.
  • JD: the difference is that you get the Texture object right away, rather than waiting for the Promise first.
  • DM: nobody is going to wait on the Promise.
  • JD: then you have to make all the APIs take Promises or the real object.
  • DM: why?
  • JD: you want to upload a texture. Have an API taking a Texture*. Now all you have is a Promise.
  • DM: was thinking you would do .then() and then upload.
  • JD: that’s waiting for the Promise to resolve.
  • KN: example case study: on a given frame you render the entire frame, then also check that the user has pressed F12 and want to take a screenshot. If you don’t have a texture to save the thing into, have to allocate one. Want to copy into that texture. If you have to wait for a Promise, you can’t do the copy operation. And if you wait, then you don’t have the data around anymore. But if you pipeline things, you can insert the copy early enough to capture the correct screen contents.
  • CW: that’s an argument for handles. Not a situation where the app is going to try to recover from errors.
  • KN: it could recover in some way. But will still fail to capture the correct frame.
  • DM: the argument between resource allocation and usage, even if you do things in the then() block, has reached me. I need to reevaluate the proposal.
  • CW: still think that maybe special-casing Buffers and Textures for fallibility is not the right call. Might be better to be able to ask all objects if they’re valid.
  • KN: don’t know. I’m split on this issue. Think that it’s preferable to only have recoverable object creation on objects where it’s useful, rather than sticking it on every object. It’s a complicated topic and I don’t know the answer. If you’re debugging you want a layer telling you synchronously that an error occurred during object creation. Otherwise you won’t know that your descriptor object failed to create.
  • KN: agree there’s a benefit to making the API uniform, but think the simplicity of having fewer objects with this mechanism outweighs it.
  • CW: any other questions about our proposal? Mainly, clarifications?
  • DM: just to clarify: in NXT descriptor sets, allocation is not fallible?
  • KN: you could pass invalid arguments. But we don’t care to allow the application itself to recover from such a situation, because that’s a bug in the application. It’s useful for a developer to see that error, but not useful for an application to recover from it.
  • CW: but we haven’t thought about apps trying to use tons of descriptor sets.
  • KN: may be more object types we want this API attached to rather than just buffers and textures.
  • CW: let’s go back to Github and discuss. Counter-proposals, etc.

Mailing list for vendors (ISVs):

Agenda for next meeting

  • Plan agenda of the F2F
  • Revisit error handling discussions
Clone this wiki locally