Minutes 2019 07 22

GPU Web 2019-07-22

Chair: Corentin / Dean

Scribe: Ken / Austin / Kai

Location: Google Meet

TL;DR

We have spec editors!
Discard to 0 #358
- Concern that discard sounds cheap when actually the operation is a clear.
- Consensus to call it clear and call out in the spec that it is expected to be faster than a real clear.
Multi threading
- MM described multithreading use cases seen by the Metal team: parallel pipeline compilation, parallel texture upload for LOD streaming, freeing up the main thread.
  - For commands people use either the parallel render command encoder, encode separately and submit in order, or use multiple devices/queues
- Discussions of #354
  - Problems around objects with state being passed around (including views to objects) and complications for buffer mapping.
PR Burndown
- load ops and clear values #284 / #283 / #268
- Consensus to go with loadValue being “load” or a GPUColor.

Tentative agenda

Discard to 0
Compressed texture copies
Multi threading
Multi queue
PR Burndown
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
Google
- Austin Eng
- Corentin Wallez
- Idan Raiter
- James Darpinian
- Kai Ninomiya
- Ken Russell
- Shrek Shao
Intel
- Yunchao He
Microsoft
- Chas Boyd
- Damyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Mehmet Oguz Derin
Timo de Kort

Administrative stuff

We have 3 spec editors: Justin Fan, Dzmitry Malyshau, and Kai Ninomiya. (Yay!)
CW: 3 is sort of a large number that might dilute responsibilities. Please keep each other accountable.

Discard to 0

CW: #358
CW: Adds discard operation. Debate we’ve had in the past. Resolved to allow discard-to-0. Lazy clear. Lots of disagreement in that PR, should go through again.
DP: I was one of the objectors. Point was general user confusion. Developer wants to discard. Clear-to-zero is impl bit about how runtime satisfies security requirements.
JG: that’s what we do in WebGL.
CW: since all impls will be doing lazy zero clearing then why not call it discard to 0?
MM: are we discussing name or semantics?
JG: would like to leave open opportunity to not clear to 0. In WebGL the approach we have is this. Discard is natural, what we want, but impl detail and safety detail about clearing to 0. Would like users to clear to 0 in the same way we’d like them to not rely on our impl zeroing resources if they don’t do that.
CW: we can educate users, but do we agree the spec is “as-if” the contents will be cleared to zero? Non-normative spec detail, impls are supposed to use discard. Have extension, have uninitialized data?
JG: there’s “action X” the action we want to take and “action Y” the one we have to do right now. Right now we’re naming it for Y instead of X. In the PR I think it’s in the wrong direction.
CW: what about in WebGPU 2 when it’s safe, should we add something called discard?
JG: would be possible on some impls today. This PR closes that door.
RC: are you against more than just the name? Discard, and if you use something you discarded without clearing it, is it a validation error or should it be auto-cleared?
JG: ideally, would like to offer safe discard semantics. Might be preserved, might be zeros, or might be something else from your (logical) device. All “safe”. Would like to generate a warning if we see users are likely at risk for relying on impl defined behavior. This is discard-to-0, but “as if” means it’s like clear.
MM: sounds like you’re arguing for a third value. Safety is important, portability is also important. What you said is safe but not portable. Giving authors the choice of safety+portability rather than just safety is important.
JG: perf considerations too.
CW: we expect impls to use discard+clear op so should be fast.
JG: could be native extensions down the line which make these things safe in the future. Don’t have same compulsion toward portability as safety.
MM: would you object to a 3rd value?
JG: I think we only need 2 values if we say it’s discard.
CW: and the contents are 0s?
JG: undefined.
KN: and we warn if you use it?
JG: yes.
JG: what do we do if textures are initialized but not given data?
JG: say they start uninitialized. In WebGL we say they’re filled with zeros.
MM: same thing, unobservably different.
JG: if state tracking doesn’t pick up on some usage patterns and causes undesirable clears, that’s bad. Happens in WebGL today.
CW: what if we call it discard-to-0. Add another thing “discard” which offers the undefined behavior. Warnings are an impl detail.
KN: sounds like you want for it to be an error to use something uninitialized.
CW: that increases the state of the API a lot. Texture, for every subresource, can be in error state or not.
JG: we have this in WebGL and it’s not prohibitive.
CW: it’s a problem for developers. More state that can produce errors, more likely app breaks if user does something different.
JG: similar, it would just be filled with zeros instead. Would limp bad applications along longer, not encourage them to be written correctly in the first place.
JG: if it clears to zero I think it should be called “clear” and clear to 0. Not say “discard” at all. Like discarded, then on next read cleared.
CW: then what do we call it? clear-to-0?
JG: just clear.
MM: I’m fine if we call it “clear” and say that it’s expected to be faster than a real clear.
RC: sounds fine to me.
DP: happy with that.
CW: I’ll update that to be clear.
JG: thank you.

Compressed texture copies

CW: agreed the operations on compressed textures would be spec’ed, and the actual formats would be extensions. Intel has done BC textures in Dawn, have investigation into texture copies.
https://github.com/gpuweb/gpuweb/issues/363
CW: describes validation constraints in compressed textures. Please look and see if it meets expectations.
DM: looked through it, gave thumbs up, no criticisms to approach, seems reasonable. Let’s discuss when everyone’s looked at it.

Multi threading

#354
KN: MM you ready to discuss?
MM: think so. Have some interesting findings. Did research on common multithreading patterns in Metal. Don’t know much about Vulkan and D3D. These are use cases.
MM: first: turns out that recording command buffers is fast. Not usually the bottleneck.
MM: most patterns in Metal apps using MT are not for traversing diff parts of scene graph and recording cmdbufs independently. Most uses are: parallel compilations, parallel creation of pipeline state object encoders, parallel texture uploader (including different mip levels). Having all Metal code running on different thread to free up main thread. All important. Should consider them all as first-class citizens.
MM: in terms of issuing commands to GPU, 3 patterns. MetalParallelCommandEncoder. Better than 1 thread per cmd buf and going wide. Other is encoding & submitting cmdbufs independently. Encode in arbitrary order, submit in desired order. Also common for apps to use multiple devices and farm out diff things to diff ones. Also common to have multiple queues.
MM: splitting into priorities: parallel compiles / texture uploads / PSO creations are most imp’t. After that there’s a bunch of others.
CW: thanks for overview.
KN: thanks.
CW: parallel compilation is something WebGL struggled with until recently. Current API for WebGPU doesn’t seem to help with parallel compiles right now. Could be impl detail.
JG: natural thing to do - if we had cross-thread WebGL contexts - would have been to do the shader compile synchronously from a background thread. This is common in state of the art OpenGL applications. Huge memcpys for texture uploads, etc., and have those go in background.
KN: would be easy to do parallel shader compilation in all alternatives. Only alternative 2 works for uploading diff mip levels simultaneously.
JG: Transferable is like moving, Serialization is a copy?
CW: yes.
CW: for tex uploads on background threads, do they issue the commands for it or just prepare everything?
MM: not sure.
DM: what kind of developers were surveyed?
MM: 2 sources. Don’t want to give names but we have a lot of Metal developers in Apple. Second source was the ecosystem outreach team who talks with third-party developers.
CW: seems parallel compile operation should be easy as long as we give devs a way to see them happening. Tex uploads, have to figure out right path. Map buffers in different worker?
KN: yes.
JG: you mentioned uploading separate mip levels. Was pattern, have background thread uploading mip levels in series, or uploading mip levels concurrently?
MM: it’s a trick to get loading screens shorter. Upload lower qual mip levels first to get things started. While playing in background, upload higher mip levels.
RC: from convs with D3D team that’s common. Networking, take a bunch of textures, maybe compose together, then upload lower mips first. I can see this when I play games.
JG: definitely tricker than background compilation.
RC: also seeing from devs lots of shader compilation in background threads. It’s DXIL to native binaries, too.
JG: do we want to address this in multiple parts? Immutable resource creation from multi workers would be easiest.
KN: sounds good to me.
KN: think it makes sense to think about easier cases first, like shader compilation. When writing mine, was thinking about rsrc creation on background threads, shaders, etc. Shaders was easiest one. 3 alternatives.
- 1. all objects are Movable between threads but no references at the same time from multi threads. Except Device. Device always internally synchronized. For shader compilation, on background thread, create shader module, pipeline, etc. Transfer final object back to thread to use it on. Despite being restrictive it works fine for shader compilation.
- 1. Everything can be referenced from multiple threads. Always a new reference, never moving things. Easy for shader compilation.
  - JG: that’s what most native apps are working with / expecting.
  - KN: Yea, except no internal sync guarantees, usually
- 1. For shader pipelines, same as 1). Because they’re immutable. For other resources, more complicated.
JG: what happens if you have texture view into texture and then transfer texture to diff thread?
- KN: under (1). Here’s why this is annoying. I did it with Buffer and BindGroup. Two objects, point at each other, so if you Transfer you either have to Transfer both or break the association. For this reason I think (1) isn’t good.
- JD: but if you serialize you have race conditions?
- KN: yes, that’s the problem with (2).
- MM: well behaved app has to have locking mechanism (via postMessage), which worker is OK to run a command at a given time?
- KN: for most part think you won’t see these dependencies. Only if you’re trying to init a buffer from multi threads at the same time.
- CW: all creation stuff would work fine. GPU commands would work fine; app would pull cmdbufs before submission. Scary thing is e.g. buffer submission state machine. App could have sync problems there.
- JD: concerned about driver sync problems too.
- JG: shouldn’t be able too.
- KN: we’ll have to do all sync inside impl. Have to be some guarantee about what we pass into driver. High-level data races between threads.
- JG: similar to what we have now, but with locking overhead. May have drivers where you don’t need to do that.
- KN: for easy case (shader pipelines) both options work fine. Unlikely / impossible to mess up because it’s immutable upon creation.
- JG: my favorite is (2). Shared mutable state, if you mess with it without locking you get races.
- KN: without significantly more investigation it’s certainly the most obvious one which actually works.
- MM: In (1) would we have to add new API to transfer objects?
- KN: no, it’s just postMessage and transfer list.
- MM: in (2) then you have to post message it?
- KN: no. (1) transferable. (2) not transferable. To get to worker, just postMessage them. Transfer list is a postMessage arg.
- CW: certain kinds of objects are copied by reference, not value. SharedArrayBuffer’s the only one currently.
- KN: and in (2) these objects would work the same way.
- RC: where do data races come in?
- CW: mapBuffer and unmapping from other thread. Need CPU side tracking on objects. Mainly done for buffers right now. Likely impls will lock on queue submit.
- KN: likely you’ll have one or the other write into the buffer.
- DM: but textures are not CPU accessible. Why immutable?
- KN: e.g. 2 different cmdbufs that write in to texture and submit them.
- DM: so it’s the submit race? Lot of things going on there which we’re not discussing for purpose of transfer / copy between threads. Think it’s mainly fences / buffers. Race where 1 thread tries to map, second tries to get the buffer. Unlikely we’ll see this.
- KN: other question of whether this could even pass TAG review, because like SAB, it exposes actual parallelism to JS, which is usually not allowed.
- RC: texture sub resources aside, do we care about multiple threads writing to a single buffer?
- CW: does buffer mapping return ArrayBuffer or SharedArrayBuffer?
- JG: part of question is, do you have writable buffer and reference to th ebuffer.
- CW: we have BufferView proposal we have to get back to someday. Could handle Rafael’s case. Or have buffer mapping return SABs.
- RC: sub-resource thing could be handled with new API to split texture into sub-resources. For buffers, is it necessary for these to be true, multithreaded, serialized, or just let people use multiple buffers?
- MM: kind of like idea of matching TextureViews and BufferViews (though the latter doesn’t exist yet.)
- RC: so let views be Transferable?
- MM: right. Threads A and B could have 2 views that don’t intersect. THen they can read/write etc.
- CW: mutation of texture contents can only happen on the GPU right now. Would all have to go through queue submit, which is likely internally synchronized.
- MM: point I’m making is, for good MT story with buffers, need BufferViews.
- KN: sounds to me like you want a buffer, pull out bunch BufferViews, but can’t have intersecting BufferViews, then transfer all views to other thread.
- JG: what if BUffer itself were transferable? Don’t invalidate bind groups? Just invalidate on its host thread?
- KN: transfer back before BindGroup starts working again?
- KN: still then have races between mapping and queue submit.
- CW: then might as well have Buffer Cloneable.
- CW: concurrency is hard. Maybe let’s think about this more offline.
- MM: one use case: treat GPUBuffer as SharedArrayBuffer.
- KN: Corentin had the Map operation return SAB. Same idea.
- MM: that tries to subsume lots of complexity into the ArrayBuffer design itself. Maybe try to not do that. Either whole buffer or pieces are accessible only from one thread, but don’t let have GPU buffers have SAB semantics. Seems hard.
- RC: I agree with Myles.
- JG: tradeoff between impl and app complexity. If Buffers aren’t shared mutable state it’ll be a porting concern. Ideal if we could solve this in a simple shared way. In interim, maybe land the immutable object creation stuff we know we can do.
- CW: you suggest GPUBuffer can’t be mapped on more than one thread?
- JG: maybe don’t allow that for now. Only immutable object creation.
- DM: createBufferMapped, we need on background threads.
- JG: thinking, instead of holding up obj creation on threads, let those go through.
- CW: what about allowing creation of buffers that aren’t mapped?
- JG: interesting.
- RC: If we want to optimize just shader creation, then we can do that.
- KN: thought about that. Doing it in WebGL.
- JG: I want to not do that in WebGL. Seems like a worse solution to me.
- KN: usually lot of other processing involved. Shader preprocessing steps. Compilation or transpilation to spirv/whlsl. Bunch of work, not subsumed by that API. Have to do it on a separate thread.
- DM: why are we talking about mapping from multiple threads at all? Don’t think we need that. Model of buffer use is extensible to multiple threads. Don’t need to map a unit from multiple threads, etc. Does someone see a good use case?
- KN: see advantage. Multithreaded texture decoder. Solved with MapBuffer returning SAB.
- MM: Question is whether doing all work from multiple thread and then having it copy from the SharedArrayBuffer into the ArrayBuffer is significantly worse than that.
- CW: in your impl that mapped buffer could / should be GPU memory.
- MM: increasing # of copies, decreasing complexity of impl.
- KN: Don’t think it increases complexity. We already have SAB.
- CW: OK. Concurrency not something we have to solve right now. Maybe discuss offline, what’s transferable when, discuss more on issue.
- JG: would like to put on agenda for next week.

Multi queue

Not discussed.

PR Burndown

KN: Would like to prioritize load ops and clear values #284 / #283 / #268
LoadOp / Clear / default attachment in attachment descriptors
KN: been sitting here a long while. Would like to figure out what we’re doing with load values / LoadOps. #283, merge load value and LoadOp together into single field. #268, adding defaults for StoreOps. #284, where DM suggested associating texture with clear color value.
KN: open for a long time, would like to get through it.
CW: #283 seems really nice.
KN: I like that one. Need to fix it, it’s wrong right now. Supposed to merge some other clear values too.
JG: don’t like the opaque-to-black load op.
KN: don’t care about that one.
JG: if never any more load ops except load / clear, could do nullable thing.
KN: fixed the PR.
JG: way it’s written now, could say it loads 0, or actually loads. Could say, depth clear value defaults to null, and if not there, we load? And if clear value, we clear to that first? Right now it’s weird because it takes either float or string. Strange you can tell it load, zero, or here’s the value you want to load.
KN: I’m OK with “load” or color / float / u32.
JG: If only useful values are load or user-specified clear value, then we can just center on those.
CW: what about DontCare?
JG: that’s the only other concern.
KN: for this, it would work. Load / DontCare / value.
CW: think it’s great to force users to specify what they want for this.
KN: could also force them to type “null”.
DM: was discussed in #276. Rejected. Opposite of what user thinks. If they want some extra work, they have to type extra symbols. This was the opposite. I’m in favor of Kai’s open PR.
KN: Don’t think it’s weird to union string and value type. In the JS world outside the Web platform, people do this all the time and it’s easy enough to detect between the two.
JG: fair. Only recco, only load op should be “load”. Remove “opaque black and “zero”.
KN: Fine with me.
CW: Sounds good.
RC: Was going to suggest what Jeff said.
RC: if we get into trouble with people then could just have the enum be something different in the dictionary. Want it to be “load”, but add a different color.
KN: that’s where we were heading.
JG: That’s fair. I’d like to try this. It’s a cool idea to embed the load value in the LoadOp.
RC: suppose somebody puts “0”.
JG: think it coerces into a number.
KN: depends on WebIDL and think it’s not allowed.
MM: do we have a way to spec a depth clear value before? Only removals in the patch remove this.
CW: refresh. Kai rebased, removes clearColor/Depth/Stencil.
MM: Think this is fine; Still want defaults and there’s still a patch for it.
KN: still allows for defaults.
CW: OK, can talk about patch with defaults next time.

Agenda for next meeting

Compressed texture copies?
Multi threading
CW: coordinate spaces, winding, texture flipping. We’ll come up with investigation, would like to discuss in group.
CW: ok, let’s cancel next week (absences from Google, Microsoft).
(later) KN: Didn’t get to #284, would still like to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2019 07 22

GPU Web 2019-07-22

TL;DR

Tentative agenda

Attendance

Administrative stuff

Discard to 0

Compressed texture copies

Multi threading

Multi queue

PR Burndown

Agenda for next meeting

Clone this wiki locally