GPU Web 2020 05 04

Chair: Corentin

Scribe: Ken / Austin

Location: Google Meet

Tentative agenda

Specification for GPUBuffer mapping #708
writeBuffer #509 #734
Copying of depth24plus formats #652
Should we have a core "stencil8" format? #306
PR burndown
- Add GPUPipelineBase.getBindGroupLayout.#543
- Add pipeline shader module error enumeration. #646
- Document pipelines #724
- Add validation rules on copyTextureToTexture #725
- Merge GPUTextureCopyView.arrayLayer into .origin.z #730
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Myles C. Maxfield
Google
- Austin Eng
- Brandon Jones
- Corentin Wallez
- James Darpinian
- Kai Ninomiya
- Ken Russell
Microsoft
- Damyan Pepper
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Matijs Toonen
Mehmet Oguz Derin
Timo de Kort

Specification for GPUBuffer mapping #708

CW: PR follows the original Bikeshed. Couple of issues:
- getMappedRange can return multiple ArrayBuffers at different offset. There are some questions about Typed Array views.
- DM: we were told that creation of TypedArray view requires alignment of type. Any ArrayBuffer needs to be able to create any view, so has to have biggest alignment of all types. This is 8 bytes currently. Corentin pointed out this is not future-proof. Don't have a solution to that yet.
- CW: more specifically: if you do new Float32Array on ArrayBuffer, offset in bytes must be multiple of 8. New Float32Array(buffer, offsetInBytes, sizeInElements).
- CW: require that getMappedRange take offset modulo 8?
- CW: or, mapped range can be unaligned ArrayBuffer? Change JS spec to say that unalignment of ArrayBuffer + alignment of view is aligned?
- KR: Would lean toward requiring the offset is multiple of 8
- JG: me too
- CW: concerns from JavaScriptCore side / WebKit?
- DJ: no.
- CW: and if JS comes up with bigger-aligned typed arrays? Then maybe we can do the alignment / unalignment later.
- KN: Float32Array requires alignment of 4 bytes IIRC.
- CW: Float64Array requires 8-byte alignment.
- KN: other option: throw exception if you try to create view that isn't sufficiently aligned. Making Uint16Array might work, making Uint32Array might not.
- JG: how would you figure that out from the JS side?
- KN: we'd say offset 0 into the buffer is always sufficiently (8-byte) aligned. Any offset you add to that - like 4 - would constrain the types of views you can make.
- CW: Jeff’s concern is that you get an ArrayBuffer, and you don’t know what the alignment is.
- KN: you know because you requested it.
- JG: it's hidden though. Utility functions that already exist that assume ArrayBuffers are 8-byte aligned at beginning would stop working.
- KN: True, but they wouldn’t break in an unreliable or confusing way. Someone is responsible for having created that array buffer via mapping.
- JG: it's true, but feels risky to me, and feels safer to say you can only map views on 8-byte boundaries for now.
- CW: Goes back to usual point: do the more restrictive thing now, and we can relax more in the future.
- KN: yes.
- JG: could be cool to not have this restriction but don't think people will be sad if we do have it. If we don't have this restriction we'll have to have more detailed talks with TC39.
- CW: we should make a list of all the talks we want to have with TC39
  - (later) KN: opened https://github.com/gpuweb/gpuweb/issues/747
- CW: RESOLVED: require 8-byte alignment for now. See later about relaxing this if necessary.
- DM: for both offsets and sizes?
- CW: don't think it's necessary. You can already create ArrayBuffers of size 7.
- KN: that already works. If you have ArrayBuffer of size 1 (or 7) and try to make Float32Array you get an exception.
  - new Float32Array(new ArrayBuffer(7)) // RangeError
  - new Float32Array(new ArrayBuffer(7), undefined, 1) // OK
- JG: Do we need to care about alignments for the purposes of accessing an underlying mapping from an API? Like if you make a 3-byte mapping and try to access it, is that okay on all of our backends?
- CW: We’re talking about the arguments for getMappedRange. I don’t think it matters there, but for the parameters of mapAsync, we probably need a 4-byte alignment like we do for buffers. On Metal, B2B copies need to be 4-byte aligned.
- JG: did we allow overlapping ArrayBuffers?
- CW: overlapping getMappedRange? Yes. DM talked with SpiderMonkey folks, they said it's OK.
- DM: overlapping's OK, as long as it's aligned.
- CW: we also require 4-byte alignment for mapAsync call arguments because of buffer-to-buffer copies.
- AE: we should check if overlapping's OK with the V8 team. We ran into an issue where two ArrayBuffers with the same base address get a crash in V8. If they're overlapping and have the same start (like 0..8 and 0..12), that could be a problem.
  - KN: there's a map keyed on the base address in V8.
  - MM: isn't this a legitimate bug in the engine?
  - KN: no, you can ordinarily only create multiple views, not multiple ArrayBuffers pointing to the same address.
  - JG: please check with V8 team. If we have to say no overlaps, that's probably fine too.
  - CW: you can essentially memcpy in JS by doing Uint8Array.set(something else). What if you memcpy 2 different ArrayBuffers pointing at the same data? Is that valid? Do the JS engines handle that?
  - JG: 2 different ArrayBufferViews that use set()...yes. Decays into memmove. But different ArrayBuffers pointing to same memory...nobody's thought about that.
  - CW: have to talk more with JS people then.
  - DM: what was motivation for allowing intersections anyway?
  - CW: "why not"?
  - JG: good motivation. :)
  - CW: if it works then it's easier to spec, implement, and less restrictive. If can't work, let's not do it.
  - JG: it does sound like more work than originally thought. Even just given existing feedback from V8 team, maybe we should restrict to disjoint mapped subranges.
  - CW: can keep talking with JS folks, see if we want to allow this in the future.
- CW: DM also pointed out that existing spec in Bikeshed has weird corner case: if you map for reading, and then write to the ArrayBuffer you're supposed to only read from, and you map again, the data you wrote just disappears. Spec is written like this because nothing's updated. This introduces issue where there has to be at least 1 copy when you read back, because we have to copy whole range of data unmodified by JS, to JS. Other corner cases exist, all result in little overhead. In review, couldn't figure out how to change this to either allow writes by JS to be seen by JS again without additional copies / complexity, or optimize it to allow writes. This is another corner case.
- MM: have we discussed inventing read-only ArrayBuffer? WebGL created typed arrays.
- JG: original conformance tests were in the WebGL conformance tests. I talked with our JS team, they think it could be possible. A little gun-shy about it. I got a "maybe".
- MM: I can talk with our JS team. That would be the ultimate solution to this problem.
- JG: it's something we keep running up against. Would be nice to have this as a primitive.
- KR: One caveat: In Java, there was a 10x hit if you made any readonly ArrayBuffer. That’s because Java’s class hierarchy optimizations were different from Javascript. But this is the reason we didn’t put readonly into the original spec. We can revisit it with our JS teams. Also the reason for separation between TypedArrayView and DataViews in terms of in-memory modification vs. I/O use cases.
- MM: Is there more description on that?
- KR: I can write that up.
- RC: Are there similar things in the other way? For the destination ones, if you map more than one time, the stuff stays there?
- CW: For MapWrite, the only data you see is what JS put there.
- RC: Okay, so it starts at zero the first time, and every time you map again, you see what was last there.
- JG: What if we forbid Read without Write? Then you can write into it and it’s fine.
- RC: what do you mean by that?
- JG: if you map for read, you also have to map for write, because of the restriction coming back from ArrayBuffers.
- KN: then it's only COPY_DST, so if you write data into it it doesn't matter.
- CW: when you unmap, we have to ship over the data to the GPU. Esp if not over fast path. That requires going through shmem upload. Depends on where you put the tradeoff. On the path where you can wrap shmem, it would be 0-copy... in case where you map shmem, it would be normally 0-copy, we don't want JS to be able to see things later so they're 1-copy. If you force MAP_READ to also be MAP_WRITE, then in case where you can't map shmem you have 2 copies. One from GPU->renderer, then copy to update the data that's in the GPU process when you unmap. The tradeoff as currently written in the PR in both cases is you have 1 copy. Alt tradeoff: when you can't map shmem you have 2 copies. In case where you can wrap it you have 0.
- KR: Requiring MapWrite when you MapRead seems like it discards the purpose of having the separation.
- JG:
- KR: We can’t tell whether they wrote an individual byte in the mapping, that’s not practical.
- JG: We can for small amounts.
- CW: But then you do a memcmp and hit all the cachelines.
- CW: My suggestion is that we keep the semantic as-is and see if we can do readonly ArrayBuffer or something better. Keep it for now and have this one unnecessary copy when you can wrap shmem.
- MM: I should say it’s important we don’t forget about writeBuffer.
- CW: it's the next agenda item.
CW: one more thing: under review, DM had a concern that having single mapAsync wasn't clear. At call point you didn't know whether it was mapped for reading/ writing. Suggestion to split into mapReadAsync/Write. Or, mapAsync with a flag, READ/WRITE. In the future, also desynchronized, discard. I think it can take a lot of time to discuss this in the meeting so let's move on. What do you all think about landing the PR as is and opening another discussion about this specific point, and then modifying Bikeshed accordingly?
JG: land the mapping PR?
CW: #708, yes, without the map flags (mapRead vs. mapWrite async, no flags for now). We can all study it and raise issues.
JG: think the official way to usually do this is to have a branch and have PRs against that. When we’re all happy with the branch we merge it in.
CW: we haven't done that in this group so far.
JG: it's the normal way feature branches work in many git flows, so suggesting it here. People who look at our spec and say, that's what's going to happen (they shouldn't do this, but...) feels a little weird to land something that's not ready.
KN: I prefer to land this with a feature.
MM: I think it’s pretty important that writeBuffer comes along with this, and splitting them up makes me uncomfortable.
KN: from the editor's perspective it's hard to have multiple in-flight requests against the spec.
CW: Let’s discuss writeBuffer now. The PR can stay in flight for more weeks I guess.
KN: I just opened #747 about things we should talk with TC39 about.

writeBuffer #509 #734

CW: two issues: PR from DM which adds writeBuffer - maybe stale by now. Second, investigation by Austin into how Chromium would implement writeToBuffer, esp. issues around shmem and minimizing copies. It's how we see the end state of writeToBuffer optimizations.
AE: has been discussed a lot in other settings. Most of the APIs have mechanism to create shared file mapping in renderer process, opening it in some way with e.g. Metal driver or D3D12 API. That's the bulk of how we'd do the upload mechanism. Allocate large chunks of this shared mapping, import it, and sub-allocate out of it. writeToBuffer gives sub-allocation of ring buffer. One copy from TypedArrayView into chunk. Send message to GPU - want to copy this into destination buffer. Queues up copies. Submits them before your next submit so as-if they'd happened immediately. After submission, impl checks some fences to see if copies complete; when they are, signals ring buffer that those chunks are available for reuse.
CW: crucial point is that the ring buffer's open at same time in renderer process and in GPU process as a buffer. Write once in renderer, and on GPU process side, you encode CopyBufferToBuffer, and that's it. Signaling, etc.., but only these two copies on either end, because you can open shmem as a GPU buffer.
JG: would love to see this as something userland apps could do outside of WebGPU. That's the point I'd prefer to get to. I still have concerns about writeBuffer that are intractable for it - I don't prefer it as an API solution.
DM: 1) to Austin's point - I think we'd use the same logic for createBufferMapped. Essentially, when creating buffer with data not mappable, you allocate shmem and copy into it. Should be possible to reuse infrastructure. Whether writeBuffer exists, we'll have this infrastructure.
CW: Austin pointed out 1 difference is that with writeBuffer you have ring buffer and signaling back and forth. With createBufferMapped you wouldn't do ring buffer because you don't know how long it'd be mapped.
DM: you'd expect user wouldn't hold on to the mapping for long.
JG: Not sure that’s a fair assumption to make given the shape of the API. I can see cases where they’re streaming data into a mapping that’s coming over the network. There’s no way to tell you that it may be around for a while.
DM: could be. Even in this case it wouldn't break. You'd just have more shmems open at one time. Wouldn't break entire uploading infrastructure.
CW: comes down to user agent heuristics.
DM: Want to point to proximity of these 2 semantics. 2) JG's concern about writeBuffer is legit IMO. Some big application comes to us, we need this particular thing to be faster, and on mobile for example we write to buffer directly. Then pressure on all of us to do the same thing. Even though it's not spec'd. If we had a way to prevent this scenario I think it would solve the concern. I suggest: if that happens - if we do get asked to writeToBuffers directly, then we gather together and discuss how it happens. For example, buffer has to be map-writeable. Say what alignment has to be for writes to be direct. Tracking of GPU usage. We should gather together and describe spec changes instead of doing hidden accelerated code paths.
CW: Imagine that happens: wouldn’t the rule essentially amount to mapAsync?
DM: With the client side tracking?
CW: Ah I see.
MM: in the specific situation - dialogue between browser vendor and large engine - wouldn't the browser just tell the engine to use the other API?
CW: that's probably what we'd do. Just did this with bgfx today. I reviewed the code and suggested exactly that.
DM: the other API doesn't solve this. We're talking about writing to part of buffer while GPU is using other part of buffer. Point is - if we do want to do this, let's agree to do it in a process other than trying to hack the impl to do this implicitly.
MM: That sounds fine to me. Having additional communication about which stars have to align to get on the fast path would be reasonable.
CW: there's track record on doing this - on Chromium side there was getMappedRangeAsync (?) - because a lot of customers wanted it and Chromium started doing prototypes. Discussion came to WebGL WG, and nicer, more consistent API was found. Hopefully we can do the same thing here.
KR: That absolutely happens. Jeff found it could be implemented with existing primitives, Kai modified his proposal, and it turned out great.
JG: Part of the success was figuring out how to offer the functionality without expanding the API surface. Would like to do the same thing here.
DM: that comes to a different philosophical conflict - do we want most apps to use a helper library to access the API? Approach Jeff's suggesting is to only have the powerful primitive and build on top of it. My preference is to have something people can use right away.
MM: On our platforms, we’ve deprecated OpenGL and created an API that at least we believe is both powerful and easy to use and easy to learn. I hope we can do the same in WebGPU.
JG: Appreciate that feedback, but something I’m concerned about is something that doesn’t exist in Metal or OpenGL on Mac, and that’s pressure to adjust implementations based on outside influence. People don’t have other choices for Metal drivers or OpenGL on Mac drivers.
CW: if you look at evolution of Metal API, it certainly seems there's feedback from developers on other systems, used to having other primitives, that adds features to Metal. Explicit resource management was not in first version of Metal. Probably came because of developers wanting to do things more manually / performantly.
JG: From an implementation standpoint, it’s safer and more reliable to have only one way to do something. Adding additional paths is more maintenance cost. … people say your solution didn’t work for me, can you modify it and make it more powerful. I’d like to avoid that here. I understand people still want writeBuffer. I don’t agree, but I don’t think it’s worth to continue disagreement. Everyone else thinks it’s valuable so I think we should just move forward with it.
CW: thanks Jeff for moving in the spirit of compromise. We all agree that writeBuffer is the easy-to-use, good-enough primitive, but we have to do a good job of educating users on the efficient paths. It's not because the spec has writeToBuffer that we should steer people toward it, but instead toward mapAsync.
JG: It does seem weird to me to add something where the best practices don’t use it. I think it’s valuable to have that if we do have writeBuffer, but want to make sure that’s understood.
CW: I think it is.
DM: we wouldn't say, don't use it. We'd say, if you want more control, you can go this other path.
RC: is it the case that writeToBuffer would be more performant than mapAsync?
CW: I'd consider it an implementation bug if it's consistently more performant.
RC: so it's a helper function through and through.
JG: I expect it to be more performant. When our V1 of WebGPU comes out, I fully expect writeBuffer to be more performant in some cases.
RC: Why are we telling people to use it?
CW: on Chrome's impl, once we implement what Austin described, I don't understand why writeToBuffer would be more performant than mapAsync. Should be consistently slower, or equal.
MM: my intuition is that it should have similar perf. Would be interesting to do perf comparisons.
CW: RC the expectation on Chromium side is we'll have perf tests for these things, and the two lines shouldn't cross. Or there's a bug in the mapAsync impl.
RC: OK. Is it possible to impl writeBuffer in JS only?
CW: not as efficiently as the browser.
MM: you can createBufferMapped, enqueue copy, then delete the buffer. That creates a new buffer, which isn't what writeBuffer does under the hood. SO, no, you can't do what writeBuffer does, but you can write a different program that does what writeBuffer does.
JG: I did something like this for the staging belt polyfill.
MM: difference is you need a data structure of pre-mapped buffers. One API design goal was to be able to upload for current frame.
JG: the belt polyfill in JS has that. You have a bunch of pre-mapped buffers, and if you run out, it does createBufferMapped. I believe I've written it.
MM: I'm trying to say it's not possible without that data structure.
JG: I think we still disagree. You can use that library and phrase it in terms of writeToBuffer, and it'll look like and behave like writeToBuffer.
CW: in Austin's doc, if you squint it's like a ring buffer of buffers, but you can race with the GPU. You don't have GC overhead, etc. Need to assume that the data you have becomes magically visible on the GPU. Looks conceptually similar. JS doesn't have access to all these bells and whistles.
RC: seems that writeToBuffer's better and we should have people use it, right?
CW: no - if you know each frame you upload up to 4 MB of uniforms, the correct thing to do is, each frame have a 4 MB buffer mapped, and each frame copy it to your uniforms. Should be faster, all the time, then a bunch of writeToBuffers. If you can structure your app around it, mapAsync is the better option.
RC: OK.
MM: did we get a resolution?
CW: I think we did 15 min ago.
JG: any comments I made after 15 min ago are just to add color. From a group project standpoint I think we should move this forward.
DM: I'll update the pull request and we'll proceed with the editors on how to land it?
CW: I think the semantic of writeToBuffer is fairly easy so we don't need another go-round with the group. writeToTexture is a different thing we haven't talked about.

WGSL recap

DJ: There are some WGSL discussions that might benefit from more input. e.g. some where the group basically agrees on the concept, but hasn’t yet agreed on the specifics.
DJ: Please feel free to give opinions.
MM: should we every month or two have a roundup of what's been happening in WGSL and talk about it on the WebGPU call?
DJ: most of the people are the same so don't want to duplicate effort.
CW: people that want to follow loosely should probably just follow what DJ sends in meeting announcements.
MM: OK. We don't have to talk about this more.

PR burndown

Add GPUPipelineBase.getBindGroupLayout.#543
- Vaguely agreed on this during last F2F. This is the real PR with more spec text. Can't just land this with editors.
Add pipeline shader module error enumeration. #646
Document pipelines #724
Add validation rules on copyTextureToTexture #725
Merge GPUTextureCopyView.arrayLayer into .origin.z #730

Agenda for next meeting

Let's discuss topics between WGSL and WebGPU API.
shader module enumeration
JG had a couple of PRs to discuss here.
- JG: source maps. Anything straddling language and API. WGSL group is like a Technical Sub-Group.
- JG: also shader compiler messages.
MM: being able to supply header files to compilation.
DJ: Justin's away this week so you can ask me or Myles about any of these.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2020 05 04

Tentative agenda

Attendance

Specification for GPUBuffer mapping #708

writeBuffer #509 #734

WGSL recap

PR burndown

Agenda for next meeting

Clone this wiki locally