GPU Web 2020 03 23

Chair: Corentin

Scribe: Ken

Location: Google Meet

TL;DR

Query API #614 and #619
- Agreement that the unit for timestamps should be the same everywhere
- It's not possible to have timestamps immediately available on the GPU on iOS so there needs to be an API different than writeTimestamp.
- Discussion around delta vs. absolute timestamps
Managing on-chip memory #435
- Concerns voiced on the issues and in the meeting, and agreement to drop this for now.
- Agreement that some form of tile management is very important (core, or extension)
Data upload (#605 is somewhat new)
- The new proposal splits read and write map calls.
- Still concerns about needing subrange map read
PR burndown
- Add definitions about copy commands with textures #623
  - No discussion on the PR itself but agreement that not all PRs need to be discussed in the calls.

Tentative agenda

Query API #614 and #619
Managing on-chip memory #435
Data upload (#605 is somewhat new)
PR burndown
Agenda for next meeting

Attendance

Apple
- Justin Fan
- Myles C. Maxfield
Google
- Austin Eng
- Corentin Wallez
- Kai Ninomiya
- Ken Russell
- Shrek Shao
- Ryan Harrisson
Intel
- Yunchao He
Microsoft
- Damyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Mehmet Oguz Derin
Timo de Kort

Administrative items

CW: There’s a thread about adding more GLSL-based CTS tests temporarily.
- MM: I’ll reply by end of day
- CW: Think internally we talked about how we could put it in a branch and then convert to WGSL for master. Otherwise, (DJ is not here), there was an update on the charter. Francois made an update on his PR.

Query API #614 and #619

YH: a few open issues for the investigation:
YH: do we need begin/end query for timestamp query? Maybe can just return timestamp delta, or just the timestamp?
MM: last time we talked about this, was a concern that absolute timestamp numbers might give authors incorrect results. Any counter argument to that?
YH: correct.
CW: my concern: queries are grouped into a single concept, but they’re all different. Pipeline stats and occlusion are HW counters. Somewhat similar. Timestamp is different - I want data about something now. Putting everything in the same API doesn’t necessarily work well, may make timestamps more complicated.
YH: this was in a meeting after the F2F. If we return timestamp directly, few issues:
- Some APIs return ticks, some times. For developer, fragmentation issue.
MM: I agree that having the meaning of timestamp counters different on different platforms would be unfortunate for the web. Think there needs to be a bake step to e.g. run a compute shader to convert ticks to timestamps or vice versa.
CW: agree. Delta-proposal already requires compute shaders.
YH: yes.
CW: so that doesn’t really change the need for compute shaders.
YH: but if it’s done by the impl, web developer doesn’t need to handle the fragmentation.
MM: should we talk about beginTimeGPU/endTimeGPU?
MM: our concern is: it would be great if timestamp queries weren’t optional in WebGPU. They’re probably the most imp’t queries, exist in all 3 APIs. If non-optional, should be implementable on top of gpuStartTime/gpuEndTime, that’s only available on iOS, and it should be able to use these facilities.
AE: we could have two levels of timestamp queries, one on cmdbuffer and one at higher level.
MM: sounds great.
JG: why is this useful?
AE: so you can figure out how long cmd buffer took to run, and maybe finer grained-stuff like render passes.
YH: are they different APIs?
AE: one could be in core.
JG: I’d prefer to use the same thing if we can. Same API for both use cases.
DM: didn’t we clarify that Vulkan doesn’t support timestamp queries at all? Can be zero?
MM: but in reality do all Vulkan devices support more than 0 bits?
CW: not sure. Certainly in OpenGL ES some devices decided not to support them.
KR: yes, that’s why some WebGL impls turned off timestamps completely.
JG: any use cases not supported by time deltas? They seem strictly better to me.
CW: if you’re an app trying to measure lots of things. begin / begin / begin again, seems wasteful. Otherwise the two are equivalent.
JG: what do we want for timestamps? Do we want GPU side timestamps? CPU-side process submission timestamp?
CW: in all cases, GPU timestamp. Even for gpuStartTime/gpuEndTime in Metal.
MM: Metal supports both, but GPU side are the more common ones. An API not supporting the scheduling one - the CPU side one - doesn’t have to be handled with this API.
MM: agree with DM, it is useful to measure time to draw e.g. a single character. Having something that’s finer grained than gpuStart/EndTime on platforms supporting it is valuable. But want iOS to work out of the box with queries in the core API.
CW: what about API adding timestamps to GpuCommandEncoder? And on iOS, command encoders can be split, so there’s a different gpuStart/EndTime that can be recorded? Actually nvm, changes the API so these aren’t available on the GPU.
- Suggestion was, what if we had a timestamp write API, used inside renderpasses would give bad precision on iOS, but at command encoder boundary would give good data. Only case is, inside renderpasses, bad quality on iOS. But doesn’t work.
MM: what about API where GPU / CPU can ask for it explicitly? And on Metal on GPU, it would actually be a copy? And if CPU asks for it, would be a download under the hood?
CW: could work. Or, timestamps only allowed to be copied into mappable buffers. Force timestamps to be CPU-only on all APIs. Maybe not useful for GPU to access timestamps in the same frame.
MM: can’t think of use case off the top of my head of GPU needing the timestamps in the same frame.
JG: not sure, but wouldn’t design the API around what we can think of right now.
CW: we need a way to put the gpuStart/EndTime on the GPU on iOS. Assuming we want the right timestamps to work on all platforms, these are given on the CPU side, not the GPU side. Have to receive them on the CPU and upload to the GPU.
JG: I see. Consider: primary use case is not in production software. Might bring back some telemetry based on it, but #1 use case is local development. Portability constraints might be more relaxed then. Results of these queries are not really portable.
KN: think compute will make this a necessary thing in production. Will have to tune parameters at runtime based on the results.
MM: the gpuStart/EndTime is usually available only a frame or two later. So impossible to get it to the GPU on the same frame.
CW: yes. Means timestamps are CPU-side only things.
KN: app would have to upload to GPU if it wanted to.
CW: 2 questions:
- Single timestamp, or delta?
- How do we deal with timestamps being CPU-side-only objects, compared to the rest?
MM: can we resolve on the first one? Think nobody wants them to not be deltas
- JG / CW have at least mild objections to this
- JG: think we should at least document that user story
- JG: if someone’s building a profiler and want history of when various command buffers were completed, without having to begin/end all the time
- MM: OK, makes sense. And if you did begin/end around every draw call, you’d probably get different results than measuring around every duration. Compelling use case, actually.
CW: shall we document this discussion and have a single “write” timestamp? Single command that records a timestamp, vs. a delta?
JG: might be systems where it’s not implementable.
KR: it was mobile devices and macOS (I think) where absolute timestamps weren’t implemented in OpenGL.
JG: can synthesize disjoint deltas from timestamps. So should be able to offer time deltas everywhere. That could be core in the spec. Then timestamps could be nice-to-have.
CW: how about we support CPU-side time deltas everywhere, somehow?
KN: within a cmd buf, or cmd buf boundaries?
CW: inside cmdbuf. Write timestamp you can use on the GPU immediately is an extension.
MM: what’s the plan for iOS, inside a cmdbuf?
CW: split the cmdbuf recording. Wouldn’t split inside renderpass, but some data’s better than no data.
MM: reason for not putting it on cmdbuf boundaries is?
CW: it’s interesting to have data on a single draw call, or multiple draws related to e.g. a single character. But maybe put GPU timestamps on cmdbufs in the core API, and writeTimestamp on GPU is an extension. And maybe only get it in developer mode.
MM: not sure about dev mode part, but rest of that sounded OK.
RC: what’s diff between CPU timestamp and GPU?
CW: GPU timestamp is something put into a GPU buffer, and is usable by next command. First appears on the GPU. The CPU timestamp is a GPU measurement of GPU time, but is first given to you on the CPU. If you want to use it on the GPU you have to upload it.
MM: this is how the Metal API works.
...more discussion…
CW: agreement to change proposals to do more iOS like stuff, and have extension for timestamp query?
YH: so to summarize:
- add core feature for command buffer level timestamp? This is TimeDelta?
  - Agreement: yes, time deltas
- renderpass timestamp might be a real GPU timestamp?
CW: not sure if it has to be a TimeDelta on the command buffer.
MM: second piece you mentioned YH - that should be optional somehow. Might be our first extension.
YH: OK.
CW: getting begin/endTime should probably be something you ask for on the command buffer, so we avoid doing it for every cmdbuf.
MM: that’s fine.
KR: can we get agreement to provide TimeDelta on cmdbufs? Rather than two timestamps?
CW: sounds OK. We’re still interested in impl’ing WebGPU on OpenGL ES, and GLES has good time deltas.
KN: I agree we should do that.
MM: for the record we don’t have agreement that WebGPU is going to be designed with GLES as a target. Not disagreeing with the conclusion of this topic though.
AE: if on Vulkan the precision is 0 bits, what happens?
CW: we just provide garbage.
KR: Vulkan has writeTimestamp but the precision is up to the driver, all the way up to 0 meaningful bits of precision.
YH: for renderpass timestamp, if we return GPU timestamp may need more work on this. Lot of impl details, fragmentation. Ticks vs. nanoseconds. Nanoseconds might have different time base. Also time might be reset to 0. Lot of details. Don’t think a real GPU timestamp makes sense. Can gather more data.
MM: returning garbage is probably OK, as long as developer knows they should ignore those fields.
CW: can just return Inf, or something.
KN: or time on CPU, but probably not helpful.
CW: Let’s move on and make further comments on the issue / PR.

Managing on-chip memory #435

CW: I commented on the implementability of this (multi-sub-pass render passes) and DM corrected me.
CW: in the issue I gave a number of reasons why I think we shouldn’t have this feature in core WebGPU.
JG: I didn’t have a chance to write up my findings. Generally agree - think it’s intriguing for a post-MVP, but think MVP is OK without it. Cool idea, think there’s something we can do to support framebuffer-local things. Just dealing with framebuffer-local aspect of renderpass dependencies simplifies the API a lot. We need all the information we can get to do pipeline creation at pipeline creation time. DM expressed concerns about dynamic pipeline creation, agree with that. Think some parts of this are implementable. Appreciate the proposal here. It moves us toward a possible solution for a subset of the whole subpass dependency problem space.
MM: OK. I’m not going to push much harder on this then. One parting question. From iPhone’s perspective it’s unfortunate that people don’t use on-chip memory. What do you think is the best way to get developers to use it?
- (a) is that a problem we’re solving?
- (b) is there some other way to do it that would get better perf on tiling GPUs?
JG: I think we want it. Don’t want my feedback to be negative. It’s mostly positive. I think there’s a solution in this space, we haven’t quite found it, but we’re in the right area. Think if we can get something like pixel-local storage, think it’s worth doing on our mobile backends. You’ve shown us a direction we could take that simplifies it enough for this use case. I intend to go off and try to narrow down what it could look like.
CW: also think this is very imp’t to unlock perf on mobile. One idea I’ve been thinking about is proposing a Vulkan-like extension for pixel-local storage. More flexible, etc. than Vulkan renderpasses. I’m still studying the problem space. Will likely make an extension proposal. In the future where this or a similar proposal is implemented by mobile driver vendors, an extension doing pixel local storage you’d say “I want this amount of tile memory” and then work with it would be the easiest / most flexible API, and would be part of the feature level for mobile. This is why I want to fix it at the Vulkan level.
JG: do we know what the story is on D3D?
RC: aside from renderpasses there’s nothing to help these kinds of devices.
DP: that’s my understanding too. In fact, mesh shaders are going to make it worse.
CW: why do mesh shaders make it worse?
DP: you have to replay the same vertex / mesh work over each tile. Usually in tiled renderers it can batch up the vertex work and replay it.
MM: in this proposed Vulkan extension, do you think there’s a possibility to build whatever WebGPU proposes on devices not supporting it? On Metal, we expose the functionality, but the reason it’s not used is a culture problem.
CW: technically it can be shimmed if you have storage textures. But I worry the perf would be bad. You did benchmarks for this actually.
MM: yes, storage textures are really really slow compared to this.
CW: if you assume D3D12 is where desktop APIs are going, and Vk/Metal where mobile APIs are going, they go in completely diff directions. Culture today is desktop first, mobile second. It’s a culture problem, not sure we can fix it by exposing a WebGPU feature that works better on mobile.
JG: one question: benchmark on top is useful but that’s on an iPhone correct?
MM: yes.
JG: then synthesizing ImageBlocks at API level on systems not supporting them: if emulation path is not terrible on platforms not having ImageBlock support, maybe could use that alternate path. What about Win/D3D12 etc.?
CW: is that rasterized order views?
MM: I think so. Good point. Worth investigating. Disappointed we can’t do anything in core, but won’t push if people don’t think should be in core.
JG: still have explanation I want to do before writing it out of core from my side. My point is not that it shouldn’t be in core, just not essential for MVP. Think we should still work on it and that something like it should be in core in some form. Should have better support for tiling architectures in version 1. But as triaging things, not sure should be in MVP.
MM: not sure I agree but won’t argue the case. We can move on.

Data upload (#605 is somewhat new)

CW: small update from last week. Were two main requests for changes in the API:
- 1. separate the paths for reads/writes, make mapRead case more like mapWrite
- 1. allow for sub-range mapReadAsync
CW: in latest proposal:
- 1. mapReadAsync/mapWriteAsync are separate to be clearer
- 1. however, don’t think we need subrange mapReadAsync. Reading data back is done much less often. People can create buffer the right size to read back.
- Also, I didn’t put in Promise<ArrayBuffer> for mapReadAsync. This can be polyfilled in queue readback in 12 lines using the current proposal. Mapping proposal has better chance of getting agreement than something on the queue.
DM: why is there no range on mapReadAsync?
- CW: because generally there’s only a small amt of data read back. Can create correctly sized buffer to read back the data.
MM: in compute-only workloads I think that assumption is not true.
CW: right now, MAP_READ only allowed with COPY_DEST. Have one copy between compute workload and CPU. Can size that buffer the right size for receiving that copy.
MM: think you’ve shown it’s possible to not have a bunch of unnecessary readback, but APIs should be well behaved by default. Seems unreasonable to ask every developer who wants to read back data to make a new buffer the right size and copy into it.
JG: think this is just a glBegin(GL_TRIANGLES) difference. Sharper tools, need extra care. This is what client libraries are for.
DM: we can just add the range to mapReadAsync. An argument there won’t make our approach different.
JG: it requires subrange tracking doesn’t it?
CW: only buffer-local. Only need to remember what it’s for.
JG: have to care whether it’s written by other BindGroups.
CW: if it’s in the MAP_READ state you can’t use it for other operations. First call to mapReadAsync is ownership transfer.
JG: still seems to me mapRead/Write are the same. Thought the point of mapReadAsync was to add a subrange. Not sure what the use case of splitting them is.
CW: marginally more understandable. Easier to understand what’s happening.
JG: do you consider that compelling?
MM: in map (no read/write), user will have to pass in whether for reading/writing somewhere
CW: it’s in the usage. The buffer’s either map read or write, but not both. Latest proposal only has one.
KR: thought mapWriteAsync didn’t take range. Can write subranges.
CW: correct. mapRead would have to transfer the entire buffer.
JG: think I’ll want to write another proposal.
JG: let’s move on and do PR burndown.
CW: please comment on Github issue.
MM: would you like me to propose in the Github issue about adding arguments to mapReadAsync?
CW: yes please. Want all comments in same place. Let’s keep discussion rolling, think we can resolve this offline.

PR burndown

Add definitions about copy commands with textures #623
- CW: think spec editors’ meeting can resolve this
- JG: feels we use this section about whether to merge things. We need to leave more than 5 mins per call for this.
- CW: for PRs that don’t change validation a lot, I think it’s fine to merge it without going through the concall.
- KR: agree - the way this has been done in WebGL is if concerns are voiced on a PR, then we discuss, otherwise merge. Scales better.
- CW: sounds fine. Then all 3 of these PRs can basically be merged because they’ve been approved left and right.
- JG: would have liked to talk about callbacks per promise.
- CW: sorry - I missed that one. #626 . Let’s discuss next week.

Agenda for next meeting

More on queries
Hopefully, have path forward for data uploads
#626 and other PRs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Web 2020 03 23

TL;DR

Tentative agenda

Attendance

Administrative items

Query API #614 and #619

Managing on-chip memory #435

Data upload (#605 is somewhat new)

PR burndown

Agenda for next meeting

Clone this wiki locally