Minutes 2021 09 27

GPU Web 2021-09-27

Chair: Jeff Gilbert

Scribe: Kai w/help

Location: Google Meet

Tentative agenda

CTS updates
Milestone triage (timebox 15m) #2073
timestamp-query is unimplementable on TBDR architectures #2046 (alternative proposal in Hao's comment)
TF.js issues (if any remaining to be discussed?) (timebox 15m)
Implement "depth_clamp" on D3D12 #2100
- Replace clampDepth with disableDepthClip #2139
(stretch) Require a render pipeline flag to enable sample-rate shading #2133
(stretch) Which corner is the origin of a copyExternalImageToTexture? #2110
Agenda for next meeting

Attendance

WIP, it is the list of all the people invited to the meeting. In bold the people that have been seen in the meeting:

Apple
- Myles C. Maxfield
Google
- Gregg Tavares
- Kai Ninomiya
- Loko Kung
- Shrek Shao
Intel
- Hao Li
- Jiajia Qin
- Jiawei Shao
- Yang Gu
- Yunchao He
- Zhaoming Jiang
Microsoft
- Damyan Pepper
- Rafael Cintron
Mozilla
- Jeff Gilbert
Michael Shannon
Mehmet Oguz Derin

CTS updates

KN: Nothing big, did write some tests for depth clipping and clamping for investigation. Thanks for review!
- MM: Which tests?
- Issue: https://github.com/gpuweb/gpuweb/issues/2100
- Tests: https://github.com/gpuweb/cts/pull/753
JG: At Mozilla, we’re bringing more CTS tests online for wgpu, still working on attaching the CTS harness to Firefox CI for full-stack testing.

Milestone triage (timebox 15m) #2073

https://github.com/gpuweb/gpuweb/issues?q=is%3Aopen+no%3Amilestone+
#234 shader programming model: use little-endian layouts for scalar data
- V1, needs spec
#214 RFC: Make sure that a minimal-garbage hot-path is supported
- Post-V1, needs proposal
#150 Characterize the performance benefit of WebGPUBufferUsage and WebGPUTextureUsage
- MM: Could close based on all APIs having these usages. But not sure whether Metal has buffer usages offhand. Assign to me.
#39 security model, focused on shaders
- Move to WGSL
#35 Out-of-bounds behavior in shaders
- Close (labeled as investigation); move to WGSL
#9 Ability to issue command create resources from workers
- KN: Will dupe into #354

timestamp-query is unimplementable on TBDR architectures #2046 (alternative proposal in Hao's comment)

HL: Summary of issue
MM: Believe where we left off is whether we should have one or two optional feature flags for timestamp queries.
MM: Related discussion about whether the kind of API in the spec right now (writeTimestamp in calls) should split the pass or not.
- Whether we should split compute passes automatically for timestamps.
- MM: Don’t think there is a behavior difference in splitting render passes, only performance. Have demonstrated here that compute pass performance also suffers from splitting.
- KN: Some observable ordering differences?
- MM: They are still within the ordering constraints of an unsplit render pass.
JG:
MM: On TBDR there is a timing difference between one pass with many draw/dispatch, and many passes with one each. Measuring the performance of the latter doesn’t tell you the performance of the former. The timing result you would get would be very large (due to slowdown) and would not be representative of the actual costs. It would be wrong for an application to take action based on these timing numbers because they’re not representative of the real workload.
JG: How does this proposal do better than that?
JG: On some architectures, those timing results are going to be wildly different, on others they’re going to be close.
MM: In my proposal (don’t allow writeTimestamp in middle of a pass), author can do one of two things:
- Split the pass themselves. They’ll be able to measure the separate performance but won’t be able to see how they perform together.
- Use native code analyzers/debuggers/profilers like the ones in Xcode. Will be give a way more detailed result (parallel execution, hardware occupancy/utilization, etc.)
- Suggest developers use more advanced native profilers. And if you want to measure inside the app you should measure the performance of passes which
HL?: Want to profile timing of every dispatch.
JQ: In TF.js, integrating timing into our automated testing framework. Cannot use native tools because we need to automate it daily.
MM: What is the result of these numbers? If they are really high, what do you in response?
JQ: Check performance …. Collect perf data of ops in various models. Helpful to see if there are perf regressions, if certain ops need to be tuned, etc.
MM: Measuring individual operations is not representative of what users will experience. …
JQ: Often one operator is a bottleneck. For example in one model, one op takes a disproportionate amount of time (scatter, 120 ms -> 20 ms -> under 1ms). So it is very important to be able to profile a single operator to discover if the implementation is not efficient.
JS/JQ: Per-dispatch timestamp reveals timing issues.
JS:
MM: The proposal I’m making doesn’t stop you from measuring individual ops, you can create separate compute passes and measure them.
JG: The information you get by (either implicit or explicit pass splitting) should be the same. You’ll lose performance by splitting passes, but for the purposes of regression testing, you would still be able to do that by splitting your own passes, and also be able to run the whole pass again and measure that.
JG: Does that cover what you want or is there something else essential about writeTimestamp in the middle of a pass?
JQ: Change code to do special processing (...)?
MM: I would measure two things:
- Performance of whole pass: measures what users would see
- Separately, execute with every operator in its own pass, to identify regressions in individual operators.
JQ: (...) Need to change our current code logic to have multiple modes: one to split each op into its own pass, one to have whole pass. Think it doesn’t make very much sense to design code around those modes.
JS: Also may use draw calls sometimes instead of dispatch.
HL: Not sure that splitting the compute passes will actually give the correct profiling data.
JG: On some architectures (TBDR), you either won’t get good data or you won’t get data at all in cases where you use the old Disjoint Timer Query style. Will get a result that says “this timestamp is disjoint and can’t be compared to others”. Can’t offer an efficient--. Same performance problem you have with splitting passes is the same thing we would have to do to support timestamps in the middle of passes for core WebGPU. What we’re hoping to do here is to find some path that provides the most value for measuring things while being able to support all GPUs, and we are worried that we can’t really succeed on TBDR if we allow timestamps in the middle of passes.
JG: May be an opportunity for an extension later, but would only be able to do it by splitting passes and you’ll actually capture the overhead of splitting passes.
MM: I understand why this may be frustrating, understand why you would want functionality (measuring dispatches) supported on e.g. Intel to be available. But for V1 we are focusing on core features across architectures.
JQ: Discussions have been assuming we have only one optional feature for this, why not two? [for different architectures] So browser won’t have to split passes silently, just expose what the underlying API supports.
HL: Discussed moving timestamp queries into another document outside of the WebGPU spec. If we do that we could put two timestamp query features there.
JG: Think we have consensus to support end/begin-pass timestamp queries. For more flexibility, add a new feature/extension. Limitation right now is we are really focused on what we actually need to ship V1 of WebGPU so we are focused on the single extension is the most flexible that we can offer everywhere. After we finish up with sprint to V1 optimistically in 2022 we would investigate making things more expansive.
HL: Can easily move the current timestamp design to another document.
MM: Moving words around is easy, but implementation is not.
HL/JQ/JS: Developing feature on desktop.
MM: Should design a feature that works on desktop and mobile.
JQ: Would like to keep both the current and the per-pass timestamp APIs.
JG: Dawn welcome to continue supporting a “prototype” of mid-pass timestamp queries, but standard would remove it. Reserve time later in this group (after V1) to discuss a platform-specific (desktop/intel) extension/feature in the spec. That is where I feel like we are ending up.
MM: May be sufficient. Starting the browser with a nonstandard flag [for automated regression testing] might be OK.
JS: Is this a final decision of the group?
YG: I can understand the decision. After V1 can discuss more about this. We still will want to revisit it: changing the execution mode of the program (TF.js) will mean we are no longer measuring the performance of what we ship. (...)
JG: Proposal here is that you (Intel) will still be able to do this profiling [the way it is now] by passing a flag to Chromium. Is it enough for your usecase to pass a flag to Chromium? Or do you need it generally available?
YG: Think this will be sufficient.
YG: Desktop very common; would like to discuss including more desktop specific features in the API, even if after V1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2021 09 27

GPU Web 2021-09-27

Tentative agenda

Attendance

CTS updates

Milestone triage (timebox 15m) #2073

timestamp-query is unimplementable on TBDR architectures #2046 (alternative proposal in Hao's comment)

TF.js issues (if any remaining to be discussed?) (timebox 15m)

Implement "depth_clamp" on D3D12 #2100

(stretch) Require a render pipeline flag to enable sample-rate shading #2133

(stretch) Which corner is the origin of a copyExternalImageToTexture? #2110

Agenda for next meeting

Clone this wiki locally