Skip to content

Minutes 2021 11 03

Corentin Wallez edited this page Nov 9, 2021 · 1 revision

GPU Web 2021-11-03

Chair: Jeff Gilbert

Scribe: Myles Maxfield, Kai Ninomiya

Location: Google Meet

Tentative agenda

  • CTS updates
  • Milestone triage (timebox 15m) #2073 (currently triaging V1 issues, start at #537)
  • How to measure the shader compilation time in js #2236 (Jiajia)
  • (socialize last comments) Specify the importExternalTexture of WebCodec VideoFrame #2124 (Kai/Shaobo)
  • Proposal: Add a usage bit to allow creating a view with different format when creating a texture #168 (comment) (Jiawei)
  • texture_2d_depth for loading depth values might be unnecessary? #2094
  • Removing the COPY* texture usages #2196
  • Refactor descriptors for attachment load/store/read-only #2215
  • GPUAdapter.name considered harmful #2191
  • Agenda for next meeting

Attendance

  • Apple
    • Myles C. Maxfield
  • Google
    • Brandon Jones
    • Gregg Tavares
    • Kai Ninomiya
  • Intel
    • Jiajia Qin
    • Jiawei Shao
    • Shaobo Yan
    • Yang Gu
    • Yunchao He
    • Zhaoming Jiang
  • Microsoft
    • Rafael Cintron
  • Mozilla
    • Jeff Gilbert
  • Michael Shannon
  • Mehmet Oguz Derin

CTS updates

  • KN: I have one interesting update. Since we landed the change in the spec to require SecureContext, I did a few changes to make that work. All the HTML files have been renamed to .https.html. So WPT will run them in HTTPS. I’m working on getting this into Chromium’s automated testing. That change has landed.
  • KN: There has been lots of work on the shader tests. Thanks to the contributors
  • MM: Found three or four interfaces not marked as SecureContext. Will submit PR about it.
  • MM: Reason I found these is we’ve started posting WebGPU patches in WebKit
    • KN: Great news!

Milestone triage (timebox 15m) #2073 (currently triaging V1 issues, start at #537)

    1. Metal validation layers require sourceBytesPerImage >= 512
    • KN: It’s been a long time since people have looked at this.
    • MM: We need to determine the answer for V1
    1. Copying of depth24plus formats
    • KN: I don’t remember where this is at.
    • KN: This should be “Needs Specification” in V1.
    1. Investigation: Sampling Depth Textures
    • KN: It’s possible this is resolved.
    • KN: “Needs Specification” in V1.
    1. Binding to zero-sized range of a buffer
    • KN: This needs to be spec’ed.
    1. Error conventions for extensions
    • KN: Yep. Needs specification.
    • I’ll do a lot of spec work in the coming weeks. Hopefully we can wrap up these old things.
    1. List of issues to discuss with TC39, WASM, JS engine teams
    • MS: A lot of these issues belong to KN
    • KN: This is a tracking issue. Leaving in V1.
    1. Add a combined image sampler as per Vulkan
    • KN: Not considering now.
    • MM: Let’s close it. That’s the right signal to send to the world
    • JG: Let’s close it and see who responds!
  • Copies of depth values outside [0, 1]
    • KN: Needs spec. V1.
  • Pipeline statistics clarifications
    • KN: We don’t know if we will actually have pipeline statistics
    • MM: It’s in the spec right now
    • KN: Needs investigation. V1.
    • MM: I don’t think we need to move these out of the spec
    • KN: We weren’t sure if we wanted to ship them. They’re only useful for developer
    • MS: It was decided it’s an optional extension
    • MM: As long as there’s a way to quantize the results, pipeline statistics should be fine to ship as an optional feature
  • Next time: After #794

How to measure the shader compilation time in js #2236 (Jiajia)

  • JQ: I want to ask question. Is createComputePipeline or createRenderPipeline blocking? Will it block the main thread until the underlying graphics API is finished?
  • KN: No.
  • Should it be blocking or non-blocking? Will it change later?
  • KN: It won’t change. We don’t do blocking operations. It would require blocking on the GPU process, which takes a lot of time that Javascript could use for other things
  • JQ: That’s good. My assumption is correct. We had a discussion about this issue with Corentin. I think he said he wanted to change this API to make it blocking.
  • JQ: 2 issues: 1. Clarify the spec to say that these are non-blocking operations
  • JQ: 2. We want to collect the shader compilation time. It’s not easy to get the time. I think that Chrome is adding about:tracing events to get us the time, so we can do profiling. We want the spec to have an interface to directly get the compilation time so we can integrate these APIs into tf.js, so we can do automated testing.
  • JQ: Since it’s a non-blocking API, we are thinking for the synchronous version we can do some optimization - use multithreading to parallelize shader/pipeline creation.
  • MM: I think you can use ErrorScopes for this?
  • JQ: Waiting for createShaderModule won’t capture time in create*Pipeline.
  • JG: Feel in some way your only way to be sure everything is created is to … do something with the pipeline. Once you get the result you know it’s ready. The spec allows a lot of room for us to hide how long it takes, especially in the successful case. (We might do things as late as draw time for final setup)
  • KN: There’s some value in exposing specifically timing information. The error scopes can return as soon as their done with validation. In the case of pipelines, the driver can always reject the pipeline, so we can’t return early in practice. I think. In principle, error scopes are not a great way of asking for timing information. 1. They’re only waiting for validation to be finished, not for other work to be done, and 2. They can get tied up with the timing of other errors. It’s reasonable to try to have some timing API for timing pipeline creation. Knowing when the pipeline is finished creating won’t tell you how long the pipeline took to be created - if there are multiple things in flight, you could wait before the pipeline creation started. Which would be more useful to you? Would you like to know when it finished, or how long it took to create? (AKA: Do you care about scheduling [overhead]?)
  • JQ: We want to know the underlying API to create the underlying object, how long the driver took. Because it includes the shader compilation time. I want to know if the APi to call compileShader time.
  • MM: What is the goal/use case? Are you trying to use this information to detect performance regressions e.g. in Intel driver?
  • JQ: We want to analyze which shader takes too long to compile. We can adjust the shader. It’s for tf.js - which also has webgl backend, we optimized some operators, and the warmup time became very long. Later, we found the root cause is the new shader is too long. We adjusted the shader to make it shorter, and the shader compilation time became shorter. If we can have this kind of time, we’ll know which shaders to optimize. We’ll also know, if we make a change, we can detect regressions.
  • Our users complain about the startup time.
  • JQ: Yes. We need to knwo the time distribution.
  • KN: It sounds like you want to know the actual time spent compiling and creating pipeline, to detect regressions. You can already detect this by reading the first result.
  • KN: We might want to specifically give time information about shader and pipeline compilation. That way we can account for scheduling overhead. If you issue many shader compilations at once, we can tell you how long the compilation itself took, rather than the cross-process-boudnary
  • YH: Regarding compile time, can we do this in the native code? For GLES, we can query the compile status and link status, and for D3D, Vulkan, Metal, can you do that?
  • JG: You do it yourself in D3D
  • KN: In those APIs, I think shader compilation is a blocking operation - use a native thread if you want to parallelize. If that is the case, (which makes sense in native), then, we can just time it on the CPU thread, and return that result.
  • MM: Not really convinced that we need a new API here. I think JG’s suggestion is good enough and is reliable.
  • KN: You could avoid timing overhead by making sure all previous work is complete. But that doesn’t work in a real application.
  • MM: Not actually sure if multithreaded is a problem. If you compile 1million shaders, and it takes 10s, that’s what you want to know.
  • KN: If you issue 5 compilations and you’re on a 4 core system, one shader will look like it took 2x longer because it waited.
  • MM: If trying to determine user pain, I think you want the overhead included, since that’s the measurement of what the users actually feel.
  • KN: That doesn’t help detect regressions. That doesn’t help detect variations of timing in a shader. You’re at the mercy of the scheduler.
  • JG: This is a macro-vs microbenchmark problem. When you put it all together, the result of all compilations is different than the sum of each individual compilation. Microbenchmarks give you an approximation of what’s happening, but it’s less reliable. Macrobenchamrsk are more faithful, but <missed>
  • JG: If you want a macro benchmark, you can do it already. For a microbenchmark, you can write code that does this. It just doesn’t work on a user’s machine. But you can do it in regression testing.
  • JQ: Our benchmark is more like a real application case. All our shader compilation is at NOT at the beginning. It’s at runtime. We dispatch and maybe create another pipeline, dispatch, a lot of dispatches, then later submit. I don’t know, in this workflow, how we can get the exact compilation time.
  • JG: You can measure how long it took you that time
  • JQ: Currently we run a whole module, which has dozens of shaders. Should we reduce each to one operator and test each alone?
  • JG: Yes, would be one way. There are two things you want to know, one is when things get slow overall. I think what you’re asking for is to know when a particular pipeline creation took 4 seconds so you know that’s the one you need to fix.
  • JQ: Would like to run a benchmark and collect all this info including pipeline info.
  • MM: Would like to know if a user will ever run one operator. If not, more useful to know that on user machine, this whole batch of operators took too long. Can send telemetry that this took too long. Developer can [investigate details?]
  • YG: We want to allow tj.js developers to have the data breakdown. We dont’ want to involve intel’s servers. We hope there’s some mechanism so we can bring the API to the tf.js developers, so they can breakdown the data by themselves. We can provide a profiling model for them, so they can profile. WebGPU has an async API here, and everything can be addressed. TF.js uses a sync framework. We couldn’t switch to async api very easily.
  • JG: Let’s move on.
  • KN: Thanks for the discussion, this was helpful. We can continue discussing this later. It will require more thought.
  • YG: Thank you.
  • JQ: We want multithreaded compilation
  • JG: multithreaded compilation is already possible
  • JQ: Please add a note to the spec
  • MM: Sounds good
  • KN: There might be some tweaks to error scopes which can make it easier. But, in general, yes, this is an implementation concern, not a specification concern.
  • JG: Is there anything else?
  • <silence>

(socialize last comments) Specify the importExternalTexture of WebCodec VideoFrame #2124 (Kai/Shaobo)

  • SY: Thanks Kai for the investigation. Can you describe your proposal?
  • KN: We can’t have an in-depth discussion today - people haven’t had time to review. Lifetimes of GPUExternalTextures and video element implicit frames in video frames in WebCodecs.
  • KN: 1 question: How the lifetime of imported videos is managed. Right now we destroy the GPUExternalTexture quickly. If an application wants to use a frame multiple times (maybe 24fps video vs 60fps screen) they’ll have to re-use frames, applications would have to import multiple times. Implementations will want to add caching to prevent that from there being too much overhead. Or, we can make it explicit, and tie the lifetime to the actual video frame. If it’s explicit, it could stall the decoder if the application holds onto frames for too long. If we make it implicitly tied to the actual video playback, I have concerns about portable timing. But those may be resolve-able. There are many options here. It’s difficult to determine which ones will work out best.
  • KN: 2. Lifetime of video frames. When you’re imported from WebCodecs - video frames already have a behavior where if you hold on to them for too long they can starve the decoder pool of decoding resources. So we don’t worry about that too much here. The lifetime semantics are just details, in my opinion. I think we can design that one to fit nicely with however <video> works.
  • KN: Does anyone have specific questions?
  • JG: I feel gunshy talking about VideoFrame, which VideoFrame is part of WebCodecs. We’re not necessarily in consensus on WebCodecs in general. It’s difficult to discuss, when we haven’t agreed on the full direction in general.
  • KN: That is a concern. We would like to have the interop point defined, if possible. I don’t know to what degree the design of WebCodecs is stable (or is-or-will-be agreed upon). We can at least come to an informal consensus on, assuming the presence of WebCodecs, this is how it would work. We can implement it behind that. And not put it in the spec (yet) if we’re not ready to commit.
  • JG: This sounds like an extra spec thing for now. This is part of the proposal to have WebCodecs. It’s hard to promote anything in it into the core API because we don’t have implementation experience. We don't’ want to sign up for delivering something that we eventually won't’ want to deliver. It’s scary to say “if we had the prerequisites, this is how we would want to do it”.
  • JG: There is an interesting question about lifetimes of video frames with regards to <video>. We can discuss that in the bug, and give people more time to talk about it.
  • KN: With question A answered, question B becomes simpler. We do have an answer to question A right now in the spec. We may want to change it though. I’m currently in favor of what’s in the spec right now.
  • Please take a look and comment

Proposal: Add a usage bit to allow creating a view with different format when creating a texture #168 (comment) (Jiawei)

  • JS: I agree with everyone’s opinions. It’s not that critical. We don’t have a good use case, and can discuss later.
  • JS: It’s weird we have to force the texture format on the texture view to be equal to the texture
  • MM: Agree we should defer
  • KN: The API doesn’t require you to specify a format when you create the view. It’s optional, and defaults to the original format. Right now, it doesn’t do anything since it has to match (or be omitted).

texture_2d_depth for loading depth values might be unnecessary? #2094

  • MM: Might need DM. I think he has convinced me as to what the solution is but would prefer to wait.
  • MM: Can DM come next week?
    • JG: Will check, think he is still on leave next week. Returning week after.
    • MM: Let’s put it on the schedule for 2 weeks from now.

Removing the COPY* texture usages #2196

  • KN: It doesn’t sound like there’s a whole lot of appetite for this. So, if no one wants to argue for it, it’s probably time to close it.
  • MM: It’s a step in the right direction
  • KN: It’s nice, but it’s not a clear win.
  • MM: What’s the downside?
  • KN: The downside is 1) we’re not passing the information to drivers that they could use to optimize. I think someone found a case where copy usages matter.
  • KN: In the summary of the issue, it looks like Corentin found some cases on Intel and AMD where it makes a difference. It’s unclear how big the difference is. So, if we could agree that all implementations needed to add these usages all the time anyway, this would be easier, but I don’t think we agree on that. So I guess it depends on whether we want to try to find out what the costs are. IN current drivers.And whether that conclusion will hold in the future
  • KN: Maybe we could make it easier to make these usages? But still have them?
  • MM: Compromise solution, worth considering. Generally we need a reason/evidence to include something in the API.
  • KN: We have clear evidence that there is some effect in drivers but we don’t know how much.
  • MM: There’s also important evidence that one of the three APIs does not need these.
  • MM: I don’t think finding an “if” statement in a driver that reads a flag is sufficient. We (the CG) should gather real performance numbers.
  • MM: I’m probably not the best person to investigate this, but I am willing to do it.
  • JG: Think for a first pass of the API it’s OK to have these in. We can take them out later if needed (make the copy usages not required for copy operations).
  • MM: I think that’s backwards. By default, we don’t know if we need them, and if it turns out this new API feature is required for improved performance, then the new flags will be guarded behind a new optional feature
  • KN: Enabling a feature on a device shouldn’t break all of your code. If we added this later I think we would need to do it in a way that enabling it is non-breaking. (A non-breaking way could be to have inverse: “NOT_COPY_SRC” usages or an “exclude these usages please” option.)
  • MM: clarify?
  • KN: Enabling a feature to “now validate these new usages” would break all existing copy operations until you update your resource creation to add the flags.
  • MM: That [the fact that it would be breaking?] is the point of the feature.

Refactor descriptors for attachment load/store/read-only #2215

GPUAdapter.name considered harmful #2191

Agenda for next meeting

  • Triage starting at #794
Clone this wiki locally