Skip to content

Minutes 2019 04 08

Corentin Wallez edited this page Jan 4, 2022 · 1 revision

GPU Web** **2019-04-08

Chair: Corentin

Scribe: Ken

Location: Google Meet

TL;DR

  • #233 - signed integer representation
    • Consensus to require two’s complement.
  • #234 - data shall be little-endian
    • Having a single endianness simplifies testing lot
    • Consensus to require little endian since no known GPU is big endian.
  • #230 - float bit width
    • Few hardware has 16f hardware, and min16float in HLSL wasn’t a good idea in retrospect.
    • Consensus to have only 32bit floats for the MVP other float types can be in extensions.
  • #229 - integer bit widths
    • Consensus to go with 32bit, emulating is impossible with races.
  • #232 - Permitted memory orders on control barriers
    • Consensus to have acquire/release barriers only, no relaxed ones.
  • (no issue) (spirv-execution-env#16) - Permitted memory orders on atomic accesses.
    • Consensus to only have atomics that also are memory barriers.
  • #264 - Uniformity analysis.
    • RM had a uniformity analysis in the WHLSL spec but it wasn’t modular: uniformity of function call depended on uniformity of arguments.
    • ANGLE had to fight FXC’s uniformity checks that were similar.
    • Consensus: hope that a modular version of RM’s uniformity typing works for developers

Tentative agenda

  • WebGPU programming model
  • Agenda for next meeting

Attendance

  • Apple
    • Dean Jackson
    • Justin Fan
    • Myles C. Maxfield
    • Robin Morisset
  • Google
    • Austin Eng
    • Corentin Wallez
    • Dan Sinclair
    • David Neto
    • James Darpinian
    • Kai Ninomiya
    • Ken Russell
    • Ryan Harrisson
  • Microsoft
    • Chas Boyd
  • Mozilla
    • Jeff Gilbert
  • Mehmet Oguz Derin

Administrative items

  • CW: DJ merged the bikeshed version of the spec, please rebase pull requests on it.
  • CW: Also will send an email to ask for attendance for the F2F please answer then.

WebGPU programming model

The spreadsheet with all programming model issues

Agenda order gpuweb issue Category Description SPIR-V env issue

1

233 Data Signed integer representation: Propose twos complement

2

234 Data Endianness: Propose little-endian

3

230 Data Float bit widths: only 32bits built into Vulkan

4

229 Data Integer bit widths: only 32bits built into Vulkan

5

232 Memory model Permitted memory orders on control barriers. AcquireRelease because MSL control barriers are not Release-only or Acquire-only. Vulkan has no SequentiallyConsistent

6

Memory model Permitted memory orders on atomic accesses. Only relaxed atomics, because MSL only has relaxed atomics 16

7

264 Uniformity Should WebGPU require an error when it can analyze erroneous non-uniformity? (compiler obliged to throw an error in provably non-uniform cases, where a primitive is miused such as barrier or derivative.)
  • DN: after Jan F2F I made a bunch of issues and a spreadsheet. Link is above. Hopefully should be easy to resolve. Necessary for a good spec. In order of easiest to hardest. Will be obvious when we’re not at consensus.
  • #233 - signed integer representation
    • Whole world using two’s complement. Even C++ is doing this.
    • MM: so if we find a GPU that doesn’t do two’s complement then we have to emulate? Any examples of this?
    • KN: SPIR-V uses two’s complement, so any GPU running Vulkan uses two’s complement?
    • RM: everything I know of is two’s complement. No reason to do anything else.
    • DN: will record this in issue and close it. Has to get written down in any shading language spec we produce.
  • #234 - data shall be little-endian
    • The only one I know about is PPC and you can flip it.
    • CW: any GPU / processor architecture big-endian?
    • JG: there are people running PPC / Talos machines in big-endian mode, because they can. We try to keep Firefox running on such machines. Not such as easy as one’s vs. two’s complement. The web is sort of implicitly little-endian.
    • KR: not really. Typed arrays are host-endian.
    • DN: you run into problems with middleware where you have to transform data.
    • MM: why are TypedArrays host-endian?
    • KR: there were more big-endian architectures when they were designed. e.g. Wii, PowerPC, with web browser that ran WebGL.
    • KR: seems to me that the more important question is to match the endianness of the GPU and CPU in our specs.
    • MM: any GPUs out there that are big-endian?
    • JG: not sure. More of an open question, rather than the two’s complement question.
    • CW: can’t we just look at WebGL? Alias a Uint32Array with a Uint8Array?
    • KR: you can do that.
    • CB: do we care about these DSPs for Neural Nets? Don’t know what their integer complement / byte order is.
    • MM: not sure whether people will use WebGPU for this.
    • CB: Movidius. USB stick with DSP in it. Designed for 8-bit / 16-bit integer math. Focusing on low-power scenario. We expose those cores through DirectCompute / HLSL. Will get to int8 question soon.
    • KR: Sorry for pushing so hard about WebGL’s endianness-agnosticism. If this will move the WebGPU spec forward more easily then let’s settle on little-endian and move forward. This is more restrictive and could plausibly be relaxed later.
    • DN: it simplifies testing a whole lot. Removes a dimension of complexity from your entire pipeline.
    • JG: think it’s likely we won’t run into issues with this, but I don’t think it’s something that could be relaxed later. I’m OK with that.
    • CW: consensus is to go with little-endian.
  • #230 - float bit width
    • DN: 16-bit, 64-bit floats are interesting. Vk 1.0 only required 32-bit data types. Others are optional. Not common to have 64-bit floats on mobile for example. 16-bit floats also not common on mobile.
    • KN: 16-bit float registers not common.
    • DN: scalar half-precision storage is not universal. When I chased down the relevant HLSL discussion between Fil and Myles, it seems Fil wasn’t considering the memory-access issue.
    • CW: can we have a half2 data type?
    • DN: that could be emulated at the compiler level.
    • RM: if it’s not something widely available in the market, I think we should keep it out of the MVP. Would be fine pushing out half-float support to e.g. version 1.1.
    • JG: easy to postpone until after MVP. If we’re going to support real compute on the web we’ll have to support 64-bit floats.
    • KR: probably as an extension.
    • CW: in SPIR-V would be easy to spec an extension for this. Also in WHLSL.
    • DN: can see: nothing below/above 32-bits: then half2; then scalar half; etc.
      • Parallel track of support → double
    • MM: how valuable are these types? In my experience in shaders people don’t use doubles that often. But I don’t have that intuition for smaller types. Chas, do you have any intuition for how often lower-precision floats are used in shaders?
    • CB: they’re moving away from them. Now that we have explicit float16 people are using them, and we’re not doing e.g. 14-bit floats.
    • JG: what Chas was saying is that we’re getting rid of the old low-precision floats and replacing it with 16-bit floats.
    • CW: Intel / NVIDIA’s latest generation has an ALU that can do a single 32-bit op, or two packed 16-bit opts. May be useful at some point.
    • MM: ah, not suggesting minimum-width types. Think that would be a mistake.
    • JG: so, 32-bit floats for MVP.
  • #229 - integer bit widths
    • DN: pretty much same. Vk 1.0 had just 32-bit support. 16-bit ints are roughly in the same place as 16-bit floats, but there’s less interest because 16-bit floats have ML applications. HLSL doesn’t support 8-bit ints yet? Or just compilers? Installed base is not there yet.
    • CB: correct.
    • CW: emulating 16-bit integers is easier than emulating 16-bit floats? Do we think the market share of devices with 16-bit ints is big enough to be useful?
    • DN: agree, as long as you aren’t dealing with memory, and races within a word.
    • Discussion about packed data types, short2, etc.
    • CW: ok with Apple? 32-bit ints only?
    • RM: originally we were hoping to emulate smaller ints, but didn’t realize issue with races. So for MVP would be nice to try to correctly emulate them. Since they’re not widely supported, OK with making them (smaller ints) a later extension.
    • CW: sounds good.
  • #232 - Permitted memory orders on control barriers
    • DN: discussion between Robin and me. Basically in agreement. Barriers are never sequentially consistent. Control barriers: normal thing is acquire / release. GLSL has wrinkle with tess control barriers, but not in WebGPU, so drop it. Robin wanted relaxed barriers so they can interact with fences. That’s a generalization we can do, but since memory models for shaders haven’t been solid until now, we aren’t helping anybody greatly by not having that feature, so better to have just acquire / release.
    • CW: could you give us a 2 sentence breakdown?
    • DN: at control barrier, making sure all threads are meeting. At acquire/release, at release, anything threads were doing is taken into account so mem transactions are completed when you read later. Relaxed barrier would only give you that you executed to there, but not what memory has done. Then you have to ask, what is the barrier for anyway? Fence / control barrier / another fence.
    • RM: programmers might use relaxed instead of acquire/release and not understand the difference.
    • MM: what are the types of barriers that programmers are writing? What are the authorings in human-writeable shaders? Going in the direction of human-writeable languages.
    • DN: if you say barrier in a GLSL compute shader, it’s an acquire/release barrier.
    • RM: I referred to the spec of that one when writing the issue.
    • JG: is the relaxed barrier the release-only, acquire-only barriers?
    • RM: no.
    • JG: I see. it’s a different thing.
    • RM: relaxed is the same as none. No memory ordering.
    • CW: ok, we all agree.
    • Decision: acquire/release barriers only. Not relaxed ones.
  • (no issue) (spirv-execution-env#16) - Permitted memory orders on atomic accesses.
    • DN: these are the other kinds of atomic accesses. These can also perform sync and publish previous memory ops. Prior writes will become visible. Not how it works in GLSL or MSL, today. They’re just relaxed. So when doing atomic op on a word, you’re just ensuring that people operating on that word with an atomic will see the results. Not publishing memory. Just exposing existing behavior.
    • MM: it’s the only thing we can do.
    • DN: yes. SPIR-V was more general about this and have to restrict it. WHLSL was silent on this topic.
    • Agreement on this.
  • #264 - Uniformity analysis.
    • DN: when you do a derivative, only valid when everyone in your quad is doing it too. In F2F discussion, should the WebGPU language have a statically analyzable algorithm to know, is this thing provably in uniform control flow? Or are you more relaxed, and maybe can’t prove it, so either check at runtime or the programmer promises that the code will all get to the same place? There will be places where you can’t statically analyze perfectly. At one point in WHLSL spec Robin had put in analysis of uniform vs. non-uniform values / variables, but was removed, because it was unclear whether it was helpful. I looked into this, agree that there’s a lot of complexity where not sure it serves peoples’ interests given that it’s an imprecise spec. Open for debate. Some corners not complete in that analysis, and could be some distinctions between work group and sub-group uniformity. Could have sub-group uniformity, but have to track different uniformity at work group level.
    • DN: analysis was good, but didn’t have the legs to make it.
    • RM: general idea was basic type inference, where every variable could be uniform or non-uniform. Also control flow could be uniform or non-uniform. Infer smallest set of things that were uniform. Think it could have worked, but was getting really unpleasant. Got very long. There were some nice parts about interactions and control flow. But for example, break inside loop made the whole loop non-uniform. Another example: left values (variables) only work if they’re uniform, so hard to make them non-uniform. Analysis wasn’t modular; had to look within functions. This sort of thing along with the sub-group issue could be accommodated in a somewhat straightforward way but would add more complexity. Too complex for the spec, but would be great if we had some other solution. We think these could be a security threat. Derivative in non-uniform control flow, could this read data from another process?
    • CW: fxc did warn against non-uniform control flow with texture loads in branches, etc. Had a lot of rewrites in ANGLE where we split things into uniform and non-uniform control flows. Think fxc had similar constraints. Could encode those constraints in the API.
    • RM: my fear isn’t supporting too many valid programs, but imposing burdens on the spec implementers.
    • MM: part of the reason we’re doing this uniformity analysis is to prevent undefined behavior. Haven’t seen any docs anywhere that e.g. fxc’s analysis is sufficient to eliminate undefined behvaior.
    • CW: it’s not.
    • MM: OK, less motivation to do this.
    • CW: how about this: spec defines uniform control flow or not. Then if you do things that require derivatives in non uniform control flow, you get 0. Could emit 2 functions, one called inside uniform control flow, and the other for non-uniform.
    • RM: even worse; have to take into account uniformity of every branch as well as every incoming argument.
    • CW: way we’ve done it in ANGLE is: if you call a func that needs to be in unif control flow, we might take the non-unif upon any incoming value. More conservative.
    • RM: if programmer can work within that model, maybe should invest in that.
    • MM: couple of options. Don’t do any checking. Put really complex thing in spec. Make pessimal version of the analysis.
    • CW: not sure how we get data on that. One way would be to change ANGLE translator to be more pessimistic, and see if it changes how shaders are translated into WebGL. But ANGLE doesn’t have uniform-level type system for variables.
    • RM: not sure either.
    • MM: ANGLE’s analysis today is more pessimistic than we wrote up. If authors are OK with that then it’s OK.
    • CW: we’re more pessimistic. It’s shader rewriting and workarounds for fxc. Not really spec compliant.
    • MM: so your goal was to make fxc not fail compiles. For WebGPU we’ll be emitting DXIL.
    • CW: Chrome at least can’t rely on the presence of DXIL drivers.
    • MM: thought those were required by D3D12.
    • CB: de facto they are supported.
    • CW: my understanding is that DXIL support was added after D3D12. Some Windows 10 installations don’t support them.
    • JG: probably OK though.
    • CW: no WebGPU on those?
    • JG: probably a host of people who don’t get WebGPU, so if this is a small subset it’s OK.
    • CW: resolution: hope (3) that Myles stated is OK, having more restricted version of Robin’s analysis will work both for security and developers.
    • RM: I can try writing a simplified analysis based on it, which is modular. Should I also sketch for how it would work with subgroup sync?
    • DN: in your current analysis you have T-values, so you’d have 2 dimensions?
    • MM: 3D. Subgroup always a subset of work group uniformity. (?)
    • RM: I can try doing that. Currently add it in the spec. I know for sure that it’s uniform, or I don’t know. (“non-uniform”). Think 3 values would work.
  • CW: should we have more of these programming model meetings? Are there more issues to go through?
    • DN: large spreadsheet. They get increasingly interesting. These were good to go through. Lots of agreement. Would welcome feedback on what next to work on. Another meeting like this is useful once we’ve had offline thoughts.
    • CW: OK, not scheduling the next one yet. Go through the spreadsheet.
    • MM: was it valuable to have this in the whole group?
    • JG: I found it valuable.
    • KN: agree.
    • KR: agree.
    • CW: we could always say that the exact spec for our memory model work is too complex for most of us.
    • JG: we can also have ad-hoc meetings outside of the regular time.

Agenda for next meeting

  • CW: Google to make some proposal for ImageBitmap in WebGPU
  • getSubData / setSubdata
  • discard_store operation. Very important for perf today.
Clone this wiki locally