add and benchmark typed_hvcat(SA, ::Val, ...) #811

simeonschaub · 2020-07-18T10:02:49Z

to explore the benefits of JuliaLang/julia#36719.

For constructing a 4x4 SMatrix in a loop, as shown here in perf/hvcat_val.jl, I get the following timings on my machine:

julia> include("perf/hvcat_val.jl")
BenchmarkTools.Trial: 
  memory estimate:  69.00 MiB
  allocs estimate:  524288
  --------------
  minimum time:     324.084 ms (0.40% GC)
  median time:      325.538 ms (0.68% GC)
  mean time:        327.568 ms (0.62% GC)
  maximum time:     351.516 ms (0.41% GC)
  --------------
  samples:          16
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  49.00 MiB
  allocs estimate:  327680
  --------------
  minimum time:     87.512 ms (0.98% GC)
  median time:      88.072 ms (1.22% GC)
  mean time:        89.065 ms (1.31% GC)
  maximum time:     113.838 ms (0.91% GC)
  --------------
  samples:          57
  evals/sample:     1

to explore the benefits of JuliaLang/julia#36719

c42f · 2020-07-20T04:24:18Z

Interesting, but I'd expect constant propagation to do the same here in most circumstances. Specifically, the rows should always be a constant literal tuple because it comes from interpreting syntax and the compiler can constant propagate it.

In fact I thought I tested this exact thing in the original SA PR, or I wouldn't have merged it! Perhaps it fails at larger matrix sizes?

Anyway, consider the following less abstract version of your test case for 3x3 which gives 0 allocations on julia-1.4 and which the non-SA version seems to perform exactly the same:

julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9)
           r = SA[0 0 0; 0 0 0; 0 0 0]
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
               r += SA[i1 i2 i3; i4 i5 i6; i7 i8 i9]
           end
           r
       end
foo (generic function with 2 methods)

julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9)
           r = SMatrix{3,3}((0,0,0, 0,0,0, 0,0,0))
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
               r += SMatrix{3,3}((i1,i4,i7, i2,i5,i8, i3,i6,i9))
           end
           r
       end
bar (generic function with 1 method)

julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  695.789 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
 768  768  768
 768  768  768
 768  768  768

julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  695.946 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
 768  768  768
 768  768  768
 768  768  768

Can you reproduce this? How does this reconcile with the numbers you're getting?

simeonschaub · 2020-07-20T07:58:15Z

Yes, I should probably have described my methodology here better. I am seeing the same result as you for 3x3, but for sizes larger than 3x4, I do see these improvements. I benchmarked various sizes in this git here: https://gist.github.com/simeonschaub/fb6eff0d212f8514ecec186a69712a82. (The (4, 2) case is really weird, because I checked that they produced the same @code_native and this difference goes away, if I change the order I call the two functions in. My best guess is that this is due to an invalid use of @pure in the current implementation, which would probably also be an argument against relying too much on @pure here.)
I do think 4x4 is still a reasonable size to use this constructor for, so I think this would still be a worthwhile improvement, but I was actually surprised how well constant prop worked for the smaller sizes.

c42f · 2020-07-21T05:44:03Z

Right, 4x4 being slower makes some sense.

I think we should figure out what's going on here and whether some minor rearrangement (eg careful use of @inline or @generated in the right place) could encourage the compiler to produce better code.

As pointed out by Jeff in JuliaLang/julia#36719 it would be preferable to avoid making things harder for the compiler and I feel like this is a good case for relying on constant propagation. Let's see :)

c42f · 2020-07-21T10:56:30Z

A suspicious thing here is that both of your perf/hvcat_val.jl examples still allocate, even though they really shouldn't.

I've got a suspicion that the thing that's hardest on the compiler here may be the use of Iterators.product with so many fields, rather than SA per se. Yes, they're interacting in a somewhat bad way, but @code_typed doesn't show any type instability. Avoiding Iterators.product improves the performance markedly without needing to remove SA:

julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
               r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
           end
           r
       end
foo (generic function with 2 methods)

julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SMatrix{4,4}((0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0))
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
               r += SMatrix{4,4}((i1, i2, i3, i4, i5,i6,i7,i8, i9,i10,i11,i12, i13,i14,i15,i16))
           end
           r
       end
bar (generic function with 1 method)

julia> function baz(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
           for i1=x1, i2=x2, i3=x3, i4=x4, i5=x5, i6=x6, i7=x7, i8=x8, i9=x9, i10=x10, i11=x11, i12=x12, i13=x13, i14=x14, i15=x15, i16=x16
               r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
           end
           r
       end
baz (generic function with 1 method)

julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  491.958 ms (524288 allocations: 69.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  209.902 ms (655360 allocations: 94.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

julia> @btime baz(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  143.984 μs (0 allocations: 0 bytes)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

simeonschaub · 2020-07-21T11:53:03Z

Oh, interesting! Do you agree, that the use of @pure is also problematic here? One thing I still want to try is something like

@inline function Base.typed_hvcat(::Type{SA}, alt::T, i...) where {T<:Tuple}
    Base.typed_hvcat(SA, Val{alt}(), i...)
end

I had pretty good success using something similar to this to call a specialized generated function in CoolTensors.jl. (forcing T to be specialized on seems to be important here)

c42f · 2020-07-21T12:20:02Z

Do you agree, that the use of @pure is also problematic here?

Maybe, but I'm not sure why? Arguably it might be safer if I didn't use any inside of it (or getindex, or != ?!), but it would be some pretty severe type piracy for someone to change the definition of any of those methods for integers and tuples.

forcing T to be specialized on seems to be important here

I didn't think this was meant to affect this case... but perhaps it does! I'd be interested if it makes a difference.

add and benchmark typed_hvcat(SA, ::Val, ...)

e52f495

to explore the benefits of JuliaLang/julia#36719

simeonschaub mentioned this pull request Jul 18, 2020

RFC: lower T[a b; c d] to typed_hvcat(T, Val((2, 2)), a, b, c, d) JuliaLang/julia#36719

Closed

simeonschaub marked this pull request as draft July 18, 2020 10:10

mateuszbaran mentioned this pull request Mar 23, 2021

Make hvcat for StaticArray and UniformScaling inputs produce a StaticArray #888

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add and benchmark typed_hvcat(SA, ::Val, ...) #811

add and benchmark typed_hvcat(SA, ::Val, ...) #811

simeonschaub commented Jul 18, 2020

c42f commented Jul 20, 2020 •

edited

Loading

simeonschaub commented Jul 20, 2020

c42f commented Jul 21, 2020

c42f commented Jul 21, 2020

simeonschaub commented Jul 21, 2020

c42f commented Jul 21, 2020 •

edited

Loading

add and benchmark typed_hvcat(SA, ::Val, ...) #811

Are you sure you want to change the base?

add and benchmark typed_hvcat(SA, ::Val, ...) #811

Conversation

simeonschaub commented Jul 18, 2020

c42f commented Jul 20, 2020 • edited Loading

simeonschaub commented Jul 20, 2020

c42f commented Jul 21, 2020

c42f commented Jul 21, 2020

simeonschaub commented Jul 21, 2020

c42f commented Jul 21, 2020 • edited Loading

c42f commented Jul 20, 2020 •

edited

Loading

c42f commented Jul 21, 2020 •

edited

Loading