Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add and benchmark typed_hvcat(SA, ::Val, ...) #811

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

simeonschaub
Copy link
Contributor

to explore the benefits of JuliaLang/julia#36719.

For constructing a 4x4 SMatrix in a loop, as shown here in perf/hvcat_val.jl, I get the following timings on my machine:

julia> include("perf/hvcat_val.jl")
BenchmarkTools.Trial: 
  memory estimate:  69.00 MiB
  allocs estimate:  524288
  --------------
  minimum time:     324.084 ms (0.40% GC)
  median time:      325.538 ms (0.68% GC)
  mean time:        327.568 ms (0.62% GC)
  maximum time:     351.516 ms (0.41% GC)
  --------------
  samples:          16
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  49.00 MiB
  allocs estimate:  327680
  --------------
  minimum time:     87.512 ms (0.98% GC)
  median time:      88.072 ms (1.22% GC)
  mean time:        89.065 ms (1.31% GC)
  maximum time:     113.838 ms (0.91% GC)
  --------------
  samples:          57
  evals/sample:     1

@c42f
Copy link
Member

c42f commented Jul 20, 2020

Interesting, but I'd expect constant propagation to do the same here in most circumstances. Specifically, the rows should always be a constant literal tuple because it comes from interpreting syntax and the compiler can constant propagate it.

In fact I thought I tested this exact thing in the original SA PR, or I wouldn't have merged it! Perhaps it fails at larger matrix sizes?

Anyway, consider the following less abstract version of your test case for 3x3 which gives 0 allocations on julia-1.4 and which the non-SA version seems to perform exactly the same:

julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9)
           r = SA[0 0 0; 0 0 0; 0 0 0]
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
               r += SA[i1 i2 i3; i4 i5 i6; i7 i8 i9]
           end
           r
       end
foo (generic function with 2 methods)

julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9)
           r = SMatrix{3,3}((0,0,0, 0,0,0, 0,0,0))
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
               r += SMatrix{3,3}((i1,i4,i7, i2,i5,i8, i3,i6,i9))
           end
           r
       end
bar (generic function with 1 method)

julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  695.789 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
 768  768  768
 768  768  768
 768  768  768

julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  695.946 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
 768  768  768
 768  768  768
 768  768  768

Can you reproduce this? How does this reconcile with the numbers you're getting?

@simeonschaub
Copy link
Contributor Author

Yes, I should probably have described my methodology here better. I am seeing the same result as you for 3x3, but for sizes larger than 3x4, I do see these improvements. I benchmarked various sizes in this git here: https://gist.github.com/simeonschaub/fb6eff0d212f8514ecec186a69712a82. (The (4, 2) case is really weird, because I checked that they produced the same @code_native and this difference goes away, if I change the order I call the two functions in. My best guess is that this is due to an invalid use of @pure in the current implementation, which would probably also be an argument against relying too much on @pure here.)
I do think 4x4 is still a reasonable size to use this constructor for, so I think this would still be a worthwhile improvement, but I was actually surprised how well constant prop worked for the smaller sizes.

@c42f
Copy link
Member

c42f commented Jul 21, 2020

Right, 4x4 being slower makes some sense.

I think we should figure out what's going on here and whether some minor rearrangement (eg careful use of @inline or @generated in the right place) could encourage the compiler to produce better code.

As pointed out by Jeff in JuliaLang/julia#36719 it would be preferable to avoid making things harder for the compiler and I feel like this is a good case for relying on constant propagation. Let's see :)

@c42f
Copy link
Member

c42f commented Jul 21, 2020

A suspicious thing here is that both of your perf/hvcat_val.jl examples still allocate, even though they really shouldn't.

I've got a suspicion that the thing that's hardest on the compiler here may be the use of Iterators.product with so many fields, rather than SA per se. Yes, they're interacting in a somewhat bad way, but @code_typed doesn't show any type instability. Avoiding Iterators.product improves the performance markedly without needing to remove SA:

julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
               r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
           end
           r
       end
foo (generic function with 2 methods)

julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SMatrix{4,4}((0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0))
           for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
               r += SMatrix{4,4}((i1, i2, i3, i4, i5,i6,i7,i8, i9,i10,i11,i12, i13,i14,i15,i16))
           end
           r
       end
bar (generic function with 1 method)

julia> function baz(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
           r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
           for i1=x1, i2=x2, i3=x3, i4=x4, i5=x5, i6=x6, i7=x7, i8=x8, i9=x9, i10=x10, i11=x11, i12=x12, i13=x13, i14=x14, i15=x15, i16=x16
               r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
           end
           r
       end
baz (generic function with 1 method)

julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  491.958 ms (524288 allocations: 69.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  209.902 ms (655360 allocations: 94.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

julia> @btime baz(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
  143.984 μs (0 allocations: 0 bytes)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304
 98304  98304  98304  98304

@simeonschaub
Copy link
Contributor Author

Oh, interesting! Do you agree, that the use of @pure is also problematic here? One thing I still want to try is something like

@inline function Base.typed_hvcat(::Type{SA}, alt::T, i...) where {T<:Tuple}
    Base.typed_hvcat(SA, Val{alt}(), i...)
end

I had pretty good success using something similar to this to call a specialized generated function in CoolTensors.jl. (forcing T to be specialized on seems to be important here)

@c42f
Copy link
Member

c42f commented Jul 21, 2020

Do you agree, that the use of @pure is also problematic here?

Maybe, but I'm not sure why? Arguably it might be safer if I didn't use any inside of it (or getindex, or != ?!), but it would be some pretty severe type piracy for someone to change the definition of any of those methods for integers and tuples.

forcing T to be specialized on seems to be important here

I didn't think this was meant to affect this case... but perhaps it does! I'd be interested if it makes a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants