-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add and benchmark typed_hvcat(SA, ::Val, ...) #811
base: master
Are you sure you want to change the base?
Conversation
to explore the benefits of JuliaLang/julia#36719
Interesting, but I'd expect constant propagation to do the same here in most circumstances. Specifically, the In fact I thought I tested this exact thing in the original Anyway, consider the following less abstract version of your test case for 3x3 which gives 0 allocations on julia-1.4 and which the non-SA version seems to perform exactly the same: julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r = SA[0 0 0; 0 0 0; 0 0 0]
for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r += SA[i1 i2 i3; i4 i5 i6; i7 i8 i9]
end
r
end
foo (generic function with 2 methods)
julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r = SMatrix{3,3}((0,0,0, 0,0,0, 0,0,0))
for (i1,i2,i3,i4,i5,i6,i7,i8,i9) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9)
r += SMatrix{3,3}((i1,i4,i7, i2,i5,i8, i3,i6,i9))
end
r
end
bar (generic function with 1 method)
julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
695.789 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
768 768 768
768 768 768
768 768 768
julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
695.946 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Int64,2,9} with indices SOneTo(3)×SOneTo(3):
768 768 768
768 768 768
768 768 768 Can you reproduce this? How does this reconcile with the numbers you're getting? |
Yes, I should probably have described my methodology here better. I am seeing the same result as you for 3x3, but for sizes larger than 3x4, I do see these improvements. I benchmarked various sizes in this git here: https://gist.github.com/simeonschaub/fb6eff0d212f8514ecec186a69712a82. (The |
Right, 4x4 being slower makes some sense. I think we should figure out what's going on here and whether some minor rearrangement (eg careful use of As pointed out by Jeff in JuliaLang/julia#36719 it would be preferable to avoid making things harder for the compiler and I feel like this is a good case for relying on constant propagation. Let's see :) |
A suspicious thing here is that both of your perf/hvcat_val.jl examples still allocate, even though they really shouldn't. I've got a suspicion that the thing that's hardest on the compiler here may be the use of julia> function foo(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
end
r
end
foo (generic function with 2 methods)
julia> function bar(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r = SMatrix{4,4}((0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0))
for (i1,i2,i3,i4,i5,i6,i7,i8,i9,i10,i11,i12,i13,i14,i15,i16) in Iterators.product(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r += SMatrix{4,4}((i1, i2, i3, i4, i5,i6,i7,i8, i9,i10,i11,i12, i13,i14,i15,i16))
end
r
end
bar (generic function with 1 method)
julia> function baz(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16)
r = SA[0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]
for i1=x1, i2=x2, i3=x3, i4=x4, i5=x5, i6=x6, i7=x7, i8=x8, i9=x9, i10=x10, i11=x11, i12=x12, i13=x13, i14=x14, i15=x15, i16=x16
r += SA[i1 i2 i3 i4; i5 i6 i7 i8; i9 i10 i11 i12; i13 i14 i15 i16]
end
r
end
baz (generic function with 1 method)
julia> @btime foo(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
491.958 ms (524288 allocations: 69.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
julia> @btime bar(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
209.902 ms (655360 allocations: 94.00 MiB)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
julia> @btime baz(1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2, 1:2)
143.984 μs (0 allocations: 0 bytes)
4×4 SArray{Tuple{4,4},Int64,2,16} with indices SOneTo(4)×SOneTo(4):
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304
98304 98304 98304 98304 |
Oh, interesting! Do you agree, that the use of @inline function Base.typed_hvcat(::Type{SA}, alt::T, i...) where {T<:Tuple}
Base.typed_hvcat(SA, Val{alt}(), i...)
end I had pretty good success using something similar to this to call a specialized generated function in CoolTensors.jl. (forcing |
Maybe, but I'm not sure why? Arguably it might be safer if I didn't use
I didn't think this was meant to affect this case... but perhaps it does! I'd be interested if it makes a difference. |
to explore the benefits of JuliaLang/julia#36719.
For constructing a 4x4 SMatrix in a loop, as shown here in
perf/hvcat_val.jl
, I get the following timings on my machine: