-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QuadSort and BlitSort #95
base: main
Are you sure you want to change the base?
Conversation
Wow, thanks for the huge PR! A PR that triples the length of the codebase, especially with "the disclaimer that I cannot do any maintenance for it, and that I actually have no idea how the algorithm works ;(." is not super appealing. It's also slow, at least on M2, but we can make it faster julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.BlitSort) evals=1
4.790 ms (5 allocs: 160 bytes)
julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1
2.115 ms (9 allocs: 781.500 KiB)
julia> @b rand(100_000) sort!(_, alg=QuickSort) evals=1
4.473 ms
julia> @b rand(100_000) sort! evals=1
794.255 μs (6 allocs: 789.438 KiB)
julia> versioninfo()
Julia Version 1.11.0-beta2
Commit edb3c92d6a6 (2024-05-29 09:37 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (aarch64-linux-gnu)
CPU: 8 × unknown
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores) But I'm down to at least start a review :). I won't commit to following all the way through. EDITED to update performance figures for 7742143 and because I initially ran the benchmarks with --checkbounds=yes |
src/QuadSort.jl
Outdated
end | ||
end | ||
|
||
function quad_swap_merge!(array, array_index::Int, swap, swap_index::Int, cmp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function quad_swap_merge!(array, array_index::Int, swap, swap_index::Int, cmp) | |
function quad_swap_merge!(array, array_index::Int, swap, swap_index::Int, cmp::T) where T # Force specialization on cpm |
This forces specialization on cmp which, by default, is not specialized on because (I'm guessing) cmp is a function. (see https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing). The parity_merge_two
function calls cmp
so it specializes, hence the call from quad_swap_merge to parity_merge_two uses dynamic dispatch.
julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # Before
4.704 ms (76922 allocs: 2.525 MiB)
shell> git stash pop
On branch blitsort
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: src/QuadSort.jl
no changes added to commit (use "git add" and/or "git commit -a")
Dropped refs/stash@{0} (572b55ea506e7bd86da6622f0ffe56919a6c0115)
julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # Suggestion
3.857 ms (2126 allocs: 830.938 KiB)
I went ahead and forced specialization as I suggested, resulting in this performance improvement and elimination of unexpected allocations caused by dynamic dispatch: julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # Before
4.704 ms (76922 allocs: 2.525 MiB)
julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # After
3.835 ms (9 allocs: 781.500 KiB) # is 9 reasonable? |
Would you share a specific benchmark that this outperforms the default algorithm on? I'm having trouble finding any on my hardware. CI on 1.0 is also failing. |
Oh, I wasn't aware of this avoidance of specialization on functions. First of all, here's my benchmark/test function: function testsort(alg)
for len in 1:2048
data = Vector{Int}(undef, len)
datacopy = similar(data)
datacopy2 = similar(data)
try
timesort = 0.0
timefn = 0.0
for i in 1:1000
Random.rand!(data)
copyto!(datacopy, data)
copyto!(datacopy2, data)
timesort += @elapsed(sort!(data, alg=Base.Sort.QuickSort))
timefn += @elapsed(sort!(datacopy; alg))
if i == 1 # compilation time
timesort = 0.0
timefn = 0.0
end
if !issorted(datacopy)
println("Error:")
println(datacopy2)
return false
end
end
copyto!(data, 1:len)
copyto!(datacopy2, data)
if !issorted(sort!(data; alg))
println("Error: 1:$len")
return false
end
copyto!(data, len:-1:1)
copyto!(datacopy2, data)
if !issorted(sort!(data; alg))
println("Error: $len:-1:1")
return false
end
println("Success: ", len, "; fraction: $(timesort/timefn)")
catch e
println("Error $e:")
println(datacopy2)
rethrow()
end
end
println("Done")
return true
end If I call
And the fraction is QuickSort/BlitSort, so >1 means BlitSort is faster. In order to get a feeling for why the numbers were so bad (first translation was much worse), I ran the profiler on this and had a couple of issues with unexpected GC and dynamic dispatch in Are the allocations reasonable? There should be one allocation for the stack space for sure, which for QuadSort can be of the same size as the data. There is a second possible place where allocation of 32 elements can happen, which is in |
I get similar results as you when running that function: julia> testsort(SortingAlgorithms.BlitSort)
Success: 1; fraction: 0.9235749472202668
Success: 2; fraction: 0.8981071543150461
Success: 3; fraction: 1.0050755059840843
Success: 4; fraction: 1.0173432198064596
Success: 5; fraction: 1.3241362167536646
Success: 6; fraction: 1.3577964971819847
Success: 7; fraction: 1.2400373457211795
Success: 8; fraction: 1.1846993507104564
Success: 9; fraction: 1.1983808259067896
Success: 10; fraction: 1.1529177218600182
Success: 11; fraction: 1.190134309504794
Success: 12; fraction: 1.3672939212836008
Success: 13; fraction: 1.3935159998021849
Success: 14; fraction: 1.4222025940216767
Success: 15; fraction: 1.4175718336989669
Success: 16; fraction: 1.7141359110961103
Success: 17; fraction: 1.5514961901397621
Success: 18; fraction: 1.5403749410847938
Success: 19; fraction: 1.4709996095937898
Success: 20; fraction: 1.4741674595623069
Success: 21; fraction: 1.433004638279142
Success: 22; fraction: 1.6136668211364882
Success: 23; fraction: 1.6504559645593966
Success: 24; fraction: 1.4553823137282706
Success: 25; fraction: 1.4404924221914732
Success: 26; fraction: 1.434619380430659
Success: 27; fraction: 1.4715156592168233
Success: 28; fraction: 1.440334616459845
Success: 29; fraction: 1.7438207438207278
Success: 30; fraction: 1.7900271146088131
Success: 31; fraction: 2.088852368621629
Success: 32; fraction: 0.596349153958278
Success: 33; fraction: 0.4954208074259095
Success: 34; fraction: 0.5252703500710478
Success: 35; fraction: 0.5421151436373185
Success: 40; fraction: 0.1297712291141863
Success: 50; fraction: 0.695014733055055
Success: 60; fraction: 0.7895886547674394
Success: 70; fraction: 0.9010496645524076
Success: 80; fraction: 0.981002307025407
Success: 90; fraction: 0.9951907695867407
Success: 100; fraction: 1.1102504155906174
Success: 110; fraction: 1.1514692985311152
Success: 120; fraction: 1.1801529656451037
Success: 130; fraction: 1.3360731604922897
Success: 131; fraction: 0.8873762152247565
Success: 132; fraction: 1.4329688705185493
Success: 133; fraction: 1.592944409633476
Success: 200; fraction: 1.5365608930619814
Success: 300; fraction: 1.4506622286635045
Success: 400; fraction: 1.4189226111454856
Success: 500; fraction: 1.391537019045629
Success: 600; fraction: 1.3613096186873197
Success: 700; fraction: 1.350556979850107
Success: 1000; fraction: 1.3085079297816213
Success: 2048; fraction: 1.2390204099312676
Done
true However, QuickSort is not Julia's default sorting algorithm. To get the default alg, don't set the alg keyword. Also, that timing fails to account for allocation costs (because the associated free/gc time may not be timed in the same timing block as the allocation) which may cause BlitSort to get a lower reported runtime for very small inputs. When I reproduce this using Chairmarks and the default alg I get this: using Random, Chairmarks
function time_sort(alg)
for len in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,40,50,60,70,80,90,100,110,120,130,131,132,133,200,300,400,500,600,700,1000,2048]
data = Vector{Int}(undef, len)
datacopy = similar(data)
timesort = @b rand!(data) sort!(copyto!(datacopy, _))
timefn = @b rand!(data) sort!(copyto!(datacopy, _); alg)
println(len, "; fraction: $(timesort.time/timefn.time)")
end
println("Done")
end
julia> time_sort(SortingAlgorithms.BlitSort)
1; fraction: 0.7130380199107237
2; fraction: 0.757731849668227
3; fraction: 0.7504643314968642
4; fraction: 0.744190132990133
5; fraction: 0.767963278031093
6; fraction: 0.7901921243687335
7; fraction: 0.7425666403293549
8; fraction: 0.7544145315619588
9; fraction: 0.7484723964648307
10; fraction: 0.8732520261161361
11; fraction: 0.7556041514041516
12; fraction: 0.5120827794344842
13; fraction: 0.5920690943947947
14; fraction: 0.5852041369615424
15; fraction: 0.6335968129695769
16; fraction: 0.9091960128162979
17; fraction: 0.9383193017143447
18; fraction: 0.9435285898952117
19; fraction: 1.0352756071548779
20; fraction: 1.1237714241617471
21; fraction: 1.3102308810700787
22; fraction: 1.3320647200732052
23; fraction: 1.4775422527857751
24; fraction: 1.1007392691262852
25; fraction: 1.1398564677052336
26; fraction: 1.1830320211528413
27; fraction: 1.1251956019893306
28; fraction: 1.1998741387860647
29; fraction: 1.1351351351351353
30; fraction: 1.1831476640269691
31; fraction: 1.257297307278611
32; fraction: 0.29874599887026926
33; fraction: 0.2510720853323746
34; fraction: 0.24863594318449125
35; fraction: 0.2930601231791964
40; fraction: 0.36425943983431636
50; fraction: 0.30863845474628854
60; fraction: 0.4185269536195455
70; fraction: 0.5346380981223501
80; fraction: 0.7574928909952606
90; fraction: 0.46533213068175006
100; fraction: 0.5420962797826392
110; fraction: 0.5350877192982456
120; fraction: 0.5588690426531101
130; fraction: 0.5397727272727272
131; fraction: 0.5386907472707637
132; fraction: 0.5327724571298785
133; fraction: 0.9128491607453596
200; fraction: 0.9253311781861626
300; fraction: 0.9820513702162775
400; fraction: 0.9683544303797469
500; fraction: 1.2379683410991256
600; fraction: 1.115890953573717
700; fraction: 0.9855251141552512
1000; fraction: 0.7467678253464427
2048; fraction: 0.44970290595116513
Done Which does show that this alg is faster than the default in some cases |
I went through each and every function and checked their disassembly (for
Based on your hint to Chairmarks (which I didn't know), I rewrote my benchmark. The lines that previously contained timesort += @b(copyto!(datacopy, data), sort!(_), evals=1, samples=1).time
timefn += @b(copyto!(datacopy, data), sort!(_; alg), evals=1, samples=1).time My reasoning for this that
As you see, I now also chose Julia's default sorting algorithm. The numbers are slightly worse (in particular for the tiny sizes), but comparable if I compare with QuickSort; however, the default (which should be RadixSort?) turns out to be much better for larger sizes. Well, my reason for going for BlitSort was not its speed (although what is reported elsewhere is still significantly better than what we get here), but that it is an in-place (or say, constant-space) stable algorithm. |
@b rand!(data) sort!(copyto!(datacopy, _))
The variability due to RNG, especially at large sizes, is negligible compared to system noise in every sorting algorithm I've ever bench marked, but if that's a concern to you you can use For large arrays (i.e. it takes more than 30 microseconds to sort them) the benchmarking method doesn't matter much. But for small and tiny arrays using only a single evaluation can result in some strange benchmarking artifacts. The way to establish a ground truth is to embed the two sorting algorithms in a variety of realistic workflows with runtime on the order of seconds and compare the end-to-end runtime of those workflows. |
Base.@assume_effects :nothrow function blit_median_of_cbrt!(array, array_index::Int, swap, swap_index::Int, swap_size::UInt, | ||
nmemb::UInt, cmp::F) where {F} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effect annotation is invalid in the case of a cmp
function that may throw.
Base.@assume_effects :nothrow function blit_partition!(array, array_index::Int, swap, swap_index::Int, swap_size::UInt, | ||
nmemb::UInt, cmp::F) where {F} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Bumper.jl? |
A couple of days ago, I (thought I) had need of a stable in-place sorting algorithm and after some searching came up with BlitSort (which, despite its complexity, appears to be not as contrived to implement as GrailSort or WikiSort).
So I took @scandum's code and ported it to Julia. Given that this was a very minor component in what I needed (and I finally ended up not needing it at all), I thought that before the code goes to waste on my drive, I'd contribute it to this package - with the disclaimer that I cannot do any maintenance for it, and that I actually have no idea how the algorithm works ;(.
If you'd still like to include it, let me mention a couple of things and possible points for improvements:
sort!
interface used by Julia(Collections) for this PR. Since the code is a translation of scandum's C, the comparison function is the other way round. Would be nice to change this consistently.ifelse
, and I tried it. However, this proved to be worse than standard branching. I guess that the reason for this is that (a) very generally, conditional moves also introduce dependency chains and whether they are better than branches or not is extremely dependent on the branch prediction and (b) the whole construct is only possible with conditional moves alone if bothptl
andptr
are not held in a register, but point to some memory. Then, we can just conditionally assign the pointer and then work on the pointer alone. As a consequence, I'd have to replace those variables by references wherever they occur. Given that x64 has enough registers to keep them all in the CPU otherwise, this provides some additional penalty.So this led me to remove all the branchless code.
Missing
element. I originally fixed this by replacing everycopyto!
by your_unsafe_copyto!
, which makes the error disappear, but as the issue is triggered at a location wherecopyto!
is used to forward-copy in an overlapping region, this is not safe. Another backwards-working_unsafe_copyto_backwards!
had to be implemented for this. However, I think - I'll look more into this tomorrow - that the error is actually in Julia Base. Whensort!
does the missing-optimization, it shifts all missing elements to the back and then wraps the whole vector into aWithoutMissingVector
. This is accounted for by adjustinghi
in the sorting, because of course the vector still containsmissing
. But thencopyto!
comes and for some reason, as you already reported, copies due to aliasing. But it copies the whole vector, which containsmissing
despite its type - this triggers the assertion. So actually, this should be a bug in Base.Base.Experimental.@optlevel 3
at the beginning of the module?