Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuadSort and BlitSort #95

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

projekter
Copy link

A couple of days ago, I (thought I) had need of a stable in-place sorting algorithm and after some searching came up with BlitSort (which, despite its complexity, appears to be not as contrived to implement as GrailSort or WikiSort).
So I took @scandum's code and ported it to Julia. Given that this was a very minor component in what I needed (and I finally ended up not needing it at all), I thought that before the code goes to waste on my drive, I'd contribute it to this package - with the disclaimer that I cannot do any maintenance for it, and that I actually have no idea how the algorithm works ;(.
If you'd still like to include it, let me mention a couple of things and possible points for improvements:

  • I adopted the code to the sort! interface used by Julia(Collections) for this PR. Since the code is a translation of scandum's C, the comparison function is the other way round. Would be nice to change this consistently.
  • The C code naturally works with pointers to the elements, while I ported everything to the Julia interface, i.e., array indexing. Consequently, the Julia code should also be able to handle more irregular arrays (not unit step, not just integer data types).
  • The C code made use of stack-space arrays. I wrote a quick wrapper for this which I think should be safe even for non-bitstype objects and still go to the stack (well, if not, then not). Basically, I use an uninitialized Tuple (which Julia will automatically zero-initialize for managed types) and wrap around a Vector that doesn't own the memory. After the jobs are finished, I manually undefine the elements in the vector if necessary, which should trigger all the necessary GC so that nothing is left over when the Tuple goes out of scope. However, I did not test it for managed types.
  • The C code contained a couple of branchless comparisons (i.e., use conditional moves instead of jumps). The typical code for this looks like
    *ptd++ = cmp(ptl, ptr) <= 0 ? *ptl++ : *ptr++;
    which Clang compiles to two conditional moves and standard arithmetic. It would be possible to mimick something like this with Julia using ifelse, and I tried it. However, this proved to be worse than standard branching. I guess that the reason for this is that (a) very generally, conditional moves also introduce dependency chains and whether they are better than branches or not is extremely dependent on the branch prediction and (b) the whole construct is only possible with conditional moves alone if both ptl and ptr are not held in a register, but point to some memory. Then, we can just conditionally assign the pointer and then work on the pointer alone. As a consequence, I'd have to replace those variables by references wherever they occur. Given that x64 has enough registers to keep them all in the CPU otherwise, this provides some additional penalty.
    So this led me to remove all the branchless code.
  • Your testsuite initially did not pass, triggering an assertion failure in Base due to a Missing element. I originally fixed this by replacing every copyto! by your _unsafe_copyto!, which makes the error disappear, but as the issue is triggered at a location where copyto! is used to forward-copy in an overlapping region, this is not safe. Another backwards-working _unsafe_copyto_backwards! had to be implemented for this. However, I think - I'll look more into this tomorrow - that the error is actually in Julia Base. When sort! does the missing-optimization, it shifts all missing elements to the back and then wraps the whole vector into a WithoutMissingVector. This is accounted for by adjusting hi in the sorting, because of course the vector still contains missing. But then copyto! comes and for some reason, as you already reported, copies due to aliasing. But it copies the whole vector, which contains missing despite its type - this triggers the assertion. So actually, this should be a bug in Base.
  • Julia makes it very easy to use SIMD; I did not check for this possibility at any place. Perhaps something could be gained at some places.
  • I checked correctness not only with your testsuite, but also with a custom script that in increasing lengths from 1 to 2048 checks sorting for 1000 random vectors each, plus the initially sorted and reverse-sorted one, all with inbounds disabled. So despite the huge complexity and my complete lack of understanding, I'm fairly certain that there are no blatant errors.
  • In my benchmarks, the code now beats QuickSort (as soon as BlitSort sets in, QuadSort is slower starting from length 32, though it eventually, at about 400 elements, beats QuickSort) by a factor of 1.2 - 1.5. This is a bit strange, I would expect far larger savings from @scandum's reports. Profview doesn't show anything suspicious, no GC, no dispatch issues, ...
  • The code is so long that I put it in new files. While this is different from the other sorting algorithms, I think refactoring everything into one file per algorithm would not be the worst of ideas...
  • How about Base.Experimental.@optlevel 3 at the beginning of the module?

@LilithHafner
Copy link
Member

LilithHafner commented Jun 1, 2024

Wow, thanks for the huge PR!

A PR that triples the length of the codebase, especially with "the disclaimer that I cannot do any maintenance for it, and that I actually have no idea how the algorithm works ;(." is not super appealing.

It's also slow, at least on M2, but we can make it faster

julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.BlitSort) evals=1
4.790 ms (5 allocs: 160 bytes)

julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1
2.115 ms (9 allocs: 781.500 KiB)

julia> @b rand(100_000) sort!(_, alg=QuickSort) evals=1
4.473 ms

julia> @b rand(100_000) sort! evals=1
794.255 μs (6 allocs: 789.438 KiB)

julia> versioninfo()
Julia Version 1.11.0-beta2
Commit edb3c92d6a6 (2024-05-29 09:37 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

But I'm down to at least start a review :). I won't commit to following all the way through.

EDITED to update performance figures for 7742143 and because I initially ran the benchmarks with --checkbounds=yes

src/QuadSort.jl Outdated
end
end

function quad_swap_merge!(array, array_index::Int, swap, swap_index::Int, cmp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function quad_swap_merge!(array, array_index::Int, swap, swap_index::Int, cmp)
function quad_swap_merge!(array, array_index::Int, swap, swap_index::Int, cmp::T) where T # Force specialization on cpm

This forces specialization on cmp which, by default, is not specialized on because (I'm guessing) cmp is a function. (see https://docs.julialang.org/en/v1/manual/performance-tips/#Be-aware-of-when-Julia-avoids-specializing). The parity_merge_two function calls cmp so it specializes, hence the call from quad_swap_merge to parity_merge_two uses dynamic dispatch.

julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # Before
4.704 ms (76922 allocs: 2.525 MiB)

shell> git stash pop
On branch blitsort
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   src/QuadSort.jl

no changes added to commit (use "git add" and/or "git commit -a")
Dropped refs/stash@{0} (572b55ea506e7bd86da6622f0ffe56919a6c0115)

julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # Suggestion
3.857 ms (2126 allocs: 830.938 KiB)

@LilithHafner
Copy link
Member

I went ahead and forced specialization as I suggested, resulting in this performance improvement and elimination of unexpected allocations caused by dynamic dispatch:

julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # Before
4.704 ms (76922 allocs: 2.525 MiB)

julia> @b rand(100_000) sort!(_, alg=SortingAlgorithms.QuadSort) evals=1 # After
3.835 ms (9 allocs: 781.500 KiB) # is 9 reasonable?

@LilithHafner
Copy link
Member

Would you share a specific benchmark that this outperforms the default algorithm on? I'm having trouble finding any on my hardware.

CI on 1.0 is also failing.

@projekter
Copy link
Author

Oh, I wasn't aware of this avoidance of specialization on functions. First of all, here's my benchmark/test function:

function testsort(alg)
    for len in 1:2048
        data = Vector{Int}(undef, len)
        datacopy = similar(data)
        datacopy2 = similar(data)
        try
            timesort = 0.0
            timefn = 0.0
            for i in 1:1000
                Random.rand!(data)
                copyto!(datacopy, data)
                copyto!(datacopy2, data)
                timesort += @elapsed(sort!(data, alg=Base.Sort.QuickSort))
                timefn += @elapsed(sort!(datacopy; alg))
                if i == 1 # compilation time
                    timesort = 0.0
                    timefn = 0.0
                end
                if !issorted(datacopy)
                    println("Error:")
                    println(datacopy2)
                    return false
                end
            end
            copyto!(data, 1:len)
            copyto!(datacopy2, data)
            if !issorted(sort!(data; alg))
                println("Error: 1:$len")
                return false
            end
            copyto!(data, len:-1:1)
            copyto!(datacopy2, data)
            if !issorted(sort!(data; alg))
                println("Error: $len:-1:1")
                return false
            end
            println("Success: ", len, "; fraction: $(timesort/timefn)")
        catch e
            println("Error $e:")
            println(datacopy2)
            rethrow()
        end
    end
    println("Done")
    return true
end

If I call testsort(BlitSort), then an excerpt of the reported looks like this:

Success: 1; fraction: 0.9646017699115047
Success: 2; fraction: 0.8703703703703712
Success: 3; fraction: 1.2357414448669204
Success: 4; fraction: 1.3046594982078903
Success: 5; fraction: 1.2127071823204472
Success: 6; fraction: 1.1697722567287792
Success: 7; fraction: 1.0452898550724645
Success: 8; fraction: 1.0180180180180185
Success: 9; fraction: 1.0477453580901863
Success: 10; fraction: 1.1818181818181828
Success: 11; fraction: 0.6772727272727305
Success: 12; fraction: 1.0500000000000012
Success: 13; fraction: 1.1115421920465562
Success: 14; fraction: 1.0971843778383288
Success: 15; fraction: 1.0909822866344616
Success: 16; fraction: 1.4178082191780828
Success: 17; fraction: 1.3226337448559689
Success: 18; fraction: 1.2893129770992384
Success: 19; fraction: 1.5210084033613478
Success: 20; fraction: 1.2559366754617425
Success: 21; fraction: 1.2644526445264457
Success: 22; fraction: 1.6375137513751346
Success: 23; fraction: 1.475487612018971
Success: 24; fraction: 1.396527451900512
Success: 25; fraction: 1.3679999999999954
Success: 26; fraction: 1.3822903363133237
Success: 27; fraction: 1.177380543141974
Success: 28; fraction: 1.4709163346613563
Success: 29; fraction: 1.505905511811029
Success: 30; fraction: 1.62567736556899
Success: 31; fraction: 1.7831431079894662
Success: 32; fraction: 0.2989858012170453
Success: 33; fraction: 0.2883952963534875
Success: 34; fraction: 0.29129672897196324
Success: 35; fraction: 0.28623376623376706
Success: 40; fraction: 0.30584900798763315
Success: 50; fraction: 0.3488628026412304
Success: 60; fraction: 0.37976384364821253
Success: 70; fraction: 0.45789056304520515
Success: 80; fraction: 0.4902067662342383
Success: 90; fraction: 0.5029883906499414
Success: 100; fraction: 0.5923076923076961
Success: 110; fraction: 0.5938189845474624
Success: 120; fraction: 0.6164005519416484
Success: 130; fraction: 0.7322216772927823
Success: 131; fraction: 0.7119909181211813
Success: 132; fraction: 0.7616764440433192
Success: 133; fraction: 1.5564345334496
Success: 200; fraction: 1.5227312868699066
Success: 300; fraction: 1.3125808312939062
Success: 400; fraction: 1.3490556709186126
Success: 500; fraction: 1.3033762914048312
Success: 600; fraction: 1.2465125469274851
Success: 700; fraction: 1.2810594622844182
Success: 1000; fraction: 1.2483761629555905
Success: 2048; fraction: 1.216062710391902

And the fraction is QuickSort/BlitSort, so >1 means BlitSort is faster.

In order to get a feeling for why the numbers were so bad (first translation was much worse), I ran the profiler on this and had a couple of issues with unexpected GC and dynamic dispatch in tiny_sort! (see the comment there). I could not understand why this was the case, as the disassembly gave no indication whatsoever why there should be any of those two things going on and I was not able to isolate it to a reproducible MWE.
However, the disassembly also never gave any indication that the calls to cmp would not be specialized/inlined (well, if I remember correctly, sometimes @code_native does not show the true disassembly, maybe I ran into one of those cases). But I just added the forced specialization to all functions and could not observe any noticeable effect on the timings. Strange, I'm on Windows 1.10.3.
Regarding CI on 1.0 (which I never used, I started with Julia 1.6 or 1.7), it looks like my stack space trick won't work for more complicated types like unions, as the data layout is "unspecified". While this shouldn't be any problem here - even if the layout is unspecified, as the memory is passed over to Julia, so whatever it internally does, it can do here - the check prevents the memory wrapping. Interestingly, it appears that the very same check is still present in the current versions, however, it doesn't appear to be triggered any more. But since 1.4 it is additionally conditioned on possible inline storage (JuliaLang/julia#33886), so maybe this is the version that makes it possible (and shows that I should be more careful in the stack-check; sometimes, the wrapping still won't work, so I'll have to go on the heap instead).

Are the allocations reasonable? There should be one allocation for the stack space for sure, which for QuadSort can be of the same size as the data. There is a second possible place where allocation of 32 elements can happen, which is in quad_swap!, which in QuadSort is also called at most once (BlitSort might call it more than once). Everything else should be completely allocation-free. I don't know if the preprocessing of the sort-machinery maybe has some further allocations, which could explain the discrepancy (for me, 6 allocations are reported with the same amount of memory).

@LilithHafner
Copy link
Member

I get similar results as you when running that function:

julia> testsort(SortingAlgorithms.BlitSort)
Success: 1; fraction: 0.9235749472202668
Success: 2; fraction: 0.8981071543150461
Success: 3; fraction: 1.0050755059840843
Success: 4; fraction: 1.0173432198064596
Success: 5; fraction: 1.3241362167536646
Success: 6; fraction: 1.3577964971819847
Success: 7; fraction: 1.2400373457211795
Success: 8; fraction: 1.1846993507104564
Success: 9; fraction: 1.1983808259067896
Success: 10; fraction: 1.1529177218600182
Success: 11; fraction: 1.190134309504794
Success: 12; fraction: 1.3672939212836008
Success: 13; fraction: 1.3935159998021849
Success: 14; fraction: 1.4222025940216767
Success: 15; fraction: 1.4175718336989669
Success: 16; fraction: 1.7141359110961103
Success: 17; fraction: 1.5514961901397621
Success: 18; fraction: 1.5403749410847938
Success: 19; fraction: 1.4709996095937898
Success: 20; fraction: 1.4741674595623069
Success: 21; fraction: 1.433004638279142
Success: 22; fraction: 1.6136668211364882
Success: 23; fraction: 1.6504559645593966
Success: 24; fraction: 1.4553823137282706
Success: 25; fraction: 1.4404924221914732
Success: 26; fraction: 1.434619380430659
Success: 27; fraction: 1.4715156592168233
Success: 28; fraction: 1.440334616459845
Success: 29; fraction: 1.7438207438207278
Success: 30; fraction: 1.7900271146088131
Success: 31; fraction: 2.088852368621629
Success: 32; fraction: 0.596349153958278
Success: 33; fraction: 0.4954208074259095
Success: 34; fraction: 0.5252703500710478
Success: 35; fraction: 0.5421151436373185
Success: 40; fraction: 0.1297712291141863
Success: 50; fraction: 0.695014733055055
Success: 60; fraction: 0.7895886547674394
Success: 70; fraction: 0.9010496645524076
Success: 80; fraction: 0.981002307025407
Success: 90; fraction: 0.9951907695867407
Success: 100; fraction: 1.1102504155906174
Success: 110; fraction: 1.1514692985311152
Success: 120; fraction: 1.1801529656451037
Success: 130; fraction: 1.3360731604922897
Success: 131; fraction: 0.8873762152247565
Success: 132; fraction: 1.4329688705185493
Success: 133; fraction: 1.592944409633476
Success: 200; fraction: 1.5365608930619814
Success: 300; fraction: 1.4506622286635045
Success: 400; fraction: 1.4189226111454856
Success: 500; fraction: 1.391537019045629
Success: 600; fraction: 1.3613096186873197
Success: 700; fraction: 1.350556979850107
Success: 1000; fraction: 1.3085079297816213
Success: 2048; fraction: 1.2390204099312676
Done
true

However, QuickSort is not Julia's default sorting algorithm. To get the default alg, don't set the alg keyword. Also, that timing fails to account for allocation costs (because the associated free/gc time may not be timed in the same timing block as the allocation) which may cause BlitSort to get a lower reported runtime for very small inputs.

When I reproduce this using Chairmarks and the default alg I get this:

using Random, Chairmarks
function time_sort(alg)
    for len in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,40,50,60,70,80,90,100,110,120,130,131,132,133,200,300,400,500,600,700,1000,2048]
        data = Vector{Int}(undef, len)
        datacopy = similar(data)
        timesort = @b rand!(data) sort!(copyto!(datacopy, _))
        timefn = @b rand!(data) sort!(copyto!(datacopy, _); alg)
        println(len, "; fraction: $(timesort.time/timefn.time)")
    end
    println("Done")
end
julia> time_sort(SortingAlgorithms.BlitSort)
1; fraction: 0.7130380199107237
2; fraction: 0.757731849668227
3; fraction: 0.7504643314968642
4; fraction: 0.744190132990133
5; fraction: 0.767963278031093
6; fraction: 0.7901921243687335
7; fraction: 0.7425666403293549
8; fraction: 0.7544145315619588
9; fraction: 0.7484723964648307
10; fraction: 0.8732520261161361
11; fraction: 0.7556041514041516
12; fraction: 0.5120827794344842
13; fraction: 0.5920690943947947
14; fraction: 0.5852041369615424
15; fraction: 0.6335968129695769
16; fraction: 0.9091960128162979
17; fraction: 0.9383193017143447
18; fraction: 0.9435285898952117
19; fraction: 1.0352756071548779
20; fraction: 1.1237714241617471
21; fraction: 1.3102308810700787
22; fraction: 1.3320647200732052
23; fraction: 1.4775422527857751
24; fraction: 1.1007392691262852
25; fraction: 1.1398564677052336
26; fraction: 1.1830320211528413
27; fraction: 1.1251956019893306
28; fraction: 1.1998741387860647
29; fraction: 1.1351351351351353
30; fraction: 1.1831476640269691
31; fraction: 1.257297307278611
32; fraction: 0.29874599887026926
33; fraction: 0.2510720853323746
34; fraction: 0.24863594318449125
35; fraction: 0.2930601231791964
40; fraction: 0.36425943983431636
50; fraction: 0.30863845474628854
60; fraction: 0.4185269536195455
70; fraction: 0.5346380981223501
80; fraction: 0.7574928909952606
90; fraction: 0.46533213068175006
100; fraction: 0.5420962797826392
110; fraction: 0.5350877192982456
120; fraction: 0.5588690426531101
130; fraction: 0.5397727272727272
131; fraction: 0.5386907472707637
132; fraction: 0.5327724571298785
133; fraction: 0.9128491607453596
200; fraction: 0.9253311781861626
300; fraction: 0.9820513702162775
400; fraction: 0.9683544303797469
500; fraction: 1.2379683410991256
600; fraction: 1.115890953573717
700; fraction: 0.9855251141552512
1000; fraction: 0.7467678253464427
2048; fraction: 0.44970290595116513
Done

Which does show that this alg is faster than the default in some cases

@projekter
Copy link
Author

projekter commented Jun 2, 2024

I went through each and every function and checked their disassembly (for Vector{Int}). Apart from finding one type instability in blit_analyze! (which could not have any performance impact, as it would have triggered an exception), I can only find the following calls:

  • calls to the own functions
  • calls to memmove (which I guess could be improved, those could be calls to memcpy instead, but I expect that the C runtime checks for overlapping and then uses the simpler implementation, and this extra check should not be that important)
  • two possible exceptions where the compiler cannot see that they won't happen (blit_median_of_cbrt! - the division cannot be by zero; blit_partition! - the variable max is always assigned to some value when accessed - I don't want to pre-assign it to some value because we don't know the type. So maybe zero() is not implemented). I added @assume_effects on the function for this (which will break compatibility more).
  • one exception due to a type instability that is now fixed (and which was not called before)
  • lots of things going on in the "frontend" functions sort! directly, but those are called only once
  • one call to unsafe_wrap which indeed, through jl_ptr_to_array_1d has a single call to jl_gc_alloc - only for the allocation of the array itself, not for its underlying data. Unfortunately, despite the changes in 1.11 to the array structure, the same point still remains - creating an array will call the GC, even if the memory is managed elsewhere. This is a bit annoying, but AFAIK, there's just no good way to instruct Julia to create a stack-based array that elides the GC.

Based on your hint to Chairmarks (which I didn't know), I rewrote my benchmark. The lines that previously contained @elapsed now look like this:

timesort += @b(copyto!(datacopy, data), sort!(_), evals=1, samples=1).time
timefn += @b(copyto!(datacopy, data), sort!(_; alg), evals=1, samples=1).time

My reasoning for this that

  • the function should be evaluated only once, because the second and all subsequent times, it is applied to the ordered collection, which gives a wrong expression
  • when comparing to a different algorithm, we should always start with the same initial data, so we can also take only one sample and have to sum up all the samples manually.

As you see, I now also chose Julia's default sorting algorithm. The numbers are slightly worse (in particular for the tiny sizes), but comparable if I compare with QuickSort; however, the default (which should be RadixSort?) turns out to be much better for larger sizes. Well, my reason for going for BlitSort was not its speed (although what is reported elsewhere is still significantly better than what we get here), but that it is an in-place (or say, constant-space) stable algorithm.

@LilithHafner
Copy link
Member

the function should be evaluated only once, because the second and all subsequent times, it is applied to the ordered collection, which gives a wrong expression

data is never sorted so that's not an issue with my commendation:

@b rand!(data) sort!(copyto!(datacopy, _))
  • when comparing to a different algorithm, we should always start with the same initial data, so we can also take only one sample and have to sum up all the samples manually.

The variability due to RNG, especially at large sizes, is negligible compared to system noise in every sorting algorithm I've ever bench marked, but if that's a concern to you you can use @b sort!(copyto!(datacopy, data)) instead.

For large arrays (i.e. it takes more than 30 microseconds to sort them) the benchmarking method doesn't matter much. But for small and tiny arrays using only a single evaluation can result in some strange benchmarking artifacts. The way to establish a ground truth is to embed the two sorting algorithms in a variety of realistic workflows with runtime on the order of seconds and compare the end-to-end runtime of those workflows.

Comment on lines +248 to +249
Base.@assume_effects :nothrow function blit_median_of_cbrt!(array, array_index::Int, swap, swap_index::Int, swap_size::UInt,
nmemb::UInt, cmp::F) where {F}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effect annotation is invalid in the case of a cmp function that may throw.

Comment on lines +385 to +386
Base.@assume_effects :nothrow function blit_partition!(array, array_index::Int, swap, swap_index::Int, swap_size::UInt,
nmemb::UInt, cmp::F) where {F}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@LilithHafner
Copy link
Member

This is a bit annoying, but AFAIK, there's just no good way to instruct Julia to create a stack-based array that elides the GC.

Bumper.jl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants