Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Inefficiencys #2

Open
Q-Bit64 opened this issue Aug 24, 2020 · 9 comments
Open

On Inefficiencys #2

Q-Bit64 opened this issue Aug 24, 2020 · 9 comments

Comments

@Q-Bit64
Copy link

Q-Bit64 commented Aug 24, 2020

-Disclaimer : Im only good at writing fast Code, usually C/C++ or Asm, Not "nice" or "good" Code in a sense of readability.
But judging by the documented methods and this "ReSharper"(i dont know what that is, but i assume it helps with readability)
You seem to be good at it / care about it ...

In the File "Open.Numeric.Primes/source/MillerRabin.cs"

So using an unproven algorithm in a prime testing library is (in my opinion) not a good idea.
The miller test is "assuming the truth of the generalized Riemann hypothesis (GRH)"
So it might be incorrect. I would recommend the AKS, because its also runs in polynomial time an // but is a little bit slower
"does not rely on unproven assumptions."
https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test

But since you decided to use it i will point out a few problems:

Line 45 "
ref readonly var b = ref ar[i]; // i dont know why you would use "ref readonly var" instead of "ulong"
var a = value - 2; // I dont know what that is, that's not part of the algorithm
var now = a > b // again, no clue, that's not part of the algorithm
? Pow(in b, in d, in value) // again, no clue, that's not part of the algorithm
: Pow(in a, in d, in value); // again, no clue, But nice try avoiding branch miss using Conditional moves xDD
"
that can be completely replaced by "ulong now = Pow(a, d, value);" // not sure, what the "in" keyword means ...
but let me take a look at the Pow function ...

okey looks complicated, it looks like a really cryptic recursive D&C Power algorithm.
So that really slow. each recursion will get written on the Stack + Call Overhead + Pipeline flushes + unnecessary Overhead.
Write on without recursion ( i assume it would be at least 10 times faster ). I leave it as an exercise to the reader, i think you
can figure it out. Lets see what "Mul" does ...

Oh Okey, i think i need to explain (simplified) branches.
before a processor can execute a Instruction, it needs to be "decoded" (interpreted)
this step can take a long time, usually longer then the execution of the instruction itself. // on all x86 Systems
so the processor starts decoding many instructions ahead of its current instruction, so the next instruction
is finished, before the processor executes it. These already decode instruction a stored in a pipeline.

So every time your execution path splits up (like at a if statement) the processor "guesses" a path and
continues decoding. If the execution Path follows the prediction, every thing is fine, but if not, the processor "flushes"
the Pipeline, and starts decoding the other execution Path. That takes a lot of time.
To avoid that, there are simple heuristics, for the prediction, like for example, loops are always
predicted true because on say 10 Iterations, 10 / 11 predictions will be correct, otherwise you got like 1 / 11. very bad.

Okey, thats no good Multiplikation algorithm, but some of the mistakes are common,
so lets cover them, and go for alternatives later.

First of all : Its wrong, it will fail for Numbers bigger than 2^63, because the integer result
greater than 2^64 will wrap around, so THIS comparison wont catch them. Plus "(now > mod)"
should be "(now >= mod)", its se same Speed, but more reliable, because it reduces the risk of overflow, see above.

So remember this branch stuff ? with loop conditions are always predicted to be true, and we want to avoid miss predictions ?
In Line 72 and 74 the "while (now >= mod) now -= mod;" // I corrected the Condition
loops ensure, that 0 <= now < mod.

So lets go though one Loop Iteration:
now <<= 1;
// Now we got 0<= 2now < 2mod
// As we see, if we subtract mod only if (now >= mod)
// we got it back to 0 <= now < mod without a Loop
// but lets see what the loop does
while (now >= mod) now -= mod; // Corrected again
if now >= mod is wrong, the processor flushes the pipeline, ouch. // again, loops are always predicted to be true
if now >= mod is true, the processor predicts the next Loop Iteration to be true again, wich is wrong EVERY time
// So the processor flushes the pipeline, ouch,
// either way, the Pipeline gets flushed, BIG OUCH
next we got another if statement, wich again will be miss predicted to some amount, bad. // a naive approximation : 50%

So before we get to the same bad correction loop again, lets cover the case where the addition ("now += b;") happens.
// lets assume, 0 <= b < mod, so b is already reduced, wich it should be, else you should reduce it at the start of the function.
// Now we got 0<= now + b < mod + b < mod * 2
// if we subtract mod only if (now >= mod)
// were again back to 0 <= now < mod, without the while loop, which again will always lead to a pipeline flush

okey, now to the solution : cmove or conditional move.
a conditional move acts like this ;
if(Condition)
x = a;
else
x = b;
but without breaking the Execution Path, so it cant get miss predicted.
Good compilers will notice the code above, and generate the cmove instruction, but not the C# Compiler (at least in my tests).
So we need to give him a little hint:
x = Condition ? a : b;
This almost always generates a cmove instruction, IF and only IF a and be are already computed,
so no "cheating" like
x = 0 < a ? a : -a; // x = |a| mathematically speaking
because -a needs to be computed. again, good compilers my recognize this, and do this for you, but last time i checked,
C# didn't. if the Compiler doesn't generate a cmove, it generates branches, like above, and they will sometimes be miss
predicted, so in speed intensives Code, like inner loops avoid them whenever you can.
The optimization of this loop is again left as an exercise to the reader xD. // i love to say that xDD
When you are done with the loop in line 69 you can remove the loop in line 68, because it wont save much time but generates
at least one(possibly even more) branch miss prediction, witch alone will take more time as 10 - 20 times the line 69's loop body

also consider "Inlining" your function through
using System.Runtime.CompilerServices;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static ulong Mul(ulong a, ulong b, ulong mod) { ... }
this will eliminate the method call overhead and increase code locality at the (negatable)cost of a larger executable,
if the function is very big, for relatively small functions it reduces the overall size. For more information:
https://en.wikipedia.org/wiki/Inline_expansion

The previously mentioned alternatives are also left as an exercise for the reader
but i encourage you to not just google, because i think you can do it yourself.
Let me know, what you came up with.

Okey, back to my own work, if you need performance advice, please give me the specific part of the Code,
but i cannot think myself into a hole (especially multi file) project and search all slow parts myself, figuring out what needs to
be fast isn't that hard.

PS : I just noticed the
"/* Based on: https://stackoverflow.com/questions/4236673/sample-code-for-fast-primality-testing-in-c-sharp#4236870 */"
comment, this mess of a multiplication and power algorithm isn't your fault, and the wrong implemented unproven prime
testing algorithm wasnt your idea, but thats a good example, why you should just take every thing from google. xDD

@electricessence
Copy link
Collaborator

electricessence commented Aug 27, 2020

Hello! Just checking in. I'll read in detail tomorrow and respond more.
Always appreciate feedback. Anywhere I can make improvement... it will be done!

Note: when it comes to primality checks, the "Optimized" version (the default) makes choices to select which algorithm will perform better for the size of the number. Smaller numbers will typically use Polynomial, and with larger numbers, Miller-Rabin will be used with some ulong to verify primality, and BigInteger will use Miller-Rabin to exclude non-primes and then verify primality with Polynomial.

https://github.com/Open-NET-Libraries/Open.Numeric.Primes/blob/master/source/Optimized.cs
The default use typically gets routed to this instance:
https://github.com/Open-NET-Libraries/Open.Numeric.Primes/blob/master/source/Primes.cs#L125

@electricessence
Copy link
Collaborator

electricessence commented Aug 27, 2020

In the File "Open.Numeric.Primes/source/MillerRabin.cs"

Providing the option to use it, is fine.
Someone either has to explicitly use it, or it has to be a big enough number for the default to attempt it.

@electricessence
Copy link
Collaborator

electricessence commented Aug 27, 2020

Line 45 "
ref readonly var b = ref ar[i]; // i dont know why you would use "ref readonly var" instead of "ulong"

I'm attempting to use this feature of C# ReadOnlySpan<ulong> to avoid creating copies of the number contained.
This may be a moot attempt at performance as it may or may not help given the choice between copying a 64 bit number and setting a 64 bit pointer. I was concerned that in a 32-bit application it may provide a benefit. Hence why you will see the optional in modifier in places where parameters are 64 bit or greater.

@electricessence
Copy link
Collaborator

var a = value - 2; // I dont know what that is, that's not part of the algorithm
var now = a > b // again, no clue, that's not part of the algorithm
? Pow(in b, in d, in value) // again, no clue, that's not part of the algorithm
: Pow(in a, in d, in value); // again, no clue, But nice try avoiding branch miss using Conditional moves xDD

This is done instead of Math.Min. The result is the same, but can then leverage the in version of the Pow method.

@electricessence
Copy link
Collaborator

electricessence commented Aug 27, 2020

in is a newer keyword in C# that provides a couple benefits:

  1. The number cannot be modified by the function called and so even though it's a reference, the caller can assume the number did not change. (There might be some compiler optimizations around this, but I can't say.)
  2. For larger structs and value types, instead of making a 'copy' of the value for the function to consume, it's the reference instead. This has a dramatic effect on BigInteger use and other structs. Maybe not so much for 64 bit values, potentially detrimental for 32 bit (and doubles), and useless for reference types.

@electricessence
Copy link
Collaborator

electricessence commented Aug 27, 2020

okey looks complicated, it looks like a really cryptic recursive D&C Power algorithm.
So that really slow. each recursion will get written on the Stack + Call Overhead + Pipeline flushes + unnecessary Overhead.
Write on without recursion ( i assume it would be at least 10 times faster ).

Since I'm really adapting what someone else wrote, I'm not sure how this could be improved.
I'm trying to make sure the code is clean and easy to read with some nice statics that avoid recursion where possible.
As well as a separation of ulong and BigInteger.
Can you advise what I should change here?

@electricessence
Copy link
Collaborator

So I'm aware of the wrap-around possibility and try to avoid issues with that where possible.
You'll see in other places, I check to see what numbers will cause the wrap-around and then up-level it to a higher number type if needed. But I may have missed it here.

@electricessence
Copy link
Collaborator

It would be great if you cloned the repo, made a branch with the suggested changes and I can test.
If you wanna focus on one aspect at a time for each change, then I can incrementally test them.
If there's something more GCE (gross conceptual error) that I may have done, maybe a different implementation entirely is required?

@electricessence
Copy link
Collaborator

electricessence commented Aug 27, 2020

Here are the current speed test results I have: (thanks for the trial division help/fix)

Singular Trial Division Tests:
00:00:00.0000744 U32 Benchmark
00:00:00.0000773 U32
00:00:09.7749346 U64 Benchmark
00:00:08.6137033 U64

Batch Tests:
00:00:00.0013936 Polynomial.U32
00:00:00.0014005 Optimized
00:00:00.0122794 Polynomial.U64
00:00:00.0428230 MillerRabin.U64
00:00:00.1061931 TrialDivision.U64.Memoized Cached
00:00:01.4419965 TrialDivision.U32.Memoized
00:00:02.1333997 TrialDivision.U64.Memoized
00:00:02.5321315 TrialDivision.U32
00:00:05.4558333 TrialDivision.U64

If you are correct, then at least one of these numbers should improve.
I could definitely add more tests and do more focus on Miller-Rabin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants