Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do those improvements work? #5

Open
rikimaru0345 opened this issue Jul 31, 2018 · 3 comments
Open

How do those improvements work? #5

rikimaru0345 opened this issue Jul 31, 2018 · 3 comments
Labels

Comments

@rikimaru0345
Copy link

Hello,
I was looking through the commits and noticed two commits: 1c452a9 and 429a82a

The first one changed some code, and I'd like to ask what changes were made here, and more importantly how exactly that improved performance? What was the big bottleneck here?
I understand the code (at least somewhat), but the diffs are a bit mangled and hard to follow.

The second one surprises me as well, how were those aggressive inlining attributes identified as harmful to the performance? Did you manually comment them out and re-run the benchmarks? (I doubt it a bit, that'd be a somewhat "blind" approach, no??) Did you inspect the generated asm code and identified some issues there?

Great work on the library! I love it! :)

@master131
Copy link
Owner

master131 commented Jul 31, 2018

I performed some benchmarks and noticed that the stack allocations were impacting performance. When I use standard arrays which are allocated on the heap, it performed better.

As for removing the aggressive inlining, basically when those functions were inlined, the caller would run out of registers, so heaps of CPU cycles were wasted swapping variables in and out of registers from the stack, rather than doing meaningful calcuations.

So yes, it was a combination of benchmarks and identifying the JITed assembly code.

@EgorBo
Copy link
Contributor

EgorBo commented Dec 23, 2018

@master131 stackallocs can be boosted by up to 30% by disabling zeroing see https://github.com/dotnet/coreclr/issues/1279
.NET Core uses this technique for some libs including mscorlib via an additional linker (ILLINK) step "ClearLocalsInit" (see my tweet https://twitter.com/EgorBo/status/1066662708504903681).

Also the heap allocations can be cached via ArrayPool ;-)

Great work by the way!
On my machine it's only 1.5x slower than .NET Core implementation (with 1.0.6 native lib) for level 11 and a 2mb PDF file

@master131
Copy link
Owner

master131 commented Dec 24, 2018

Thanks for bringing that to my attention @EgorBo. As an experiment I removed .initlocals and re-instated the stackalloc to see if performance was any better than the v0.3.2.

On .NET Framework 4.5:
v0.3.2 had a benchmark time of 1.788s.
The initlocals/stackalloc experiment was worse, with a time of 2.172s.

On .NET Core 2.2:
v0.3.2 had a benchmark time of 1.625s.
The initlocals/stackalloc experiment was better, with a time of 1.594s.

I did also check the JITed assembly to verify that it was not being zero initialized. For simplicity reasons I may just leave the allocation as-is (as the performance increase in .NET Core is negligible) and not mess around with micro-optimisations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants