How do those improvements work? #5

rikimaru0345 · 2018-07-31T01:02:30Z

Hello,
I was looking through the commits and noticed two commits: 1c452a9 and 429a82a

The first one changed some code, and I'd like to ask what changes were made here, and more importantly how exactly that improved performance? What was the big bottleneck here?
I understand the code (at least somewhat), but the diffs are a bit mangled and hard to follow.

The second one surprises me as well, how were those aggressive inlining attributes identified as harmful to the performance? Did you manually comment them out and re-run the benchmarks? (I doubt it a bit, that'd be a somewhat "blind" approach, no??) Did you inspect the generated asm code and identified some issues there?

Great work on the library! I love it! :)

The text was updated successfully, but these errors were encountered:

master131 · 2018-07-31T02:55:03Z

I performed some benchmarks and noticed that the stack allocations were impacting performance. When I use standard arrays which are allocated on the heap, it performed better.

As for removing the aggressive inlining, basically when those functions were inlined, the caller would run out of registers, so heaps of CPU cycles were wasted swapping variables in and out of registers from the stack, rather than doing meaningful calcuations.

So yes, it was a combination of benchmarks and identifying the JITed assembly code.

EgorBo · 2018-12-23T23:30:27Z

@master131 stackallocs can be boosted by up to 30% by disabling zeroing see https://github.com/dotnet/coreclr/issues/1279
.NET Core uses this technique for some libs including mscorlib via an additional linker (ILLINK) step "ClearLocalsInit" (see my tweet https://twitter.com/EgorBo/status/1066662708504903681).

Also the heap allocations can be cached via ArrayPool ;-)

Great work by the way!
On my machine it's only 1.5x slower than .NET Core implementation (with 1.0.6 native lib) for level 11 and a 2mb PDF file

master131 · 2018-12-24T04:45:55Z

Thanks for bringing that to my attention @EgorBo. As an experiment I removed .initlocals and re-instated the stackalloc to see if performance was any better than the v0.3.2.

On .NET Framework 4.5:
v0.3.2 had a benchmark time of 1.788s.
The initlocals/stackalloc experiment was worse, with a time of 2.172s.

On .NET Core 2.2:
v0.3.2 had a benchmark time of 1.625s.
The initlocals/stackalloc experiment was better, with a time of 1.594s.

I did also check the JITed assembly to verify that it was not being zero initialized. For simplicity reasons I may just leave the allocation as-is (as the performance increase in .NET Core is negligible) and not mess around with micro-optimisations.

master131 added the question label Dec 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do those improvements work? #5

How do those improvements work? #5

rikimaru0345 commented Jul 31, 2018

master131 commented Jul 31, 2018 •

edited

EgorBo commented Dec 23, 2018 •

edited

master131 commented Dec 24, 2018 •

edited

How do those improvements work? #5

How do those improvements work? #5

Comments

rikimaru0345 commented Jul 31, 2018

master131 commented Jul 31, 2018 • edited

EgorBo commented Dec 23, 2018 • edited

master131 commented Dec 24, 2018 • edited

master131 commented Jul 31, 2018 •

edited

EgorBo commented Dec 23, 2018 •

edited

master131 commented Dec 24, 2018 •

edited