Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on ARM (RBPi4) #974

Open
jofemodo opened this issue Aug 20, 2021 · 6 comments
Open

Performance on ARM (RBPi4) #974

jofemodo opened this issue Aug 20, 2021 · 6 comments
Assignees
Labels
improvement Improve on existing functionality
Milestone

Comments

@jofemodo
Copy link
Contributor

Hi sfizz developers!

Here @jofemodo from zynthian project => https://zynthian.org/
I'm integrating sfizz on the zynthian stack using an improved version of the sfizz_jack client and it works quite nicely.
Congratulations! for your excellent work!!

Regarding performance, i must say i don't get the performance i would like. Playing hard, i get XRuns when the number of used voices grows over 45-50. Of course, this limit is lower when effects are added to the audio chain. I've been tweaking the settings (preload size, etc.) but i always get the XRuns before reaching 64 active voices, so we reduced the max. number of voices to 40, that is a good compromise while leaving some room for adding effects.

My questions:

  • Do you think there is some room for improving performance on ARM architecture? Some tips? ;-)

  • I'm using a good SD-card but ... do you think the bottleneck could be located on the disk-read subsystem? (it really smells more like a CPU usage issue).

  • Currently we are compiling for 32 bits. Could we expect a noticeable improvement on performance by migrating to 64 bits?

All the best,

@paulfd
Copy link
Member

paulfd commented Aug 23, 2021

Hi, good to hear ! The JACK client is really more of a proof of concept but if you want to upstream your improvement I'd be very open. Although it could also be a separate project entirely.

There are probably many ways to improve performance on ARM, I had some plan to do it although real life is catching up a bit at the moment. There may be some low hanging fruits but none come to my mind now. The rest would be intricate work around the interpolation and rendering. For example, I had some improvements reordering and vectorizing the panning process by avoiding mixing up float and integer operations. ARM appears to be sensitive to this kind of stuff, but it's tricky work.

To check if it's CPU or SD card bound, if you have a smallish library or a lot of ram, you can add hint_ram_based=1 in a <control> block. This should load all samples in RAM and avoid reading from disk. Also within your JACK client you may use a linear interpolator instead of the default one, using this API or the equivalent C one.

If you compile for 32 bits with the proper vector instructions going to 64 is probably not going to be a massive change, but @jpcima knows more about ARM than me.

@jpcima
Copy link
Collaborator

jpcima commented Aug 23, 2021

Currently we are compiling for 32 bits. Could we expect a noticeable improvement on performance by migrating to 64 bits?

It has the benefit of doubling the register amount, which has potential for speed improvement yes.
This needs measurement.

As of now, there exist SSE parts of code which are not converted to SIMDE, so they lack the basic vectorization.

The panning code has been a bottleneck once identified by @paulfd.
Maybe the lerp vectorization trick already present in WindowedSinc can be applied to any tabulated functions, including pan, we've not benchmarked that one on ARM yet.
Alternatively the pan can be computed with the function sqrt, which is an instruction present in both Intels and ARMs. (but costly, once again I didn't run the comparison, and no idea how it's on the ARM)

@jofemodo are you able to do benchmarking on a variety of ARM machines?

@jofemodo
Copy link
Contributor Author

@jofemodo are you able to do benchmarking on a variety of ARM machines?

Hi @jpcima,
I only develop and test on RBPi. I could do benchmarking for this only.

Regards,

@jofemodo
Copy link
Contributor Author

Hi @paulfd !

The JACK client is really more of a proof of concept but if you want to upstream your improvement I'd be very open.Although it could also be a separate project entirely.

I deduced this for its simplicity, what is good for my goal of expanding it ;-)
I sent a PR with the little improvements i added:

  • Some new CLI option
  • An internal command line for allowing to load instruments and setting engine options. It could be extended with commands for loading scala files, etc.

Regards,

@ephemer
Copy link

ephemer commented Sep 8, 2021

I would be curious to hear of any further improvements possible for ARM machines. We plan to deploy sfizz on a variety of mobile (phone/tablet) devices, almost all of which are using ARM, either v7a or v8-a.

From what we can see for our use case, almost all of the sfizz CPU time is spent in the interpolation functions. That may be due to our sample-heavy instruments though – for synthesised instruments this may look quite different.

If there are low-hanging fruit updating the simde library to vectorise certain operations, this is something I would be interested at looking into. Admittedly, I don't have a huge amount of experience in the area though (I have a lot more experience using slightly higher-level SIMD primitives, e.g. in Swift).

@paulfd paulfd added the improvement Improve on existing functionality label Oct 14, 2021
@paulfd paulfd self-assigned this Oct 14, 2021
@paulfd
Copy link
Member

paulfd commented Oct 30, 2021

There may very well be low hanging fruits in the interpolation methods, even on SSE. On ARM you also have some cost, sometimes, when switching from float to integer to float operations unnecessarily. I think it is less true in ARMv8 though. The interpolation function themselves should be quite easy to benchmark. The challenge would be if there are slowdowns in the looping code which is quite complex (basically it's the part of the code that is responsible for finding the sample indices that are going to be interpolated all at once later on).

@paulfd paulfd added this to the 1.2.0 milestone Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improve on existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants