Replies: 16 comments 23 replies
-
Update: It turns out AEC is broken on Raspberry Pi OS Bullseye 32bit ! |
Beta Was this translation helpful? Give feedback.
-
@fquirin Mycroft is using the XMOS XVF3510 chip in its SJ201 board. This chip includes AEC as well as VAD, noise suppression, and AGC. I honestly think they could sell these boards alone and people would buy them. |
Beta Was this translation helpful? Give feedback.
-
Didn't know it was broken but I gave up with webrtc aec as it works really well for low SNR in terms of echo but then fails as SNR hits a certain level. If using software I have a preference for Speex AEC as it doesn't cancel like webrtc but merely attenuates but at much higher levels of SNR. Seems to be quite a few things broken on bullseye but likely if you had compiled the libs yourself then likely they will work as prob there is a mismatch somewhere. 16Khz is sort of the norm for speech as increases in sample rate just increase model parameters and load whilst very little speech of need is above 16Khz so I wouldn't worry too much about that. |
Beta Was this translation helpful? Give feedback.
-
PS Speechbrain contains various tutorials on beamforming and DOA also pytorch itself https://pytorch.org/audio/master/tutorials/mvdr_tutorial.html I always had a problem getting torchaudio running on Arm64 so never tested load but I think that might of recently changed and people have got it running so those examples should work. |
Beta Was this translation helpful? Give feedback.
-
@fquirin The SJ201 does everything at 48Khz (AEC, AGC, etc.) since this is what the XMOS XVF3510 processes natively. In fact, you get two output channels that can be configured with different settings for noise suppression, AEC, AGC, etc. I'm seeing if (on the Mark II) we can get better performance by using one channel for wake word detection, and the other for speech to text. |
Beta Was this translation helpful? Give feedback.
-
Python is just slow for realtime DSP and even though there are examples for Torch and Speechbrain even with working Torchaudio on arm64 I am expecting the load to be horrendous.
Then there is an C toolkit with a python2 wrapper but its file based and I am just lost with C and swig to make it streaming https://github.com/kkumatani/distant_speech_recognition Beamforming is not everything though as if you get a directional mic then perpendicular focus still always bleeds to some extent from the sides and generally this is set by the number of mics you have. Espressif did a limited run demonstration for there ESP32-S3 the ESP32-S3-Box that is a Amazon certified front end of 2 mic -> AEC -> BSS and it works amazingly well but its does contain Espressif blobs and even on that they have a 4 channel ADC for 2mics and 1 channel is the dac fed back as a ref channel to sync on a microcontroller. Audio is just pressure waves in air and its much better to visualise as ripples in water as really its not directional, its going all directions and rebounding off different density surfaces. So there is either no software and expensive hardware or software and no current hardware and the initial audio processing is essential to providing the expectations that users get with commercial offerings and its a big problem and always has been. Then there is post processing and full DNN blind source separation where maybe a model for Spleeter rather than music based audio sets could be fed with voice based for 2 stem separation and github contains many DNN BSS variations. There is so many parameters where the big commercial guys have a huge advantage to the bring your own mic and hardware party. Every mic and hardware setup has a spectral signature that can change significantly on what is used. I had to let my memory come back to me but ODAS might be worth another try as my tests where on a Pi3 which its a little load heavy but a Pi4 might be a much better platform. |
Beta Was this translation helpful? Give feedback.
-
I had a look at the speechbrain doa and beamforming examples and checked for speed and if torchaudio seems to work now.
So approx x12 realtime as sample is 10 seconds with 4 channels on a pi400 Arm64 so maybe when chunked and streamed the load and performance is possible, likely less with 2 mic but again less accurate. https://colab.research.google.com/drive/1UVoYDUiIrwMpBTghQPbA6rC1mc9IBzi6?usp=sharing Might even reduce load by using ODAS for DOA as can connect to a socket and then pytorch speechbrain for the beamform alg. The Generalized Eigenvalue Beamforming I think is a type of semi blind source separation where the near and far of a beamformer can be further cleaned which can improve speech recognition performance. |
Beta Was this translation helpful? Give feedback.
-
Its always a mare working out orientation and mic order of the hardware and think this might be right for DOA
You could prob try that with different chunk lengths so you get the best consistency and load I guess |
Beta Was this translation helpful? Give feedback.
-
PS I had a look at Spleeter and the results are amazing but once again cpu wise the load is huge. PS @synesthesiam the big data guys don't start turning off speech enhancement effects to fit there datasets they use datasets recorded with those speech enhancement effects. For Mycroft if the SJ201 board is going to be 'their' thing that all datasets should be processed through or at least have considerable portions of the dataset via the SJ201 board or create general large models and use transfer learning as the spectral signature of hardware can be thought of as a certain type of accent. |
Beta Was this translation helpful? Give feedback.
-
There is this and not exactly sure what this is but looks promising maybe might try it
|
Beta Was this translation helpful? Give feedback.
-
Hey @StuartIanNaylor there is a lot of info to digest thanks for writing it all down. I will need some time to work through all this but I'll definitely check it out ;-) Just two quick comments. I've been experimenting with RNNoise as well. In Bullseye I failed to make it work but I tried a few days ago on Buster again ... and it's actually working insanely well on Rpi4! O_O. The only issue I have is that the VAD filter seems to be too agressive cutting a few samples at start. Another thing is that I'm still not sure if the resulting audio data is actually better for ASR systems, to the human ear it seems very clear but I fear there are subtle artifacts that the ASR engines are not trained on :-(. Besides that there is obviously a problem with any noise that is actual speech data since it is ment to let it through and cannot decide if its your voice or noise. Another thing I was wondering is if beamforming makes any sense at all with the given hardware. I've tried to maximize audio quality by hand using the 4 channels of the Respeaker 4mic linear array and to be honest even with the outer distance of about 15cm there is not much of an improvement happening. I'm wondering if Amazon etc. even doing actual beamforming or if they just use this to do source separation or some fancy ML stuff on their massive cloud servers 🤔 |
Beta Was this translation helpful? Give feedback.
-
PS https://github.com/sekiguchi92/SoundSourceSeparation if you feed it with a 4 channel wav once more being python the speed is very slow but the results are extremely good.
This was one of the earlier algs they where using I think Google has moved on and uses VoiceFilter-Lite as with there newer smart speakers you set up profiles which is very much how VoiceFilter-Lite works. The example wav I used is here for a quickstart https://drive.google.com/file/d/1KfNXlDYsQU_EI9l88GlI-DDcJlQV5i4r/view?usp=sharing The 3 track separation is here you will need to mute channels so you dont hear all at once Speaker extraction seems to be where its at and a github serach for I think VoiceFilter is not much more apart from it has a small attention model of a profiled voice so it only runs extraction on sections indicating that voice. There is another alg I have found and its more straight semi blind source separation I don't know how it sounds as it just uses to random numpy arrays in the example and never got round applying to a near/far 2 track wav. |
Beta Was this translation helpful? Give feedback.
-
@fquirin Doh! apols I forgot about https://github.com/athena-team/athena-signal which actually works quite well as does have beamforming and other goodies is C based but has a python wrapper and is pretty good with load. Also there is https://github.com/breizhn/DTLN where always thought it was a bit load heavy but sanebow did a great job here https://github.com/SaneBow/PiDTLN But just noticed someone has combined both https://github.com/avcodecs/DTLNtfliteC sort of name says it all but misses that it also athena and C. I am trying to remember why I didn't initially like athena signal maybe its non streaming but will have a look. |
Beta Was this translation helpful? Give feedback.
-
@fquirin Also I forgot to mention about hardware as hats are just about the worse as its near impossible to isolate the mics from vibration. What you can do as telephones did for decade is also use uni-directional microphones that has a rear port that cancels sound that hits the front. Also the capsules themselves are approx $1 This can really help with your AEC as the AEC of the Mic is additional to your digital AEC and together make a great AEC combo. |
Beta Was this translation helpful? Give feedback.
-
In that sense the 4mic linear and 6mic array from Seeedstudio are pretty cool because they have the HAT and mic-board separated :-). |
Beta Was this translation helpful? Give feedback.
-
@fquirin Just a thought as not used but pipewire is becoming the 'new' thing for linux its still webrtc but a new implementation have you tried the AEC of pipewire? |
Beta Was this translation helpful? Give feedback.
-
I've been experimenting with acoustic-echo-cancellation (AEC) and beam forming lately using Pulseaudio.
The Pulseaudio
module-echo-cancel
looks very interesting and it certainly does remove the output audio very aggressively, but what doesn't seem to work at all is theaec_args
parameter. Also the recording quality during audio output strongly depends on your signal-to-noise ration. If its tool low your recording is scrambled and unrecognizable :-/.There are some demo scripts in the DIY client repository: https://github.com/SEPIA-Framework/sepia-installation-and-setup/tree/dev/sepia-client-installation/rpi/pulseaudio
Some resources:
@StuartIanNaylor I've seen you've done a lot of experiments with this module for Mycroft (A, B) and Rhasspy, have you tried the module recently with Raspberry Pi OS Bullseye? Could you maybe share any of your experiences about the
aec_args
parameters and their effects? For me 'noise_suppression', 'analog_gain_control', 'digital_gain_control' and 'high_pass_filter' seem to have no effect at all and I'm not sure if the effect of 'beamforming' is just imagination 😅 🙈 .@synesthesiam I think you've mentioned the new MyCroft mic HAT will use AEC. Is this an integrated codec or any Pulseaudio version?
I've tested several microphones including Respeaker 6mic and 4mic linear, IQAudio Zero and Joy-IT Talking Pi HAT. Next on the list is Respeaker/Waveshare 2mic HAT.
I've also played with RNNoise on the Raspberry Pi 4 and I got it working for a minute ... and then it never worked again O_o. Noise reduction was pretty good but the cut-off was super aggressive and it only worked when you were very close to the microphone.
I'd be very happy if anyone can share their ideas, experiment results and thoughts about it 🙂
Beta Was this translation helpful? Give feedback.
All reactions