Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just really bad quality in general? #301

Open
Oleg-A-LLIto opened this issue Jul 9, 2023 · 6 comments
Open

Just really bad quality in general? #301

Oleg-A-LLIto opened this issue Jul 9, 2023 · 6 comments

Comments

@Oleg-A-LLIto
Copy link

So, I'm using this to align the text I get from a TTS engine, a pretty good one, too (eleven labs). To me that sounds like a perfect task: no mic noise, no background sounds, English language, and the volume is really stable. Still, not sure what I'm doing wrong here, but it works extremely poorly. To the point, the result is pretty much unusable. Half the words (by the way, yes, I'm aligning per word) are crushed into a 0-second long interval and the others are just overly long periods of time spaced around randomly. I feel like I would get a much better result by just approximating the mapping with character/vowel count. By just how bad it is, I'm guessing this is not how Aeneas normally is, so what could be a problem causing generally bad performance? I'm not getting any errors, I'm running win11 and I process fairly small chunks of text.

@jahamed
Copy link

jahamed commented Jul 26, 2023

Hey man, did you ever find a forced alignment solution that works? I'm also using 11labs and trying to do some forced alignment to fix the subtitles I'm generating.

@Oleg-A-LLIto
Copy link
Author

@jahamed Yep, pyfoal works like a charm for me. Takes some headache to set up, but it's extremely good, at least with 11labs' output. Glad to save you all that time it took me to get there lmao

@jahamed
Copy link

jahamed commented Jul 27, 2023

Thanks @Oleg-A-LLIto! Yes seems like a pain to setup, I got a decent working example with Gentle Forced Aligner too, runs easier on a Mac. Very surprised there isn't an easier & more modern way to get this stuff working (at least in node).

@jahamed
Copy link

jahamed commented Jul 28, 2023

@jahamed Yep, pyfoal works like a charm for me. Takes some headache to set up, but it's extremely good, at least with 11labs' output. Glad to save you all that time it took me to get there lmao

Hey man I found another good library for this, a lot more modern + easier to use. Alignment is very good, thought you should know, It's working for me perfectly now.
https://github.com/echogarden-project/echogarden

@smontlouis
Copy link

For Word level timestmaps, you should use whisperX with aeneas.
Aeneas is very good for forced alignment with transcript, whisperX is perfect for words timestamps.

Get the aeneas result, transform data for whisperX align model, profit.

@abathur
Copy link

abathur commented Oct 15, 2023

By just how bad it is, I'm guessing this is not how Aeneas normally is, so what could be a problem causing generally bad performance? I'm not getting any errors, I'm running win11 and I process fairly small chunks of text.

You might want to use the --debug flag to investigate.

I just noticed some pretty rough results with a build that was falling back to python + subprocess for speech synthesis, but got much better results with one using the compiled cew extension. The ~good version of this looks something like:

[DEBU] 2023-10-14 21:37:58.570971 ExecuteTask: Setting synthesizer...
[DEBU] 2023-10-14 21:37:58.571019 Synthesizer: Selecting TTS engine...
[DEBU] 2023-10-14 21:37:58.571061 Synthesizer: TTS engine: eSpeak
[DEBU] 2023-10-14 21:37:58.571105 ESPEAKTTSWrapper: No tts_path specified in rconf, setting default TTS path
[DEBU] 2023-10-14 21:37:58.571130 ESPEAKTTSWrapper: TTS path is             espeak
[DEBU] 2023-10-14 21:37:58.571145 ESPEAKTTSWrapper: TTS cache?              False
[DEBU] 2023-10-14 21:37:58.571158 ESPEAKTTSWrapper: Has Python      call?   False
[DEBU] 2023-10-14 21:37:58.571170 ESPEAKTTSWrapper: Has C extension call?   True
[DEBU] 2023-10-14 21:37:58.571183 ESPEAKTTSWrapper: Has subprocess  call?   True
[DEBU] 2023-10-14 21:37:58.571205 ESPEAKTTSWrapper: Subprocess arguments: ['espeak', '-v', 'VOICE_CODE_STRING', '-w', 'WAVE_PATH', 'TEXT_STDIN']
[DEBU] 2023-10-14 21:37:58.571227 Synthesizer: Selecting TTS engine... done
[DEBU] 2023-10-14 21:37:58.571239 ExecuteTask: Setting synthesizer... done
[DEBU] 2023-10-14 21:37:58.571366 ExecuteTask: STEP 3 BEGIN (synthesize text)
[DEBU] 2023-10-14 21:37:58.571826 Synthesizer: Synthesizing text...
[DEBU] 2023-10-14 21:37:58.572540 ESPEAKTTSWrapper: Calling TTS engine via C extension or subprocess
[DEBU] 2023-10-14 21:37:58.572600 ESPEAKTTSWrapper: C extension 'cew' enabled
[DEBU] 2023-10-14 21:37:58.691740 ESPEAKTTSWrapper: C extension 'cew' enabled and it can be loaded
[DEBU] 2023-10-14 21:37:58.691839 ESPEAKTTSWrapper: Synthesizing using C extension...

(But I haven't tried the other packages mentioned here...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants