Replies: 1 comment
-
Lhotse supports any kind of audio-to-audio modeling tasks. I don't know of an open-source recipe for voice conversion but the closest you may be able to find is speech-enhancement/audio-to-audio training in Nvidia NeMo, which supports lhotse dataloading. Generally speaking, you can follow an ASR data preparation recipe with the following modifications:
cuts = CutSet.from_manifests(recordings=RecordingSet.from_dir("path/to/dir", "*.flac", num_jobs=4))
for cut in cuts:
cut.target_recording = Recording.from_file(target_audio_path)
cuts.to_file("src_tgt_cuts.jsonl.gz")
src_audio, src_audio_lens = collate_audio(cuts)
tgt_audio, tgt_audio_lens = collate_audio(cuts, recording_field="target_recording") |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Our lab is looking into using Lhotse for voice conversion. While recipes for well-known tasks like speech recognition and text-to-speech do exist, voice conversion seems like it is a bit less-explored. A quick search through the repository brought up the l2-arctic recipe and the vctk recipe. But their use in to create parallel speaker training data for speech to speech seq2seq voice conversion seems non-obvious. Is there a recipe for a similar task that someone could point to for us to get started off on our custom datasets, and work with an existing setup?
Beta Was this translation helpful? Give feedback.
All reactions