F5 TTS — MLX

Implementation of F5-TTS, with the MLX framework.

F5 TTS is a non-autoregressive, zero-shot text-to-speech system using a flow-matching mel spectrogram generator with a diffusion transformer (DiT).

You can listen to a sample here that was generated in ~11 seconds on an M3 Max MacBook Pro.

F5 is an evolution of E2 TTS and improves performance with ConvNeXT v2 blocks for the learned text alignment. This repository is based on the original Pytorch implementation available here.

Installation

pip install f5-tts-mlx

Basic Usage

python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog."

You can also use a pipe to generate speech from the output of another process, for instance from a language model:

mlx_lm.generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose false \
 --temp 0 --max-tokens 512 --prompt "Write a concise paragraph explaning wavelets." \
| python -m f5_tts_mlx.generate

Voice Matching

If you want to use your own reference audio sample, make sure it's a mono, 24kHz wav file of around 5-10 seconds:

python -m f5_tts_mlx.generate \
--text "The quick brown fox jumped over the lazy dog." \
--ref-audio /path/to/audio.wav \
--ref-text "This is the caption for the reference audio."

You can convert an audio file to the correct format with ffmpeg like this:

ffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav

See here for more options to customize generation.

Quantized Models

If you're in a bandwidth or memory-limited environment, you can use the --q option to load a quantized version of the model. 4-bit and 8-bit variants are supported.

python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog." --q 4

From Python

You can load a pretrained model from Python:

from f5_tts_mlx.generate import generate

audio = generate(text = "Hello world.", ...)

Pretrained model weights are also available on Hugging Face.

Appreciation

Yushen Chen for the original Pytorch implementation of F5 TTS and pretrained model.

Phil Wang for the E2 TTS implementation that this model is based on.

Citations

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

@inproceedings{Eskimez2024E2TE,
    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:270738197}
}

License

The code in this repository is released under the MIT license as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
data		data
f5_tts_mlx		f5_tts_mlx
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
f5tts.jpg		f5tts.jpg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train_libritts_small.py		train_libritts_small.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F5 TTS — MLX

Installation

Basic Usage

Voice Matching

Quantized Models

From Python

Appreciation

Citations

License

About

Releases 23

Packages

Contributors 3

Languages

License

lucasnewman/f5-tts-mlx

Folders and files

Latest commit

History

Repository files navigation

F5 TTS — MLX

Installation

Basic Usage

Voice Matching

Quantized Models

From Python

Appreciation

Citations

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 23

Packages 0

Contributors 3

Languages

Packages