This repository is under heavy development as I work on polishing this project. Until this notice disappears, please expect documentation to be incomplete/out of date and understand that you may find this program confusing to build / use and I will likely provide little support until documentation and polishing is complete.
This is a set of components intended to ease the creation of audio productions that involve the synthesis of text to speech to audio, particularly where many voices that might be contained on any number of different computers or devices are involved.
This setup involves 3 components:
The frontend to this application involves the user client. This application is responsible for requesting synthesis and outputting the results.
The task of these components is to translate text to speech depending on the voices available on a given system. A provider can run anywhere TTS voices are available, and some providers might be platform-agnostic as they might synthesize tts from cloud providers like Google or Amazon.
This component is what ties everything together. A coagulator is run, and then speech providers connect to it and send a list of voices it can synthesize. The coagulator will take note of voice names to provider clients. When a user client connects to the coagulator, it can therefor send a long speech script to the coagulator which is then parsed, whereupon requests per voice are sent on to the various speech providers that can synthesize each voices. The coagulator then acts as a relay, sending the audio data from each provider back to the user client that initially asked for it.
Currently the client works with python 3.12 windows 64 bit, however it should work on other platforms soon once a few minor dependancy issues are resolved.
Pretty much this entire project is written in python sans a couple of providers such as the one for the NVGT fallback RSynth voice. The recommended first-time instructions are as follows:
- After opening a terminal window up to this repository's directory, create a python virtual environment to make an isolated workspace for all modules this project uses:
python -m venv venv
- Activate the virtual environment you just created: on windows
venv\scripts\activate
, on MacOSsource venv/bin/activate
, or on Linuxvenv/bin/activate
- Install requirements:
pip install -r requirements.txt
- If you wish to run the balcony provider, you'll want to download balcon.zip and place the contained balcon.exe in the provider directory.
From this point you can cd to the coagulator directory and run coagulator.py, cd to the provider directory and run balcony.py or macsay.py, or cd to the user directory and run STAR.py based on what you want to do.
If you want to create a complete local stack in one shell window, at least on windows you can use pythonw instead of python to run the coagulator and the provider in windows mode which will not block your terminal.
If you wish to build the binaries, you can run pyinstaller STAR.spec from the repository's root directory after generating the readme.html file by running python user/html_readme.py.
You only need to set up a virtual environment the first time you clone the repository, after which you only need to activate it every time you open a terminal window up to the root of the repository.
You can skip this section unless you wish to contribute to the project, want to know how it works, or intend to write new providers or clients.
This project uses web sockets for communication. The coagulator is what acts as the web socket server, and is written in python. Both speech providers and user clients connect to it. All payloads accept for the audio data itself are in json, and I can't imagine a way to make the API any more simple.
For a user client to communicate with the coagulator, it should send a payload in the form:
{"user": revision, "request": ["voice1: line1", "voice2: line2", "someone<r=-5>": "etc"]}
If the text key is omited, the coagulator will return a list of full voice names available in the form {"voices": ["voice1", "voice2"]}
Otherwise, the coagulator will start sending back binary payloads in the form:
2 byte little endian request ID length, request ID, audio data
Until all lines passed have been synthesized.
While the coagulator does act as a direct relay and thus providers could send data in any format it wants, what's written here is the standard.
The speech key is important. It contains an ID used to track the order of synthesis. It is made up of 2 or at most 3 numbers separated by an underscore. The first number usually isn't importaant, it's the integer client ID that you are known as by the coagulator. If it exists, the second number contains any custom ID you have passed by providing an ID key in your request message. The final number however contains an integer that denotes the fragment number that this speech payload is referencing. When outputting data without providing an ID in your requests, you should number or sequence your output based on this final number.
The recommended way to write a STAR provider is by using the existing framework called provider.py within the provider directory. Several example providers already exist which not only connect most popular voice engines to STAR, but which can also be used as examples of creating your own provider.
If you have a command line program which can receive text as input and which produces wav data as output, a provider can be written in just a couple lines of code. The following is a 2 line provider which wraps the sam.exe synthesizer, for example.
from provider import star_provider
star_provider("sammy", voices = "microsam", synthesis_process = ["sam", "-wav", "{filename}", "{text}"], synthesis_process_rate = ["-speed", "{rate}"], synthesis_process_pitch = ["-pitch", "{pitch}"])
The constructor for the STAR provider class is as follows:
def __init__(self, provider_basename = os.path.splitext(sys.argv[0])[0], handle_argv = True, run_immedietly = True, voices = None, synthesis_process = None, synthesis_process_rate = None, synthesis_process_pitch = None)
Arguments:
- provider_basename: This is used to name the provider's configuration file, show error messages, or anything else where a generic name string can be useful. Should be a valid file basename.
- handle_argv: By default, the provided provider facility can handle various command line arguments which control things like what config file to load and more. You can disable this switch if you want to disable this for your provider.
- run_immedietly: If you set this to False, you'll need to manually call provider.run() after the provider has been constructed, which could be good if you wish to further configure it before starting it up.
- voices: An initial list of voices you provide. For more dynamic providers where the list could change, it's best to instead override the provider's get_voices method.
- synthesis_process: If your provider is simply connecting to a command line application, this argument controls the process that is to be executed. It should be a list, with one part of the command per item. The first item is usually a process filename, with consecutive items being arguments that are passed to the given filename. The format specifiers {filename}, {text}, {rate}, {pitch} can be used to insert the dynamic bits of the argument content however it's best to use the synthesis_process_rate and synthesis_process_pitch provider arguments to handle the speech parameters instead of bundling them all in the synthesis_process argument directly.
- synthesis_process_rate, synthesis_process_pitch: Additions to the synthesis_process argument, these arguments are only appended to the final command that is to be executed only if the rate and/or pitch are actually present in the request. This might be important because with most command line applications, not specifying a rate or pitch argument at all means using the default, therefor the provider's default value of 0 for these parameters if they are not present might not be optimal for calling your synthesis process. Just like above you use {rate} or {pitch} to actually insert that value into the argument string, and you should pass arguments as a python list.
You might be interested in overriding the following methods in a subclass of the star_provider object if you need more advanced functionality than what the base object provides:
- async def synthesize(self, voice, text, rate = None, pitch = None): This method can return a bytes like object containing synthesized wave data, or an error string if synthesis was not successful. The voice argument is garenteed to be set to one of the voices you told the provider about. Beware the async nature of this function.
- def get_voices(self): This function can either return a single voice name as a string, a list of voice names, or a dictionary with the key being a voice name and the value being a subdictionary with extra metadata about the voice (E for language codes etc in future).
Usually, it's best to inherit from the provided python provider class rather than doing this from scratch, but below are the instructions nevertheless which can be useful for example if you wish to write a provider that runs on an embedded device or in any other environment where modern python is not available.
When a speech provider first connects to a coagulator, it should send a packet in the form:
{"provider": revision, "voices": ["voice1", "voice2", "etc"]}
before continuously listening for packets in the form:
{"voice": "voicename", "text": "text to speak", "id": ID}
and responding each time with binary packets such as:
2 byte little endian ID length, ID, audio data
where ID is the same as that received in the voice packet's id key.
Voice packets might contain extra parameters like rate and pitch, but your providers should be set up to not require these.
If a request fails to synthesize or if you wish to pipe status messages back to the client that initiated a speech request, you can send packets such as:
{"provider": revision, "id": request_ID, "status": "status message"}
Another special key, "abort", can be sent in this payload with a value of True to indicate that this speech request has failed and thus that no audio data will be forthcoming.
This is a tool created with the intention of making it possible for small groups of friends to create tts audio skits and dramas with increased colaberation, or even so that somebody can network all of their local voices together with fewer cables and hastles. By no means is this intended to deprive voice creators of income / hurt them in any way / disrespect their terms of service. Sharing access to voices that disallow such distrobution in their license agreements, particularly beyond small groups of friends, goes against the intended use I had in mind for this project and I expressly disclaim any responsibility for such misuse of the program. Please use this tool respectfully!