NOTE: This README describes the old version of the SEPIA STT Server. Please consider upgrading to V2 to get all the new features and best support. Old ASR models can be converted easily ;-).
[BETA - UNDER CONSTRUCTION]
This server supports streaming audio over a WebSocket connection with integration of an open-source ASR decoder like the Kaldi speech recognition toolkit. It can handle full-duplex messaging during the decoding process for intermediate results. The REST interface of the server allows to switch the ASR model on-the-fly.
- Websocket server (Python Tornado) that can receive (and send) audio streams
- Compatible to SEPIA Framework client
- Integration of Zamia Speech (python-kaldiasr) to use Kalid ASR in Python
- Roughly based on nexmo-community/audiosocket_framework
Make sure you have Docker installed then pull the image via the command-line:
docker pull sepia/stt-server:beta2.1
Once the image has finished downloading (~700MB, extracted ~2GB) you can run it using:
docker run --rm --name=sepia_stt -d -p 9000:8080 sepia/stt-server:beta2.1
This will start the STT server (with internal proxy running on port 8080 with path '/stt') and expose it to port 9000 (choose whatever you need here).
To test if the server is working you can call the settings interface with:
curl http://localhost:9000/stt/settings && echo
You should see a JSON response indicating the ASR model and server version.
To stop the server use:
docker stop sepia_stt
To change the server settings, add your own ASR models, do language model customization or to capture your recordings for later you can use the internal 'share' folder like this:
wget -O share-folder.zip https://github.com/SEPIA-Framework/sepia-stt-server/blob/master/share-folder.zip?raw=true
unzip share-folder.zip -d /home/[my user]/sepia-stt-share/
docker run --rm --name=sepia_stt -d -p 9000:8080 -v /home/[my user]/sepia-stt-share:/apps/share sepia/stt-server:beta2.1
where /home/[my user]/sepia-stt-share
is just an example for any folder you would like to use (e.g. in Windows it could be C:/sepia/stt-share).
When setup like this the server will load it's configuration from the app.conf in your shared folder.
For SEPIA app/client settings see below.
Make sure you have at least Python 2.7 with pip (e.g.: sudo apt-get install python-pip) installed. You may also need header files for Python and OpenSSL depending on your operating system. If you are good to go install a few dependencies via pip:
pip install tornado webrtcvad numpy
Then get the Python Kaldi bindings from Zamia Speech (Debian9 64bit example, see link for details):
echo "deb http://goofy.zamia.org/repo-ai/debian/stretch/amd64/ ./" >/etc/apt/sources.list.d/zamia-ai.list
wget -qO - http://goofy.zamia.org/repo-ai/debian/stretch/amd64/bofh.asc | sudo apt-key add -
apt-get update
apt-get install python-kaldiasr
Download one (or more) of their great ASR models too! I recommend 'kaldi-generic-en-tdnn_sp'.
git clone https://github.com/SEPIA-Framework/sepia-stt-server.git
cd sepia-stt-server
python sepia_stt_server.py
You can check if the server is reachable by calling http://localhost:20741/ping
The application reads its configuration on start-up from the app.conf file that can be located in several different locations (checked in this order):
- Home folder of the user:
~/share/sepia_stt_server/app.conf
- App folder:
/apps/share/sepia_stt_server/app.conf
- Base folder of the server app:
./app.conf
The most important settings are:
- port: Port of the server, default is 20741. You can use
ngrok http 20741
to tunnel to the SEPIA STT-Server for testing - recordings_path: This is where the framework application will store audio files it records, default is "./recordings/"
- kaldi_model_path: This is where the ASR models for Kaldi are stored, default is "/opt/kaldi/model/kaldi-generic-en-tdnn_sp" as used by Zamia Speech
Open your client (or e.g. the official public client), go to settings and look for 'ASR server' (page 2). If you are using the Docker image (see above) your entry should look something like this:
ws://127.0.0.1:9000/stt/socket
(when running Docker on same machine and used the example command to start the image)wss://secure.example.com/stt/socket
(when using a secure server and proxy)
After you've set the correct server check the 'ASR engine' selector. If your browser supports the 'MediaDevices' interface you will be able to select 'Custom (WebSocket)' here.
Some browsers might require a secure HTTPS connection. If you don't have your own secure web-server you can use tools like Ngrok for testing, e.g.:
./ngrok http 9000
Choose the right port depending on your app.conf and your Docker run command (in case you are using the Docker image) and then set your 'ASR server' like this:
wss://[MY-NGROK-ADDRESS].nkrok.io/socket
(if you run the server directly) orwss://[MY-NGROK-ADDRESS].nkrok.io/stt/socket
(if you're using the Docker image).
Finally test the speech recognition in your client via the microphone button :-)
The configuration can be changed while the server is running.
Get the current configuration via HTTP GET to (custom server):
curl -X GET http://localhost:20741/settings
Note: Replace localhost by your server or localhost:port with the web-server/proxy/Ngrok address. When you are using the Docker image your server is using a proxy! Add: '/stt/settings' to the path like in the client setup.
Set a different Kaldi model via HTTP POST, e.g.:
curl -X POST http://localhost:20741/settings \
-H 'Content-Type: application/json' \
-d '{"token":"test", "kaldi_model":"/home/user/share/kaldi_models/my-own-model"}'
(Note: token=test is a placeholder for future authentication process)