Skip to content

dminier/whisper-triton-api

Repository files navigation

Description

This project is a small experiment, providing an efficient Speech2Text API with Whisper (large v3) + Triton.

Requirement

  • NVIDIA GPU
  • CUDA 12.3
  • docker compose
  • python 3.10 or later

Build Whisper model with TensorRT-LLM

  • NVIDIA Release 24.01 (build 80100513)
  • Triton Server Version 2.42.0

Python Backend is already prepared, this documentation build Whisper TensorRT-LLM model in the right directory for "docker compose"

(@see https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#export-whisper-model-to-tensorrt-llm for more)

mount="./models/whisper/1/whisper_large_v3:/workspace/TensorRT-LLM/examples/whisper/whisper_large_v3"
docker run -it --name "whisper-build" --gpus all --net host -v $mount --shm-size=2g soar97/triton-whisper:24.01.complete

# You are inside container under /workspace 
cd TensorRT-LLM/examples/whisper

# @see : https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#export-whisper-model-to-tensorrt-llm

# take large-v3 model as an example
wget --directory-prefix=assets https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt

# Build the large-v3 model using a single GPU with plugins.
python3 build.py --output_dir whisper_large_v3 --use_gpt_attention_plugin --use_gemm_plugin  --use_bert_attention_plugin --enable_context_fmha

You can upload directory models/whisper inside your own MLOPS repository.

Remove your previous build container :

docker rm "whisper-build"

API

Usage

With Streamlit

Use my cookbook : https://github.com/dminier/experiments-cookbook

Simple test

To run both api and triton server :

docker compose up -d
curl -X 'POST' \   
  'http://localhost:7000/rest/speech2text/en'  \  
   -H 'accept: application/json'  \   
   -H 'Content-Type: multipart/form-data'  \  
   -F 'file=@tests/dataset/en/en-1.wav;type=audio/wav'

Response :

" after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels"

Uvicorn output :

2024-03-10 13:07:22.596 | DEBUG    | speech2text.application.api:speech2text:25 - 236.97 ms to transcribe en-1.wav of size 212.044 kB and content-type audio/wav

Stress test

Inside stress-tests :

cd stress-tests

Only once, this script download and prepare a French datasets :

./prepare_datasets.sh

Run locust :

./run_locust.sh

Open Locust web page and simply start (stress.py contains custom load shapes): http://0.0.0.0:8089

With my computer (AMD Ryzen 5 3600, 16Gib RAM and NVIDIA GeFORCE RTX 4070) :

Locust charts

You will see a file rtf_report.txt (https://openvoice-tech.net/index.php/Real-time-factor).

After stopping locust, you can run python show_rtf.py

Open rtf.png :

Locust charts

References

Todo

  • Better Api contract
  • Batch endpoint
  • some audio conversion
  • Accuracy test
  • Use Smaller Triton Server

About

Experiment with NVIDIA Triton and Whisper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published