server example: 1st query fast, subsequent queries slower #2657

laurence-henderson · 2024-12-22T06:40:02Z

Repeatably the first time I call /inference after the server has been started performance is great, constant 100% CPU. Subsequent calls take about twice as long, CPU activity fluctuates from about 20% to 80%.
Something to do with reusing the context?

<start server with ./build/bin/whisper-server -m models/ggml-small.en.bin -t 16>
time curl 127.0.0.1:8080/inference -H "Content-Type: multipart/form-data" -F file=@./samples/output.wav
real 0m26.195s
time curl 127.0.0.1:8080/inference -H "Content-Type: multipart/form-data" -F file=@./samples/output.wav
real 0m48.280s
time curl 127.0.0.1:8080/inference -H "Content-Type: multipart/form-data" -F file=@./samples/output.wav
real 0m48.256s

<restart server with ./build/bin/whisper-server -m models/ggml-small.en.bin -t 16>
time curl 127.0.0.1:8080/inference -H "Content-Type: multipart/form-data" -F file=@./samples/output.wav
real 0m26.566s
time curl 127.0.0.1:8080/inference -H "Content-Type: multipart/form-data" -F file=@./samples/output.wav
real 0m48.180s
time curl 127.0.0.1:8080/inference -H "Content-Type: multipart/form-data" -F file=@./samples/output.wav
real 0m48.206s

Each invocation I see the exact same server output:

system_info: n_threads = 16 / 16 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |
operator(): processing 'output.wav' (8142848 samples, 508.9 sec), 16 threads, 1 processors, lang = en, task = transcribe, timestamps = 1

In case it is relevant this is running on AWS ARM c7g.4xlarge, Ubuntu 24.04

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-12-22T13:31:36Z

Hm, not sure. I cannot reproduce locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server example: 1st query fast, subsequent queries slower #2657

server example: 1st query fast, subsequent queries slower #2657

laurence-henderson commented Dec 22, 2024

ggerganov commented Dec 22, 2024

server example: 1st query fast, subsequent queries slower #2657

server example: 1st query fast, subsequent queries slower #2657

Comments

laurence-henderson commented Dec 22, 2024

ggerganov commented Dec 22, 2024