Help with achieving best performance on Intel CPU + GPU #2662

alucryd · 2024-12-23T21:12:44Z

alucryd
Dec 23, 2024

Hi all, I'm looking to build a fast STT for use in Home Assistant. I'm coming from faster-whisper with a small model running directly on the N100 machine which runs Home Assistant. Each command was taking 6-7 seconds with really hit or miss results.

I recently learned about Vulkan support in whisper.cpp and decided to migrate the STT component to my home server, running a Xeon D-1521 and a discrete GPU. I am now able to run a large-v3 model in about 8-9 seconds with infinitely better accuracy, which I'll trade a couple seconds for any day. Vulkan is really a game changer, as it is about 10x faster compared to the CPU backend. It would be awesome if I could bring that down under the 5 second mark, but I'm struggling as everything I tried so far has had no effect at all.

Here's everything I tried:

Replaced my AMD RX 550 with an Intel Arc A380
Tried distilled models like bofenghuang/whisper-large-v3-french-distil-dec4 (which I understand is about the same as using the newer turbo model)
Tried quantized (q5_0) versions of these models
Tried adding BLAS to the mix (using Intel MKL)
Tried reducing beam size to 1
Tried increasing threads to 8

The one thing I haven't tried yet is using OpenVINO which I believe can also run on the Arc GPU, however I haven't been able to yet as I'm currently stuck with Python and OpenVINO versions that are seemingly too recent for whisper.cpp.

Should I pursue OpenVINO given my current hardware, or have I hit a hard limit? Anything else that is worth trying (besides downgrading to a smaller model and sacrificing accuracy)?

alucryd · 2024-12-23T21:52:51Z

alucryd
Dec 23, 2024
Author

Oh well, answering my own question, OpenVINO was definitely worth pursuing! Used a 3.10 venv as outlined in the documentation to convert the large-v3-turbo model, and I was able to build whisper.cpp against the very latest OpenVINO (2024.6.0 at the time of writing). Running on the GPU, I am now down to a mere 3s!

Will try to modify the convert script to convert my French distilled model and I should be all set.

Is it still worth it to build ggml with BLAS and Intel MKL or is everything happening on the GPU?

3 replies

ggerganov Dec 24, 2024
Maintainer

I don't have experience with OpenVINO, so happy to hear that you made it work. OpenVINO is used only for the Encoder of Whisper. The Decoder would continue to work on the CPU. Since you switched to an Intel GPU, I think you can try getting the SYCL backend to run - both with and without OpenVINO. It should significantly improve the Decoder performance and might or might not be faster for the Encoder.

Is it still worth it to build ggml with BLAS and Intel MKL or is everything happening on the GPU?

BLAS would only potentially help with the Encoder if it was running on the CPU. But since you have OpenVINO enabled, I doubt it would make any difference.

alucryd Dec 24, 2024
Author

Thanks for the quick reply. Turns out BLAS does help as I'm now down to around 1s using the turbo model with OpenVINO + BLAS (MKL)! This is with 4 threads and a beam size of 4, although these 2 don't appear to make much of a difference, but I guess that's to be expected for such short audio clips.

Also bumped my OneAPI toolkit while I was trying to build SYCL, went from 2024.1.0 to 2025.0.1 which might have helped with performance as well.

Haven't been able to build SYCL yet but it may be due to the Arch Linux build tools, I'll try building manually next. Here's the error I'm getting in case you have an idea:

/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/array:206:7: note: called by 'operator[]'
  206 |       operator[](size_type __n) noexcept
      |       ^
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/array:217:2: error: SYCL kernel cannot call an undefined function without SYCL_EXTERNAL attribute
  217 |         __glibcxx_requires_subscript(__n);
      |         ^
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/debug/assertions.h:44:3: note: expanded from macro '__glibcxx_requires_subscript'
   44 |   __glibcxx_assert(_N < this->size())
      |   ^
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/x86_64-pc-linux-gnu/bits/c++config.h:597:7: note: expanded from macro '__glibcxx_assert'
  597 |       _GLIBCXX_ASSERT_FAIL(cond);                                       \
      |       ^
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/x86_64-pc-linux-gnu/bits/c++config.h:586:8: note: expanded from macro '_GLIBCXX_ASSERT_FAIL'
  586 |   std::__glibcxx_assert_fail(__FILE__, __LINE__, __PRETTY_FUNCTION__,   \
      |        ^
/usr/lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/x86_64-pc-linux-gnu/bits/c++config.h:579:3: note: '__glibcxx_assert_fail' declared here
  579 |   __glibcxx_assert_fail /* Called when a precondition violation is detected. */
      |   ^

alucryd Dec 24, 2024
Author

Managed to build SYCL outside of the Arch build tools, however I'm getting a core dump when launching the server:

/usr/bin/whisper.cpp-server -m whisper-large-v3-turbo/ggml-model.bin -debug
whisper_init_from_file_with_params_no_state: loading model from 'whisper-large-v3-turbo/ggml-model.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
register_backend: registered backend SYCL (1 devices)
register_device: registered device SYCL0 (Intel(R) Arc(TM) A380 Graphics)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) CPU D-1521 @ 2.40GHz)
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_default_buffer_type: using device SYCL0 (Intel(R) Arc(TM) A380 Graphics)
whisper_model_load:    SYCL0 total size =  1623.92 MB
Bus error (core dumped)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with achieving best performance on Intel CPU + GPU #2662

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help with achieving best performance on Intel CPU + GPU #2662

alucryd Dec 23, 2024

Replies: 1 comment · 3 replies

alucryd Dec 23, 2024 Author

ggerganov Dec 24, 2024 Maintainer

alucryd Dec 24, 2024 Author

alucryd Dec 24, 2024 Author

alucryd
Dec 23, 2024

Replies: 1 comment 3 replies

alucryd
Dec 23, 2024
Author

ggerganov Dec 24, 2024
Maintainer

alucryd Dec 24, 2024
Author

alucryd Dec 24, 2024
Author