Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No i8mm/sve in dimensity8300 means slower generation. #1146

Open
Hiso89 opened this issue Oct 6, 2024 · 4 comments
Open

No i8mm/sve in dimensity8300 means slower generation. #1146

Hiso89 opened this issue Oct 6, 2024 · 4 comments

Comments

@Hiso89
Copy link

Hiso89 commented Oct 6, 2024

Only q4_0_4_4 gguf are running in my Poco X6 pro phone. CPU-Z said it have cortex A510 and A715 cores. They are support both i8mm and sve. When i tried to run a gguf what needs it this happens:

~/koboldcpp $ python koboldcpp.py --model Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf


Welcome to KoboldCpp - Version 1.75.2
No GPU or CPU backend was selected. Trying to assign one for you automatically...
No GPU Backend found...

No GPU backend found, or could not automatically determine GPU layers. Please set it manually.
Attempting to use CPU library.
Initializing dynamic library: koboldcpp_default.so

Namespace(model='Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf', model_param='Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=3, usecublas=None, usevulkan=None, useclblast=None, usecpu=False, contextsize=4096, gpulayers=0, tensor_split=None, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=3, lora=None, noshift=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, mmproj='', password=None, ignoremissing=False, chatcompletionsadapter='', flashattention=False, quantkv=0, forceversion=0, smartcontext=False, unpack='', nomodel=False, showgui=False, skiplauncher=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=0, sdclamped=0, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None, noblas=False)

Loading model: /data/data/com.termux/files/home/koboldcpp/Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf

The reported GGUF Arch is: phi3
Arch Category: 0


Identified as GGUF model: (ver 6)
Attempting to Load...

Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 32 key-value pairs and 197 tensors from /data/data/com.termux/files/home/koboldcpp/Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.1685 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_swa = 262144
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 3.82 B
llm_load_print_meta: model size = 2.03 GiB (4.55 BPW)
llm_load_print_meta: general.name = Phi 3.5 Mini instruct_Uncensored
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: CPU buffer size = 2074.66 MiB
.....................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx = 4192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1572.00 MiB
llama_new_context_with_model: KV self size = 1572.00 MiB, K (f16): 786.00 MiB, V (f16): 786.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 306.19 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
ggml/src/ggml-aarch64.c:1926: GGML_ASSERT((ggml_cpu_has_sve() || ggml_cpu_has_matmul_int8()) && "__ARM_FEATURE_SVE and __ARM_FEATURE_MATMUL_INT8 not defined, use the Q4_0_4_4 quantization format for optimal " "performance") failed
ggml/src/ggml-aarch64.c:1926: GGML_ASSERT((ggml_cpu_has_sve() || ggml_cpu_has_matmul_int8()) && "__ARM_FEATURE_SVE and __ARM_FEATURE_MATMUL_INT8 not defined, use the Q4_0_4_4 quantization format for optimal " "performance") failed
ggml/src/ggml-aarch64.c:1926: GGML_ASSERT((ggml_cpu_has_sve() || ggml_cpu_has_matmul_int8()) && "__ARM_FEATURE_SVE and __ARM_FEATURE_MATMUL_INT8 not defined, use the Q4_0_4_4 quantization format for optimal " "performance") failed
0: 0x771f1b3a58
1: 0x771f1b39cc ggml_abort
2: 0x771f4c00dc ggml_gemm_q4_0_8x8_q8_0
3: 0x771f1dc290
4: 0x771f1c46b8
5: 0x771f1c4408 ggml_graph_compute
6: 0x771f4bc0a0
7: 0x771f4bad7c ggml_backend_sched_graph_compute_async
8: 0x771f2b6678 llama_decode
9: 0x771f2ed848 _Z18gpttype_load_model17load_model_inputs10FileFormat19FileFormatExtraMeta
10: 0x771f27fd0c load_model
11: 0x7722d6d054
12: 0x7722d68c10
13: 0x772465ec64
14: 0x772465760c
15: 0x77a588050c _PyObject_MakeTpCall
16: 0x77a595fbb0 _PyEval_EvalFrameDefault
17: 0x77a595b064 PyEval_EvalCode
18: 0x77a59aedcc
19: 0x77a59ad114 _PyRun_SimpleFileObject
20: 0x77a59acaf0 _PyRun_AnyFileObject
21: 0x77a59cd780
22: 0x77a59ccec8 Py_RunMain
23: 0x77a59cd138
24: 0x77a59cd1e0 Py_BytesMain
25: 0x77a8f0e79c __libc_init
0: 0x771f1b3a58
0: 0x771f1b3a58
1: 0x771f1b39cc ggml_abort
1: 0x771f1b39cc ggml_abort
2: 0x771f4c00dc ggml_gemm_q4_0_8x8_q8_0
2: 0x771f4c00dc ggml_gemm_q4_0_8x8_q8_0
3: 0x771f1dc290
4: 0x771f1c46b8
3: 0x771f1dc290
5: 0x771f1cf990
4: 0x771f1c46b8
6: 0x77a8f83d60
5: 0x771f1cf990
7: 0x77a8f17bc4
6: 0x77a8f83d60
7: 0x77a8f17bc4
Aborted
~/koboldcpp $

Ok the 4_4 are slowly working i think maybe 4_8 and 8_8 can be much faster if koboldai/termux can use these CPU features.

@Hiso89
Copy link
Author

Hiso89 commented Oct 6, 2024

Also it have an NPU what according to mediatek can natively support llm up to 10b idk maybe koboldcpp should use that too if possible. I'm not a developer i don't know how much work it takes or even possible just saying/requesting. Or maybe even 4_4 using it? Yi 9b are actually running in acceptable speed.

@LostRuins
Copy link
Owner

LostRuins commented Oct 6, 2024

As a general note the aarch64 special intrinsics for these quants have not really been tested. So I am not sure what the current compatibility status is. Could you try running a regular quant (e.g. q4_0) and see if that works correctly? Then compare with Q4_0_4_4

@Hiso89
Copy link
Author

Hiso89 commented Oct 6, 2024

The regular quants like q4K_M runs slowly the q4_0_4_4 runs faster, so it's fine for me.

Maybe i wrong, but i just expected the q4_0_8_8 quants work and ran even faster because the cortex cores (a510 and a715) have the requied instruction set, this is why i gave this feedback.

@gustrd
Copy link

gustrd commented Oct 7, 2024

Noticing:

System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

The code fails to identify that the CPU has SVE and MATMUL_INT8.

I tried looking into the source code, it seems the code tries to compile some instructions, and if it fails it marks these features as disabled.

My Snapdragon 8G1, that also has i8mm, also marks MATMUL_INT8 = 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants