-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No i8mm/sve in dimensity8300 means slower generation. #1146
Comments
Also it have an NPU what according to mediatek can natively support llm up to 10b idk maybe koboldcpp should use that too if possible. I'm not a developer i don't know how much work it takes or even possible just saying/requesting. Or maybe even 4_4 using it? Yi 9b are actually running in acceptable speed. |
As a general note the aarch64 special intrinsics for these quants have not really been tested. So I am not sure what the current compatibility status is. Could you try running a regular quant (e.g. q4_0) and see if that works correctly? Then compare with Q4_0_4_4 |
The regular quants like q4K_M runs slowly the q4_0_4_4 runs faster, so it's fine for me. Maybe i wrong, but i just expected the q4_0_8_8 quants work and ran even faster because the cortex cores (a510 and a715) have the requied instruction set, this is why i gave this feedback. |
Noticing:
The code fails to identify that the CPU has SVE and MATMUL_INT8. I tried looking into the source code, it seems the code tries to compile some instructions, and if it fails it marks these features as disabled. My Snapdragon 8G1, that also has i8mm, also marks MATMUL_INT8 = 0. |
Only q4_0_4_4 gguf are running in my Poco X6 pro phone. CPU-Z said it have cortex A510 and A715 cores. They are support both i8mm and sve. When i tried to run a gguf what needs it this happens:
~/koboldcpp $ python koboldcpp.py --model Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf
Welcome to KoboldCpp - Version 1.75.2
No GPU or CPU backend was selected. Trying to assign one for you automatically...
No GPU Backend found...
No GPU backend found, or could not automatically determine GPU layers. Please set it manually.
Attempting to use CPU library.
Initializing dynamic library: koboldcpp_default.so
Namespace(model='Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf', model_param='Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=3, usecublas=None, usevulkan=None, useclblast=None, usecpu=False, contextsize=4096, gpulayers=0, tensor_split=None, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=3, lora=None, noshift=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, onready='', benchmark=None, prompt='', promptlimit=100, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, mmproj='', password=None, ignoremissing=False, chatcompletionsadapter='', flashattention=False, quantkv=0, forceversion=0, smartcontext=False, unpack='', nomodel=False, showgui=False, skiplauncher=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=0, sdclamped=0, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None, noblas=False)
Loading model: /data/data/com.termux/files/home/koboldcpp/Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf
The reported GGUF Arch is: phi3
Arch Category: 0
Identified as GGUF model: (ver 6)
Attempting to Load...
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 32 key-value pairs and 197 tensors from /data/data/com.termux/files/home/koboldcpp/Phi-3.5-mini-instruct_Uncensored-Q4_0_4_8.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.1685 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32064
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_swa = 262144
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3072
llm_load_print_meta: n_embd_v_gqa = 3072
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8192
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 3.82 B
llm_load_print_meta: model size = 2.03 GiB (4.55 BPW)
llm_load_print_meta: general.name = Phi 3.5 Mini instruct_Uncensored
llm_load_print_meta: BOS token = 1 '
'llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: CPU buffer size = 2074.66 MiB
.....................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx = 4192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1572.00 MiB
llama_new_context_with_model: KV self size = 1572.00 MiB, K (f16): 786.00 MiB, V (f16): 786.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 306.19 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
ggml/src/ggml-aarch64.c:1926: GGML_ASSERT((ggml_cpu_has_sve() || ggml_cpu_has_matmul_int8()) && "__ARM_FEATURE_SVE and __ARM_FEATURE_MATMUL_INT8 not defined, use the Q4_0_4_4 quantization format for optimal " "performance") failed
ggml/src/ggml-aarch64.c:1926: GGML_ASSERT((ggml_cpu_has_sve() || ggml_cpu_has_matmul_int8()) && "__ARM_FEATURE_SVE and __ARM_FEATURE_MATMUL_INT8 not defined, use the Q4_0_4_4 quantization format for optimal " "performance") failed
ggml/src/ggml-aarch64.c:1926: GGML_ASSERT((ggml_cpu_has_sve() || ggml_cpu_has_matmul_int8()) && "__ARM_FEATURE_SVE and __ARM_FEATURE_MATMUL_INT8 not defined, use the Q4_0_4_4 quantization format for optimal " "performance") failed
0: 0x771f1b3a58
1: 0x771f1b39cc ggml_abort
2: 0x771f4c00dc ggml_gemm_q4_0_8x8_q8_0
3: 0x771f1dc290
4: 0x771f1c46b8
5: 0x771f1c4408 ggml_graph_compute
6: 0x771f4bc0a0
7: 0x771f4bad7c ggml_backend_sched_graph_compute_async
8: 0x771f2b6678 llama_decode
9: 0x771f2ed848 _Z18gpttype_load_model17load_model_inputs10FileFormat19FileFormatExtraMeta
10: 0x771f27fd0c load_model
11: 0x7722d6d054
12: 0x7722d68c10
13: 0x772465ec64
14: 0x772465760c
15: 0x77a588050c _PyObject_MakeTpCall
16: 0x77a595fbb0 _PyEval_EvalFrameDefault
17: 0x77a595b064 PyEval_EvalCode
18: 0x77a59aedcc
19: 0x77a59ad114 _PyRun_SimpleFileObject
20: 0x77a59acaf0 _PyRun_AnyFileObject
21: 0x77a59cd780
22: 0x77a59ccec8 Py_RunMain
23: 0x77a59cd138
24: 0x77a59cd1e0 Py_BytesMain
25: 0x77a8f0e79c __libc_init
0: 0x771f1b3a58
0: 0x771f1b3a58
1: 0x771f1b39cc ggml_abort
1: 0x771f1b39cc ggml_abort
2: 0x771f4c00dc ggml_gemm_q4_0_8x8_q8_0
2: 0x771f4c00dc ggml_gemm_q4_0_8x8_q8_0
3: 0x771f1dc290
4: 0x771f1c46b8
3: 0x771f1dc290
5: 0x771f1cf990
4: 0x771f1c46b8
6: 0x77a8f83d60
5: 0x771f1cf990
7: 0x77a8f17bc4
6: 0x77a8f83d60
7: 0x77a8f17bc4
Aborted
~/koboldcpp $
Ok the 4_4 are slowly working i think maybe 4_8 and 8_8 can be much faster if koboldai/termux can use these CPU features.
The text was updated successfully, but these errors were encountered: