KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument #1158

YajuShinki · 2024-10-12T15:12:46Z

Describe the Issue
After updating my computer, when running KoboldCPP, the program either crashes or refuses to generate any text. Most of the time, when loading a model, the terminal shows an error: ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument before trying to load the model into memory.
Occasionally it will successfully boot up, but processing prompt is much slower than before the system update, and it aborts before actually generating anything. Eventually it simply crashes with Killed printed to the console before exiting.
I've tried updating to the latest version of koboldCPP, and using both cuda1210 and cuda1150 versions produce the same result.

Additional Information:
OS: Arch Linux, kernel version 6.11.3-arch1-1 (previous working version: 6.10)
CPU: AMD Ryzen 5 5600 (12) @ 4.468GHz
GPU: NVIDIA GeForce RTX 3060
Model used: Beyonder 4x7b-v2 q5_k_m
GPU layers: 19
CPU threads: 6
Context size: 8192 with ContextShift on
Crashes whether FlashAttention is off or on

Log:

***
Welcome to KoboldCpp - Version 1.76
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(benchmark=None, blasbatchsize=512, blasthreads=6, chatcompletionsadapter=None, config=None, contextsize=8192, debugmode=1, flashattention=False, forceversion=0, foreground=False, gpulayers=19, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='/home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdthreads=5, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=6, unpack='', useclblast=None, usecpu=False, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None, whispermodel='')
==========
Loading model: /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf

The reported GGUF Arch is: llama
Arch Category: 4

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Trained max context length (value:2048).
Desired context length (value:8192).
Solar context multiplier (value:1.000).
Chi context train (value:325.950).
Chi chosen context (value:1303.798).
Log Chi context train (value:2.513).
Log Chi chosen context (value:3.115).
RoPE Frequency Base value (value:10000.000).
RoPE base calculated via Gradient AI formula. (value:90835.4).
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 26 key-value pairs and 611 tensors from /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 4
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 24.15 B
llm_load_print_meta: model size       = 15.49 GiB (5.51 BPW) 
llm_load_print_meta: general.name     = mlabonne_beyonder-4x7b-v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 1 '<s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.58 MiB
ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument
llm_load_tensors: offloading 19 repeating layers to GPU
llm_load_tensors: offloaded 19/33 layers to GPU
llm_load_tensors:        CPU buffer size =  6558.12 MiB
llm_load_tensors:      CUDA0 buffer size =  9304.88 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   420.88 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   615.12 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   593.56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB
llama_new_context_with_model: graph nodes  = 1510
llama_new_context_with_model: graph splits = 160
Killed

The text was updated successfully, but these errors were encountered:

LostRuins · 2024-10-12T15:25:43Z

Did you select the number of layers yourself, or was it automatically picked?

YajuShinki · 2024-10-12T18:51:02Z

I chose the number of layers through trial and error. 19 layers was the maximum I could fit on the GPU with 8k context without it running out of VRAM.

LostRuins · 2024-10-13T12:31:54Z

Try fewer layers.

YajuShinki · 2024-10-13T22:31:44Z

I have tried running it again with 10 layers, and the result is still the same. The only difference is that it says failed to allocate 10965.24 MiB of pinned memory rather than 6558.12 (which I just now realized is the exact size of the CPU buffer), so something seems to be going very wrong when trying to allocate CPU RAM.

justme1135 · 2024-10-23T08:05:44Z

Similar error on EndeavourOS with 6.11.4-arch2-1 kernel (existed in previous version as well).

ggml_cuda_host_malloc: failed to allocate 21588.00 MiB of pinned memory: invalid argument

LostRuins · 2024-10-24T14:32:57Z

Try using the default settings, don't change anything. Just launch koboldcpp, select your model, select CUDA, and disable MMAP. Does that work and load correctly?

YajuShinki changed the title ~~KoboldCPP crashes after Arch system update when loading model: ggml_cuda_host_malloc ... invalid argument~~ KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument #1158

KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument #1158

YajuShinki commented Oct 12, 2024

LostRuins commented Oct 12, 2024

YajuShinki commented Oct 12, 2024

LostRuins commented Oct 13, 2024

YajuShinki commented Oct 13, 2024

justme1135 commented Oct 23, 2024

LostRuins commented Oct 24, 2024

KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument #1158

KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument #1158

Comments

YajuShinki commented Oct 12, 2024

LostRuins commented Oct 12, 2024

YajuShinki commented Oct 12, 2024

LostRuins commented Oct 13, 2024

YajuShinki commented Oct 13, 2024

justme1135 commented Oct 23, 2024

LostRuins commented Oct 24, 2024