Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument #1158

Open
YajuShinki opened this issue Oct 12, 2024 · 6 comments

Comments

@YajuShinki
Copy link

Describe the Issue
After updating my computer, when running KoboldCPP, the program either crashes or refuses to generate any text. Most of the time, when loading a model, the terminal shows an error: ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument before trying to load the model into memory.
Occasionally it will successfully boot up, but processing prompt is much slower than before the system update, and it aborts before actually generating anything. Eventually it simply crashes with Killed printed to the console before exiting.
I've tried updating to the latest version of koboldCPP, and using both cuda1210 and cuda1150 versions produce the same result.

Additional Information:
OS: Arch Linux, kernel version 6.11.3-arch1-1 (previous working version: 6.10)
CPU: AMD Ryzen 5 5600 (12) @ 4.468GHz
GPU: NVIDIA GeForce RTX 3060
Model used: Beyonder 4x7b-v2 q5_k_m
GPU layers: 19
CPU threads: 6
Context size: 8192 with ContextShift on
Crashes whether FlashAttention is off or on

Log:

***
Welcome to KoboldCpp - Version 1.76
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(benchmark=None, blasbatchsize=512, blasthreads=6, chatcompletionsadapter=None, config=None, contextsize=8192, debugmode=1, flashattention=False, forceversion=0, foreground=False, gpulayers=19, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='/home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdthreads=5, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=6, unpack='', useclblast=None, usecpu=False, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None, whispermodel='')
==========
Loading model: /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf

The reported GGUF Arch is: llama
Arch Category: 4

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Trained max context length (value:2048).
Desired context length (value:8192).
Solar context multiplier (value:1.000).
Chi context train (value:325.950).
Chi chosen context (value:1303.798).
Log Chi context train (value:2.513).
Log Chi chosen context (value:3.115).
RoPE Frequency Base value (value:10000.000).
RoPE base calculated via Gradient AI formula. (value:90835.4).
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 26 key-value pairs and 611 tensors from /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 4
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 24.15 B
llm_load_print_meta: model size       = 15.49 GiB (5.51 BPW) 
llm_load_print_meta: general.name     = mlabonne_beyonder-4x7b-v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 1 '<s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.58 MiB
ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument
llm_load_tensors: offloading 19 repeating layers to GPU
llm_load_tensors: offloaded 19/33 layers to GPU
llm_load_tensors:        CPU buffer size =  6558.12 MiB
llm_load_tensors:      CUDA0 buffer size =  9304.88 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   420.88 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   615.12 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   593.56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB
llama_new_context_with_model: graph nodes  = 1510
llama_new_context_with_model: graph splits = 160
Killed
@YajuShinki YajuShinki changed the title KoboldCPP crashes after Arch system update when loading model: ggml_cuda_host_malloc ... invalid argument KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument Oct 12, 2024
@LostRuins
Copy link
Owner

Did you select the number of layers yourself, or was it automatically picked?

@YajuShinki
Copy link
Author

I chose the number of layers through trial and error. 19 layers was the maximum I could fit on the GPU with 8k context without it running out of VRAM.

@LostRuins
Copy link
Owner

Try fewer layers.

@YajuShinki
Copy link
Author

I have tried running it again with 10 layers, and the result is still the same. The only difference is that it says failed to allocate 10965.24 MiB of pinned memory rather than 6558.12 (which I just now realized is the exact size of the CPU buffer), so something seems to be going very wrong when trying to allocate CPU RAM.

@justme1135
Copy link

Similar error on EndeavourOS with 6.11.4-arch2-1 kernel (existed in previous version as well).

ggml_cuda_host_malloc: failed to allocate 21588.00 MiB of pinned memory: invalid argument

@LostRuins
Copy link
Owner

Try using the default settings, don't change anything. Just launch koboldcpp, select your model, select CUDA, and disable MMAP. Does that work and load correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@YajuShinki @LostRuins @justme1135 and others