-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Fault on GPU #7337
Comments
The bug template says.
Why did you ignore it? |
Thanks @arnfaldur for the revert, here the details about the system, steps to reproduce the bug: System Info: CPU(s): 8 Vendor ID: ARM NUMA: RAM = 16GB GPU(s): llama.cpp version : Not sure as I followed all the steps on the github README.md(would appreaciate if someone can guide me on how to obtain it) cmake version : 3.22.1 Steps to be followed for reproducing the bug:
|
I got a little snarky, I'm sorry about that If you run the main executable or the server, it prints the build number like so: $ ./main
Log start
main: build = 2936 (5ca49cbe) or $ ./server
{"tid":"124519658393600","timestamp":1716210643,"level":"INFO","function":"main","line":2943,"msg":"build info","build":2936,"commit":"5ca49cbe"} I'm afraid I don't know much about the training logic so I can't help you there. |
It's ok buddy. So I got this as my build number: $ ./mainLog start |
That's a fairly new build. It can't hurt updating to the latest and retrying, this repo is moving very fast. It's worth the shot but might not help though. |
Ok so do you mean I should give it a try to the below build:
|
Yes. I mean that it's worth trying if it's not a lot of work. Iit's not very likely to solve the issue but there's a chance. |
When I am trying to run the following finetuning command on GPU:
nohup ../build/bin/finetune --model-base llama-3b-Q5_0.gguf --train-data "shakespeare.txt" --save-every 1 --adam-iter 2 --batch 4 --ctx 4 --lora-out ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/lora.bin --checkpoint-in ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/checkpoint.gguf --checkpoint-out ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/checkpoint-ITERATION.gguf > ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/training_logs.out -ngl 33
I get segmentation fault error with ever increasing nohup.out file:
llama_model_loader: loaded meta data with 24 key-value pairs and 237 tensors from llama-3b-Q5_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = models
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 3200
llama_model_loader: - kv 4: llama.block_count u32 = 26
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 8640
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 100
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 8
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "
", "", "<0x00>", "<...llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,60820] = ["▁ t", "▁ a", "i n", "h e", "▁...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 53 tensors
llama_model_loader: - type q5_0: 183 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 3200
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 26
llm_load_print_meta: n_rot = 100
llm_load_print_meta: n_embd_head_k = 100
llm_load_print_meta: n_embd_head_v = 100
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3200
llm_load_print_meta: n_embd_v_gqa = 3200
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8640
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q5_0
llm_load_print_meta: model params = 3.43 B
llm_load_print_meta: model size = 2.23 GiB (5.59 BPW)
llm_load_print_meta: general.name = models
llm_load_print_meta: BOS token = 1 '
''llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA T4G, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.24 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors: CPU buffer size = 67.14 MiB
llm_load_tensors: CUDA0 buffer size = 2216.65 MiB
...............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 162.50 MiB
llama_new_context_with_model: KV self size = 162.50 MiB, K (f16): 81.25 MiB, V (f16): 81.25 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 68.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 7.26 MiB
llama_new_context_with_model: graph nodes = 838
llama_new_context_with_model: graph splits = 2
main: seed: 1715928042
main: model base = 'llama-3b-Q5_0.gguf'
main: init model
print_params: n_vocab : 32000
print_params: n_ctx : 4
print_params: n_embd : 3200
print_params: n_ff : 8640
print_params: n_head : 32
print_params: n_head_kv : 32
print_params: n_layer : 26
print_params: norm_rms_eps : 0.000001
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq : 4
print_lora_params: n_rank_wk : 4
print_lora_params: n_rank_wv : 4
print_lora_params: n_rank_wo : 4
print_lora_params: n_rank_ffn_norm : 1
print_lora_params: n_rank_ffn_gate : 4
print_lora_params: n_rank_ffn_down : 4
print_lora_params: n_rank_ffn_up : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm : 1
print_lora_params: n_rank_output : 4
main: total train_iterations 0
main: seen train_samples 0
main: seen train_tokens 0
main: completed train_epochs 0
main: lora_size = 54844064 bytes (52.3 MB)
main: opt_size = 81694048 bytes (77.9 MB)
main: opt iter 0
main: input_size = 2048096 bytes (2.0 MB)
main: compute_size = 846062208 bytes (806.9 MB)
main: evaluation order = RIGHT_TO_LEFT
main: tokenize training data from shakespeare.txt
main: sample-start:
main: include-sample-start: false
tokenize_file: total number of samples: 26826
main: number of training tokens: 26830
main: number of unique tokens: 3320
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter= 0 sample=1/26826 sched=0.000000 loss=0.000000 |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
It get's stuck on '-' character and keeps on printing that without any progress and leads to segmentation fault finally
The text was updated successfully, but these errors were encountered: