Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault on GPU #7337

Open
djain-fujitsu opened this issue May 17, 2024 · 7 comments
Open

Segmentation Fault on GPU #7337

djain-fujitsu opened this issue May 17, 2024 · 7 comments
Labels
training Fine-tuning and training stuff

Comments

@djain-fujitsu
Copy link

When I am trying to run the following finetuning command on GPU:
nohup ../build/bin/finetune --model-base llama-3b-Q5_0.gguf --train-data "shakespeare.txt" --save-every 1 --adam-iter 2 --batch 4 --ctx 4 --lora-out ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/lora.bin --checkpoint-in ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/checkpoint.gguf --checkpoint-out ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/checkpoint-ITERATION.gguf > ../../training/checkpoints/llama_3b_q5_ctx_4_batch_4_threads_6/training_logs.out -ngl 33

I get segmentation fault error with ever increasing nohup.out file:

llama_model_loader: loaded meta data with 24 key-value pairs and 237 tensors from llama-3b-Q5_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = models
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 3200
llama_model_loader: - kv 4: llama.block_count u32 = 26
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 8640
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 100
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 8
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,60820] = ["▁ t", "▁ a", "i n", "h e", "▁...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 53 tensors
llama_model_loader: - type q5_0: 183 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 3200
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 26
llm_load_print_meta: n_rot = 100
llm_load_print_meta: n_embd_head_k = 100
llm_load_print_meta: n_embd_head_v = 100
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 3200
llm_load_print_meta: n_embd_v_gqa = 3200
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 8640
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q5_0
llm_load_print_meta: model params = 3.43 B
llm_load_print_meta: model size = 2.23 GiB (5.59 BPW)
llm_load_print_meta: general.name = models
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA T4G, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.24 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors: CPU buffer size = 67.14 MiB
llm_load_tensors: CUDA0 buffer size = 2216.65 MiB
...............................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 162.50 MiB
llama_new_context_with_model: KV self size = 162.50 MiB, K (f16): 81.25 MiB, V (f16): 81.25 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 68.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 7.26 MiB
llama_new_context_with_model: graph nodes = 838
llama_new_context_with_model: graph splits = 2
main: seed: 1715928042
main: model base = 'llama-3b-Q5_0.gguf'
main: init model
print_params: n_vocab : 32000
print_params: n_ctx : 4
print_params: n_embd : 3200
print_params: n_ff : 8640
print_params: n_head : 32
print_params: n_head_kv : 32
print_params: n_layer : 26
print_params: norm_rms_eps : 0.000001
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq : 4
print_lora_params: n_rank_wk : 4
print_lora_params: n_rank_wv : 4
print_lora_params: n_rank_wo : 4
print_lora_params: n_rank_ffn_norm : 1
print_lora_params: n_rank_ffn_gate : 4
print_lora_params: n_rank_ffn_down : 4
print_lora_params: n_rank_ffn_up : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm : 1
print_lora_params: n_rank_output : 4
main: total train_iterations 0
main: seen train_samples 0
main: seen train_tokens 0
main: completed train_epochs 0
main: lora_size = 54844064 bytes (52.3 MB)
main: opt_size = 81694048 bytes (77.9 MB)
main: opt iter 0
main: input_size = 2048096 bytes (2.0 MB)
main: compute_size = 846062208 bytes (806.9 MB)
main: evaluation order = RIGHT_TO_LEFT
main: tokenize training data from shakespeare.txt
main: sample-start:
main: include-sample-start: false
tokenize_file: total number of samples: 26826
main: number of training tokens: 26830
main: number of unique tokens: 3320
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 768376 bytes (0.7 MB)
train_opt_callback: iter= 0 sample=1/26826 sched=0.000000 loss=0.000000 |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It get's stuck on '-' character and keeps on printing that without any progress and leads to segmentation fault finally

@djain-fujitsu djain-fujitsu changed the title Segmentation_Fault on GPU Segmentation Fault on GPU May 17, 2024
@arnfaldur
Copy link

The bug template says.

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

Why did you ignore it?

@slaren slaren added the training Fine-tuning and training stuff label May 19, 2024
@djain-fujitsu
Copy link
Author

djain-fujitsu commented May 20, 2024

The bug template says.

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

Why did you ignore it?

Thanks @arnfaldur for the revert, here the details about the system, steps to reproduce the bug:

System Info:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian

CPU(s): 8
On-line CPU(s) list: 0-7

Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssb
s
Caches (sum of all):
L1d: 512 KiB (8 instances)
L1i: 512 KiB (8 instances)
L2: 8 MiB (8 instances)
L3: 32 MiB (1 instance)

NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7

RAM = 16GB

GPU(s):
Model : NVIDIA T4G
Driver Version: 545.23.08
CUDA Version: 12.3
Memory : 16GB

llama.cpp version : Not sure as I followed all the steps on the github README.md(would appreaciate if someone can guide me on how to obtain it)

cmake version : 3.22.1

Steps to be followed for reproducing the bug:

  1. git clone https://github.com/ggerganov/llama.cpp.git

  2. cd ..../llama.cpp/

  3. Building binaries:

    mkdir build
    cd build
    cmake .. -DLLAMA_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
    cmake --build . --config Release
    cd ..

  4. cd ./models

  5. Download shakespeare text file : wget https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt

  6. Download GGUF file from the following link : wget -r https://huggingface.co/gultar/OpenHermes-Llama-3b-GGUF/tree/main

  7. nohup ../build/bin/finetune --model-base llama-3b-Q5_0.gguf --train-data "shakespeare.txt" --save-every 1 --adam-iter 2 --batch 4 --ctx 4

@arnfaldur
Copy link

Why did you ignore it?

I got a little snarky, I'm sorry about that

If you run the main executable or the server, it prints the build number like so:

$ ./main
Log start
main: build = 2936 (5ca49cbe)

or

$ ./server
{"tid":"124519658393600","timestamp":1716210643,"level":"INFO","function":"main","line":2943,"msg":"build info","build":2936,"commit":"5ca49cbe"}

I'm afraid I don't know much about the training logic so I can't help you there.

@djain-fujitsu
Copy link
Author

djain-fujitsu commented May 21, 2024

Why did you ignore it?

I got a little snarky, I'm sorry about that

If you run the main executable or the server, it prints the build number like so:

$ ./main
Log start
main: build = 2936 (5ca49cbe)

or

$ ./server
{"tid":"124519658393600","timestamp":1716210643,"level":"INFO","function":"main","line":2943,"msg":"build info","build":2936,"commit":"5ca49cbe"}

I'm afraid I don't know much about the training logic so I can't help you there.

It's ok buddy. So I got this as my build number:

$ ./main

Log start
main: build = 2782 (60325fa)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for aarch64-linux-gnu

@arnfaldur
Copy link

That's a fairly new build. It can't hurt updating to the latest and retrying, this repo is moving very fast. It's worth the shot but might not help though.

@djain-fujitsu
Copy link
Author

That's a fairly new build. It can't hurt updating to the latest and retrying, this repo is moving very fast. It's worth the shot but might not help though.

Ok so do you mean I should give it a try to the below build:

$ ./main
Log start
main: build = 2936 (5ca49cb)

@arnfaldur
Copy link

Yes. I mean that it's worth trying if it's not a lot of work. Iit's not very likely to solve the issue but there's a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training Fine-tuning and training stuff
Projects
None yet
Development

No branches or pull requests

3 participants