-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getting error when building from source #1142
Comments
Can you share your quntization step? |
yes I was running the basic example with llama 2 chat 7B and the quantization was done as part of the model loading:
|
Hi, @RachelShalom I have tried your Python script and got the int8 bin. But I could run the
thanks. |
ohh I understrand. You are right - I did both. I pip installed and used the script above that created the bin file. |
Yes. build
My machine is main: seed = 12
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from xxx/ne_llama_q_int8_jblas_cbf16_g32.bin
init: n_vocab = 32000
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ne ctx size = 7199.26 MB
load: mem required = 9249.26 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_jblas_kv = 1
model_init_from_file: kv self size = 276.00 MB
system_info: n_threads = 112 / 224 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 10, n_keep = 0
once upon a time, a little girl named Ella was born with a rare genetic
model_print_timings: load time = 4617.68 ms
model_print_timings: sample time = 8.27 ms / 10 runs ( 0.83 ms per token)
model_print_timings: prompt eval time = 359.86 ms / 9 tokens ( 39.98 ms per token)
model_print_timings: eval time = 521.95 ms / 9 runs ( 57.99 ms per token)
model_print_timings: total time = 5152.31 ms
========== eval time log of each prediction ==========
prediction 0, time: 359.86ms
prediction 1, time: 70.54ms
prediction 2, time: 62.73ms
prediction 3, time: 56.78ms
prediction 4, time: 45.01ms
prediction 5, time: 75.16ms
prediction 6, time: 46.52ms
prediction 7, time: 44.98ms
prediction 8, time: 62.22ms
prediction 9, time: 58.00ms
please ignore the time logs since they may be inaccurate. And the generated results may be different in your machine (model would dispatch to different kernels due to different instruction sets). |
thank you @zhentaoyu will try that! |
Hi, @RachelShalom, do you run it successfully? Can we close this issue? |
Hi I was trying to run a model with the printing output in the following way and I keep getting what(): unexpectedly reached end of file
any idea on how to solve this?
The text was updated successfully, but these errors were encountered: