speed #500

wiiiktor · 2023-12-03T10:09:15Z

wiiiktor
Dec 3, 2023

After using models.TransformersChat to load:
llama2 = models.TransformersChat("meta-llama/Llama-2-7b-chat-hf")
I run the RPG character generation code on Google Colab A100 and generation took 40 seconds. Why not 1.17 sec, like with LlamaCpp?

Harsha-Nori · 2023-12-03T18:50:42Z

Harsha-Nori
Dec 3, 2023
Maintainer

llama.cpp is generally more optimized than transformers, but a difference that large sounds like you may not be running inference on your GPU (or that you have insufficient GPU memory to load the model, and thereby are subject to paging, etc.). Did you check that the device is configured properly to utilize the Colab GPU (upon initializing llama2)?

0 replies

wiiiktor · 2023-12-03T19:48:56Z

wiiiktor
Dec 3, 2023
Author

I simply run your Notebook on A100 (with a few tens of GB RAM still free). How can I check the device configuration?
I am much more interested in designing algorithms then MLOps and up to now I could not start LlamaCpp with Guidance successfully; working colab/kaggle notebooks would be hugely welcome! :)

From what I see, the best option for making experiments is Kaggle (which works with GCP offering 300 usd of entry credit); Kaggle has a library of models already loaded, so just picking one from their directory would probably be easiest to start. And Kaggle works for 12 hours long, even if browser tab is closed. My Kaggle notebook for Llama.Cpp with Guidance is here, but not working: https://www.kaggle.com/code/wiiiktor/notebook8e0fdb563d I simply used "Add models" funcitonality, that Kaggle offers.

5 replies

benbot Dec 9, 2023

How are you defining llama2?

wiiiktor Dec 9, 2023
Author

What do you mean by defining? Sorry, I do not know, so maybe I am not defining it :)

I am already on kaggle using llama2-based models via models.TransformersChat (currently using Weyaxi/OpenOrca-Zephyr-7B) and I am very pleased with results. I never had success with llama.cpp and abandoned it. I read here that llama.cpp is configured best for guidance, but this was very general info; what is the reason to prefer llama.cpp?..

benbot Dec 9, 2023

defining like llama2 = models.LlamaCpp(...)

What does that line look like? You need 2 arguments to that LlamaCpp call for it to use gpu. n_gpu_layers=-1 (or however many layers you want to offload to gpu) and device_map='auto'

wiiiktor Dec 9, 2023
Author

1// When I tried to use LlamaCpp, I was just using the code from the official notebook:
#llama2cpp = models.LlamaCpp("/models/llama-2-13b.Q5_K_S.gguf", n_gpu_layers=-1, n_ctx=4096)

2// But here I had very slow token generation after using transformers with A100 80gb on Colab, please see in the top post:
llama2 = models.TransformersChat("meta-llama/Llama-2-7b-chat-hf")

3// Currently, I use models.TransformersChat (currently using Weyaxi/OpenOrca-Zephyr-7B) on kaggle, gpu 56gb, with speed 2 tokens / second, but the official guidance manual claims 1.17 seconds for like 30 tokens! I would like to have this :) Unfortunately, I did not manage to install LlamaCpp on kaggle, would be great, if Guidance published some notebook ready-to-use.

wiiiktor Dec 9, 2023
Author

I managed to run LlamaCpp here:
https://www.kaggle.com/wiiiktor/llamacpp-demo
Now I have speed of a few tens of tokens per second.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed #500

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

speed #500

wiiiktor Dec 3, 2023

Replies: 2 comments · 5 replies

Harsha-Nori Dec 3, 2023 Maintainer

wiiiktor Dec 3, 2023 Author

benbot Dec 9, 2023

wiiiktor Dec 9, 2023 Author

benbot Dec 9, 2023

wiiiktor Dec 9, 2023 Author

wiiiktor Dec 9, 2023 Author

wiiiktor
Dec 3, 2023

Replies: 2 comments 5 replies

Harsha-Nori
Dec 3, 2023
Maintainer

wiiiktor
Dec 3, 2023
Author

wiiiktor Dec 9, 2023
Author

wiiiktor Dec 9, 2023
Author

wiiiktor Dec 9, 2023
Author