Replies: 2 comments 5 replies
-
llama.cpp is generally more optimized than transformers, but a difference that large sounds like you may not be running inference on your GPU (or that you have insufficient GPU memory to load the model, and thereby are subject to paging, etc.). Did you check that the |
Beta Was this translation helpful? Give feedback.
-
I simply run your Notebook on A100 (with a few tens of GB RAM still free). How can I check the device configuration? From what I see, the best option for making experiments is Kaggle (which works with GCP offering 300 usd of entry credit); Kaggle has a library of models already loaded, so just picking one from their directory would probably be easiest to start. And Kaggle works for 12 hours long, even if browser tab is closed. My Kaggle notebook for Llama.Cpp with Guidance is here, but not working: https://www.kaggle.com/code/wiiiktor/notebook8e0fdb563d I simply used "Add models" funcitonality, that Kaggle offers. |
Beta Was this translation helpful? Give feedback.
-
After using models.TransformersChat to load:
llama2 = models.TransformersChat("meta-llama/Llama-2-7b-chat-hf")
I run the RPG character generation code on Google Colab A100 and generation took 40 seconds. Why not 1.17 sec, like with LlamaCpp?
Beta Was this translation helpful? Give feedback.
All reactions