torch.cuda.OutOfMemoryError: CUDA out of memory. #164
-
hi 👋 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 23.69 GiB of
which 30.12 MiB is free. Including non-PyTorch memory, this process has 23.54 GiB memory in use. Of the allocated memory
23.29 GiB is allocated by PyTorch, and 1.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory
is large try setting max_split_ But this text-generated model runs perfectly fine on this machine! from autollm import AutoQueryEngine
from autollm.utils.document_reading import read_files_as_documents
import os
os.environ["HUGGINGFACE_API_KEY"] = "hf_xxxx"
documents = read_files_as_documents(input_dir="/home/yang/rag-app/docs")
query_engine = AutoQueryEngine.from_defaults(
documents=documents,
embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
llm_model='FlagAlpha/Llama2-Chinese-7b-Chat',
)
def query(prompt):
response = query_engine.query(
prompt
)
return response.response and the example can't work https://github.com/safevideo/autollm#supports-100-llms
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
Hi there @BoyYangzai 👋, Thank you for bringing this issue to our attention. We appreciate your use of our platform and your effort in detailing the problem. Regarding the As for the We hope this helps! If you have any more questions or run into further issues, feel free to reach out. Best, |
Beta Was this translation helpful? Give feedback.
-
@SeeknnDestroy But as a beginner I've been able to successfully run this FlagAlpha/Llama2-Chinese-7b-Chat model through llama2-webui on my 3090! Why is it when it comes to Autollm that it says there is not enough memory?
I previously thought it was caused by a problem with the way I specified the model, resulting in an incorrect model being loaded. embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
llm_model='FlagAlpha/Llama2-Chinese-7b-Chat', |
Beta Was this translation helpful? Give feedback.
-
Can you please add |
Beta Was this translation helpful? Give feedback.
-
from autollm import AutoQueryEngine,read_files_as_documents
documents = read_files_as_documents(input_dir="docs/")
llm_model="huggingface/FlagAlpha/Llama2-Chinese-7b-Chat"
query_engine = AutoQueryEngine.from_defaults(
documents=documents,
embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
llm_model=llm_model,
)
response = query_engine.query("What's your name") Sadly,It's the same. I don't know if this has happened to anyone else.
Is it perhaps possible that the |
Beta Was this translation helpful? Give feedback.
-
Hi @BoyYangzai, Thanks for trying out the suggestion and providing detailed feedback. It seems that the root of the issue is related to the GPU memory requirements of the Since you have multiple GPUs, a parallel execution could theoretically solve this issue. However, setting up models to run in parallel across multiple GPUs can be quite complex and is not always straightforward. A more practical solution in your case would be to deploy the llm_model = "huggingface/FlagAlpha/Llama2-Chinese-7b-Chat"
llm_api_base = "https://my-endpoint.huggingface.cloud"
query_engine = AutoQueryEngine.from_defaults(
documents=documents,
embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
llm_model=llm_model,
llm_api_base=llm_api_base,
)
response = query_engine.query("What's your name") This approach requires the model to be available on Hugging Face's platform. If it's not already there, you might need to upload it or request the model provider to do so. Using the Please change your |
Beta Was this translation helpful? Give feedback.
-
@SeeknnDestroy But what's been bothering me is this.
Are autollm and launching llama2-webui two different ways of using the model? Why does llama2-webui work fine and autollm doesn't? |
Beta Was this translation helpful? Give feedback.
Hi @BoyYangzai,
Thanks for trying out the suggestion and providing detailed feedback. It seems that the root of the issue is related to the GPU memory requirements of the
Llama2-7B-Chat
model. This model requires approximately 30GB of GPU memory to run effectively, which exceeds the capacity of a single RTX 3090 GPU in your setup.Since you have multiple GPUs, a parallel execution could theoretically solve this issue. However, setting up models to run in parallel across multiple GPUs can be quite complex and is not always straightforward.
A more practical solution in your case would be to deploy the
Llama2-7B-Chat
model to Hugging Face and use it via their hosted API. This way, you can of…