torch.cuda.OutOfMemoryError: CUDA out of memory. #164

BoyYangzai · 2023-11-29T02:16:33Z

BoyYangzai
Nov 29, 2023

hi 👋
I refer to the README and issue, discussion to load my local model, but he reports an error.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 23.69 GiB of
which 30.12 MiB is free. Including non-PyTorch memory, this process has 23.54 GiB memory in use. Of the allocated memory
23.29 GiB is allocated by PyTorch, and 1.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory 
is large try setting max_split_

But this text-generated model runs perfectly fine on this machine!
This is very strange.
Here is my code.

from autollm import AutoQueryEngine
from autollm.utils.document_reading import read_files_as_documents
import os

os.environ["HUGGINGFACE_API_KEY"] = "hf_xxxx"
documents = read_files_as_documents(input_dir="/home/yang/rag-app/docs")

query_engine = AutoQueryEngine.from_defaults(
     documents=documents,
     embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
     llm_model='FlagAlpha/Llama2-Chinese-7b-Chat',
)

def query(prompt):
  response = query_engine.query(
    prompt
 )
  return response.response

and the example can't work https://github.com/safevideo/autollm#supports-100-llms

TypeError: AutoQueryEngine.from_parameters() got an unexpected keyword argument 'llm_model'

Answered by SeeknnDestroy

Nov 29, 2023

Hi @BoyYangzai,

Thanks for trying out the suggestion and providing detailed feedback. It seems that the root of the issue is related to the GPU memory requirements of the Llama2-7B-Chat model. This model requires approximately 30GB of GPU memory to run effectively, which exceeds the capacity of a single RTX 3090 GPU in your setup.

Since you have multiple GPUs, a parallel execution could theoretically solve this issue. However, setting up models to run in parallel across multiple GPUs can be quite complex and is not always straightforward.

A more practical solution in your case would be to deploy the Llama2-7B-Chat model to Hugging Face and use it via their hosted API. This way, you can of…

View full answer

SeeknnDestroy · 2023-11-29T08:22:30Z

SeeknnDestroy
Nov 29, 2023
Maintainer

Hi there @BoyYangzai 👋,

Thank you for bringing this issue to our attention. We appreciate your use of our platform and your effort in detailing the problem.

Regarding the torch.cuda.OutOfMemoryError, it seems like the GPU memory is fully utilized. This can happen when loading large models, especially on machines with already high memory usage. To mitigate this, you could try reducing the batch size or freeing up GPU memory by closing other applications or processes that might be consuming GPU resources.

As for the TypeError, in our latest update, we changed the method from AutoQueryEngine.from_parameters to AutoQueryEngine.from_defaults. from_parameters is deprecated. Please update your code to reflect this change.

We hope this helps! If you have any more questions or run into further issues, feel free to reach out.

Best,
Talha

0 replies

BoyYangzai · 2023-11-29T09:03:51Z

BoyYangzai
Nov 29, 2023
Author

@SeeknnDestroy
Thank you so much for your detailed response!

But as a beginner
I'm confused

I've been able to successfully run this FlagAlpha/Llama2-Chinese-7b-Chat model through llama2-webui on my 3090!

Why is it when it comes to Autollm that it says there is not enough memory?
my gpu still has a lot of free spaces.😕

$ nvidia-smi
log:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   30C    P8              24W / 350W |    124MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:07:00.0 Off |                  N/A |
|  0%   45C    P8              29W / 350W |     28MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off | 00000000:08:00.0 Off |                  N/A |
| 31%   25C    P8              19W / 350W |     12MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

I previously thought it was caused by a problem with the way I specified the model, resulting in an incorrect model being loaded.

     embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
     llm_model='FlagAlpha/Llama2-Chinese-7b-Chat',

0 replies

SeeknnDestroy · 2023-11-29T09:46:24Z

SeeknnDestroy
Nov 29, 2023
Maintainer

Can you please add huggingface/ prefix to llm_model you are using and try again? As huggingface/FlagAlpha/Llama2-Chinese-7b-Chat

0 replies

BoyYangzai · 2023-11-29T09:53:02Z

BoyYangzai
Nov 29, 2023
Author

@SeeknnDestroy

from autollm import AutoQueryEngine,read_files_as_documents

documents = read_files_as_documents(input_dir="docs/")

llm_model="huggingface/FlagAlpha/Llama2-Chinese-7b-Chat"

query_engine = AutoQueryEngine.from_defaults(
    documents=documents,
    embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
    llm_model=llm_model,    
)

response = query_engine.query("What's your name")

Sadly,It's the same. I don't know if this has happened to anyone else.

 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 23.69 GiB of 
which 109.81 MiB is free. Including non-PyTorch memory, this process has 23.46 GiB memory in use. Of the allocated 
memory 23.21 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated 
memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

Is it perhaps possible that the cuda version is incompatible with the package in question?🤔️

0 replies

SeeknnDestroy · 2023-11-29T10:10:53Z

SeeknnDestroy
Nov 29, 2023
Maintainer

Hi @BoyYangzai,

Thanks for trying out the suggestion and providing detailed feedback. It seems that the root of the issue is related to the GPU memory requirements of the Llama2-7B-Chat model. This model requires approximately 30GB of GPU memory to run effectively, which exceeds the capacity of a single RTX 3090 GPU in your setup.

Since you have multiple GPUs, a parallel execution could theoretically solve this issue. However, setting up models to run in parallel across multiple GPUs can be quite complex and is not always straightforward.

A more practical solution in your case would be to deploy the Llama2-7B-Chat model to Hugging Face and use it via their hosted API. This way, you can offload the heavy memory requirements to Hugging Face's infrastructure. Here's how you can modify your AutoQueryEngine setup:

llm_model = "huggingface/FlagAlpha/Llama2-Chinese-7b-Chat"
llm_api_base = "https://my-endpoint.huggingface.cloud"

query_engine = AutoQueryEngine.from_defaults(
    documents=documents,
    embed_model='local:FlagAlpha/Llama2-Chinese-7b-Chat',
    llm_model=llm_model,
    llm_api_base=llm_api_base,
)

response = query_engine.query("What's your name")

This approach requires the model to be available on Hugging Face's platform. If it's not already there, you might need to upload it or request the model provider to do so. Using the llm_api_base parameter, you can point AutoQueryEngine to use the Hugging Face API, which should help you circumvent the memory limitation on your local machine.

Please change your llm_api_base accordingly and try this. Let us know if it resolves the issue or if you encounter any other problems.

0 replies

BoyYangzai · 2023-11-29T10:19:10Z

BoyYangzai
Nov 29, 2023
Author

@SeeknnDestroy
Thank you for the program. I'm going to try this for our project

But what's been bothering me is this.

I've been able to successfully run this FlagAlpha/Llama2-Chinese-7b-Chat model through llama2-webui on my 3090!

Are autollm and launching llama2-webui two different ways of using the model? Why does llama2-webui work fine and autollm doesn't?
Thanks again for your positive response and open source work!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.cuda.OutOfMemoryError: CUDA out of memory. #164

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

torch.cuda.OutOfMemoryError: CUDA out of memory. #164

BoyYangzai Nov 29, 2023

Replies: 6 comments

SeeknnDestroy Nov 29, 2023 Maintainer

BoyYangzai Nov 29, 2023 Author

SeeknnDestroy Nov 29, 2023 Maintainer

BoyYangzai Nov 29, 2023 Author

SeeknnDestroy Nov 29, 2023 Maintainer

BoyYangzai Nov 29, 2023 Author

BoyYangzai
Nov 29, 2023

SeeknnDestroy
Nov 29, 2023
Maintainer

BoyYangzai
Nov 29, 2023
Author

SeeknnDestroy
Nov 29, 2023
Maintainer

BoyYangzai
Nov 29, 2023
Author

SeeknnDestroy
Nov 29, 2023
Maintainer

BoyYangzai
Nov 29, 2023
Author