Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM issue when finetuning unsloth/llama-3-8b-bnb-4bit on Colab with T4 with 18000 context length #465

Open
rycfung opened this issue May 14, 2024 · 1 comment

Comments

@rycfung
Copy link

rycfung commented May 14, 2024

I'm using the unsloth colab notebook to finetune the unsloth/llama-3-8b-bnb-4bit model with data with a max context length of 18000. Whenever I kick off training, it always run out of memory. That doesn't seem to be the case with the yahma/alpaca example. Here's the error:

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 102 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
[<ipython-input-7-3d62c575fcfd>](https://localhost:8080/#) in <cell line: 1>()
----> 1 trainer_stats = trainer.train()

13 frames
[/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py](https://localhost:8080/#) in _convert_to_fp32(tensor)
    779 
    780     def _convert_to_fp32(tensor):
--> 781         return tensor.float()
    782 
    783     def _is_fp16_bf16_tensor(tensor):

OutOfMemoryError: CUDA out of memory. Tried to allocate 9.47 GiB. GPU 0 has a total capacity of 14.75 GiB of which 3.78 GiB is free. Process 2116 has 10.95 GiB memory in use. Of the allocated memory 10.79 GiB is allocated by PyTorch, and 23.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Is the longer context length the reason for this to run run out of memory? What's the recommendation in this case to make this fine-tuning job possible

@danielhanchen
Copy link
Contributor

Yes too long contexts will cause OOMs.
According to our blog: https://unsloth.ai/blog/llama3, the max context length on Tesla T4s (16GB) is 10K ish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants