VRAM requirements? #131

candre23 · 2024-11-02T18:10:14Z

candre23
Nov 2, 2024

I'm trying to quantize a 123b model to 4bit, but it's failing on OOM at about 24% when loading the shards. I have three 3090s (24GB each), but I can't figure out how to convince HQQ to use more than one of them, and it doesn't look like all three would even be enough if I could. Is there some way to lower those requirements during the quantization process, even at the expense of speed? I've had no problem quantizing models of this size to EXL2 on a single 3090, so I don't see why HQQ should have such higher memory demands.

mobicham · 2024-11-02T21:34:37Z

mobicham
Nov 2, 2024
Maintainer

Can you share your code please?

0 replies

candre23 · 2024-11-03T14:36:45Z

candre23
Nov 3, 2024
Author

It's pretty basic, I think. It's also very possible I'm missing something important.

This is being done in WSL2, if it matters. I was completely unable to get hqq to even attempt to quantize under native windows. It kept insisting it couldn't find a GPU, even though they show up in nvidia-smi just fine and other applications like tabby have no problem using them.

quantpy.txt

0 replies

mobicham · 2024-11-03T16:41:43Z

mobicham
Nov 3, 2024
Maintainer

device_map="cuda:0"
you are trying to load the whole model on a single gpu, it should be "auto"

0 replies

candre23 · 2024-11-03T22:35:38Z

candre23
Nov 3, 2024
Author

I tried that, but when I set it to auto (or anything other than cuda) I get

ValueError: You are attempting to use an HQQ model with a device_map that contains a CPU or disk device. This is not supported. Please remove the CPU or disk device from the device_map.

Even if I can manage to get it to utilize all three GPUs, will it be enough? Is there a predictable model-size-to-VRAM-needed curve?

0 replies

mobicham · 2024-11-04T08:08:31Z

mobicham
Nov 4, 2024
Maintainer

That strange, the last time I did integrated with transformers I made sure it works with multi-gpu. Can you try the master branch of transformers ?
In general it's about ~5GB per 7B parameters with 4-bit grou-size=64 or 128. That's not specific to hqq, any linear asymmetric quantization like AWQ or GPTQ will take the about the same amount of vram. If it's symmetric, it's gonna take a bit less but the quality will be much worse.

Is this 123B model an MoE or just a regular model? Because of it's MoE you can quantize the experts to 2 or 3-bits to save vram

0 replies

candre23 · 2024-11-04T13:38:00Z

candre23
Nov 4, 2024
Author

Not an MoE, a Mistral Large finetune. If 5GB per 7b params is accurate, I don't have enough VRAM even if I could get multi-GPU working. Rats.

I'm not super savvy with this stuff - how would I go about switching to the master branch of transformers?

0 replies

mobicham · 2024-11-05T12:55:24Z

mobicham
Nov 5, 2024
Maintainer

You can quantize the MLP layers to 3-bit, it's gonna run a bit slow though (working on making it faster)

To install the master branch you just do pip install git+https://github.com/huggingface/transformers.git
(in general pip install git+<github_project.git>)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRAM requirements? #131

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

VRAM requirements? #131

candre23 Nov 2, 2024

Replies: 7 comments

mobicham Nov 2, 2024 Maintainer

candre23 Nov 3, 2024 Author

mobicham Nov 3, 2024 Maintainer

candre23 Nov 3, 2024 Author

mobicham Nov 4, 2024 Maintainer

candre23 Nov 4, 2024 Author

mobicham Nov 5, 2024 Maintainer

candre23
Nov 2, 2024

mobicham
Nov 2, 2024
Maintainer

candre23
Nov 3, 2024
Author

mobicham
Nov 3, 2024
Maintainer

candre23
Nov 3, 2024
Author

mobicham
Nov 4, 2024
Maintainer

candre23
Nov 4, 2024
Author

mobicham
Nov 5, 2024
Maintainer