Replies: 7 comments
-
Can you share your code please? |
Beta Was this translation helpful? Give feedback.
-
It's pretty basic, I think. It's also very possible I'm missing something important. This is being done in WSL2, if it matters. I was completely unable to get hqq to even attempt to quantize under native windows. It kept insisting it couldn't find a GPU, even though they show up in nvidia-smi just fine and other applications like tabby have no problem using them. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I tried that, but when I set it to auto (or anything other than cuda) I get
Even if I can manage to get it to utilize all three GPUs, will it be enough? Is there a predictable model-size-to-VRAM-needed curve? |
Beta Was this translation helpful? Give feedback.
-
That strange, the last time I did integrated with Is this 123B model an MoE or just a regular model? Because of it's MoE you can quantize the experts to 2 or 3-bits to save vram |
Beta Was this translation helpful? Give feedback.
-
Not an MoE, a Mistral Large finetune. If 5GB per 7b params is accurate, I don't have enough VRAM even if I could get multi-GPU working. Rats. I'm not super savvy with this stuff - how would I go about switching to the master branch of transformers? |
Beta Was this translation helpful? Give feedback.
-
You can quantize the MLP layers to 3-bit, it's gonna run a bit slow though (working on making it faster) To install the master branch you just do |
Beta Was this translation helpful? Give feedback.
-
I'm trying to quantize a 123b model to 4bit, but it's failing on OOM at about 24% when loading the shards. I have three 3090s (24GB each), but I can't figure out how to convince HQQ to use more than one of them, and it doesn't look like all three would even be enough if I could. Is there some way to lower those requirements during the quantization process, even at the expense of speed? I've had no problem quantizing models of this size to EXL2 on a single 3090, so I don't see why HQQ should have such higher memory demands.
Beta Was this translation helpful? Give feedback.
All reactions