-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support inference with WOQ and LoRA adapter #1434
Comments
Hi @Yuan0320 , thanks for using ITREX. Regarding If you meant for the latter case, you can just load the LoRA adapter and merge it to the model before WOQ, and do WOQ after LoRA adapter has been merged into the model, in this way, the model's structure won't change, only its weights are updated. |
Hi @XinyuYe-Intel, thanks for the quick reply and insight, it makes sense. I initially meant the former case, as I want to keep the high precision in adapter to minimize the accuracy loss from WOQ. I think it might be challenging to achieve this ( |
No problem at all. And yes, we haven't supported the former case yet. |
Hi itrex team, thanks for the great work!
I've been experimenting with the Weight Only Quantization (WOQ) from ITREX, following the provided examples in weightonlyquant.md#example-for-cpu-device. The results are promising.
Now I'm interested in extending this by incorporating a trained LoRA adapter for inference. I'd like to combine pretrained weights (WOQ) with LoRA adapter (FP32/16) for inference. I'm wondering if it's feasible to achieve this, or if it's on the roadmap for future updates? Any insights or assistance would be greatly appreciated. Thanks!
The text was updated successfully, but these errors were encountered: