-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8bit + Aten + compile #130
Comments
You need to patch the model for inference before. Because by default, the model is raady for QLoRa training which is not compatible with torch.compile ...
HQQLinear.set_backend(HQQBackend.ATEN)
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model)
#Inference
from hqq.utils.generation_hf import patch_model_for_compiled_runtime
patch_model_for_compiled_runtime(model, tokenizer, warmup=True)
... |
If you are using the compiled runtime, you can also use the |
Thank you very much. I noticed that the patch_hqq_inference() in prepare_for_inference replaces the forward function with forward_hqq_inferece. This forward_hqq_inferece() is different from the forward_aten of HQQLinear itself. In this case, won't backend=aten not work? |
Oh yes, it's a bit confusing, in short no. |
Thank you very much. However, I don’t think this is the main issue. After PyTorch 2.4, the binding implementation for C++/CUDA operators has changed (https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html). When testing, I found that the methods in hqq/kernels are not compatible with torch.compile() and report the following issue: |
Oh could be actually, thanks for checking! Just use the |
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer Load the modelmodel_id = 'mobiuslabsgmbh/Llama-2-7b-chat-hf_1bitgs8_hqq' Define the device before using itdevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu') Move the model to the selected devicemodel.to(device) Setup Inference Modetokenizer.add_bos_token = False Optional: torch compile for faster inferencemodel = torch.compile(model) # You might want to enable this for potential speedupdef chat_processor(chat, max_new_tokens=100, do_sample=True, device='cuda'):
Now you can call the function:results = chat_processor("What is the solution to x^2 - 1 = 0", max_new_tokens=100, device=device) /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: During handling of the above exception, another exception occurred: Traceback (most recent call last): |
When I try to run patch_model_for_compiled_runtime on 8bit + aten, the program reports an error. How can I solve this problem?
code
import torch
import torch.fx
import time
device = 'cuda:0'
backend = 'torchao_int4' #"torchao_int4" (4-bit only) or "bitblas" (4-bit + 2-bit)
compute_dtype = torch.float16 if backend=="bitblas" else torch.bfloat16
cache_dir = '.'
model_id = './llama/llama3/Meta-Llama-3-8B'
########################################################################
#Load model
from transformers import AutoModelForCausalLM, AutoTokenizer
from hqq.models.hf.base import AutoHQQHFModel
from hqq.core.quantize import *
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_dir, torch_dtype=compute_dtype, attn_implementation="sdpa")
#Quantize
quant_config = BaseQuantizeConfig(nbits=8, group_size=64, axis=0)
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)
HQQLinear.set_backend(HQQBackend.ATEN_FORWARD)
#Inference
from hqq.utils.generation_hf import patch_model_for_compiled_runtime
patch_model_for_compiled_runtime(model, tokenizer, warmup=True)
WARMUP_PROMPTS = [
"Write an essay about large language models.",
"Tell me a funny joke!",
"Hello, my name is Kiven, I like England for five reasons. First,",
"Who is Elon Musk?",
"Write a Python code snippet that adds two numbers together.",
]
for prompt in WARMUP_PROMPTS:
inputs_warmup = tokenizer(prompt,return_tensors='pt',padding='max_length',max_length=128,truncation=True).to(model.device)
torch.cuda.synchronize()
warmup_start = time.time()
output = model.generate(**inputs_warmup,max_new_tokens=1000,cache_implementation="static", pad_token_id=tokenizer.pad_token_id)
torch.cuda.synchronize()
warmup_end = time.time()
print(warmup_end-warmup_start)
The text was updated successfully, but these errors were encountered: