LQ-LoRA: Low-rank plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [Paper]
- Clone the repo
git clone https://github.com/HanGuo97/lq-lora.git
cd lq-lora
- Create Docker image (optional)
# Using BuiltKit
DOCKER_BUILDKIT=1 docker build \
-t lqlora \
-f Dockerfile \
.
docker run -ti --rm \
--gpus all \
-p 28888:8888 \
--shm-size=2g \
lqlora \
bash -c "cd main/ && jupyter-lab --ip=0.0.0.0 --allow-root"
- Install dependencies
bash scripts/setup.sh
Note: Some of the codebase relies on PyTorch>=2.1.
TODO.
After downloading the files, please update FILE_NAMES_DICT
in models/allocation_utils
accordingly.
from transformers import AutoTokenizer, AutoModelForCausalLM
from models import lora_utils
data = "c4" # applying data-aware quantization
budget = "2.75" # target bits
model_size = "70b" # 7b or 70b
# Loads the base model (to CPU)
model = AutoModelForCausalLM.from_pretrained(
f"meta-llama/Llama-2-{model_size}-hf")
# Adds LoRA components, etc
model = lora_utils.prepare_model_for_lora(
model=model,
num_ranks=64,
lora_alpha=16,
lora_dropout=0.0,
use_gradient_checkpointing=True)
# Applies LQ-LoRA to the model.
lora_utils.transform_lora_layers(
lpq=True,
model=model,
model_name=f"llama-2-{model_size}/lpq-64/{data},budget={budget}",
device="cuda")
Note that HuggingFace's PEFT library only saves the adapeter parameters. Since LQ-LoRA additionally changes the base model parameters, we need to save the entire weights of the model.
state_dict = model.state_dict()
file_name = os.path.join(
output_dir,
"full_model.pth")
torch.save(state_dict, file_name)
# No need to apply `transform_lora_layers` because
# these will be loaded from the checkpoint.
model = lora_utils.prepare_model_for_lora(
model=model,
num_ranks=64,
lora_alpha=16,
lora_dropout=0.0,
use_gradient_checkpointing=True,
checkpoint_dir=checkpoint_dir) # -> enter the path to the checkpoint directory
- Upload the artifacts
- We use a legacy version of the (de)quantizaton implementation. We will update the code to use the latest version of the (de)quantization implementation.
This code reuses components from several libraries including QLoRA and OmniQuant.