v0.1.6

mobicham released this 19 Mar 13:35

· 208 commits to master since this release

df43514

HQQ v0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features

Quantize on target device.
Meta-offloading uses pinned memory for faster/async transfers.
Loading saved LoRA weights automatically adds LoRA modules if not already present.
pip install automatically compiles the CUDA kernels now.
CUDA backend automatically detected and used when available.
You can quantize any HF model automatically via AutoHQQHFModel.
Faster meta-offloading with CUDA streams (experimental).
Int8 matmul (experimental).
Shared memory CUDA kernels (experimental).

Bugs

Fix Peft bias dtype.
Removed auto backend setting in LoRA.
All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other

Refactor backends (using backprop backends by default now).
Added typing.
Ruff fix and reformat all Python files.
Refactor ATEN for reference tensors.

Issues

Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
Shared memory CUDA kernels are a bit slower than without for some reason.
The block size setting doesn't have much influence on the speed.
Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.

Assets 2