Skip to content

v0.1.6

Compare
Choose a tag to compare
@mobicham mobicham released this 19 Mar 13:35
· 208 commits to master since this release

HQQ v0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features

  • Quantize on target device.
  • Meta-offloading uses pinned memory for faster/async transfers.
  • Loading saved LoRA weights automatically adds LoRA modules if not already present.
  • pip install automatically compiles the CUDA kernels now.
  • CUDA backend automatically detected and used when available.
  • You can quantize any HF model automatically via AutoHQQHFModel.
  • Faster meta-offloading with CUDA streams (experimental).
  • Int8 matmul (experimental).
  • Shared memory CUDA kernels (experimental).

Bugs

  • Fix Peft bias dtype.
  • Removed auto backend setting in LoRA.
  • All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other

  • Refactor backends (using backprop backends by default now).
  • Added typing.
  • Ruff fix and reformat all Python files.
  • Refactor ATEN for reference tensors.

Issues

  • Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
  • Shared memory CUDA kernels are a bit slower than without for some reason.
  • The block size setting doesn't have much influence on the speed.
  • Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.