Sequential_update flag not reducing GPU memory usage #995

hafezmg48 · 2024-12-19T16:03:17Z

Describe the bug
Using llmcompressor to prune a llama 3.1 -8B model, used the sequential_update=True flag to sequentially prune the model, but still the whole model loads up in the GPU allocating 16.5GB of memory. This takes the exact same amount of GPU vRAM as when I turn of the sequential flag and it does all layers together.

Expected behavior
The expected behavior was to only allocate GPU vRAM for a single layer at a time as it is supposed to sequentially load and update each layer offload and load next layer. This would approximately allocate 2GB of vRAM memory.

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 20.04]: Ubuntu 20.04.6 LTS
Python version: 3.11.8
LLM Compressor version: 0.3.0
ML framework version(s): torch 2.4.0
Other Python package versions: numpy 1.26.4, compressed-tensors 0.8.1, vllm 0.6.3.post1
Other relevant environment information: Tesla V100- 32G, Driver Version: 545.23.08 CUDA Version: 12.3

To Reproduce
Exact steps to reproduce the behavior:
Code to generate the issue:

from transformers import AutoTokenizer
from datasets import Dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.pruning import WandaPruningModifier
import random
from datasets import load_dataset

model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

num_samples = 512
max_seq_len = 1024 #8192

tokenizer = AutoTokenizer.from_pretrained(model_id)

preprocess_fn = lambda example: {"text": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n{text}".format_map(example)}

dataset_name = "neuralmagic/LLM_compression_calibration"
dataset = load_dataset(dataset_name, split="train")
ds = dataset.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

recipe = WandaPruningModifier(sparsity=0.5,
                        sequential_update=True
              )


oneshot(
  model=model,
  dataset=ds,
  recipe=recipe,
  max_seq_length=max_seq_len,
  num_calibration_samples=num_samples,
  output_dir="~/models/Meta-Llama-3.1-70B-Instruct-pruned_0.5"
)

Errors
No Errors

Additional context

The text was updated successfully, but these errors were encountered:

kylesayrs · 2025-01-07T19:44:11Z

Hi @hafezmg48, apologies for the late response here.

The expected behavior was to only allocate GPU vRAM for a single layer at a time as it is supposed to sequentially load and update each layer offload and load next layer. This would approximately allocate 2GB of vRAM memory.

What you've described here is not actually what the sequential_update option does. The sequential_update does NOT perform layer-wise cpu offloading. Instead, this option helps reduce memory by only allocating hessians layer for each layer as they are needed. Hessians are large buffers needed by the GPTQ algorithm.

If you want to perform layer-wise cpu offloading, you can use from_pretrained(device_map="cpu"), which offloads the entire model to cpu and only onloads one layer at a time. For more information on device maps and CPU offloading enabled by accelerate, see Accelerate/BigModelInference

hafezmg48 added the bug Something isn't working label Dec 19, 2024

dsikka assigned kylesayrs Jan 5, 2025

kylesayrs closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential_update flag not reducing GPU memory usage #995

Sequential_update flag not reducing GPU memory usage #995

hafezmg48 commented Dec 19, 2024

kylesayrs commented Jan 7, 2025

Sequential_update flag not reducing GPU memory usage #995

Sequential_update flag not reducing GPU memory usage #995

Comments

hafezmg48 commented Dec 19, 2024

kylesayrs commented Jan 7, 2025