You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Using llmcompressor to prune a llama 3.1 -8B model, used the sequential_update=True flag to sequentially prune the model, but still the whole model loads up in the GPU allocating 16.5GB of memory. This takes the exact same amount of GPU vRAM as when I turn of the sequential flag and it does all layers together.
Expected behavior
The expected behavior was to only allocate GPU vRAM for a single layer at a time as it is supposed to sequentially load and update each layer offload and load next layer. This would approximately allocate 2GB of vRAM memory.
Environment
Include all relevant environment information:
OS [e.g. Ubuntu 20.04]: Ubuntu 20.04.6 LTS
Python version: 3.11.8
LLM Compressor version: 0.3.0
ML framework version(s): torch 2.4.0
Other Python package versions: numpy 1.26.4, compressed-tensors 0.8.1, vllm 0.6.3.post1
Other relevant environment information: Tesla V100- 32G, Driver Version: 545.23.08 CUDA Version: 12.3
To Reproduce
Exact steps to reproduce the behavior:
Code to generate the issue:
fromtransformersimportAutoTokenizerfromdatasetsimportDatasetfromllmcompressor.transformersimportSparseAutoModelForCausalLM, oneshotfromllmcompressor.modifiers.quantizationimportGPTQModifierfromllmcompressor.modifiers.pruningimportWandaPruningModifierimportrandomfromdatasetsimportload_datasetmodel="meta-llama/Meta-Llama-3.1-8B-Instruct"num_samples=512max_seq_len=1024#8192tokenizer=AutoTokenizer.from_pretrained(model_id)
preprocess_fn=lambdaexample: {"text": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n{text}".format_map(example)}
dataset_name="neuralmagic/LLM_compression_calibration"dataset=load_dataset(dataset_name, split="train")
ds=dataset.shuffle().select(range(num_samples))
ds=ds.map(preprocess_fn)
recipe=WandaPruningModifier(sparsity=0.5,
sequential_update=True
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
output_dir="~/models/Meta-Llama-3.1-70B-Instruct-pruned_0.5"
)
Errors
No Errors
Additional context
The text was updated successfully, but these errors were encountered:
Hi @hafezmg48, apologies for the late response here.
The expected behavior was to only allocate GPU vRAM for a single layer at a time as it is supposed to sequentially load and update each layer offload and load next layer. This would approximately allocate 2GB of vRAM memory.
What you've described here is not actually what the sequential_update option does. The sequential_update does NOT perform layer-wise cpu offloading. Instead, this option helps reduce memory by only allocating hessians layer for each layer as they are needed. Hessians are large buffers needed by the GPTQ algorithm.
If you want to perform layer-wise cpu offloading, you can use from_pretrained(device_map="cpu"), which offloads the entire model to cpu and only onloads one layer at a time. For more information on device maps and CPU offloading enabled by accelerate, see Accelerate/BigModelInference
Describe the bug
Using llmcompressor to prune a llama 3.1 -8B model, used the
sequential_update=True
flag to sequentially prune the model, but still the whole model loads up in the GPU allocating 16.5GB of memory. This takes the exact same amount of GPU vRAM as when I turn of the sequential flag and it does all layers together.Expected behavior
The expected behavior was to only allocate GPU vRAM for a single layer at a time as it is supposed to sequentially load and update each layer offload and load next layer. This would approximately allocate 2GB of vRAM memory.
Environment
Include all relevant environment information:
To Reproduce
Exact steps to reproduce the behavior:
Code to generate the issue:
Errors
No Errors
Additional context
The text was updated successfully, but these errors were encountered: