Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential_update flag not reducing GPU memory usage #995

Closed
hafezmg48 opened this issue Dec 19, 2024 · 1 comment
Closed

Sequential_update flag not reducing GPU memory usage #995

hafezmg48 opened this issue Dec 19, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@hafezmg48
Copy link

Describe the bug
Using llmcompressor to prune a llama 3.1 -8B model, used the sequential_update=True flag to sequentially prune the model, but still the whole model loads up in the GPU allocating 16.5GB of memory. This takes the exact same amount of GPU vRAM as when I turn of the sequential flag and it does all layers together.

Expected behavior
The expected behavior was to only allocate GPU vRAM for a single layer at a time as it is supposed to sequentially load and update each layer offload and load next layer. This would approximately allocate 2GB of vRAM memory.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Ubuntu 20.04.6 LTS
  2. Python version: 3.11.8
  3. LLM Compressor version: 0.3.0
  4. ML framework version(s): torch 2.4.0
  5. Other Python package versions: numpy 1.26.4, compressed-tensors 0.8.1, vllm 0.6.3.post1
  6. Other relevant environment information: Tesla V100- 32G, Driver Version: 545.23.08 CUDA Version: 12.3

To Reproduce
Exact steps to reproduce the behavior:
Code to generate the issue:

from transformers import AutoTokenizer
from datasets import Dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.pruning import WandaPruningModifier
import random
from datasets import load_dataset

model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

num_samples = 512
max_seq_len = 1024 #8192

tokenizer = AutoTokenizer.from_pretrained(model_id)

preprocess_fn = lambda example: {"text": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n{text}".format_map(example)}

dataset_name = "neuralmagic/LLM_compression_calibration"
dataset = load_dataset(dataset_name, split="train")
ds = dataset.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)

recipe = WandaPruningModifier(sparsity=0.5,
                        sequential_update=True
              )


oneshot(
  model=model,
  dataset=ds,
  recipe=recipe,
  max_seq_length=max_seq_len,
  num_calibration_samples=num_samples,
  output_dir="~/models/Meta-Llama-3.1-70B-Instruct-pruned_0.5"
)

Errors
No Errors

Additional context
image

@hafezmg48 hafezmg48 added the bug Something isn't working label Dec 19, 2024
@kylesayrs
Copy link
Collaborator

Hi @hafezmg48, apologies for the late response here.

The expected behavior was to only allocate GPU vRAM for a single layer at a time as it is supposed to sequentially load and update each layer offload and load next layer. This would approximately allocate 2GB of vRAM memory.

What you've described here is not actually what the sequential_update option does. The sequential_update does NOT perform layer-wise cpu offloading. Instead, this option helps reduce memory by only allocating hessians layer for each layer as they are needed. Hessians are large buffers needed by the GPTQ algorithm.

If you want to perform layer-wise cpu offloading, you can use from_pretrained(device_map="cpu"), which offloads the entire model to cpu and only onloads one layer at a time. For more information on device maps and CPU offloading enabled by accelerate, see Accelerate/BigModelInference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants