You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I wanted to use deepspeed activation checkpointing parameters
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true
},
in my accelerate job but I couldn't see those being used and in fact whether I used gradient_checkpointing parameter in my model config to be true or false I still use up same memory. That is why I wanted to try activation checkpointing from deepseed to avoid OOM errors. I followed the idea here huggingface/accelerate#2160 and also mentioned in deepspeed documentation here https://github.com/huggingface/transformers/blob/92d1d97c05a01160d6e7fcf4198e93bf2cec0dfe/docs/source/en/deepspeed.md#L4
See the number 2 point of replacing the torch.utils.checkpoint with the activation checkpoint in the doc.
Activation/gradient checkpointing
Activation and gradient checkpointing trades speed for more GPU memory which allows you to overcome scenarios where your GPU is out of memory or to increase your batch size for better performance. To enable this feature:
For a Hugging Face model, set model.gradient_checkpointing_enable() or --gradient_checkpointing in the [Trainer].
For a non-Hugging Face model, use the DeepSpeed Activation Checkpointing API. You could also replace the Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed API. This approach is more flexible because you can offload the forward activations to the CPU memory instead of recalculating them
Expected behavior
I just needed the accelerate job to run after making the change here huggingface/transformers#30915
Instead I got an error
Traceback (most recent call last):
File "/workspace/cookbook-internal/recipes/tune/instruct_lora/finetune.py", line 223, in _app
trainer = train(
File "/workspace/cookbook-internal/recipes/tune/common/trainer.py", line 156, in train
trainer.train(resume_from_checkpoint=False)
File "/workspace/cookbook-internal/transformers/src/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/workspace/cookbook-internal/transformers/src/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/workspace/cookbook-internal/transformers/src/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/workspace/cookbook-internal/transformers/src/transformers/trainer.py", line 3264, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1855, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 1083, in forward
return self.base_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, **kwargs)
File "/workspace/cookbook-internal/transformers/src/transformers/models/llama/modeling_llama.py", line 1164, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/workspace/cookbook-internal/transformers/src/transformers/models/llama/modeling_llama.py", line 957, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 995, in
checkpoint
CheckpointFunction.apply(function, all_outputs, *args)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 566, in
forward
outputs = run_function(*inputs_cuda)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/workspace/cookbook-internal/transformers/src/transformers/models/llama/modeling_llama.py", line 713, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, **kwargs)
File "/workspace/cookbook-internal/transformers/src/transformers/models/llama/modeling_llama.py", line 414, in forward
bsz, q_len, _ = hidden_states.size()
ValueError: not enough values to unpack (expected 3, got 2)
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
I am using accelerate from hugging face. It works when I used the pytorch.utils.checkpoint but after this change huggingface/transformers#30915 to replace it by deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint, it does not work Docker context
Are you using a specific docker image that you can share?
It will be difficult given that I work for a company Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Describe the bug
I wanted to use deepspeed activation checkpointing parameters
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": true
},
in my accelerate job but I couldn't see those being used and in fact whether I used gradient_checkpointing parameter in my model config to be true or false I still use up same memory. That is why I wanted to try activation checkpointing from deepseed to avoid OOM errors. I followed the idea here huggingface/accelerate#2160 and also mentioned in deepspeed documentation here https://github.com/huggingface/transformers/blob/92d1d97c05a01160d6e7fcf4198e93bf2cec0dfe/docs/source/en/deepspeed.md#L4
See the number 2 point of replacing the torch.utils.checkpoint with the activation checkpoint in the doc.
and made the change huggingface/transformers#30915.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I just needed the accelerate job to run after making the change here huggingface/transformers#30915
Instead I got an error
ds_report output
This is my accelerate setup
and this is the json used
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?I am using accelerate from hugging face. It works when I used the pytorch.utils.checkpoint but after this change huggingface/transformers#30915 to replace it by deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint, it does not work
Docker context
Are you using a specific docker image that you can share?
It will be difficult given that I work for a company
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: