Skip to content

Releases: huggingface/accelerate

v0.25.0: safetensors by default, new trackers, and plenty of bug fixes

01 Dec 15:24
Compare
Choose a tag to compare

Safetensors default

As of this release, safetensors will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here

New Experiment Trackers

This release has two new experiment trackers, ClearML and DVCLive!

To use them, just pass clear_ml or dvclive to log_with in the Accelerator init. h/t to @eugen-ajechiloae-clearml and @dberenbaum

DeepSpeed

  • Accelerate's DeepSpeed integration now supports NPU devices, h/t to @statelesshz
  • DeepSpeed can now be launched via accelerate on single GPU setups

FSDP

FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate. No more needing to call accelerator.prepare() twice!

Other useful enhancements

  • We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error.

  • When doing .gather(), if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)

Bug fixes

  • Fixed a bug that caused dataloaders to not shuffle despite shuffle=True when using multiple GPUs and the new SeedableRandomSampler.

General Changelog

New Contributors

Full Changelog: v0.24.1...v0.25.0

v0.24.1: Patch Release for Samplers

30 Oct 14:12
Compare
Choose a tag to compare
  • Fixes #2091 by changing how checking for custom samplers is done

v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements

24 Oct 17:37
Compare
Choose a tag to compare

Improved Reproducibility

One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:

dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)

For more information see this PR, we will update the docs on a subsequent release with more information on this API.

Documentation

  • The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
  • We also now have documentation on how to perform multinode training, see the launch docs

Internal structure

  • Shared file systems are now supported under save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.
  • FSDP can now be used for bfloat16 mixed precision via torch.autocast
  • all_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensors
  • Specifying drop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()

What's Changed

New Contributors

Full Changelog: v0.23.0...v0.24.0

v0.23.0: Model Memory Estimation tool, Breakpoint API, Multi-Node Notebook Launcher Support, and more!

14 Sep 19:23
Compare
Choose a tag to compare

Model Memory Estimator

A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights to stay memory efficient during the calculation.

Usage directions:

accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8

Or:

from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data

parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)

🤗 Hub is a first-class citizen

We've made the huggingface_hub library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted

Accelerator Enhancements:

  • gather_for_metrics will now also de-dupe for non-tensor objects. See #1937
  • mixed_precision="bf16" support on NPU devices. See #1949
  • New breakpoint API to help when dealing with trying to break from a condition on a single process. See #1940

Notebook Launcher Enhancements:

  • The notebook launcher now supports launching across multiple nodes! See #1913

FSDP Enhancements:

  • Activation checkpointing is now natively supported in the framework. See #1891
  • torch.compile support was fixed. See #1919

DeepSpeed Enhancements:

  • XPU/ccl support (#1827)
  • Easier gradient accumulation support, simply set gradient_accumulation_steps to "auto" in your deepspeed config, and Accelerate will use the one passed to Accelerator instead (#1901)
  • Support for custom schedulers and deepspeed optimizers (#1909)

What's Changed

New Contributors

Full Changelog: v0.22.0...v0.23.0

v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!

23 Aug 06:26
6b3e559
Compare
Choose a tag to compare

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via #1756

Accelerator.load_state can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

  • accelerator.accumulate() now supports passing in multiple models introduced via #1708
  • A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last .backward() via #1726

FSDP Changes

  • FSDP support has been added for NPU and XPU devices via #1803 and #1806
  • A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777

DataLoader Changes

  • Custom slice functions are now supported in the DataLoaderDispatcher added via #1846

What's New?

Significant community contributions

The following contributors have made significant changes to the library over the last release:

Full Changelog: v0.21.0...v0.22.0

v0.21.0: Model quantization and NPUs

13 Jul 16:51
8514c35
Compare
Choose a tag to compare

Model quantization with bitsandbytes

You can now quantize any model (no just Transformer models) using Accelerate. This is mainly for models having a lot of linear layers. See the documentation for more information!

Support for Ascend NPUs

Accelerate now supports Ascend NPUs.

What's new?

Accelerate now requires Python 3.8+ and PyTorch 1.10+ :

Significant community contributions

The following contributors have made significant changes to the library over the last release:

v0.20.3: Patch release

08 Jun 16:19
47b9ac1
Compare
Choose a tag to compare
  • Reset dataloader end_of_datalaoder at each iter in #1562 by @sgugger

v0.20.2: Patch release

08 Jun 13:24
3915c3d
Compare
Choose a tag to compare
  • fix the typo when setting the "_accelerator_prepared" attribute in #1560 by @Yura52
  • [core] Fix possibility to pass] NoneType objects in prepare in #1561 by @younesbelkada

v0.20.1: Patch release

07 Jun 19:34
1538493
Compare
Choose a tag to compare
  • Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555
  • Fix load_state_dict when there is one device and disk by @sgugger in #1557

v0.20.0: MPS and fp4 support on Big Model Inference, 4-bit QLoRA, Intel GPU, Distributed Inference, and much more!

07 Jun 19:33
9765b84
Compare
Choose a tag to compare

Big model inference

Support has been added to run device_map="auto" on the MPS device. Big model inference also work with models loaded in 4 bits in Transformers.

4-bit QLoRA Support

Distributed Inference Utilities

This version introduces a new Accelerator.split_between_processes utility to help with performing distributed infernece with non-tensorized or non-dataloader workflows. Read more here

Introduce XPU support for Intel GPU

Add support for the new PyTorch XLA TPU runtime

  • Accelerate now supports the latest TPU runtimes #1393, #1385

A new optimizer method: LocalSGD

  • This is a new wrapper around SGD which enables efficient multi-GPU training in the case when no fast interconnect is possible by @searchivarius in #1378

Papers with 🤗 Accelerate

  • We now have an entire section of the docs dedicated to official paper implementations and citations using the framework #1399, see it live here

Breaking changes

logging_dir has been fully deprecated, please use project_dir or a Project_configuration

What's new?

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @will-cromar
    • Support TPU v4 with new PyTorch/XLA TPU runtime (#1393)
    • Support TPU v2 and v3 on new PyTorch/XLA TPU runtime (#1385)
  • @searchivarius
    • Adding support for local SGD. (#1378)
  • @abhilash1910
    • Intel GPU support initialization (#1118)
    • Fix bug on ipex for diffusers (#1426)
    • Refactor and simplify xpu device in state (#1456)
    • NVME path support for deepspeed (#1484)
  • @sywangyi
    • fix error for CPU DDP using trainer api. (#1455)
    • fix crash when ipex is installed and torch has no xpu (#1502)
    • should set correct dtype to ipex optimize and use amp logic in native… (#1511)
    • remove ipexplugin, let ACCELERATE_USE_IPEX/ACCELERATE_USE_XPU control the ipex and xpu (#1503)
  • @Ethan-yt
    • Fix gradient state bugs in multiple dataloader (#1483)