Releases · huggingface/accelerate

01 Dec 15:24

v0.25.0

d08c23c

v0.25.0: safetensors by default, new trackers, and plenty of bug fixes

Safetensors default

As of this release, safetensors will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here

New Experiment Trackers

This release has two new experiment trackers, ClearML and DVCLive!

To use them, just pass clear_ml or dvclive to log_with in the Accelerator init. h/t to @eugen-ajechiloae-clearml and @dberenbaum

DeepSpeed

Accelerate's DeepSpeed integration now supports NPU devices, h/t to @statelesshz
DeepSpeed can now be launched via accelerate on single GPU setups

FSDP

FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate. No more needing to call accelerator.prepare() twice!

Other useful enhancements

We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error.
When doing .gather(), if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)

Bug fixes

Fixed a bug that caused dataloaders to not shuffle despite shuffle=True when using multiple GPUs and the new SeedableRandomSampler.

General Changelog

Add logs offloading by @SunMarc in #2075
Add ClearML tracker by @eugen-ajechiloae-clearml in #2034
CRITICAL: fix failing ci by @muellerzr in #2088
Fix flag typo by @kuza55 in #2090
Fix batch sampler by @muellerzr in #2097
fixed ip address typo by @Fluder-Paradyne in #2099
Fix memory leak in fp8 causing OOM (and potentially 3x vRAM usage) by @muellerzr in #2089
fix warning when offload by @SunMarc in #2105
Always use SeedableRandomSampler by @muellerzr in #2110
Fix issue with tests by @muellerzr in #2111
Make SeedableRandomSampler the default always by @muellerzr in #2117
Use "and" instead of comma in Bibtex citation by @qgallouedec in #2119
Add explicit error if empty batch received by @YuryYakhno in #2115
Allow for ACCELERATE_SEED env var by @muellerzr in #2126
add DeepSpeed support for NPU by @statelesshz in #2054
Sync states for npu fsdp by @jq460494839 in #2113
Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in #2121
Make safetensors the default by @muellerzr in #2120
Raise error when saving with param on meta device by @SunMarc in #2132
Leave native save as False by @muellerzr in #2138
fix retie_parameters by @SunMarc in #2137
Deal with shared memory scenarios by @muellerzr in #2136
specify config file path on README by @kwonmha in #2140
Fix safetensors contiguous by @SunMarc in #2145
Fix more tests by @muellerzr in #2146
[docs] fixed a couple of broken links by @MKhalusova in #2147
[docs] troubleshooting guide by @MKhalusova in #2133
[Docs] fix doc typos by @kashif in #2150
Add note about GradientState being in-sync with the dataloader by default by @muellerzr in #2134
Deprecated runner stuff by @muellerzr in #2152
Add examples to tests by @muellerzr in #2131
Disable pypi for merge workflows + fix trainer tests by @muellerzr in #2153
Adds dvclive tracker by @dberenbaum in #2139
check port availability only in main deepspeed/torchrun launcher by @Jingru in #2078
Do not attempt to pad nested tensors by @frankier in #2041
Add warning for problematic libraries by @muellerzr in #2151
Add ZeRO++ to DeepSpeed usage docs by @SumanthRH in #2166
Fix Megatron-LM Arguments Bug by @yuanenming in #2168
Fix non persistant buffer dispatch by @SunMarc in #1941
Updated torchrun instructions by @TJ-Solergibert in #2096
New CI Runners by @muellerzr in #2087
Revert "New CI Runners" by @muellerzr in #2172
[Working again] New CI by @muellerzr in #2173
fsdp refactoring by @pacman100 in #2177
Pin DVC by @muellerzr in #2196
Apply DVC warning to Accelerate by @muellerzr in #2197
Explicitly disable P2P using launch, and pick up in state if a user will face issues. by @muellerzr in #2195
Better error when device mismatches when calling gather() on CUDA by @muellerzr in #2180
unpins dvc by @dberenbaum in #2200
Assemble state dictionary for offloaded models by @blbadger in #2156
Allow deepspeed without distributed launcher by @pacman100 in #2204

New Contributors

@eugen-ajechiloae-clearml made their first contribution in #2034
@kuza55 made their first contribution in #2090
@Fluder-Paradyne made their first contribution in #2099
@YuryYakhno made their first contribution in #2115
@jq460494839 made their first contribution in #2113
@kwonmha made their first contribution in #2140
@dberenbaum made their first contribution in #2139
@Jingru made their first contribution in #2078
@frankier made their first contribution in #2041
@yuanenming made their first contribution in #2168
@TJ-Solergibert made their first contribution in #2096
@blbadger made their first contribution in #2156

Full Changelog: v0.24.1...v0.25.0

Contributors

kashif, frankier, and 19 other contributors

Assets 2

30 Oct 14:12

muellerzr

v0.24.1

8d1479d

v0.24.1: Patch Release for Samplers

Fixes #2091 by changing how checking for custom samplers is done

Assets 2

24 Oct 17:37

muellerzr

v0.24.0

00301b2

v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements

Improved Reproducibility

One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:

dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)

For more information see this PR, we will update the docs on a subsequent release with more information on this API.

Documentation

The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
We also now have documentation on how to perform multinode training, see the launch docs

Internal structure

Shared file systems are now supported under save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.
FSDP can now be used for bfloat16 mixed precision via torch.autocast
all_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensors
Specifying drop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()

What's Changed

Update big_modeling.md by @kli-casia in #1976
Fix model copy after dispatch_model by @austinapatel in #1971
FIX: Automatic checkpoint path inference issue by @BenjaminBossan in #1989
Fix skip first batch for deepspeed example by @SumanthRH in #2001
[docs] Quick tour refactor by @MKhalusova in #2008
Add basic documentation for multi node training by @SumanthRH in #1988
update torch_dynamo backends by @SunMarc in #1992
Sync states for xpu fsdp by @abhilash1910 in #2005
update fsdp docs by @pacman100 in #2026
Enable shared file system with save and save_state via ProjectConfiguration by @muellerzr in #1953
Fix save on each node by @muellerzr in #2036
Allow FSDP to use with torch.autocast for bfloat16 mixed precision by @brcps12 in #2033
Fix DeepSpeed version to <0.11 by @BenjaminBossan in #2043
Unpin deepspeed by @muellerzr in #2044
Reduce memory by using all_gather_into_tensor by @muellerzr in #1968
Safely end training even if trackers weren't initialized by @Ben-Epstein in #1994
Fix integration CI by @muellerzr in #2047
Make fsdp ram efficient loading optional by @pacman100 in #2037
Let drop_last modify gather_for_metrics by @muellerzr in #2048
fix docstring by @zhangsibo1129 in #2053
Fix stalebot by @muellerzr in #2052
Add space to docs by @muellerzr in #2055
Fix the error when the "train_batch_size" is absent in DeepSpeed config by @LZHgrla in #2060
remove unused constants by @statelesshz in #2045
fix: remove useless token by @rtrompier in #2069
DOC: Fix broken link to designing a device map by @BenjaminBossan in #2073
Let iterable dataset shard have a length if implemented by @muellerzr in #2066
Allow for samplers to be seedable and reproducable by @muellerzr in #2057
Fix docstring typo by @qgallouedec in #2072
Warn when kernel version is too low on Linux by @BenjaminBossan in #2077

New Contributors

@kli-casia made their first contribution in #1976
@MKhalusova made their first contribution in #2008
@brcps12 made their first contribution in #2033
@Ben-Epstein made their first contribution in #1994
@zhangsibo1129 made their first contribution in #2053
@LZHgrla made their first contribution in #2060
@rtrompier made their first contribution in #2069
@qgallouedec made their first contribution in #2072

Full Changelog: v0.23.0...v0.24.0

Contributors

MKhalusova, rtrompier, and 14 other contributors

Assets 2

14 Sep 19:23

muellerzr

v0.23.0

48d9631

v0.23.0: Model Memory Estimation tool, Breakpoint API, Multi-Node Notebook Launcher Support, and more!

Model Memory Estimator

A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights to stay memory efficient during the calculation.

Usage directions:

accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8

Or:

from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data

parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)

🤗 Hub is a first-class citizen

We've made the huggingface_hub library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted

`Accelerator` Enhancements:

gather_for_metrics will now also de-dupe for non-tensor objects. See #1937
mixed_precision="bf16" support on NPU devices. See #1949
New breakpoint API to help when dealing with trying to break from a condition on a single process. See #1940

Notebook Launcher Enhancements:

The notebook launcher now supports launching across multiple nodes! See #1913

FSDP Enhancements:

Activation checkpointing is now natively supported in the framework. See #1891
torch.compile support was fixed. See #1919

DeepSpeed Enhancements:

XPU/ccl support (#1827)
Easier gradient accumulation support, simply set gradient_accumulation_steps to "auto" in your deepspeed config, and Accelerate will use the one passed to Accelerator instead (#1901)
Support for custom schedulers and deepspeed optimizers (#1909)

What's Changed

Update release instructions by @sgugger in #1877
fix detach_hook by @SunMarc in #1880
Enable power users to bypass device_map="auto" training block by @muellerzr in #1881
Introduce model memory estimator by @muellerzr in #1876
Update with new url for explore by @muellerzr in #1884
Enable a token to be used by @muellerzr in #1886
Add doc on model memory usage by @muellerzr in #1887
Add hub as core dep by @muellerzr in #1885
update import of deepspeed integration from transformers by @pacman100 in #1894
Final nits on model util by @muellerzr in #1896
Fix nb launcher test by @muellerzr in #1899
Add FSDP activation checkpointing feature by @arde171 in #1891
Solve at least one failing test by @muellerzr in #1898
Deepspeed integration for XPU/ccl by @abhilash1910 in #1827
Add PR template by @muellerzr in #1906
deepspeed grad_acc_steps fixes by @pacman100 in #1901
Skip pypi transformers until release by @muellerzr in #1911
Fix docker images by @muellerzr in #1910
Use hosted CI runners for building docker images by @muellerzr in #1915
fix: add debug argument to sagemaker configuration by @maximegmd in #1904
improve help info when run accelerate config on npu by @statelesshz in #1895
support logging with mlflow in case of mlflow-skinny installed by @ghtaro in #1874
More CI fun - run all test parts always by @muellerzr in #1916
Expose auto in dataclass by @muellerzr in #1914
Add support for deepspeed optimizer and custom scheduler by @pacman100 in #1909
reduce gradient first for XLA when unscaling the gradients in mixed precision training with AMP. by @statelesshz in #1926
Check for invalid keys by @muellerzr in #1935
clean num devices by @SunMarc in #1936
Bring back pypi to runners by @muellerzr in #1939
Support multi-node notebook launching by @ggaaooppeenngg in #1913
fix the fsdp docs by @pacman100 in #1947
Fix docs by @ggaaooppeenngg in #1951
Protect tensorflow dependency by @SunMarc in #1959
fix safetensor saving by @SunMarc in #1954
FIX: patch_environment restores pre-existing environment variables when finished by @BenjaminBossan in #1960
Better guards for slow imports by @muellerzr in #1963
[Tests] Finish all todos by @younesbelkada in #1957
Rm strtobool by @muellerzr in #1964
Implementing gather_for_metrics with dedup for non tensor objects by @Lorenzobattistela in #1937
add bf16 mixed precision support for NPU by @statelesshz in #1949
Introduce breakpoint API by @muellerzr in #1940
fix torch compile with FSDP by @pacman100 in #1919
Add force_hooks to dispatch_model by @austinapatel in #1969
update FSDP and DeepSpeed docs by @pacman100 in #1973
Flex fix patch for accelerate by @abhilash1910 in #1972
Remove checkpoints only on main process by @Kepnu4 in #1974

New Contributors

@arde171 made their first contribution in #1891
@maximegmd made their first contribution in #1904
@ghtaro made their first contribution in #1874
@ggaaooppeenngg made their first contribution in #1913
@Lorenzobattistela made their first contribution in #1937
@austinapatel made their first contribution in #1969
@Kepnu4 made their first contribution in #1974

Full Changelog: v0.22.0...v0.23.0

Contributors

maximegmd, Kepnu4, and 13 other contributors

Assets 2

23 Aug 06:26

muellerzr

v0.22.0

6b3e559

v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via #1756

`Accelerator.load_state` can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via #1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

accelerator.accumulate() now supports passing in multiple models introduced via #1708
A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last .backward() via #1726

FSDP Changes

FSDP support has been added for NPU and XPU devices via #1803 and #1806
A new method for supporting RAM-efficient loading of models with FSDP has been added via #1777

DataLoader Changes

Custom slice functions are now supported in the DataLoaderDispatcher added via #1846

What's New?

fix failing test on 8GPU by @statelesshz in #1724
Better control over DDP's no_sync by @NouamaneTazi in #1726
Get rid of calling get_scale() by patching the step method of optimizer. by @yuxinyuan in #1720
fix the bug in npu by @statelesshz in #1728
Adding a shape check for set_module_tensor_to_device. by @Narsil in #1731
Fix errors when optimizer is not a Pytorch optimizer. by @yuxinyuan in #1733
Make balanced memory able to work with non contiguous GPUs ids by @thomwolf in #1734
Fixed typo in __repr__ of AlignDevicesHook by @KacperWyrwal in #1735
Update docs by @muellerzr in #1736
Fixed the bug that split dict incorrectly by @yuangpeng in #1742
Let load_state automatically grab the latest save by @muellerzr in #1741
fix KwargsHandler.to_kwargs not working with os.environ initialization in __post_init__ by @CyCle1024 in #1738
fix typo by @cauyxy in #1747
Check for misconfiguration of single node & single GPU by @muellerzr in #1746
Remove unused constant by @muellerzr in #1749
Rework new constant for operations by @muellerzr in #1748
Expose autocast kwargs and simplify autocast wrapper by @muellerzr in #1740
Fix FSDP related issues by @pacman100 in #1745
FSDP enhancements and fixes by @pacman100 in #1753
Fix check failure in Accelerator.save_state using multi-gpu by @CyCle1024 in #1760
Fix error when max_memory argument is in unexpected order by @ranchlai in #1759
Fix offload on disk when executing on CPU by @sgugger in #1762
Change is_aim_available() function to not match aim >= 4.0.0 by @alberttorosyan in #1769
Introduce an experimental distributed operations framework by @muellerzr in #1756
Support wrapping multiple models in Accelerator.accumulate() by @yuxinyuan in #1708
Contigous on gather by @muellerzr in #1771
[FSDP] Fix load_fsdp_optimizer by @awgu in #1755
simplify and correct the deepspeed example by @pacman100 in #1775
Set ipex default in state by @muellerzr in #1776
Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in #1800
reserve 10% GPU in get_balanced_memory to avoid OOM by @ranchlai in #1798
add support of float memory size in convert_file_size_to_int by @ranchlai in #1799
Allow users to resume from previous wandb runs with allow_val_change by @SumanthRH in #1796
Add FSDP for XPU by @abhilash1910 in #1803
Add FSDP for NPU by @statelesshz in #1806
Fix pytest import by @muellerzr in #1808
More specific logging in gather_for_metrics by @dleve123 in #1784
Detect device map auto and raise a helpful error when trying to not use model parallelism by @muellerzr in #1810
Typo fix by @muellerzr in #1812
Expand device-map warning by @muellerzr in #1819
Update bibtex to reflect team growth by @muellerzr in #1820
Improve docs on grad accumulation by @vwxyzjn in #1817
add warning when using to and cuda by @SunMarc in #1790
Fix bnb import by @muellerzr in #1813
Update docs and docstrings to match load_and_quantize_model arg by @JonathanRayner in #1822
Expose a bit of args/docstring fixup by @muellerzr in #1824
Better test by @muellerzr in #1825
Minor idiomatic change for fp8 check. by @float-trip in #1829
Use device as context manager for init_on_device by @shingjan in #1826
Ipex bug fix for device properties in modelling by @abhilash1910 in #1834
FIX: Bug with unwrap_model and keep_fp32_wrapper=False by @BenjaminBossan in #1838
Fix verify_device_map by @Rexhaif in #1842
Change CUDA check by @muellerzr in #1833
Fix the noneffective parameter: gpu_ids (Rel. Issue #1848) by @devymex in #1850
support for ram efficient loading of model with FSDP by @pacman100 in #1777
Loading logic safetensors by @SunMarc in #1853
fix dispatch for quantized model by @SunMarc in #1855
Update fsdp_with_peak_mem_tracking.py by @pacman100 in #1856
Add env variable for init_on_device by @shingjan in #1852
remove casting to FP32 when saving state dict by @pacman100 in #1868
support custom slice function in DataLoaderDispatcher by @thevasudevgupta in #1846
Include a note to the forums in the bug report by @muellerzr in #1871

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@yuxinyuan
- Support wrapping multiple models in Accelerator.accumulate() (#1708)
- Fix errors when optimizer is not a Pytorch optimizer. (#1733)
- Get rid of calling get_scale() by patching the step method of optimizer. (#1720)
@NouamaneTazi
- Better control over DDP's no_sync (#1726)
@abhilash1910
- Add FSDP for XPU (#1803)
- Ipex bug fix for device properties in modelling (#1834)
@statelesshz
- Add FSDP for NPU (#1806)
- fix failing test on 8GPU (#1724)
- fix the bug in npu (#1728)
@thevasudevgupta
- support custom slice function in DataLoaderDispatcher (#1846)

Full Changelog: v0.21.0...v0.22.0

Contributors

Narsil, dleve123, and 26 other contributors

Assets 2

13 Jul 16:51

sgugger

v0.21.0

8514c35

v0.21.0: Model quantization and NPUs

Model quantization with bitsandbytes

You can now quantize any model (no just Transformer models) using Accelerate. This is mainly for models having a lot of linear layers. See the documentation for more information!

Bnb quantization by @SunMarc in #1626

Support for Ascend NPUs

Accelerate now supports Ascend NPUs.

Add Ascend NPU accelerator support by @statelesshz in #1676

What's new?

Accelerate now requires Python 3.8+ and PyTorch 1.10+ :

🚨🚨🚨 Spring cleaning: Python 3.8 🚨🚨🚨 by @muellerzr in #1661
🚨🚨🚨 Spring cleaning: PyTorch 1.10 🚨🚨🚨 by @muellerzr in #1662
[doc build] Use secrets by @mishig25 in #1551
Update launch.mdx by @LiamSwayne in #1553
Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555
Update README.md by @LiamSwayne in #1556
Fix load_state_dict when there is one device and disk by @sgugger in #1557
Fix tests not being ran on multi-GPU nightly by @muellerzr in #1558
fix the typo when setting the "_accelerator_prepared" attribute by @Yura52 in #1560
[core] Fix possibility to passNoneType objects in prepare by @younesbelkada in #1561
Reset dataloader end_of_datalaoder at each iter by @sgugger in #1562
Update big_modeling.mdx by @LiamSwayne in #1564
[bnb] Fix failing int8 tests by @younesbelkada in #1567
Update gradient sync docs to reflect importance of optimizer.step() by @dleve123 in #1565
Update mixed precision integrations in README by @sgugger in #1569
Raise error instead of warn by @muellerzr in #1568
Introduce listify, fix tensorboard silently failing by @muellerzr in #1570
Check for bak and expand docs on directory structure by @muellerzr in #1571
Perminant solution by @muellerzr in #1577
fix the bug in xpu by @mingxiaoh in #1508
Make sure that we only set is_accelerator_prepared on items accelerate actually prepares by @muellerzr in #1578
Expand prepare() doc by @muellerzr in #1580
Get Torch version using importlib instead of pkg_resources by @catwell in #1585
improve oob performance when use mpirun to start DDP finetune without accelerate launch by @sywangyi in #1575
Update training_tpu.mdx by @LiamSwayne in #1582
Return false if CUDA available by @muellerzr in #1581
fix logger level by @caopulan in #1579
Fix test by @muellerzr in #1586
Update checkpoint.mdx by @LiamSwayne in #1587
FSDP updates by @pacman100 in #1576
Update modeling.py by @ain-soph in #1595
Integration tests by @muellerzr in #1593
Add triggers for CI workflow by @muellerzr in #1597
Remove asking xpu plugin for non xpu devices by @abhilash1910 in #1594
Remove GPU safetensors env variable by @sgugger in #1603
reset end_of_dataloader for dataloader_dispatcher by @megavaz in #1609
fix for arc gpus by @abhilash1910 in #1615
Ignore low_zero option when only device is available by @sgugger in #1617
Fix failing multinode tests by @muellerzr in #1616
Doc to md by @sgugger in #1618
Fix tb issue by @muellerzr in #1623
Fix workflow by @muellerzr in #1625
Fix transformers sync bug with accumulate by @muellerzr in #1624
fixes offload dtype by @SunMarc in #1631
fix: Megatron is not installed. please build it from source. by @yuanwu2017 in #1636
deepspeed z2/z1 state_dict bloating fix by @pacman100 in #1638
Swap disable rich by @muellerzr in #1640
fix autocasting bug by @pacman100 in #1637
fix modeling low zero by @abhilash1910 in #1634
Add skorch to runners by @muellerzr in #1646
add save model by @SunMarc in #1641
Change dispatch_model when we have only one device by @SunMarc in #1648
Doc save model by @SunMarc in #1650
Fix device_map by @SunMarc in #1651
Check for port usage before launch by @muellerzr in #1656
[BigModeling] Add missing check for quantized models by @younesbelkada in #1652
Bump integration by @muellerzr in #1658
TIL by @muellerzr in #1657
docker cpu py version by @muellerzr in #1659
[BigModeling] Final fix for dispatch int8 and fp4 models by @younesbelkada in #1660
remove safetensor dep on shard_checkpoint by @SunMarc in #1664
change the import place to avoid import error by @pacman100 in #1653
Update broken Runhouse link in examples/README.md by @dongreenberg in #1668
Bnb quantization by @SunMarc in #1626
replace save funct in doc by @SunMarc in #1672
Doc big model inference by @SunMarc in #1670
Add docs for saving Transformers models by @deppen8 in #1671
fix bnb tests by @SunMarc in #1679
Fix workflow CI by @muellerzr in #1690
remove duplicate class by @SunMarc in #1691
update readme in examples by @statelesshz in #1678
Fix nightly tests by @muellerzr in #1696
Fixup docs by @muellerzr in #1697
Improve quality errors by @muellerzr in #1698
Move mixed precision wrapping ahead of DDP/FSDP wrapping by @ChenWu98 in #1682
Add offload for 8-bit model by @SunMarc in #1699
Deepcopy on Accelerator to return self by @muellerzr in #1694
Update tracking.md by @stevhliu in #1702
Skip tests when bnb isn't available by @muellerzr in #1706
Fix launcher validation by @abhilash1910 in #1705
Fixes for issue #1683: failed to run accelerate config in colab by @Erickrus in #1692
Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait when the dataset is an IterDataPipe during multi-process training. by @yuxinyuan in #1709
add multi_gpu decorator by @SunMarc in #1712
Modify loading checkpoint behavior by @SunMarc in #1715
fix version by @SunMarc in #1701
Keep old behavior by @muellerzr in #1716
Optimize get_scale to reduce async calls by @muellerzr in #1718
Remove duplicate code by @muellerzr in #1717
New tactic by @muellerzr in #1719
add Comfy-UI by @pacman100 in #1723
add compatibility with peft by @SunMarc in #1725

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@LiamSwayne
- Update launch.mdx (#1553)
- Update README.md (#1556)
- Update big_modeling.mdx (#1564)
- Update training_tpu.mdx (#1582)
- Update checkpoint.mdx (#1587)
@mingxiaoh
- fix the bug in xpu (#1508)
@statelesshz
- update readme in examples (#1678)
- Add Ascend NPU accelerator support (#1676)
@ChenWu98
- Move mixed precision wrapping ahead of DDP/FSDP wrapping (#1682)

Contributors

catwell, dleve123, and 22 other contributors

Assets 2

08 Jun 16:19

sgugger

v0.20.3

47b9ac1

v0.20.3: Patch release

Reset dataloader end_of_datalaoder at each iter in #1562 by @sgugger

Contributors

sgugger

Assets 2

08 Jun 13:24

sgugger

v0.20.2

3915c3d

v0.20.2: Patch release

fix the typo when setting the "_accelerator_prepared" attribute in #1560 by @Yura52
[core] Fix possibility to pass] NoneType objects in prepare in #1561 by @younesbelkada

Contributors

Yura52 and younesbelkada

Assets 2

07 Jun 19:34

sgugger

v0.20.1

1538493

v0.20.1: Patch release

Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555
Fix load_state_dict when there is one device and disk by @sgugger in #1557

Contributors

muellerzr and sgugger

Assets 2

07 Jun 19:33

sgugger

v0.20.0

9765b84

v0.20.0: MPS and fp4 support on Big Model Inference, 4-bit QLoRA, Intel GPU, Distributed Inference, and much more!

Big model inference

Support has been added to run device_map="auto" on the MPS device. Big model inference also work with models loaded in 4 bits in Transformers.

Add mps support to big inference modeling by @SunMarc in #1545
Adds fp4 support for model dispatching by @younesbelkada in #1505

4-bit QLoRA Support

4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) by @TimDettmers in #1458

Distributed Inference Utilities

This version introduces a new Accelerator.split_between_processes utility to help with performing distributed infernece with non-tensorized or non-dataloader workflows. Read more here

Introduce XPU support for Intel GPU

Intel GPU support initialization by @abhilash1910 in #1118

Add support for the new PyTorch XLA TPU runtime

Accelerate now supports the latest TPU runtimes #1393, #1385

A new optimizer method: `LocalSGD`

This is a new wrapper around SGD which enables efficient multi-GPU training in the case when no fast interconnect is possible by @searchivarius in #1378

Papers with 🤗 Accelerate

We now have an entire section of the docs dedicated to official paper implementations and citations using the framework #1399, see it live here

Breaking changes

logging_dir has been fully deprecated, please use project_dir or a Project_configuration

What's new?

use existing mlflow experiment if exists by @Rusteam in #1403
changes required for DS integration by @pacman100 in #1406
fix deepspeed failing tests by @pacman100 in #1411
Make mlflow logging dir optional by @mattplo-decath in #1413
Fix bug on ipex for diffusers by @abhilash1910 in #1426
Improve Slack Updater by @muellerzr in #1433
Let quality yell at the user if it's a version difference by @muellerzr in #1438
Ensure that it gets installed by @muellerzr in #1439
[core] Introducing CustomDtype enum for custom dtypes by @younesbelkada in #1434
Fix XPU by @muellerzr in #1440
Make sure torch compiled model can also be unwrapped by @patrickvonplaten in #1437
fixed: ZeroDivisionError: division by zero by @sreio in #1436
fix potential OOM when resuming with multi-GPU training by @exhyy in #1444
Fixes in infer_auto_device_map by @sgugger in #1441
Raise error when logging improperly by @muellerzr in #1446
Fix ci by @muellerzr in #1447
Distributed prompting/inference utility by @muellerzr in #1410
Add to by @muellerzr in #1448
split_between_processes by @stevhliu in #1449
[docs] Replace state.rank -> process_index by @pcuenca in #1450
Auto multigpu logic by @muellerzr in #1452
Update with cli instructions by @muellerzr in #1453
Adds in_order argument that defaults to False, to log in order. by @JulesGM in #1262
fix error for CPU DDP using trainer api. by @sywangyi in #1455
Refactor and simplify xpu device in state by @abhilash1910 in #1456
Document how to use commands with python module instead of argparse by @muellerzr in #1457
4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) by @TimDettmers in #1458
Fix skip first batch being perminant by @muellerzr in #1466
update conversion of layers to retain original data type. by @avisinghal6 in #1467
Check for xpu specifically by @muellerzr in #1472
update register_empty_buffer to match torch args by @NouamaneTazi in #1465
Update gradient accumulation docs, and remove redundant example by @iantbutler01 in #1461
Imrpove sagemaker by @muellerzr in #1470
Split tensors as part of split_between_processes by @muellerzr in #1477
Move to device by @muellerzr in #1478
Fix gradient state bugs in multiple dataloader by @Ethan-yt in #1483
Add rdzv-backend by @muellerzr in #1490
Only use IPEX if available by @muellerzr in #1495
Update README.md by @lyhue1991 in #1493
Let gather_for_metrics always run by @muellerzr in #1496
Use empty like when we only need to create buffers by @thomasw21 in #1497
Allow key skipping in big model inference by @sgugger in #1491
fix crash when ipex is installed and torch has no xpu by @sywangyi in #1502
[bnb] Add fp4 support for dispatch by @younesbelkada in #1505
Fix 4bit model on multiple devices by @SunMarc in #1506
adjust overriding of model's forward function by @prathikr in #1492
Add assertion when call prepare with deepspeed config. by @tensimiku in #1468
NVME path support for deepspeed by @abhilash1910 in #1484
should set correct dtype to ipex optimize and use amp logic in native… by @sywangyi in #1511
Swap env vars for XPU and IPEX + CLI by @muellerzr in #1513
Fix a bug when parameters tied belong to the same module by @sgugger in #1514
Fixup deepspeed/cli tests by @muellerzr in #1526
Refactor mp into its own wrapper by @muellerzr in #1527
Check tied parameters by @SunMarc in #1529
Raise ValueError on iterable dataset if we've hit the end and attempting to go beyond it by @muellerzr in #1531
Officially support naive PP for quantized models + PEFT by @younesbelkada in #1523
remove ipexplugin, let ACCELERATE_USE_IPEX/ACCELERATE_USE_XPU control the ipex and xpu by @sywangyi in #1503
Prevent using extra VRAM for static device_map by @LSerranoPEReN in #1536
Update deepspeed.mdx by @LiamSwayne in #1541
Update performance.mdx by @LiamSwayne in #1543
Update deferring_execution.mdx by @LiamSwayne in #1544
Apply deprecations by @muellerzr in #1537
Add mps support to big inference modeling by @SunMarc in #1545
[documentation] grammar fixes in gradient_synchronization.mdx by @LiamSwayne in #1547
Eval mode by @muellerzr in #1540
Update migration.mdx by @LiamSwayne in #1549

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@will-cromar
- Support TPU v4 with new PyTorch/XLA TPU runtime (#1393)
- Support TPU v2 and v3 on new PyTorch/XLA TPU runtime (#1385)
@searchivarius
- Adding support for local SGD. (#1378)
@abhilash1910
- Intel GPU support initialization (#1118)
- Fix bug on ipex for diffusers (#1426)
- Refactor and simplify xpu device in state (#1456)
- NVME path support for deepspeed (#1484)
@sywangyi
- fix error for CPU DDP using trainer api. (#1455)
- fix crash when ipex is installed and torch has no xpu (#1502)
- should set correct dtype to ipex optimize and use amp logic in native… (#1511)
- remove ipexplugin, let ACCELERATE_USE_IPEX/ACCELERATE_USE_XPU control the ipex and xpu (#1503)
@Ethan-yt
- Fix gradient state bugs in multiple dataloader (#1483)