v0.20.0: MPS and fp4 support on Big Model Inference, 4-bit QLoRA, Intel GPU, Distributed Inference, and much more!
Big model inference
Support has been added to run device_map="auto"
on the MPS device. Big model inference also work with models loaded in 4 bits in Transformers.
- Add mps support to big inference modeling by @SunMarc in #1545
- Adds fp4 support for model dispatching by @younesbelkada in #1505
4-bit QLoRA Support
- 4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) by @TimDettmers in #1458
Distributed Inference Utilities
This version introduces a new Accelerator.split_between_processes
utility to help with performing distributed infernece with non-tensorized or non-dataloader workflows. Read more here
Introduce XPU support for Intel GPU
- Intel GPU support initialization by @abhilash1910 in #1118
Add support for the new PyTorch XLA TPU runtime
A new optimizer method: LocalSGD
- This is a new wrapper around SGD which enables efficient multi-GPU training in the case when no fast interconnect is possible by @searchivarius in #1378
Papers with 🤗 Accelerate
- We now have an entire section of the docs dedicated to official paper implementations and citations using the framework #1399, see it live here
Breaking changes
logging_dir
has been fully deprecated, please use project_dir
or a Project_configuration
What's new?
- use existing mlflow experiment if exists by @Rusteam in #1403
- changes required for DS integration by @pacman100 in #1406
- fix deepspeed failing tests by @pacman100 in #1411
- Make mlflow logging dir optional by @mattplo-decath in #1413
- Fix bug on ipex for diffusers by @abhilash1910 in #1426
- Improve Slack Updater by @muellerzr in #1433
- Let quality yell at the user if it's a version difference by @muellerzr in #1438
- Ensure that it gets installed by @muellerzr in #1439
- [
core
] IntroducingCustomDtype
enum for custom dtypes by @younesbelkada in #1434 - Fix XPU by @muellerzr in #1440
- Make sure torch compiled model can also be unwrapped by @patrickvonplaten in #1437
- fixed: ZeroDivisionError: division by zero by @sreio in #1436
- fix potential OOM when resuming with multi-GPU training by @exhyy in #1444
- Fixes in infer_auto_device_map by @sgugger in #1441
- Raise error when logging improperly by @muellerzr in #1446
- Fix ci by @muellerzr in #1447
- Distributed prompting/inference utility by @muellerzr in #1410
- Add to by @muellerzr in #1448
- split_between_processes by @stevhliu in #1449
- [docs] Replace
state.rank
->process_index
by @pcuenca in #1450 - Auto multigpu logic by @muellerzr in #1452
- Update with cli instructions by @muellerzr in #1453
- Adds
in_order
argument that defaults to False, to log in order. by @JulesGM in #1262 - fix error for CPU DDP using trainer api. by @sywangyi in #1455
- Refactor and simplify xpu device in state by @abhilash1910 in #1456
- Document how to use commands with python module instead of argparse by @muellerzr in #1457
- 4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) by @TimDettmers in #1458
- Fix skip first batch being perminant by @muellerzr in #1466
- update conversion of layers to retain original data type. by @avisinghal6 in #1467
- Check for xpu specifically by @muellerzr in #1472
- update
register_empty_buffer
to match torch args by @NouamaneTazi in #1465 - Update gradient accumulation docs, and remove redundant example by @iantbutler01 in #1461
- Imrpove sagemaker by @muellerzr in #1470
- Split tensors as part of
split_between_processes
by @muellerzr in #1477 - Move to device by @muellerzr in #1478
- Fix gradient state bugs in multiple dataloader by @Ethan-yt in #1483
- Add rdzv-backend by @muellerzr in #1490
- Only use IPEX if available by @muellerzr in #1495
- Update README.md by @lyhue1991 in #1493
- Let gather_for_metrics always run by @muellerzr in #1496
- Use empty like when we only need to create buffers by @thomasw21 in #1497
- Allow key skipping in big model inference by @sgugger in #1491
- fix crash when ipex is installed and torch has no xpu by @sywangyi in #1502
- [
bnb
] Add fp4 support for dispatch by @younesbelkada in #1505 - Fix 4bit model on multiple devices by @SunMarc in #1506
- adjust overriding of model's forward function by @prathikr in #1492
- Add assertion when call prepare with deepspeed config. by @tensimiku in #1468
- NVME path support for deepspeed by @abhilash1910 in #1484
- should set correct dtype to ipex optimize and use amp logic in native… by @sywangyi in #1511
- Swap env vars for XPU and IPEX + CLI by @muellerzr in #1513
- Fix a bug when parameters tied belong to the same module by @sgugger in #1514
- Fixup deepspeed/cli tests by @muellerzr in #1526
- Refactor mp into its own wrapper by @muellerzr in #1527
- Check tied parameters by @SunMarc in #1529
- Raise ValueError on iterable dataset if we've hit the end and attempting to go beyond it by @muellerzr in #1531
- Officially support naive PP for quantized models + PEFT by @younesbelkada in #1523
- remove ipexplugin, let ACCELERATE_USE_IPEX/ACCELERATE_USE_XPU control the ipex and xpu by @sywangyi in #1503
- Prevent using extra VRAM for static device_map by @LSerranoPEReN in #1536
- Update deepspeed.mdx by @LiamSwayne in #1541
- Update performance.mdx by @LiamSwayne in #1543
- Update deferring_execution.mdx by @LiamSwayne in #1544
- Apply deprecations by @muellerzr in #1537
- Add mps support to big inference modeling by @SunMarc in #1545
- [documentation] grammar fixes in gradient_synchronization.mdx by @LiamSwayne in #1547
- Eval mode by @muellerzr in #1540
- Update migration.mdx by @LiamSwayne in #1549
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @will-cromar
- @searchivarius
- Adding support for local SGD. (#1378)
- @abhilash1910
- @sywangyi
- @Ethan-yt
- Fix gradient state bugs in multiple dataloader (#1483)