Releases: Lightning-AI/pytorch-lightning
Minor patch release: App jobs
App
Fixed
Fabric
Changed
- Enable precision autocast for LightningModule step methods in Fabric (#17439)
Fixed
- Fixed an issue with
LightningModule.*_step
methods bypassing the DDP/FSDP wrapper (#17424) - Fixed device handling in
Fabric.setup()
when the model has no parameters (#17441)
PyTorch
Fixed
- Fixed
Model.load_from_checkpoint("checkpoint.ckpt", map_location=map_location)
would always return model on CPU (#17308) - Fixed Sync module states during non-fit (#17370)
- Fixed an issue that caused
num_nodes
not to be set correctly forFSDPStrategy
(#17438)
Contributors
@awaelchli, @Borda, @carmocca, @ethanwharris, @ryan597, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release
App
Changed
Fabric
Changed
- Let
TorchCollective
works on thetorch.distributed
WORLD process group by default (#16995)
Fixed
- fixed for all
_cuda_clearCublasWorkspaces
on teardown (#16907) - Improved the error message for installing tensorboard or tensorboardx (#17053)
PyTorch
Changed
- Changed to the
NeptuneLogger
(#16761):- It now supports neptune-client 0.16.16 and neptune >=1.0, and we have replaced the
log()
method withappend()
andextend()
. - It now accepts a namespace
Handler
as an alternative toRun
for therun
argument. This means that you can call it likeNeptuneLogger(run=run["some/namespace"])
to log everything to thesome/namespace/
location of the run.
- It now supports neptune-client 0.16.16 and neptune >=1.0, and we have replaced the
- Allow
sys.argv
and args inLightningCLI
(#16808) - Moveed HPU broadcast override to the HPU strategy file (#17011)
Depercated
- Removed registration of
ShardedTensor
state dict hooks inLightningModule.__init__
withtorch>=2.1
(#16892) - Removed the
lightning.pytorch.core.saving.ModelIO
class interface (#16974)
Fixed
- Fixed
num_nodes
not being set forDDPFullyShardedNativeStrategy
(#17160) - Fixed parsing the precision config for inference in
DeepSpeedStrategy
(#16973) - Fixed the availability check for
rich
that prevented Lightning to be imported in Google Colab (#17156) - Fixed for all
_cuda_clearCublasWorkspaces
on teardown (#16907) - The
psutil
package is now required for CPU monitoring (#17010) - Improved the error message for installing tensorboard or tensorboardx (#17053)
Contributors
@awaelchli, @belerico, @carmocca, @colehawkins, @dmitsf, @Erotemic, @ethanwharris, @kshitij12345, @Borda
If we forgot someone due to not matching commit email with GitHub account, let us know :]
2.0.1 appendix
App
Fixed
- Fix frontend hosts when running with multi-process in the cloud (#17324)
Fabric
No changes.
PyTorch
Fixed
- Make the
is_picklable
function more robust (#17270)
Contributors
@eng-yue @ethanwharris @Borda @awaelchli @carmocca
If we forgot someone due to not matching commit email with GitHub account, let us know :]
2.0.1 patch release
App
No changes
Fabric
Changed
- Generalized
Optimizer
validation to accommodate both FSDP 1.x and 2.x (#16733)
PyTorch
Changed
- Pickling the
LightningModule
no longer pickles theTrainer
(#17133) - Generalized
Optimizer
validation to accommodate both FSDP 1.x and 2.x (#16733) - Disable
torch.inference_mode
withtorch.compile
in PyTorch 2.0 (#17215)
Fixed
- Fixed issue where pickling the module instance would fail with a DataLoader error (#17130)
- Fixed WandbLogger not showing "best" aliases for model checkpoints when
ModelCheckpoint(save_top_k>0)
is used (#17121) - Fixed the availability check for
rich
that prevented Lightning to be imported in Google Colab (#17156) - Fixed parsing the precision config for inference in
DeepSpeedStrategy
(#16973) - Fixed issue where
torch.compile
would fail when logging to WandB (#17216)
Contributors
@Borda @williamFalcon @lightningforever @adamjstewart @carmocca @tshu-w @saryazdi @parambharat @awaelchli @colehawkins @woqidaideshi @md-121 @yhl48 @gkroiz @idc9 @speediedan
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Lightning 2.0: Fast, Flexible, Stable
Lightning AI is excited to announce the release of Lightning 2.0 ⚡
Over the last couple of years PyTorch Lightning has become the preferred deep learning framework for researchers and ML developers around the world, with close to 50 million downloads and 18k OSS projects, from top universities to leading labs.
With the help of over 800 contributors, we have added many features and functionalities to make it the most complete research toolkit possible, but some of these changes also introduced issues:
- API changes to the trainer
- Trainer code became harder to follow
- Many integrations made Lightning appear bloated
- The trainer became harder to customize / takes away what I instead need to tweak / have control over.
To make the research experience better, we are introducing 2.0:
- No API changes - We commit to backward compatibility in the 2.0 series
- Simplified abstraction layers, removed legacy functionality, integrations out of the main repo. This improves the project's readability and debugging experience.
- Introducing Fabric. Scale any PyTorch model with just a few lines of code. Read-on!
Highlights
PyTorch 2.0 and torch.compile
Lightning 2.0 is best friends with PyTorch 2.0. You can torch.compile
your LightningModules now!
import torch
import lightning as L
model = LitModel()
# This will compile forward and {training,validation,test,predict}_step
compiled_model = torch.compile(model)
trainer = L.Trainer()
trainer.fit(compiled_model)
PyTorch reports that on average, "models runs 43% faster in training on an NVIDIA A100 GPU. At Float32 precision, it runs 21% faster on average and at AMP Precision it runs 51% faster on average" (source). If you want to learn more about torch.compile
and how such speedups can be achieved, read the official PyTorch 2.0 blog post.
Automatic accelerator selection (#16847)
The Trainer
now chooses accelerator="auto", strategy="auto", devices="auto"
as defaults. This automatically detects the best hardware on your system (TPUs, GPUs, Apple Silicon, etc.) and chooses as many devices as are available.
import lightning as L
# Selects accelerator, devices and strategy automatically!
trainer = L.Trainer()
# Same as:
trainer = L.Trainer(accelerator="auto", strategy="auto", devices="auto")
For example, on a 8-GPU server, this will implicitly select Trainer(accelerator="cuda", strategy="ddp", devices=8)
.
Support for arbitrary iterables (#16726)
Previously, the Trainer supported DataLoader-like iterables. However, with this release, users can now work with any iterable that implements the Python iterable definition. This includes custom data structures, such as user-defined classes and generators, as well as built-in Python objects.
To use this new feature, return any iterable (or collection of iterables) from the dataloader hooks.
def train_dataloader(self):
return DataLoader(...)
return list(range(1000))
# pass loaders as a dict. This will create batches like this:
# {'a': batch_from_loader_a, 'b': batch_from_loader_b}
return {"a": DataLoader(...), "b": DataLoader(...)}
# pass loaders as list. This will create batches like this:
# [batch_from_dl_1, batch_from_dl_2]
return [DataLoader(...), DataLoader(...)]
# arbitrary nesting
# {'a': [batch_from_dl_1, batch_from_dl_2], 'b': [batch_from_dl_3, batch_from_dl_4]}
return {"a": [dl1, dl2], "b": [dl3, dl4]}
Read our data section for more information.
Redesigned multi-dataloader support (#16743, #16784, #16939)
Lightning automatically collates the batches from multiple iterables based on a "mode". This is done with our newly revamped CombinedLoader
class.
from lightning.pytorch.utilities import CombinedLoader
iterables = {"a": DataLoader(), "b": DataLoader()}
# Lightning uses this under the hood, but this way you can change the "mode"
combined_loader = CombinedLoader(iterables, mode="min_size")
model = ...
trainer = Trainer()
trainer.fit(model, combined_loader)
The following modes are supported:
min_size
: stops after the shortest iterable (the one with the lowest number of items) is done.max_size_cycle
: stops after the longest iterable (the one with most items) is done, while cycling through the rest of the iterables.max_size
: stops after the longest iterable (the one with most items) is done, while returning None for the exhausted iterables.sequential
: completely consumes ecah iterable sequentially, and returns a triplet(data, idx, iterable_idx)
If you have a need for a different "mode", feel free to open a feature request! Adding new modes is now very simplified. These improvements also allowed us to simplify the trainer's loops by abstracting this logic inside the CombinedLoader
.
Barebones Trainer mode (#16854)
A new Trainer argument Trainer(barebones=...)
was added (default is False) to disable all features that may impact the raw speed of the training loop. This allows users to quickly and fairily compare the runtime of a Lightning script with a raw PyTorch script.
This is how you enable it:
import lightning as L
# Default: False
trainer = L.Trainer(barebones=True)
A message informs about the changed settings:
You are running in `Trainer(barebones=True)` mode. All features that may impact raw speed have been disabled to facilitate analyzing the Trainer overhead. Specifically, the following features are deactivated:
- Checkpointing: `Trainer(enable_checkpointing=True)`
- Progress bar: `Trainer(enable_progress_bar=True)`
- Model summary: `Trainer(enable_model_summary=True)`
- Logging: `Trainer(logger=True)`, `Trainer(log_every_n_steps>0)`, `LightningModule.log(...)`, `LightningModule.log_dict(...)`
- Sanity checking: `Trainer(num_sanity_val_steps>0)`
- Development run: `Trainer(fast_dev_run=True)`
- Anomaly detection: `Trainer(detect_anomaly=True)`
- Profiling: `Trainer(profiler=...)`
Tip: This feature is also very useful for unit testing!
Better progress bar (#16695)
Based on feedback from users, we decided to separate the training progress bar from the validation bar. This greatly improves the time estimates (since validation is usually faster) and resolves confusion around the total batches being processed in an epoch.
This is how the bar looked in versions before 2.0:
Epoch 3: 21%|██ | 28/128 [00:36<01:32, 23.12it/s, loss=0.163]
Validation DataLoader 0: 38%|███ | 12/32 [00:12<00:20, 1.01s/it]
Note how the total batches (128) is the sum of the training batches (32) and the three validation runs (3 x 32). And this is how the progress bar looks like now:
Epoch 3: 50%|█████ | 16/32 [00:36<01:32, 23.12it/s]
Validation DataLoader 0: 38%|███ | 12/32 [00:12<00:20, 1.01s/it]
Note how the batch counts are now separate. The training progress bar pauses until validation is completed.
Lightning Fabric
Lightning 2.0 is the official release for Lightning Fabric 🎉
Fabric is the fast and lightweight way to scale PyTorch models without boilerplate code.
- Easily switch from running on CPU to GPU (Apple Silicon, CUDA, ...), TPU, multi-GPU or even multi-node training
- State-of-the-art distributed training strategies (DDP, FSDP, DeepSpeed) and mixed precision out of the box
- Handles all the boilerplate device logic for you
- Brings useful tools to help you build a trainer (callbacks, logging, checkpoints, ...)
- Designed with multi-billion parameter models in mind
📖 Go to Fabric documentation 📖
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
+ from lightning.fabric import Fabric
class PyTorchModel(nn.Module):
...
class PyTorchDataset(Dataset):
...
+ fabric = Fabric(accelerator="cuda", devices=8, strategy="ddp")
+ fabric.launch()
- device = "cuda" if torch.cuda.is_available() else "cpu"
model = PyTorchModel(...)
optimizer = torch.optim.SGD(model.parameters())
+ model, optimizer = fabric.setup(model, optimizer)
dataloader = DataLoader(PyTorchDataset(...), ...)
+ dataloader = fabric.setup_dataloaders(dataloader)
model.train()
for epoch in range(num_epochs):
for ba...
Weekly patch release
App
Removed
- Removed implicit ui testing with
testing.run_app_in_cloud
in favor of headless login and app selection (#16741)
Fabric
Added
- Added
Fabric(strategy="auto")
support (#16916)
Fixed
- Fixed edge cases in parsing device ids using NVML (#16795)
- Fixed DDP spawn hang on TPU Pods (#16844)
- Fixed an error when passing
find_usable_cuda_devices(num_devices=-1)
(#16866)
PyTorch
Added
- Added
Fabric(strategy="auto")
support. It will choose DDP over DDP-spawn, contrary tostrategy=None
(default) (#16916)
Fixed
- Fixed DDP spawn hang on TPU Pods (#16844)
- Fixed edge cases in parsing device ids using NVML (#16795)
- Fixed backwards compatibility for
lightning.pytorch.utilities.parsing.get_init_args
(#16851)
Contributors
@ethanwharris, @carmocca, @awaelchli, @justusschock , @dtuit, @Liyang90
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Lightning 2.0 Release Candidate
Full Changelog: 1.9.0...2.0.0rc0
Weekly patch release
App
Fixed
- Fixed
lightning open
command and improved redirects (#16794)
Fabric
Fixed
- Fixed an issue causing a wrong environment plugin to be selected when
accelerator=tpu
anddevices > 1
(#16806) - Fixed parsing of defaults for
--accelerator
and--precision
in Fabric CLI whenaccelerator
andprecision
are set to non-default values in the code (#16818)
PyTorch
Fixed
- Fixed an issue causing a wrong environment plugin to be selected when
accelerator=tpu
anddevices > 1
(#16806)
Contributors
@ethanwharris, @carmocca, @awaelchli, @Borda, @tchaton, @yurijmikhalevich
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Weekly patch release
App
Added
- Added Storage Commands (#16740)
rm
: Delete files from your Cloud Platform Filesystem
- Added
lightning connect data
to register data connection to private s3 buckets (#16738)
Fabric
Fixed
- Fixed an attribute error and improved input validation for invalid strategy types being passed to Fabric (#16693)
PyTorch
Changed
- Disabled strict loading in multiprocessing launcher ("ddp_spawn", etc.) when loading weights back into the main process (#16365)
Fixed
- Fixed an attribute error and improved input validation for invalid strategy types being passed to Trainer (#16693)
- Fixed early stopping triggering extra validation runs after reaching
min_epochs
ormin_steps
(#16719)
Contributors
@akihironitta, @awaelchli, @Borda, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Weekly patch release
App
Added
- Added
lightning open
command (#16482) - Added experimental support for interruptable GPU in the cloud (#16399)
- Added FileSystem abstraction to simply manipulate files (#16581)
- Added Storage Commands (#16606)
ls
: List files from your Cloud Platform Filesystemcd
: Change the current directory within your Cloud Platform filesystem (terminal session based)pwd
: Return the current folder in your Cloud Platform Filesystemcp
: Copy files between your Cloud Platform Filesystem and local filesystem
- Prevent to
cd
into non-existent folders (#16645) - Enabled
cp
(upload) at project level (#16631) - Enabled
ls
andcp
(download) at project level (#16622) - Added
lightning connect data
to register data connection to s3 buckets (#16670) - Added support for running with multiprocessing in the cloud (#16624)
- Initial plugin server (#16523)
- Connect and Disconnect node (#16700)
Changed
- Changed the default
LightningClient(retry=False)
toretry=True
(#16382) - Add support for async predict method in PythonServer and remove torch context (#16453)
- Renamed
lightning.app.components.LiteMultiNode
tolightning.app.components.FabricMultiNode
(#16505) - Changed the command
lightning connect
tolightning connect app
for consistency (#16670) - Refactor cloud dispatch and update to new API (#16456)
- Updated app URLs to the latest format (#16568)
Fixed
- Fixed a deadlock causing apps not to exit properly when running locally (#16623)
- Fixed the Drive root_folder not parsed properly (#16454)
- Fixed malformed path when downloading files using
lightning cp
(#16626) - Fixed app name in URL (#16575)
Fabric
Fixed
- Fixed error handling for
accelerator="mps"
andddp
strategy pairing (#16455) - Fixed strict availability check for
torch_xla
requirement (#16476) - Fixed an issue where PL would wrap DataLoaders with XLA's MpDeviceLoader more than once (#16571)
- Fixed the batch_sampler reference for DataLoaders wrapped with XLA's MpDeviceLoader (#16571)
- Fixed an import error when
torch.distributed
is not available (#16658)
Pytorch
Fixed
- Fixed an unintended limitation for calling
save_hyperparameters
on mixin classes that don't subclassLightningModule
/LightningDataModule
(#16369) - Fixed an issue with
MLFlowLogger
logging the wrong keys with.log_hyperparams()
(#16418) - Fixed logging more than 100 parameters with
MLFlowLogger
and long values are truncated (#16451) - Fixed strict availability check for
torch_xla
requirement (#16476) - Fixed an issue where PL would wrap DataLoaders with XLA's MpDeviceLoader more than once (#16571)
- Fixed the batch_sampler reference for DataLoaders wrapped with XLA's MpDeviceLoader (#16571)
- Fixed an import error when
torch.distributed
is not available (#16658)
Contributors
@akihironitta, @awaelchli, @Borda, @BrianPulfer, @ethanwharris, @hhsecond, @justusschock, @Liyang90, @RuRo, @senarvi, @shenoynikhil, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]