Releases: Lightning-AI/pytorch-lightning
Patch release v2.2.3
PyTorch
Fixed
- Fixed
WandbLogger.log_hyperparameters()
raising an error if hyperparameters are not JSON serializable (#19769)
Fabric
No Changes.
Full Changelog: 2.2.2...2.2.3
Patch release v2.2.2
PyTorch
Fixed
- Fixed an issue causing a TypeError when using
torch.compile
as a decorator (#19627) - Fixed a KeyError when saving a FSDP sharded checkpoint and setting
save_weights_only=True
(#19524)
Fabric
Fixed
- Fixed an issue causing a TypeError when using
torch.compile
as a decorator (#19627) - Fixed issue where some model methods couldn't be monkeypatched after being Fabric wrapped (#19705)
- Fixed an issue causing weights to be reset in
Fabric.setup()
when using FSDP (#19755)
Full Changelog: 2.2.1...2.2.2
Contributors
@ankitgola005 @awaelchli @Borda @carmocca @dmitsf @dvoytan-spark @fnhirwa
Patch release v2.2.1
PyTorch
Fixed
- Fixed an issue with CSVLogger trying to append to file from a previous run when the version is set manually (#19446)
- Fixed the divisibility check for
Trainer.accumulate_grad_batches
andTrainer.log_every_n_steps
in ThroughputMonitor (#19470) - Fixed support for Remote Stop and Remote Abort with NeptuneLogger (#19130)
- Fixed infinite recursion error in precision plugin graveyard (#19542)
Fabric
Fixed
- Fixed an issue with CSVLogger trying to append to file from a previous run when the version is set manually (#19446)
Full Changelog: 2.2.0post...2.2.1
Contributors
@Raalsky @awaelchli @carmocca @Borda
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor release correction
Full Changelog: 2.2.0...2.2.0.post0
Lightning v2.2
Lightning AI is excited to announce the release of Lightning 2.2 ⚡
Did you know? The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you Lightning Studio. Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.
While our previous release was packed with many big new features, this time around we're rolling out mainly improvements based on feedback from the community. And of course, as the name implies, this release fully supports the latest PyTorch 2.2 🎉
Highlights
Monitoring Throughput
Lightning now has built-in utilities to measure throughput metrics such as batches/sec, samples/sec and Model FLOP Utilization (MFU) (#18848).
Trainer:
For the Trainer, this comes in form of a ThroughputMonitor
callback. In order to track samples/sec, you need to provide a function to tell the monitor how to extract the batch dimension from your input. Furthermore, if you want to track MFU, you can provide a sample forward pass and the ThroughputMonitor
will automatically estimate the utilization based on the hardware you are running on:
import lightning as L
from lightning.pytorch.callbacks import ThroughputMonitor
from lightning.fabric.utilities.throughput import measure_flops
class MyModel(LightningModule):
def setup(self, stage):
with torch.device("meta"):
model = MyModel()
def sample_forward():
batch = torch.randn(..., device="meta")
return model(batch)
self.flops_per_batch = measure_flops(model, sample_forward, loss_fn=torch.Tensor.sum)
throughput = ThroughputMonitor(
batch_size_fn=lambda batch: batch.size(0),
# optional, if your samples have a length (like number of tokens)
sample_fn=lambda batch: batch.size(1)
)
trainer = L.Trainer(log_every_n_steps=10, callbacks=throughput, logger=...)
model = MyModel()
trainer.fit(model)
The results get automatically sent to the logger if one is configured on the Trainer.
Fabric:
For Fabric, the ThroughputMonitor
is a simple utility object on which you call .update()
and compute_and_log()
during the training loop:
import lightning as L
from lightning.fabric.utilities import ThroughputMonitor
fabric = L.Fabric(logger=...)
throughput = ThroughputMonitor(fabric)
t0 = time()
for batch_idx, batch in enumerate(train_dataloader):
do_work()
torch.cuda.synchronize() # required or else time() won't be correct
throughput.update(
time=(time() - t0),
batches=batch_idx,
samples=(batch_idx * batch_size)
)
if batch_idx % 10 == 0:
throughput.compute_and_log(step=batch_idx)
Check out our TinyLlama LLM pretraining script for a full example using Fabric's ThroughputMonitor
.
The troughput utilities can report:
- batches per second (per process and across process)
- samples per second (per process and across process)
- items per second (e.g. tokens) (per process and across process)
- flops per second (per process and across process)
- model flops utilization (MFU) (per process)
- total time, total samples, total batches, and total items (per process)
Improved Handling of Evaluation Mode
When you train a model and have validation enabled, the Trainer automatically calls .eval()
when transitioning to the validation loop, and .train()
when validation ends. Until now, this had the unfortunate side effect that any submodules in your LightningModule that were in evaluation mode get reset to train mode. In Lightning 2.2, the Trainer now captures the mode of every submodule before switching to validation, and restores the mode the modules were in when validation ends (#18951, #18951, #18951). This improvement will help users avoid silent correctness bugs and removes boilerplate code for managing frozen layers.
import lightning as L
class LitModel(L.LightningModule):
def __init__(self):
super().__init__()
self.trainable_module = ...
# This will now stay in eval mode
self.frozen_module = ...
self.frozen_module.eval()
def training_step(self, batch):
# Previously, modules were all in train mode
# Now: Modules are in mode they were set up with
assert self.trainable_module.training
assert not self.frozen_module.training
...
def validation_step(self, batch):
# All modules are in eval mode
...
model = LitModel()
trainer = L.Trainer()
trainer.fit(model)
If you have overridden any of the LightningModule.on_{validation,test,predict}_model_{eval,train}
hooks, they will still get called and execute your custom logic, but they are no longer required if you added them to preserve the eval mode of frozen modules.
Important
In some libraries, for example HuggingFace, models are created in evaluation mode by default (e.g. HFModel.from_pretrained(...)
). Starting from 2.2, you will have to set .train()
on these models if you intend to train them.
Converting FSDP Checkpoints
In the previous release, we introduced distributed checkpointing with FSDP to speed up saving and loading checkpoints for big models. These checkpoints are in a special format saved in a folder with shards from each GPU in a separate file. While these checkpoints can be loaded back with Lightning Trainer or Fabric very easily, they aren't easy to load or process externally. In Lightning 2.2, we introduced a CLI utility that lets you consolidate the checkpoint folder to a single file that can be loaded in raw PyTorch with torch.load()
for example (#19213).
Given you saved a distributed checkpoint, you can then convert it like so:
# For Trainer checkpoints:
python -m lightning.pytorch.utilities.consolidate_checkpoint path/to/my/checkpoint
# For Fabric checkpoints:
python -m lightning.fabric.utilities.consolidate_checkpoint path/to/my/checkpoint
Read more about distributed checkpointing in our documentation: Trainer, Fabric.
Improvements to Compiling DDP/FSDP in Fabric
PyTorch 2.0+ introduced torch.compile
, a powerful tool to speed up your models without changing the code.
We now added a comprehensive guide how to use torch.compile
correctly with tips and tricks to help you troubleshoot common issues. On top of that, Fabric.setup()
will now reapply torch.compile
on top of DDP/FSDP if you are enabling these strategies (#19280).
import lightning as L
# Select a distributed strategy (DDP, FSDP, ...)
fabric = L.Fabric(strategy="ddp", devices=8)
# Compile your model before `.setup()`
model = torch.compile(model)
# Now automatically handles compiling also over DDP/FSDP
model = fabric.setup(model)
# You can opt-out if it is causing trouble
model = fabric.setup(model, _reapply_compile=False)
You might see fewer graph breaks, but there won't be any significant speed-ups with this. We introduced this mainly to make Fabric ready for future improvements from PyTorch to optimizing distributed operations.
Saving and Loading DataLoader State
If you use a dataloader/iterable that implements the .state_dict()
and .load_state_dict()
interface, the Trainer will now automatically save and load their state in the checkpoint (#19361).
import lightning as L
class MyDataLoa...
Lightning 2.2 Release Candidate
This is a preview release for Lightning 2.2.0.
Minor patch release v2.1.4
Fabric
Fixed
- Fixed an issue preventing Fabric to run on CPU when the system's CUDA driver is outdated or broken (#19234)
- Fixed typo in kwarg in SpikeDetection (#19282)
PyTorch
Fixed
- Fixed
Trainer
not expanding thedefault_root_dir
if it has the~
(home) prefix (#19179) - Fixed warning for Dataloader if
num_workers=1
and CPU count is 1 (#19224) - Fixed
WandbLogger.watch()
method annotation to acceptNone
for the log parameter (#19237) - Fixed an issue preventing the Trainer to run on CPU when the system's CUDA driver is outdated or broken (#19234)
- Fixed an issue with the ModelCheckpoint callback not saving relative symlinks with
ModelCheckpoint(save_last="link")
(#19303) - Fixed issue where the
_restricted_classmethod_impl
would incorrectly raise a TypeError on inspection rather than on call (#19332) - Fixed exporting
__version__
in__init__
(#19221)
Full Changelog: 2.1.3...2.1.4
Contributors
@andyland @asingh9530 @awaelchli @Borda @daturkel @dipta007 @lauritsf @mjbommar @shenmishajing @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release v2.1.3
App
Changed
- Lightning App: Use the batch get endpoint (#19180)
- Drop starsessions from App's requirements (#18470)
- Optimize loading time for chunks to be there (#19109)
Data
Added
- Add fault tolerance
StreamingDataset
(#19052) - Add numpy support for the
StreamingDataset
(#19050) - Add fault tolerance for the
StreamingDataset
(#19049) - Add direct s3 support to the
StreamingDataset
(#19044) - Add disk usage check before downloading files (#19041)
Changed
- Cleanup chunks right away if the dataset doesn't fit within the cache in
StreamingDataset
(#19168) StreamingDataset
improve deletion strategy (#19118)- Improve
StreamingDataset
Speed (#19114) - Remove time in the Data Processor progress bar (#19108)
- Optimize loading time for chunks to be there (#19109)
- Resolve path for
StreamingDataset
(#19094) - Make input dir in
DataProcessor
required (#18910) - Remove the
LightningDataset
relying on un-maintained torchdata (#19019)
Fixed
Fabric
Fixed
- Avoid moving the model to device if
move_to_device=False
is passed (#19152) - Fixed broadcast at initialization in
MPIEnvironment
(#19074)
PyTorch
Changed
LightningCLI
no longer allows setting a normal class instance as default. Alazy_instance
can be used instead (#18822)
Fixed
- Fixed checks for local file protocol due to fsspec changes in 2023.10.0 (#19023)
- Fixed automatic detection of 'last.ckpt' files to respect the extension when filtering (#17072)
- Fixed an issue where setting
CHECKPOINT_JOIN_CHAR
orCHECKPOINT_EQUALS_CHAR
would only work on theModelCheckpoint
class but not on an instance (#19054) - Fixed
ModelCheckpoint
not expanding thedirpath
if it has the~
(home) prefix (#19058) - Fixed handling checkpoint dirpath suffix in NeptuneLogger (#18863)
- Fixed an edge case where
ModelCheckpoint
would alternate between versioned and unversioned filename (#19064) - Fixed broadcast at initialization in
MPIEnvironment
(#19074) - Fixed the tensor conversion in
self.log
to respect the default dtype (#19046)
Full Changelog: 2.1.2...2.1.3
Contributors
@AleksanderWWW, @awaelchli, @Borda, @carmocca, @dependabot[bot], @mauvilsa, @MF-FOOM, @tchaton, @yassersouri
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Minor patch release v2.1.2
App
Changed
- Forced plugin server to use localhost (#18976)
- Enabled bundling additional files into app source (#18980)
- Limited rate of requests to http queue (#18981)
Fabric
Fixed
- Fixed precision default from environment (#18928)
PyTorch
Fixed
- Fixed an issue causing permission errors on Windows when attempting to create a symlink for the "last" checkpoint (#18942)
- Fixed an issue where Metric instances from
torchmetrics
wouldn't get moved to the device when using FSDP (#18954) - Fixed an issue preventing the user to
Trainer.save_checkpoint()
an FSDP model whenTrainer.test/validate/predict()
ran afterTrainer.fit()
(#18992)
Contributors
@awaelchli, @carmocca, @ethanwharris, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Full Changelog: 2.1.1...2.1.2
Minor patch release v2.1.1
App
Added
- add flow
fail()
(#18883)
Fixed
- Fix failing lightning cli entry point (#18821)
Fabric
Changed
- Calling a method other than
forward
that invokes submodules is now an error when the model is wrapped (e.g., with DDP) (#18819)
Fixed
- Fixed false-positive warnings about method calls on the Fabric-wrapped module (#18819)
- Refined the FSDP saving logic and error messaging when the path exists (#18884)
- Fixed layer conversion under
Fabric.init_module()
context manager when using theBitsandbytesPrecision
plugin (#18914)
PyTorch
Fixed
- Fixed an issue when replacing an existing
last.ckpt
file with a symlink (#18793) - Fixed an issue when
BatchSizeFinder
steps_per_trial
parameter ends up defining how many validation batches to run during the entire training (#18394) - Fixed an issue saving the
last.ckpt
file when usingModelCheckpoint
on a remote filesystem, and no logger is used (#18867) - Refined the FSDP saving logic and error messaging when the path exists (#18884)
- Fixed an issue parsing the version from folders that don't include a version number in
TensorBoardLogger
andCSVLogger
(#18897)
Contributors
@awaelchli, @Borda, @BoringDonut, @carmocca, @hiaoxui, @ioangatop, @nohalon, @rasbt, @tchaton
If we forgot someone due to not matching commit email with GitHub account, let us know :]
Full Changelog: 2.1.0...2.1.1