Skip to content

Releases: Lightning-AI/pytorch-lightning

Minor patch release: App jobs

24 Apr 13:56
682d7ef
Compare
Choose a tag to compare

App

Fixed

  • Resolved Lightning App with remote storage (#17426)
  • Fixed AppState, streamlit example (#17452)

Fabric

Changed

  • Enable precision autocast for LightningModule step methods in Fabric (#17439)

Fixed

  • Fixed an issue with LightningModule.*_step methods bypassing the DDP/FSDP wrapper (#17424)
  • Fixed device handling in Fabric.setup() when the model has no parameters (#17441)

PyTorch

Fixed

  • Fixed Model.load_from_checkpoint("checkpoint.ckpt", map_location=map_location) would always return model on CPU (#17308)
  • Fixed Sync module states during non-fit (#17370)
  • Fixed an issue that caused num_nodes not to be set correctly for FSDPStrategy (#17438)

Contributors

@awaelchli, @Borda, @carmocca, @ethanwharris, @ryan597, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Minor patch release

12 Apr 15:31
a020506
Compare
Choose a tag to compare

App

Changed

  • Added healthz endpoint to plugin server (#16882)
  • System customization syncing for jobs run (#16932)

Fabric

Changed

  • Let TorchCollective works on the torch.distributed WORLD process group by default (#16995)

Fixed

  • fixed for all _cuda_clearCublasWorkspaces on teardown (#16907)
  • Improved the error message for installing tensorboard or tensorboardx (#17053)

PyTorch

Changed

  • Changed to the NeptuneLogger (#16761):
    • It now supports neptune-client 0.16.16 and neptune >=1.0, and we have replaced the log() method with append() and extend().
    • It now accepts a namespace Handler as an alternative to Run for the run argument. This means that you can call it like NeptuneLogger(run=run["some/namespace"]) to log everything to the some/namespace/ location of the run.
  • Allow sys.argv and args in LightningCLI (#16808)
  • Moveed HPU broadcast override to the HPU strategy file (#17011)

Depercated

  • Removed registration of ShardedTensor state dict hooks in LightningModule.__init__ with torch>=2.1 (#16892)
  • Removed the lightning.pytorch.core.saving.ModelIO class interface (#16974)

Fixed

  • Fixed num_nodes not being set for DDPFullyShardedNativeStrategy (#17160)
  • Fixed parsing the precision config for inference in DeepSpeedStrategy (#16973)
  • Fixed the availability check for rich that prevented Lightning to be imported in Google Colab (#17156)
  • Fixed for all _cuda_clearCublasWorkspaces on teardown (#16907)
  • The psutil package is now required for CPU monitoring (#17010)
  • Improved the error message for installing tensorboard or tensorboardx (#17053)

Contributors

@awaelchli, @belerico, @carmocca, @colehawkins, @dmitsf, @Erotemic, @ethanwharris, @kshitij12345, @Borda

If we forgot someone due to not matching commit email with GitHub account, let us know :]

2.0.1 appendix

11 Apr 18:43
38933be
Compare
Choose a tag to compare

App

Fixed

  • Fix frontend hosts when running with multi-process in the cloud (#17324)

Fabric

No changes.


PyTorch

Fixed

  • Make the is_picklable function more robust (#17270)

Contributors

@eng-yue @ethanwharris @Borda @awaelchli @carmocca

If we forgot someone due to not matching commit email with GitHub account, let us know :]

2.0.1 patch release

30 Mar 14:45
Compare
Choose a tag to compare

App

No changes


Fabric

Changed

  • Generalized Optimizer validation to accommodate both FSDP 1.x and 2.x (#16733)

PyTorch

Changed

  • Pickling the LightningModule no longer pickles the Trainer (#17133)
  • Generalized Optimizer validation to accommodate both FSDP 1.x and 2.x (#16733)
  • Disable torch.inference_mode with torch.compile in PyTorch 2.0 (#17215)

Fixed

  • Fixed issue where pickling the module instance would fail with a DataLoader error (#17130)
  • Fixed WandbLogger not showing "best" aliases for model checkpoints when ModelCheckpoint(save_top_k>0) is used (#17121)
  • Fixed the availability check for rich that prevented Lightning to be imported in Google Colab (#17156)
  • Fixed parsing the precision config for inference in DeepSpeedStrategy (#16973)
  • Fixed issue where torch.compile would fail when logging to WandB (#17216)

Contributors

@Borda @williamFalcon @lightningforever @adamjstewart @carmocca @tshu-w @saryazdi @parambharat @awaelchli @colehawkins @woqidaideshi @md-121 @yhl48 @gkroiz @idc9 @speediedan

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Lightning 2.0: Fast, Flexible, Stable

15 Mar 12:58
01834c8
Compare
Choose a tag to compare

Lightning AI is excited to announce the release of Lightning 2.0 ⚡

Over the last couple of years PyTorch Lightning has become the preferred deep learning framework for researchers and ML developers around the world, with close to 50 million downloads and 18k OSS projects, from top universities to leading labs.

With the help of over 800 contributors, we have added many features and functionalities to make it the most complete research toolkit possible, but some of these changes also introduced issues:

  • API changes to the trainer
  • Trainer code became harder to follow
  • Many integrations made Lightning appear bloated
  • The trainer became harder to customize / takes away what I instead need to tweak / have control over.

To make the research experience better, we are introducing 2.0:

  • No API changes - We commit to backward compatibility in the 2.0 series
  • Simplified abstraction layers, removed legacy functionality, integrations out of the main repo. This improves the project's readability and debugging experience.
  • Introducing Fabric. Scale any PyTorch model with just a few lines of code. Read-on!

Highlights

PyTorch 2.0 and torch.compile

Lightning 2.0 is best friends with PyTorch 2.0. You can torch.compile your LightningModules now!

import torch
import lightning as L

model = LitModel()
# This will compile forward and {training,validation,test,predict}_step 
compiled_model = torch.compile(model)

trainer = L.Trainer()
trainer.fit(compiled_model)

PyTorch reports that on average, "models runs 43% faster in training on an NVIDIA A100 GPU. At Float32 precision, it runs 21% faster on average and at AMP Precision it runs 51% faster on average" (source). If you want to learn more about torch.compile and how such speedups can be achieved, read the official PyTorch 2.0 blog post.

Automatic accelerator selection (#16847)

The Trainer now chooses accelerator="auto", strategy="auto", devices="auto" as defaults. This automatically detects the best hardware on your system (TPUs, GPUs, Apple Silicon, etc.) and chooses as many devices as are available.

import lightning as L

# Selects accelerator, devices and strategy automatically!
trainer = L.Trainer()

# Same as:
trainer = L.Trainer(accelerator="auto", strategy="auto", devices="auto")

For example, on a 8-GPU server, this will implicitly select Trainer(accelerator="cuda", strategy="ddp", devices=8).

Support for arbitrary iterables (#16726)

Previously, the Trainer supported DataLoader-like iterables. However, with this release, users can now work with any iterable that implements the Python iterable definition. This includes custom data structures, such as user-defined classes and generators, as well as built-in Python objects.

To use this new feature, return any iterable (or collection of iterables) from the dataloader hooks.

def train_dataloader(self):
    return DataLoader(...)
    return list(range(1000))
    
    # pass loaders as a dict. This will create batches like this:
    # {'a': batch_from_loader_a, 'b': batch_from_loader_b}
    return {"a": DataLoader(...), "b": DataLoader(...)}
    
    # pass loaders as list. This will create batches like this:
    # [batch_from_dl_1, batch_from_dl_2]
    return [DataLoader(...), DataLoader(...)]
    
    # arbitrary nesting
    # {'a': [batch_from_dl_1, batch_from_dl_2], 'b': [batch_from_dl_3, batch_from_dl_4]}
    return {"a": [dl1, dl2], "b": [dl3, dl4]}

Read our data section for more information.

Redesigned multi-dataloader support (#16743, #16784, #16939)

Lightning automatically collates the batches from multiple iterables based on a "mode". This is done with our newly revamped CombinedLoader class.

from lightning.pytorch.utilities import CombinedLoader

iterables = {"a": DataLoader(), "b": DataLoader()}
# Lightning uses this under the hood, but this way you can change the "mode"
combined_loader = CombinedLoader(iterables, mode="min_size")

model = ...
trainer = Trainer()
trainer.fit(model, combined_loader)

The following modes are supported:

  • min_size: stops after the shortest iterable (the one with the lowest number of items) is done.
  • max_size_cycle: stops after the longest iterable (the one with most items) is done, while cycling through the rest of the iterables.
  • max_size: stops after the longest iterable (the one with most items) is done, while returning None for the exhausted iterables.
  • sequential: completely consumes ecah iterable sequentially, and returns a triplet (data, idx, iterable_idx)

If you have a need for a different "mode", feel free to open a feature request! Adding new modes is now very simplified. These improvements also allowed us to simplify the trainer's loops by abstracting this logic inside the CombinedLoader.

Barebones Trainer mode (#16854)

A new Trainer argument Trainer(barebones=...) was added (default is False) to disable all features that may impact the raw speed of the training loop. This allows users to quickly and fairily compare the runtime of a Lightning script with a raw PyTorch script.

This is how you enable it:

import lightning as L

# Default: False
trainer = L.Trainer(barebones=True)

A message informs about the changed settings:

You are running in `Trainer(barebones=True)` mode. All features that may impact raw speed have been disabled to facilitate analyzing the Trainer overhead. Specifically, the following features are deactivated:
 - Checkpointing: `Trainer(enable_checkpointing=True)`
 - Progress bar: `Trainer(enable_progress_bar=True)`
 - Model summary: `Trainer(enable_model_summary=True)`
 - Logging: `Trainer(logger=True)`, `Trainer(log_every_n_steps>0)`, `LightningModule.log(...)`, `LightningModule.log_dict(...)`
 - Sanity checking: `Trainer(num_sanity_val_steps>0)`
 - Development run: `Trainer(fast_dev_run=True)`
 - Anomaly detection: `Trainer(detect_anomaly=True)`
 - Profiling: `Trainer(profiler=...)`

Tip: This feature is also very useful for unit testing!

Better progress bar (#16695)

Based on feedback from users, we decided to separate the training progress bar from the validation bar. This greatly improves the time estimates (since validation is usually faster) and resolves confusion around the total batches being processed in an epoch.

This is how the bar looked in versions before 2.0:

Epoch 3:  21%|██        | 28/128 [00:36<01:32, 23.12it/s, loss=0.163]
Validation DataLoader 0:  38%|███      | 12/32 [00:12<00:20,  1.01s/it]

Note how the total batches (128) is the sum of the training batches (32) and the three validation runs (3 x 32). And this is how the progress bar looks like now:

Epoch 3:  50%|█████     | 16/32 [00:36<01:32, 23.12it/s]
Validation DataLoader 0:  38%|███      | 12/32 [00:12<00:20,  1.01s/it]

Note how the batch counts are now separate. The training progress bar pauses until validation is completed.

Lightning Fabric

Lightning 2.0 is the official release for Lightning Fabric 🎉

Fabric spans across a large spectrum - from raw PyTorch all the way to high-level PyTorch Lightning

Fabric is the fast and lightweight way to scale PyTorch models without boilerplate code.

  • Easily switch from running on CPU to GPU (Apple Silicon, CUDA, ...), TPU, multi-GPU or even multi-node training
  • State-of-the-art distributed training strategies (DDP, FSDP, DeepSpeed) and mixed precision out of the box
  • Handles all the boilerplate device logic for you
  • Brings useful tools to help you build a trainer (callbacks, logging, checkpoints, ...)
  • Designed with multi-billion parameter models in mind

📖 Go to Fabric documentation 📖

  import torch
  import torch.nn as nn
  from torch.utils.data import DataLoader, Dataset

+ from lightning.fabric import Fabric

  class PyTorchModel(nn.Module):
      ...

  class PyTorchDataset(Dataset):
      ...

+ fabric = Fabric(accelerator="cuda", devices=8, strategy="ddp")
+ fabric.launch()

- device = "cuda" if torch.cuda.is_available() else "cpu"
  model = PyTorchModel(...)
  optimizer = torch.optim.SGD(model.parameters())
+ model, optimizer = fabric.setup(model, optimizer)
  dataloader = DataLoader(PyTorchDataset(...), ...)
+ dataloader = fabric.setup_dataloaders(dataloader)
  model.train()

  for epoch in range(num_epochs):
      for ba...
Read more

Weekly patch release

01 Mar 13:54
3bee819
Compare
Choose a tag to compare

App

Removed

  • Removed implicit ui testing with testing.run_app_in_cloud in favor of headless login and app selection (#16741)

Fabric

Added

  • Added Fabric(strategy="auto") support (#16916)

Fixed

  • Fixed edge cases in parsing device ids using NVML (#16795)
  • Fixed DDP spawn hang on TPU Pods (#16844)
  • Fixed an error when passing find_usable_cuda_devices(num_devices=-1) (#16866)

PyTorch

Added

  • Added Fabric(strategy="auto") support. It will choose DDP over DDP-spawn, contrary to strategy=None (default) (#16916)

Fixed

  • Fixed DDP spawn hang on TPU Pods (#16844)
  • Fixed edge cases in parsing device ids using NVML (#16795)
  • Fixed backwards compatibility for lightning.pytorch.utilities.parsing.get_init_args (#16851)

Contributors

@ethanwharris, @carmocca, @awaelchli, @justusschock , @dtuit, @Liyang90

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Lightning 2.0 Release Candidate

23 Feb 18:56
0130273
Compare
Choose a tag to compare
Pre-release

Full Changelog: 1.9.0...2.0.0rc0

Weekly patch release

21 Feb 20:39
Compare
Choose a tag to compare

App

Fixed

  • Fixed lightning open command and improved redirects (#16794)

Fabric

Fixed

  • Fixed an issue causing a wrong environment plugin to be selected when accelerator=tpu and devices > 1 (#16806)
  • Fixed parsing of defaults for --accelerator and --precision in Fabric CLI when accelerator and precision are set to non-default values in the code (#16818)

PyTorch

Fixed

  • Fixed an issue causing a wrong environment plugin to be selected when accelerator=tpu and devices > 1 (#16806)

Contributors

@ethanwharris, @carmocca, @awaelchli, @Borda, @tchaton, @yurijmikhalevich

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Weekly patch release

15 Feb 15:23
c5b836a
Compare
Choose a tag to compare

App

Added

  • Added Storage Commands (#16740)
    • rm: Delete files from your Cloud Platform Filesystem
  • Added lightning connect data to register data connection to private s3 buckets (#16738)

Fabric

Fixed

  • Fixed an attribute error and improved input validation for invalid strategy types being passed to Fabric (#16693)

PyTorch

Changed

  • Disabled strict loading in multiprocessing launcher ("ddp_spawn", etc.) when loading weights back into the main process (#16365)

Fixed

  • Fixed an attribute error and improved input validation for invalid strategy types being passed to Trainer (#16693)
  • Fixed early stopping triggering extra validation runs after reaching min_epochs or min_steps (#16719)

Contributors

@akihironitta, @awaelchli, @Borda, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Weekly patch release

10 Feb 16:57
c24b4bb
Compare
Choose a tag to compare

App

Added

  • Added lightning open command (#16482)
  • Added experimental support for interruptable GPU in the cloud (#16399)
  • Added FileSystem abstraction to simply manipulate files (#16581)
  • Added Storage Commands (#16606)
    • ls: List files from your Cloud Platform Filesystem
    • cd: Change the current directory within your Cloud Platform filesystem (terminal session based)
    • pwd: Return the current folder in your Cloud Platform Filesystem
    • cp: Copy files between your Cloud Platform Filesystem and local filesystem
  • Prevent to cd into non-existent folders (#16645)
  • Enabled cp (upload) at project level (#16631)
  • Enabled ls and cp (download) at project level (#16622)
  • Added lightning connect data to register data connection to s3 buckets (#16670)
  • Added support for running with multiprocessing in the cloud (#16624)
  • Initial plugin server (#16523)
  • Connect and Disconnect node (#16700)

Changed

  • Changed the default LightningClient(retry=False) to retry=True (#16382)
  • Add support for async predict method in PythonServer and remove torch context (#16453)
  • Renamed lightning.app.components.LiteMultiNode to lightning.app.components.FabricMultiNode (#16505)
  • Changed the command lightning connect to lightning connect app for consistency (#16670)
  • Refactor cloud dispatch and update to new API (#16456)
  • Updated app URLs to the latest format (#16568)

Fixed

  • Fixed a deadlock causing apps not to exit properly when running locally (#16623)
  • Fixed the Drive root_folder not parsed properly (#16454)
  • Fixed malformed path when downloading files using lightning cp (#16626)
  • Fixed app name in URL (#16575)

Fabric

Fixed

  • Fixed error handling for accelerator="mps" and ddp strategy pairing (#16455)
  • Fixed strict availability check for torch_xla requirement (#16476)
  • Fixed an issue where PL would wrap DataLoaders with XLA's MpDeviceLoader more than once (#16571)
  • Fixed the batch_sampler reference for DataLoaders wrapped with XLA's MpDeviceLoader (#16571)
  • Fixed an import error when torch.distributed is not available (#16658)

Pytorch

Fixed

  • Fixed an unintended limitation for calling save_hyperparameters on mixin classes that don't subclass LightningModule/LightningDataModule (#16369)
  • Fixed an issue with MLFlowLogger logging the wrong keys with .log_hyperparams() (#16418)
  • Fixed logging more than 100 parameters with MLFlowLogger and long values are truncated (#16451)
  • Fixed strict availability check for torch_xla requirement (#16476)
  • Fixed an issue where PL would wrap DataLoaders with XLA's MpDeviceLoader more than once (#16571)
  • Fixed the batch_sampler reference for DataLoaders wrapped with XLA's MpDeviceLoader (#16571)
  • Fixed an import error when torch.distributed is not available (#16658)

Contributors

@akihironitta, @awaelchli, @Borda, @BrianPulfer, @ethanwharris, @hhsecond, @justusschock, @Liyang90, @RuRo, @senarvi, @shenoynikhil, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]