24 Apr 13:56

Borda

682d7ef

Minor patch release: App jobs

App

Fixed

Resolved Lightning App with remote storage (#17426)
Fixed AppState, streamlit example (#17452)

Fabric

Changed

Enable precision autocast for LightningModule step methods in Fabric (#17439)

Fixed

Fixed an issue with LightningModule.*_step methods bypassing the DDP/FSDP wrapper (#17424)
Fixed device handling in Fabric.setup() when the model has no parameters (#17441)

PyTorch

Fixed

Fixed Model.load_from_checkpoint("checkpoint.ckpt", map_location=map_location) would always return model on CPU (#17308)
Fixed Sync module states during non-fit (#17370)
Fixed an issue that caused num_nodes not to be set correctly for FSDPStrategy (#17438)

Contributors

@awaelchli, @Borda, @carmocca, @ethanwharris, @ryan597, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 4 other contributors

Assets 10

12 Apr 15:31

Borda

1.9.5

a020506

Minor patch release

App

Changed

Added healthz endpoint to plugin server (#16882)
System customization syncing for jobs run (#16932)

Fabric

Changed

Let TorchCollective works on the torch.distributed WORLD process group by default (#16995)

Fixed

fixed for all _cuda_clearCublasWorkspaces on teardown (#16907)
Improved the error message for installing tensorboard or tensorboardx (#17053)

PyTorch

Changed

Changed to the NeptuneLogger (#16761):
- It now supports neptune-client 0.16.16 and neptune >=1.0, and we have replaced the log() method with append() and extend().
- It now accepts a namespace Handler as an alternative to Run for the run argument. This means that you can call it like NeptuneLogger(run=run["some/namespace"]) to log everything to the some/namespace/ location of the run.
Allow sys.argv and args in LightningCLI (#16808)
Moveed HPU broadcast override to the HPU strategy file (#17011)

Depercated

Removed registration of ShardedTensor state dict hooks in LightningModule.__init__ with torch>=2.1 (#16892)
Removed the lightning.pytorch.core.saving.ModelIO class interface (#16974)

Fixed

Fixed num_nodes not being set for DDPFullyShardedNativeStrategy (#17160)
Fixed parsing the precision config for inference in DeepSpeedStrategy (#16973)
Fixed the availability check for rich that prevented Lightning to be imported in Google Colab (#17156)
Fixed for all _cuda_clearCublasWorkspaces on teardown (#16907)
The psutil package is now required for CPU monitoring (#17010)
Improved the error message for installing tensorboard or tensorboardx (#17053)

Contributors

@awaelchli, @belerico, @carmocca, @colehawkins, @dmitsf, @Erotemic, @ethanwharris, @kshitij12345, @Borda

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

Erotemic, dmitsf, and 7 other contributors

Assets 10

11 Apr 18:43

awaelchli

2.0.1.post0

38933be

2.0.1 appendix

App

Fixed

Fix frontend hosts when running with multi-process in the cloud (#17324)

Fabric

No changes.

PyTorch

Fixed

Make the is_picklable function more robust (#17270)

Contributors

@eng-yue @ethanwharris @Borda @awaelchli @carmocca

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 2 other contributors

Assets 10

30 Mar 14:45

carmocca

2.0.1

06032e8

2.0.1 patch release

App

No changes

Fabric

Changed

Generalized Optimizer validation to accommodate both FSDP 1.x and 2.x (#16733)

PyTorch

Changed

Pickling the LightningModule no longer pickles the Trainer (#17133)
Generalized Optimizer validation to accommodate both FSDP 1.x and 2.x (#16733)
Disable torch.inference_mode with torch.compile in PyTorch 2.0 (#17215)

Fixed

Fixed issue where pickling the module instance would fail with a DataLoader error (#17130)
Fixed WandbLogger not showing "best" aliases for model checkpoints when ModelCheckpoint(save_top_k>0) is used (#17121)
Fixed the availability check for rich that prevented Lightning to be imported in Google Colab (#17156)
Fixed parsing the precision config for inference in DeepSpeedStrategy (#16973)
Fixed issue where torch.compile would fail when logging to WandB (#17216)

Contributors

@Borda @williamFalcon @lightningforever @adamjstewart @carmocca @tshu-w @saryazdi @parambharat @awaelchli @colehawkins @woqidaideshi @md-121 @yhl48 @gkroiz @idc9 @speediedan

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

idc9, williamFalcon, and 14 other contributors

Assets 10

15 Mar 12:58

awaelchli

2.0.0

01834c8

Lightning 2.0: Fast, Flexible, Stable

Lightning AI is excited to announce the release of Lightning 2.0 ⚡

Highlights
Backward Incompatible Changes
- PyTorch
- Fabric
Full Changelog
- PyTorch
- Fabric
- App
Contributors

Over the last couple of years PyTorch Lightning has become the preferred deep learning framework for researchers and ML developers around the world, with close to 50 million downloads and 18k OSS projects, from top universities to leading labs.

With the help of over 800 contributors, we have added many features and functionalities to make it the most complete research toolkit possible, but some of these changes also introduced issues:

API changes to the trainer
Trainer code became harder to follow
Many integrations made Lightning appear bloated
The trainer became harder to customize / takes away what I instead need to tweak / have control over.

To make the research experience better, we are introducing 2.0:

No API changes - We commit to backward compatibility in the 2.0 series
Simplified abstraction layers, removed legacy functionality, integrations out of the main repo. This improves the project's readability and debugging experience.
Introducing Fabric. Scale any PyTorch model with just a few lines of code. Read-on!

Highlights

PyTorch 2.0 and `torch.compile`

Lightning 2.0 is best friends with PyTorch 2.0. You can torch.compile your LightningModules now!

import torch
import lightning as L

model = LitModel()
# This will compile forward and {training,validation,test,predict}_step 
compiled_model = torch.compile(model)

trainer = L.Trainer()
trainer.fit(compiled_model)

PyTorch reports that on average, "models runs 43% faster in training on an NVIDIA A100 GPU. At Float32 precision, it runs 21% faster on average and at AMP Precision it runs 51% faster on average" (source). If you want to learn more about torch.compile and how such speedups can be achieved, read the official PyTorch 2.0 blog post.

Automatic accelerator selection (#16847)

The Trainer now chooses accelerator="auto", strategy="auto", devices="auto" as defaults. This automatically detects the best hardware on your system (TPUs, GPUs, Apple Silicon, etc.) and chooses as many devices as are available.

import lightning as L

# Selects accelerator, devices and strategy automatically!
trainer = L.Trainer()

# Same as:
trainer = L.Trainer(accelerator="auto", strategy="auto", devices="auto")

For example, on a 8-GPU server, this will implicitly select Trainer(accelerator="cuda", strategy="ddp", devices=8).

Support for arbitrary iterables (#16726)

Previously, the Trainer supported DataLoader-like iterables. However, with this release, users can now work with any iterable that implements the Python iterable definition. This includes custom data structures, such as user-defined classes and generators, as well as built-in Python objects.

To use this new feature, return any iterable (or collection of iterables) from the dataloader hooks.

def train_dataloader(self):
    return DataLoader(...)
    return list(range(1000))
    
    # pass loaders as a dict. This will create batches like this:
    # {'a': batch_from_loader_a, 'b': batch_from_loader_b}
    return {"a": DataLoader(...), "b": DataLoader(...)}
    
    # pass loaders as list. This will create batches like this:
    # [batch_from_dl_1, batch_from_dl_2]
    return [DataLoader(...), DataLoader(...)]
    
    # arbitrary nesting
    # {'a': [batch_from_dl_1, batch_from_dl_2], 'b': [batch_from_dl_3, batch_from_dl_4]}
    return {"a": [dl1, dl2], "b": [dl3, dl4]}

Read our data section for more information.

Redesigned multi-dataloader support (#16743, #16784, #16939)

Lightning automatically collates the batches from multiple iterables based on a "mode". This is done with our newly revamped CombinedLoader class.

from lightning.pytorch.utilities import CombinedLoader

iterables = {"a": DataLoader(), "b": DataLoader()}
# Lightning uses this under the hood, but this way you can change the "mode"
combined_loader = CombinedLoader(iterables, mode="min_size")

model = ...
trainer = Trainer()
trainer.fit(model, combined_loader)

The following modes are supported:

min_size: stops after the shortest iterable (the one with the lowest number of items) is done.
max_size_cycle: stops after the longest iterable (the one with most items) is done, while cycling through the rest of the iterables.
max_size: stops after the longest iterable (the one with most items) is done, while returning None for the exhausted iterables.
sequential: completely consumes ecah iterable sequentially, and returns a triplet (data, idx, iterable_idx)

If you have a need for a different "mode", feel free to open a feature request! Adding new modes is now very simplified. These improvements also allowed us to simplify the trainer's loops by abstracting this logic inside the CombinedLoader.

Barebones Trainer mode (#16854)

A new Trainer argument Trainer(barebones=...) was added (default is False) to disable all features that may impact the raw speed of the training loop. This allows users to quickly and fairily compare the runtime of a Lightning script with a raw PyTorch script.

This is how you enable it:

import lightning as L

# Default: False
trainer = L.Trainer(barebones=True)

A message informs about the changed settings:

You are running in `Trainer(barebones=True)` mode. All features that may impact raw speed have been disabled to facilitate analyzing the Trainer overhead. Specifically, the following features are deactivated:
 - Checkpointing: `Trainer(enable_checkpointing=True)`
 - Progress bar: `Trainer(enable_progress_bar=True)`
 - Model summary: `Trainer(enable_model_summary=True)`
 - Logging: `Trainer(logger=True)`, `Trainer(log_every_n_steps>0)`, `LightningModule.log(...)`, `LightningModule.log_dict(...)`
 - Sanity checking: `Trainer(num_sanity_val_steps>0)`
 - Development run: `Trainer(fast_dev_run=True)`
 - Anomaly detection: `Trainer(detect_anomaly=True)`
 - Profiling: `Trainer(profiler=...)`

Tip: This feature is also very useful for unit testing!

Better progress bar (#16695)

Based on feedback from users, we decided to separate the training progress bar from the validation bar. This greatly improves the time estimates (since validation is usually faster) and resolves confusion around the total batches being processed in an epoch.

This is how the bar looked in versions before 2.0:

Epoch 3:  21%|██        | 28/128 [00:36<01:32, 23.12it/s, loss=0.163]
Validation DataLoader 0:  38%|███      | 12/32 [00:12<00:20,  1.01s/it]

Note how the total batches (128) is the sum of the training batches (32) and the three validation runs (3 x 32). And this is how the progress bar looks like now:

Epoch 3:  50%|█████     | 16/32 [00:36<01:32, 23.12it/s]
Validation DataLoader 0:  38%|███      | 12/32 [00:12<00:20,  1.01s/it]

Note how the batch counts are now separate. The training progress bar pauses until validation is completed.

Lightning Fabric

Lightning 2.0 is the official release for Lightning Fabric 🎉

Fabric is the fast and lightweight way to scale PyTorch models without boilerplate code.

Easily switch from running on CPU to GPU (Apple Silicon, CUDA, ...), TPU, multi-GPU or even multi-node training
State-of-the-art distributed training strategies (DDP, FSDP, DeepSpeed) and mixed precision out of the box
Handles all the boilerplate device logic for you
Brings useful tools to help you build a trainer (callbacks, logging, checkpoints, ...)
Designed with multi-billion parameter models in mind

📖 Go to Fabric documentation 📖

  import torch
  import torch.nn as nn
  from torch.utils.data import DataLoader, Dataset

+ from lightning.fabric import Fabric

  class PyTorchModel(nn.Module):
      ...

  class PyTorchDataset(Dataset):
      ...

+ fabric = Fabric(accelerator="cuda", devices=8, strategy="ddp")
+ fabric.launch()

- device = "cuda" if torch.cuda.is_available() else "cpu"
  model = PyTorchModel(...)
  optimizer = torch.optim.SGD(model.parameters())
+ model, optimizer = fabric.setup(model, optimizer)
  dataloader = DataLoader(PyTorchDataset(...), ...)
+ dataloader = fabric.setup_dataloaders(dataloader)
  model.train()

  for epoch in range(num_epochs):
      for ba...

Contributors

martenlienen, Erotemic, and 35 other contributors

Assets 10

01 Mar 13:54

carmocca

1.9.4

3bee819

Weekly patch release

App

Removed

Removed implicit ui testing with testing.run_app_in_cloud in favor of headless login and app selection (#16741)

Fabric

Added

Added Fabric(strategy="auto") support (#16916)

Fixed

Fixed edge cases in parsing device ids using NVML (#16795)
Fixed DDP spawn hang on TPU Pods (#16844)
Fixed an error when passing find_usable_cuda_devices(num_devices=-1) (#16866)

PyTorch

Added

Added Fabric(strategy="auto") support. It will choose DDP over DDP-spawn, contrary to strategy=None (default) (#16916)

Fixed

Fixed DDP spawn hang on TPU Pods (#16844)
Fixed edge cases in parsing device ids using NVML (#16795)
Fixed backwards compatibility for lightning.pytorch.utilities.parsing.get_init_args (#16851)

Contributors

@ethanwharris, @carmocca, @awaelchli, @justusschock , @dtuit, @Liyang90

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, dtuit, and 4 other contributors

Assets 10

23 Feb 18:56

awaelchli

2.0.0rc0

0130273

Lightning 2.0 Release Candidate Pre-release

Pre-release

Full Changelog: 1.9.0...2.0.0rc0

Assets 10

21 Feb 20:39

awaelchli

1.9.3

1091484

Weekly patch release

App

Fixed

Fixed lightning open command and improved redirects (#16794)

Fabric

Fixed

Fixed an issue causing a wrong environment plugin to be selected when accelerator=tpu and devices > 1 (#16806)
Fixed parsing of defaults for --accelerator and --precision in Fabric CLI when accelerator and precision are set to non-default values in the code (#16818)

PyTorch

Fixed

Fixed an issue causing a wrong environment plugin to be selected when accelerator=tpu and devices > 1 (#16806)

Contributors

@ethanwharris, @carmocca, @awaelchli, @Borda, @tchaton, @yurijmikhalevich

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

yurijmikhalevich, awaelchli, and 4 other contributors

Assets 10

15 Feb 15:23

awaelchli

1.9.2

c5b836a

Weekly patch release

App

Added

Added Storage Commands (#16740)
- rm: Delete files from your Cloud Platform Filesystem
Added lightning connect data to register data connection to private s3 buckets (#16738)

Fabric

Fixed

Fixed an attribute error and improved input validation for invalid strategy types being passed to Fabric (#16693)

PyTorch

Changed

Disabled strict loading in multiprocessing launcher ("ddp_spawn", etc.) when loading weights back into the main process (#16365)

Fixed

Fixed an attribute error and improved input validation for invalid strategy types being passed to Trainer (#16693)
Fixed early stopping triggering extra validation runs after reaching min_epochs or min_steps (#16719)

Contributors

@akihironitta, @awaelchli, @Borda, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 2 other contributors

Assets 10

10 Feb 16:57

Borda

1.9.1

c24b4bb

Weekly patch release

App

Added

Added lightning open command (#16482)
Added experimental support for interruptable GPU in the cloud (#16399)
Added FileSystem abstraction to simply manipulate files (#16581)
Added Storage Commands (#16606)
- ls: List files from your Cloud Platform Filesystem
- cd: Change the current directory within your Cloud Platform filesystem (terminal session based)
- pwd: Return the current folder in your Cloud Platform Filesystem
- cp: Copy files between your Cloud Platform Filesystem and local filesystem
Prevent to cd into non-existent folders (#16645)
Enabled cp (upload) at project level (#16631)
Enabled ls and cp (download) at project level (#16622)
Added lightning connect data to register data connection to s3 buckets (#16670)
Added support for running with multiprocessing in the cloud (#16624)
Initial plugin server (#16523)
Connect and Disconnect node (#16700)

Changed

Changed the default LightningClient(retry=False) to retry=True (#16382)
Add support for async predict method in PythonServer and remove torch context (#16453)
Renamed lightning.app.components.LiteMultiNode to lightning.app.components.FabricMultiNode (#16505)
Changed the command lightning connect to lightning connect app for consistency (#16670)
Refactor cloud dispatch and update to new API (#16456)
Updated app URLs to the latest format (#16568)

Fixed

Fixed a deadlock causing apps not to exit properly when running locally (#16623)
Fixed the Drive root_folder not parsed properly (#16454)
Fixed malformed path when downloading files using lightning cp (#16626)
Fixed app name in URL (#16575)

Fabric

Fixed

Fixed error handling for accelerator="mps" and ddp strategy pairing (#16455)
Fixed strict availability check for torch_xla requirement (#16476)
Fixed an issue where PL would wrap DataLoaders with XLA's MpDeviceLoader more than once (#16571)
Fixed the batch_sampler reference for DataLoaders wrapped with XLA's MpDeviceLoader (#16571)
Fixed an import error when torch.distributed is not available (#16658)

Pytorch

Fixed

Fixed an unintended limitation for calling save_hyperparameters on mixin classes that don't subclass LightningModule/LightningDataModule (#16369)
Fixed an issue with MLFlowLogger logging the wrong keys with .log_hyperparams() (#16418)
Fixed logging more than 100 parameters with MLFlowLogger and long values are truncated (#16451)
Fixed strict availability check for torch_xla requirement (#16476)
Fixed an issue where PL would wrap DataLoaders with XLA's MpDeviceLoader more than once (#16571)
Fixed the batch_sampler reference for DataLoaders wrapped with XLA's MpDeviceLoader (#16571)
Fixed an import error when torch.distributed is not available (#16658)

Contributors

@akihironitta, @awaelchli, @Borda, @BrianPulfer, @ethanwharris, @hhsecond, @justusschock, @Liyang90, @RuRo, @senarvi, @shenoynikhil, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

senarvi, RuRo, and 10 other contributors

Assets 10

Releases: Lightning-AI/pytorch-lightning

Minor patch release: App jobs

App

Fixed

Fabric

Changed

Fixed

PyTorch

Fixed

Contributors

Contributors

Minor patch release

App

Changed

Fabric

Changed

Fixed

PyTorch

Changed

Depercated

Fixed

Contributors

Contributors

2.0.1 appendix

App

Fixed

Fabric

PyTorch

Fixed

Contributors

Contributors

2.0.1 patch release

App

Fabric

Changed

PyTorch

Changed

Fixed

Contributors

Contributors

Lightning 2.0: Fast, Flexible, Stable

Highlights

PyTorch 2.0 and torch.compile

Automatic accelerator selection (#16847)

Support for arbitrary iterables (#16726)

Redesigned multi-dataloader support (#16743, #16784, #16939)

Barebones Trainer mode (#16854)

Better progress bar (#16695)

Lightning Fabric

Contributors

Weekly patch release

App

Removed

Fabric

Added

Fixed

PyTorch

Added

Fixed

Contributors

Contributors

Lightning 2.0 Release Candidate

Weekly patch release

App

Fixed

Fabric

Fixed

PyTorch

Fixed

Contributors

Contributors

Weekly patch release

App

Added

Fabric

Fixed

PyTorch

Changed

Fixed

Contributors

PyTorch 2.0 and `torch.compile`