VAE not converging #11598

dvabecker · 2022-01-24T13:42:15Z

dvabecker
Jan 24, 2022

Hello everyone, I have built a Pytorch Lightning VAE from a working Tensorflow VAE with the same structure, but somehow the model doesn't learn. The reconstruction error stays at around 0.023 from the first epoch on and the KL divergence is down at 0.00000002 something. However, I built the same VAE in Tensorflow and it worked and also in normal Pytorch (which I unfortunately didn't save..) but it also worked.

This is the Encoder, Decoder and VAE Module code:

class Encoder(nn.Module):

    def __init__(self, latent_dims):
        super(Encoder, self).__init__()
        self.conv1 = nn.Conv1d(1, 32, kernel_size=5, padding='same')
        self.conv2 = nn.Conv1d(32, 64, kernel_size=5, padding='same')

        self.maxpool = nn.MaxPool1d(2)

        self.linear1 = nn.Linear(896, 14)
        self.linear2 = nn.Linear(14, latent_dims)
        self.linear3 = nn.Linear(14, latent_dims)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.maxpool(x)
        x = F.relu(self.conv2(x))
        x = self.maxpool(x)
        x = torch.flatten(x, start_dim=1)
        x = F.relu(self.linear1(x))
        mu = self.linear2(x)
        log_var = self.linear3(x)

        return mu, log_var

class Decoder(nn.Module):

    def __init__(self, latent_dims):
        super().__init__()

        self.linear1 = nn.Linear(latent_dims, 14)

        self.deconv1 = nn.ConvTranspose1d(1, 64, kernel_size=5, padding=2)
        self.upsample1 = nn.Upsample(size=28)

        self.deconv2 = nn.ConvTranspose1d(64, 32, kernel_size=5, padding=2)
        self.upsample2 = nn.Upsample(size=56)

        self.deconv3 = nn.ConvTranspose1d(32, 1, kernel_size=3, padding=1)
        
    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = x[:, None, :]
        x = F.relu(self.deconv1(x))
        x = self.upsample1(x)
        x = F.relu(self.deconv2(x))
        x = self.upsample2(x)
        x = torch.sigmoid(self.deconv3(x))
        return x

class LitAutoEncoder(pl.LightningModule):

	def __init__(self, latent_dims, lr, beta):
		super().__init__()
		self.lr = lr
		self.encoder = Encoder(latent_dims)
		self.decoder = Decoder(latent_dims)
		self.N = torch.distributions.Normal(0, 1)

	def reparametrize(self, mu, log_var):
		epsilon = self.N.sample(log_var.shape).to(device)
		z = mu + torch.exp(log_var*0.5) * epsilon
		return z

	def forward(self, x):
		# encoding
		mu, log_var = self.encoder(x)

		#reparametrize
		z = self.reparametrize(mu, log_var)

		# decoding
		reconstruction = self.decoder(z)

		return reconstruction, mu, log_var

	def configure_optimizers(self):
		opt = torch.optim.Adam(self.parameters(), lr=self.lr)
		return opt

	def training_step(self, x, batch_idx):
		x = x.float()
		x_hat, mu, log_var = self.forward(x)

		reconstruction_loss = F.mse_loss(x_hat, x)

		kl_loss = - 0.5 * (1 + log_var - torch.square(mu) - torch.exp(log_var))
		kl_loss = torch.mean(torch.sum(kl_loss, dim=1), dim=0)
		loss = reconstruction_loss + kl_loss
		
		self.log('train_loss', {"rec": reconstruction_loss, "kl": kl_loss, "total": loss}, prog_bar=True, logger=False)

		return {'loss': loss}

The inputs are 256-sized batches of data with the shape (1, 56), LR is 0.01, 10 epochs. The same settings and network structure works for tensorflow. I have been staring and trying different things for days now. Does anyone see any error or why this network doesn't converge? It is really strange, that the same VAE works in tensorflow and in my previous pytorch implementation, but not anymore...
Please let me know if you need any further information about the network.

milan-marinov-usu · 2022-06-17T10:11:33Z

milan-marinov-usu
Jun 17, 2022

I had a similar problem with a Conv1D deep autoencoder in PyTorch Lightning, which I use on timeseries data. It is a vanilla AE, not a VAE. Using my plain pytorch training function, it works very well: It reduces the loss function consistently and learns a good reconstruction of the original timeseries.

However, I had serious trouble trying to train the same NN (actually, the same class) with PyTorch Lightning.

During my first attempts, the loss exploded (became a greater number than Python can represent) after a few epochs of training and the training crashed always at the same epoch and batch. The problem occured when my dataloaders were defined in my Autoencoder pl.LightningModule via train_dataloader() and val_dataloader(), BUT the problem did NOT occur, when I removed those methods and passed the dataloaders as parameters to the Trainer.fit() method.

There were some indications, that PL could have some problems with threadpools. When using the pin_memory=True parameter in the the train and val dataloader constructors, after each training batch there was an error message from a C++ module concerning threadpools. Then I updated my environment: Python from 3.7 to 3.10, Pytorch from 1.9 to 1.11 and PL from 1.6.3 to 1.6.4. I also updated the CUDA drivers and CUDA toolkit of my Gogle Compute Engine VM instance.

After the update, the C++ error did not appear any more, the loss did not explode any more, and the training did not crash, but the training stagnated and the loss hovered somewhere around 9-e3, which is a huge value, usually it gets down to something like 5 for a batch of 100 timeseries. Strangely enough, now the training was broken both when I use the train_dataloader() / val_dataloader() methods of my PLModule AND when I remove those methods and pass my dataloaders "manually" to the Trainer.fit() method.

After that I implemented manual optimization in my PL Module like that:

    def training_step(self, batch, batch_idx):
        opt = self.optimizers(use_pl_optimizer=True)
        opt.zero_grad()
        loss = self._get_reconstruction_loss(batch)
        self.log('train_loss', loss)
        self.manual_backward(loss)
        opt.step()
        
        return {'loss': loss}

With manual optimization, the training works fine, with similar performance like my plain PyTorch training function. However, the LR Finder doesn't work anymore, and probably other things are broken too:

```python
    File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:231, in TrainingEpochLoop.advance(self, data_fetcher)
    225 model_fx = self.trainer.lightning_module.on_train_batch_end
    226 extra_kwargs = (
    227     {"dataloader_idx": 0}
    228     if callable(model_fx) and is_param_in_hook_signature(model_fx, "dataloader_idx", explicit=True)
    229     else {}
    230 )
--> 231 self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs)
    232 self.trainer._call_lightning_module_hook(
    233     "on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs
    234 )
    235 self.trainer._call_callback_hooks("on_batch_end")

File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1630, in Trainer._call_callback_hooks(self, hook_name, *args, **kwargs)
   1628 elif hook_name == "on_train_batch_end":
   1629     with self.profiler.profile(hook_name):
-> 1630         self._on_train_batch_end(*args, **kwargs)
   1631 else:
   1632     for callback in self.callbacks:

File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1662, in Trainer._on_train_batch_end(self, outputs, batch, batch_idx, dataloader_idx)
   1660     callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx, 0)
   1661 else:
-> 1662     callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx)

File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py:332, in _LRCallback.on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx)
    329 if self.progress_bar:
    330     self.progress_bar.update()
--> 332 current_loss = trainer.fit_loop.running_loss.last().item()
    333 current_step = trainer.global_step
    335 # Avg loss (loss with momentum) + smoothing

So maybe a solution to your problem could be to use manual optimization like in my case?

One question about your NN: Why do you use both transposed convolutions AND upsampling in the decoder? I think that both serve the same purpose. To my understanding, transposed convolutions are something like learnable upsampling. In my AE I use only ConvTranspose1d() and it works fine.

Regarding this whole situation, I have some questions about PyTorch Lightning:

Using manual optimization, now the loss value is not printed out anymore during training in the progress bar. I tried different things to show the loss during training in the progress bar, but nothing worked. How can I fix that?
How do I fix the LR Finder to work with my manual optimization? I tried different things, but nothing has worked so far.
Is it possible to use the automatic optimization of PL? My manual optimization shown above is very simple. How does the automatic optimization of PL differ from my simple one? Can I make the automatic optimization behave similarly to my manual optimization? Do I have to set some parameters to make it work for my autoencoder?

I am trying really hard to make PL work for my research projects and would be very grateful for any ideas how to fix and improve my autoencoder implementation.

1 reply

ParamsRaman Feb 15, 2024

@milan-marinov-usu "Using manual optimization, now the loss value is not printed out anymore during training in the progress bar. I tried different things to show the loss during training in the progress bar, but nothing worked. How can I fix that?" --> Were you able to figure this out? I am facing the same problem when I convert an already working code to pytorch lightning format. Appreciate any pointers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAE not converging #11598

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

VAE not converging #11598

dvabecker Jan 24, 2022

Replies: 1 comment · 1 reply

milan-marinov-usu Jun 17, 2022

ParamsRaman Feb 15, 2024

dvabecker
Jan 24, 2022

Replies: 1 comment 1 reply

milan-marinov-usu
Jun 17, 2022