VAE not converging #11598
Replies: 1 comment 1 reply
-
I had a similar problem with a Conv1D deep autoencoder in PyTorch Lightning, which I use on timeseries data. It is a vanilla AE, not a VAE. Using my plain pytorch training function, it works very well: It reduces the loss function consistently and learns a good reconstruction of the original timeseries. However, I had serious trouble trying to train the same NN (actually, the same class) with PyTorch Lightning. During my first attempts, the loss exploded (became a greater number than Python can represent) after a few epochs of training and the training crashed always at the same epoch and batch. The problem occured when my dataloaders were defined in my Autoencoder pl.LightningModule via train_dataloader() and val_dataloader(), BUT the problem did NOT occur, when I removed those methods and passed the dataloaders as parameters to the Trainer.fit() method. There were some indications, that PL could have some problems with threadpools. When using the After the update, the C++ error did not appear any more, the loss did not explode any more, and the training did not crash, but the training stagnated and the loss hovered somewhere around 9-e3, which is a huge value, usually it gets down to something like 5 for a batch of 100 timeseries. Strangely enough, now the training was broken both when I use the After that I implemented manual optimization in my PL Module like that: def training_step(self, batch, batch_idx):
opt = self.optimizers(use_pl_optimizer=True)
opt.zero_grad()
loss = self._get_reconstruction_loss(batch)
self.log('train_loss', loss)
self.manual_backward(loss)
opt.step()
return {'loss': loss} With manual optimization, the training works fine, with similar performance like my plain PyTorch training function. However, the LR Finder doesn't work anymore, and probably other things are broken too:
So maybe a solution to your problem could be to use manual optimization like in my case? One question about your NN: Why do you use both transposed convolutions AND upsampling in the decoder? I think that both serve the same purpose. To my understanding, transposed convolutions are something like learnable upsampling. In my AE I use only ConvTranspose1d() and it works fine. Regarding this whole situation, I have some questions about PyTorch Lightning:
I am trying really hard to make PL work for my research projects and would be very grateful for any ideas how to fix and improve my autoencoder implementation. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone, I have built a Pytorch Lightning VAE from a working Tensorflow VAE with the same structure, but somehow the model doesn't learn. The reconstruction error stays at around 0.023 from the first epoch on and the KL divergence is down at 0.00000002 something. However, I built the same VAE in Tensorflow and it worked and also in normal Pytorch (which I unfortunately didn't save..) but it also worked.
This is the Encoder, Decoder and VAE Module code:
The inputs are 256-sized batches of data with the shape (1, 56), LR is 0.01, 10 epochs. The same settings and network structure works for tensorflow. I have been staring and trying different things for days now. Does anyone see any error or why this network doesn't converge? It is really strange, that the same VAE works in tensorflow and in my previous pytorch implementation, but not anymore...
Please let me know if you need any further information about the network.
Beta Was this translation helpful? Give feedback.
All reactions