-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds video GAN framework #29
base: master
Are you sure you want to change the base?
Conversation
The attention mechanisms are explained beautifully in https://arxiv.org/abs/1904.11492
BigGAN and SAGAN are using the expensive Non-Local block for attention. This changset implements the Non-Local block ( |
The learned upsampling via conv+pixelshuffle+icnr are explained in https://arxiv.org/abs/1707.02937
This changeset implements the same idea for 3d: we can call it "voxel shuffle" 🤗 Idea is to treat the block as sub-voxel convolution and initialize the conv volumes with ICNR appropriately. The figure above shows the situation in 2d; here we work with voxels instead of pixels, but the core idea is the same. BigGAN by default uses simple nearest neighbor upsampling followed by conv layers (because it's cheap when scaling their models up to huge sizes I suppose?); one experiment will be to swap these learned upsampling layers in and see if they are any good in 3d. |
The overall BigGAN / SAGAN'ish architecture is explained in https://arxiv.org/abs/1809.11096
For now we do not condition our GANs: embeddings, concatenations, and batchnorm modifications are out. The idea to split the noise As notes above: one experiment is to replace the simple nearest neighbor upsampling with our voxel shuffle block and see how that changes things. Note: when implementing the discriminator blocks: the residual branch starts with a relu: in PyTorch this must not be an inplace operation otherwise it will change the original input tensor, too!
This describes the overall architecture for 128x128. Instead of BigGAN's self-attention which is quite expensive we can use our 3d global context blocks which are cheaper, and then add more of them to our generator and discriminator. We should start simple e.g. with (TxHxW) 8x32x32 clips. One problem will be the difference in up/down-sampling rates: we need to decouple the time domain from the spatial domain. One experiment is to see where and how to up/down-sample time. Here will be 🐉 This will be fun! 🤗 |
First somewhat reasonable results are in after a day of training (see below); obervations
|
Playground for video GANs. Right now this adds
References
TensorBoard logs can be viewed via
Work in progress 🤗