Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds video GAN framework #29

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Adds video GAN framework #29

wants to merge 1 commit into from

Conversation

daniel-j-h
Copy link
Member

Playground for video GANs. Right now this adds

  • video attention mechanisms: self-attention, simple attention, global context block
  • learned upsampling: conv+pixelshuffle+icnr
  • biggan's resisual up/down building blocks
  • tensorboard integration: tracking losses and writing out generated video examples

References

TensorBoard logs can be viewed via

docker run -it --rm --network=host -v /tmp:/data tensorflow/tensorflow:2.0.0-py3 \
  tensorboard --bind_all --logdir /data/tbevents

Work in progress 🤗

@daniel-j-h
Copy link
Member Author

The attention mechanisms are explained beautifully in https://arxiv.org/abs/1904.11492

ctx
from https://arxiv.org/abs/1904.11492

BigGAN and SAGAN are using the expensive Non-Local block for attention. This changset implements the Non-Local block (SelfAttention), the paper's simplified Non-Local block (SimpleSelfAttention) and the paper's proposed Global Context Block (GlobalContext) not for 2d but for 3d video models.

@daniel-j-h
Copy link
Member Author

The learned upsampling via conv+pixelshuffle+icnr are explained in https://arxiv.org/abs/1707.02937

shuf
from https://arxiv.org/abs/1707.02937

This changeset implements the same idea for 3d: we can call it "voxel shuffle" 🤗 Idea is to treat the block as sub-voxel convolution and initialize the conv volumes with ICNR appropriately. The figure above shows the situation in 2d; here we work with voxels instead of pixels, but the core idea is the same.

BigGAN by default uses simple nearest neighbor upsampling followed by conv layers (because it's cheap when scaling their models up to huge sizes I suppose?); one experiment will be to swap these learned upsampling layers in and see if they are any good in 3d.

@daniel-j-h
Copy link
Member Author

The overall BigGAN / SAGAN'ish architecture is explained in https://arxiv.org/abs/1809.11096

biggan1
from https://arxiv.org/abs/1809.11096

For now we do not condition our GANs: embeddings, concatenations, and batchnorm modifications are out. The idea to split the noise z into chunks and feed them in at various layers (not just at the bottom of the generator) is interesting and something we can experiment with.

As notes above: one experiment is to replace the simple nearest neighbor upsampling with our voxel shuffle block and see how that changes things.

Note: when implementing the discriminator blocks: the residual branch starts with a relu: in PyTorch this must not be an inplace operation otherwise it will change the original input tensor, too!

biggan2
from https://arxiv.org/abs/1809.11096

This describes the overall architecture for 128x128. Instead of BigGAN's self-attention which is quite expensive we can use our 3d global context blocks which are cheaper, and then add more of them to our generator and discriminator.

We should start simple e.g. with (TxHxW) 8x32x32 clips. One problem will be the difference in up/down-sampling rates: we need to decouple the time domain from the spatial domain. One experiment is to see where and how to up/down-sample time.

Here will be 🐉 This will be fun! 🤗

@daniel-j-h
Copy link
Member Author

First somewhat reasonable results are in after a day of training (see below); obervations

  1. Losses are spike'y but otherwise training seems to be stable! Maybe it's because of the hinge losses; need to experiment here.

  2. My rig is currently bottlenecked by CPUs decoding the h264 videos and pre-processing frames; with a batch size of 432 clips (one clip has 32 frames right now) we need to produce 13824 frames per batch. Might want to look into nvidia's GPU based h264 decoding and/or re-think the video dataloader we have right now. We also might want to start with 8-frame clips for a start and adapt the up/down-sampling in time-domain.

loss0

fake-3200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant