-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help needed to enable TPU Training #630
Comments
@ZDisket so the bug come from tf.data ? |
I think a bug come from https://github.com/ZDisket/TensorflowTTS/blob/tpu/examples/melgan/train_melgan.py#L242-L285 |
@ZDisket maybe we miss else branch for https://github.com/ZDisket/TensorflowTTS/blob/tpu/examples/melgan/train_melgan.py#L260-L261 ? |
Late reply because I posted the issue then went to sleep. I iterated over the dataset with this function.
and ran it just before self.run() in the GanBasedTrainer. Nothing happened. Looks like it's something in the training. According to the one issue where this problem is mentioned, the person calls it a "a horrible bug deep down in the XLA compiler", but it should've been fixed after TF2.2. |
Also, I am getting a separate error when training Tacotron2. This one might be easier to solve.
|
can you try ? |
@dathudeptrai I previously tried removing the collater and making it pad to the longest audio len, and it still failed. |
@ZDisket i will pin the issue. I have no idea about TPU since i never use it :D |
It seems that the people over at TensorflowASR already have TPU support and ran into problems in the past as well - might be worth looking into: TensorSpeech/TensorFlowASR#100 |
I got Tacotron2 to start training on Colab TPU by decorating BasedTrainer.run with @tf.function and removing experimental_relax_shapes from tacotron2.inference, at 1.4s/it on a TPU v2-8 (about 4x faster than a Tesla V100). Right now it can't save intermediate results because .numpy() is not supported in graph mode, nor save checkpoints (except for when the training is forcibly interrupted). Also TensorBoard makes logfiles but doesn't write anything to them resulting in 40 byte empty ones. However, on a small dataset with about 2.6k train elements it mysteriously gets stopped at 275 steps by something sending a CTRL + C, at least in Colab. (I think it's just Colab instance running out of memory) As I read, the TensorFlowASR guys fixed a lot of issues by forgetting about custom loops and instead using Keras built-in .fit() |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
I've been trying to get TensorflowTTS to train on Cloud TPUs because they're really fast and easy to access with the TRC, starting with MB-MelGAN+HiFi-GAN discriminator. I've already implemented all changes, including dataloader overhauls to use TFRecords and Google Cloud required here. When I try to train, however, I get this cryptic error, both in TF 2.5.0 and nightly (I didn't use TF 2.3.1 because it allocates something wrongly to the CPU causing another error).
[64,<=8192] are
[batch_size, batch_max_steps]
Here's the full training log:
train_log.txt
I can't figure out what causes this issue, no matter what I try. Any idea? Being able to train on TPUs would be really beneficial and within reach. I can provide specific instructions to replicate the issue, but it requires a Google Cloud with storage even if using Colab TPU (Tensorflow 2.x refuses to save and load data from local filesystem when using TPU). The same code, including TFRecord dataloader, trains fine on GPU.
The text was updated successfully, but these errors were encountered: