Training Colab Error: #677

medicenjona1 · 2024-12-12T19:14:32Z

While training the model, I encountered the following error:

/content/piper/src/python
2024-12-12 19:11:04.493620: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-12 19:11:04.528286: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-12 19:11:04.538373: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-12 19:11:04.564028: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-12 19:11:06.393035: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
DEBUG:piper_train:Namespace(dataset_dir='/content/drive/MyDrive/colab/piper/jonadubbing', checkpoint_epochs=5, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=9999, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=1000, accelerator='gpu', strategy=None, sync_batchnorm=False, precision=32, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/content/epoch=9999-step=1753600.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=12, validation_split=0.0, num_test_examples=0, max_phoneme_ids=None, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234, num_ckpt=0, save_last=True)
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: Setting Trainer(resume_from_checkpoint=) is deprecated in v1.5 and will be removed in v1.7. Please pass Trainer.fit(ckpt_path=) directly instead.
rank_zero_deprecation(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
DEBUG:piper_train:Checkpoints will be saved every 5 epoch(s)
DEBUG:piper_train:0 Checkpoints will be saved
DEBUG:vits.dataset:Loading dataset: /content/drive/MyDrive/colab/piper/jonadubbing/dataset.jsonl
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:731: LightningDeprecationWarning: trainer.resume_from_checkpoint is deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path with trainer.fit(ckpt_path=) instead.
ckpt_path = ckpt_path or self.resume_from_checkpoint
Restoring states from the checkpoint path at /content/epoch=9999-step=1753600.ckpt
DEBUG:fsspec.local:open file: /content/epoch=9999-step=1753600.ckpt
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1659: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 10, 'train_time_interval': None, 'save_on_train_epoch_end': True}"].
rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
DEBUG:fsspec.local:open file: /content/drive/MyDrive/colab/piper/jonadubbing/lightning_logs/version_11/hparams.yaml
Restored all states from the checkpoint file at /content/epoch=9999-step=1753600.ckpt
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention.
rank_zero_warn(
Trainer.fit stopped: max_epochs=9999 reached.

The text was updated successfully, but these errors were encountered:

rmcpantoja · 2024-12-13T12:21:41Z

Hi,
That means you will need to set pax epochs greater than 10k because the model was finetuned with 10k epochs.
Anyway, we are preparing things to make a "massive" upgrade to the trainer to support newest dependencies.
Cheers.

medicenjona1 · 2024-12-13T17:36:18Z

Hi, That means you will need to set pax epochs greater than 10k because the model was finetuned with 10k epochs. Anyway, we are preparing things to make a "massive" upgrade to the trainer to support newest dependencies. Cheers.

Hi, if I set it to 10k, it gives the same result. I haven’t tried more than 10k. I might just wait for the update and create a better dataset in the meantime. If you’re open to suggestions, I suggest adding an option to export directly within the same notebook because it can sometimes be confusing to use multiple notebooks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Colab Error: #677

Training Colab Error: #677

medicenjona1 commented Dec 12, 2024

rmcpantoja commented Dec 13, 2024

medicenjona1 commented Dec 13, 2024

Training Colab Error: #677

Training Colab Error: #677

Comments

medicenjona1 commented Dec 12, 2024

rmcpantoja commented Dec 13, 2024

medicenjona1 commented Dec 13, 2024