Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Colab Error: #677

Open
medicenjona1 opened this issue Dec 12, 2024 · 2 comments
Open

Training Colab Error: #677

medicenjona1 opened this issue Dec 12, 2024 · 2 comments

Comments

@medicenjona1
Copy link

While training the model, I encountered the following error:

/content/piper/src/python
2024-12-12 19:11:04.493620: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-12 19:11:04.528286: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-12 19:11:04.538373: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-12 19:11:04.564028: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-12 19:11:06.393035: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
DEBUG:piper_train:Namespace(dataset_dir='/content/drive/MyDrive/colab/piper/jonadubbing', checkpoint_epochs=5, quality='medium', resume_from_single_speaker_checkpoint=None, logger=True, enable_checkpointing=True, default_root_dir=None, gradient_clip_val=None, gradient_clip_algorithm=None, num_nodes=1, num_processes=None, devices='1', gpus=None, auto_select_gpus=False, tpu_cores=None, ipus=None, enable_progress_bar=True, overfit_batches=0.0, track_grad_norm=-1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=None, max_epochs=9999, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, val_check_interval=None, log_every_n_steps=1000, accelerator='gpu', strategy=None, sync_batchnorm=False, precision=32, enable_model_summary=True, weights_save_path=None, num_sanity_val_steps=2, resume_from_checkpoint='/content/epoch=9999-step=1753600.ckpt', profiler=None, benchmark=None, deterministic=None, reload_dataloaders_every_n_epochs=0, auto_lr_find=False, replace_sampler_ddp=True, detect_anomaly=False, auto_scale_batch_size=False, plugins=None, amp_backend='native', amp_level=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', batch_size=12, validation_split=0.0, num_test_examples=0, max_phoneme_ids=None, hidden_channels=192, inter_channels=192, filter_channels=768, n_layers=6, n_heads=2, seed=1234, num_ckpt=0, save_last=True)
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:52: LightningDeprecationWarning: Setting Trainer(resume_from_checkpoint=) is deprecated in v1.5 and will be removed in v1.7. Please pass Trainer.fit(ckpt_path=) directly instead.
rank_zero_deprecation(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
DEBUG:piper_train:Checkpoints will be saved every 5 epoch(s)
DEBUG:piper_train:0 Checkpoints will be saved
DEBUG:vits.dataset:Loading dataset: /content/drive/MyDrive/colab/piper/jonadubbing/dataset.jsonl
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:731: LightningDeprecationWarning: trainer.resume_from_checkpoint is deprecated in v1.5 and will be removed in v2.0. Specify the fit checkpoint path with trainer.fit(ckpt_path=) instead.
ckpt_path = ckpt_path or self.resume_from_checkpoint
Restoring states from the checkpoint path at /content/epoch=9999-step=1753600.ckpt
DEBUG:fsspec.local:open file: /content/epoch=9999-step=1753600.ckpt
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1659: UserWarning: Be aware that when using ckpt_path, callbacks used to create the checkpoint need to be provided during Trainer instantiation. Please add the following callbacks: ["ModelCheckpoint{'monitor': None, 'mode': 'min', 'every_n_train_steps': 0, 'every_n_epochs': 10, 'train_time_interval': None, 'save_on_train_epoch_end': True}"].
rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
DEBUG:fsspec.local:open file: /content/drive/MyDrive/colab/piper/jonadubbing/lightning_logs/version_11/hparams.yaml
Restored all states from the checkpoint file at /content/epoch=9999-step=1753600.ckpt
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of DataLoader across ranks is zero. Please make sure this was your intention.
rank_zero_warn(
Trainer.fit stopped: max_epochs=9999 reached.

@rmcpantoja
Copy link
Contributor

Hi,
That means you will need to set pax epochs greater than 10k because the model was finetuned with 10k epochs.
Anyway, we are preparing things to make a "massive" upgrade to the trainer to support newest dependencies.
Cheers.

@medicenjona1
Copy link
Author

Hi, That means you will need to set pax epochs greater than 10k because the model was finetuned with 10k epochs. Anyway, we are preparing things to make a "massive" upgrade to the trainer to support newest dependencies. Cheers.

Hi, if I set it to 10k, it gives the same result. I haven’t tried more than 10k. I might just wait for the update and create a better dataset in the meantime. If you’re open to suggestions, I suggest adding an option to export directly within the same notebook because it can sometimes be confusing to use multiple notebooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants