You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Title: Persistent TPU Initialization Error on Kaggle with Transformers Library
Issue Description
While running a script using the transformers library in the Kaggle environment with TPU enabled, we consistently encounter a TPU initialization error. Despite verifying all settings and configurations, the error persists. The script runs without issues on GPU, but fails on TPU.
Environment Details
Transformers Version: [Specify the version, e.g., 4.35.0]
PyTorch Version: [Specify the version, e.g., 2.0.1]
TPU Hardware: Kaggle TPU VM v3-8
Python Version: [e.g., 3.10]
Platform: Kaggle
Error Traceback
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
Steps to Reproduce
Enable TPU in the Kaggle environment.
Run the following command:
%cd aidiff
%cd improved-diffusion
!ls
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64
Expected Behavior
The script should run seamlessly with TPU without encountering initialization errors.
Observed Behavior
TPU is correctly detected in the Kaggle environment.
TPU initialization fails with the provided error message.
Switching to GPU resolves the issue, but significantly increases runtime.
Debugging Efforts
Verified TPU configurations and compatibility with Kaggle.
Ensured the required libraries and dependencies are installed and up-to-date.
Performed manual debugging and consulted external resources (e.g., GPT-4).
Confirmed the learning step works perfectly on GPU but fails on TPU.
Current Workaround
Using GPU instead of TPU bypasses the issue but is not an ideal solution due to increased runtime and resource costs.
Request for Assistance
We believe this is a bug or a compatibility issue between the transformers library and Kaggle's TPU environment or a bug in TPU initialization. Any insights, fixes, or guidance to resolve this would be greatly appreciated.
Who can help?
No response
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Reproduction
Kaggle Notebook Link:
The issue occurs when running the script in the Kaggle environment. Here’s the notebook: Kaggle Notebook
GitHub Repository:
The repository being used for this implementation is: Diffusion-LM GitHub Repository
Steps to reproduce the behavior:
Clone the GitHub repository and navigate to the working directory:
!pip install -r requirements.txt
Enable TPU in the Kaggle environment by selecting TPU VM v3-8 in notebook settings.
Run the training script using the following command:
python train_run.py --experiment e2e-tgt-tree --app "--init_emb diffusion_models/diff_roc_pad_rand128_transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd101_xstart_e2e --n_embd 16 --learned_emb yes" --pretrained_model bert-base-uncased --epoch 6 --bsz 10
Expected Behavior: The script should run seamlessly in the Kaggle environment with TPU enabled.
Observed Behavior: The TPU initialization fails with the following error:
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
ebugging Efforts:
Verified Kaggle TPU configurations and ensured the TPU VM v3-8 is enabled.
Ensured all dependencies were installed using the requirements.txt file.
Debugged manually and with external resources (e.g., GPT-4) to check compatibility.
Confirmed the script works perfectly on GPU, but the TPU-specific initialization consistently fails.
from transformers import Trainer, TrainingArguments
import torch_xla.core.xla_model as xm
The following commands, when executed, result in a TPU runtime configuration error during training:
%cd aidiff
%cd improved-diffusion
!ls
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64
This issue persists even after verifying the TPU setup and ensuring the environment is configured correctly. Please advise on any potential fixes or further diagnostics.
System Info
Title: Persistent TPU Initialization Error on Kaggle with Transformers Library
Issue Description
While running a script using the transformers library in the Kaggle environment with TPU enabled, we consistently encounter a TPU initialization error. Despite verifying all settings and configurations, the error persists. The script runs without issues on GPU, but fails on TPU.
Environment Details
Transformers Version: [Specify the version, e.g., 4.35.0]
PyTorch Version: [Specify the version, e.g., 2.0.1]
TPU Hardware: Kaggle TPU VM v3-8
Python Version: [e.g., 3.10]
Platform: Kaggle
Error Traceback
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
Steps to Reproduce
Enable TPU in the Kaggle environment.
Run the following command:
%cd aidiff
%cd improved-diffusion
!ls
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64
Expected Behavior
The script should run seamlessly with TPU without encountering initialization errors.
Observed Behavior
TPU is correctly detected in the Kaggle environment.
TPU initialization fails with the provided error message.
Switching to GPU resolves the issue, but significantly increases runtime.
Debugging Efforts
Verified TPU configurations and compatibility with Kaggle.
Ensured the required libraries and dependencies are installed and up-to-date.
Performed manual debugging and consulted external resources (e.g., GPT-4).
Confirmed the learning step works perfectly on GPU but fails on TPU.
Current Workaround
Using GPU instead of TPU bypasses the issue but is not an ideal solution due to increased runtime and resource costs.
Request for Assistance
We believe this is a bug or a compatibility issue between the transformers library and Kaggle's TPU environment or a bug in TPU initialization. Any insights, fixes, or guidance to resolve this would be greatly appreciated.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Reproduction
Kaggle Notebook Link:
The issue occurs when running the script in the Kaggle environment. Here’s the notebook:
Kaggle Notebook
GitHub Repository:
The repository being used for this implementation is:
Diffusion-LM GitHub Repository
Steps to reproduce the behavior:
!git clone https://github.com/XiangLi1999/Diffusion-LM.git %cd Diffusion-LM
!pip install -r requirements.txt
Enable TPU in the Kaggle environment by selecting TPU VM v3-8 in notebook settings.
Run the training script using the following command:
python train_run.py --experiment e2e-tgt-tree --app "--init_emb diffusion_models/diff_roc_pad_rand128_transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd101_xstart_e2e --n_embd 16 --learned_emb yes" --pretrained_model bert-base-uncased --epoch 6 --bsz 10
Expected Behavior: The script should run seamlessly in the Kaggle environment with TPU enabled.
Observed Behavior: The TPU initialization fails with the following error:
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
ebugging Efforts:
Verified Kaggle TPU configurations and ensured the TPU VM v3-8 is enabled.
Ensured all dependencies were installed using the requirements.txt file.
Debugged manually and with external resources (e.g., GPT-4) to check compatibility.
Confirmed the script works perfectly on GPU, but the TPU-specific initialization consistently fails.
from transformers import Trainer, TrainingArguments
import torch_xla.core.xla_model as xm
device = xm.xla_device()
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=10,
num_train_epochs=6,
do_train=True,
evaluation_strategy="steps",
logging_dir="./logs",
)
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()
Expected behavior
Expected Behavior
The script should execute seamlessly on Kaggle's TPU environment with the following outcomes:
TPU Initialization:
The TPU hardware should initialize successfully without errors.
Training Process:
The training process should begin and complete the specified number of epochs using TPU acceleration.
Performance:
The script should leverage the TPU's computational power, resulting in faster execution compared to GPU.
Output:
The model checkpoints, logs, and results should be saved to the specified directories as per the script's configuration.
The text was updated successfully, but these errors were encountered: