TPU Initialization Error with Transformers in Kaggle TPU VM v3-8 #35774

kashifliaqat606 · 2025-01-19T15:39:39Z

System Info

Title: Persistent TPU Initialization Error on Kaggle with Transformers Library
Issue Description
While running a script using the transformers library in the Kaggle environment with TPU enabled, we consistently encounter a TPU initialization error. Despite verifying all settings and configurations, the error persists. The script runs without issues on GPU, but fails on TPU.

Environment Details
Transformers Version: [Specify the version, e.g., 4.35.0]
PyTorch Version: [Specify the version, e.g., 2.0.1]
TPU Hardware: Kaggle TPU VM v3-8
Python Version: [e.g., 3.10]
Platform: Kaggle
Error Traceback
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
Steps to Reproduce
Enable TPU in the Kaggle environment.
Run the following command:
%cd aidiff
%cd improved-diffusion
!ls
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64

Expected Behavior
The script should run seamlessly with TPU without encountering initialization errors.

Observed Behavior
TPU is correctly detected in the Kaggle environment.
TPU initialization fails with the provided error message.
Switching to GPU resolves the issue, but significantly increases runtime.
Debugging Efforts
Verified TPU configurations and compatibility with Kaggle.
Ensured the required libraries and dependencies are installed and up-to-date.
Performed manual debugging and consulted external resources (e.g., GPT-4).
Confirmed the learning step works perfectly on GPU but fails on TPU.
Current Workaround
Using GPU instead of TPU bypasses the issue but is not an ideal solution due to increased runtime and resource costs.

Request for Assistance
We believe this is a bug or a compatibility issue between the transformers library and Kaggle's TPU environment or a bug in TPU initialization. Any insights, fixes, or guidance to resolve this would be greatly appreciated.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Reproduction
Kaggle Notebook Link:
The issue occurs when running the script in the Kaggle environment. Here’s the notebook:
Kaggle Notebook

GitHub Repository:
The repository being used for this implementation is:
Diffusion-LM GitHub Repository
Steps to reproduce the behavior:

Clone the GitHub repository and navigate to the working directory:

!git clone https://github.com/XiangLi1999/Diffusion-LM.git
%cd Diffusion-LM

!pip install -r requirements.txt
Enable TPU in the Kaggle environment by selecting TPU VM v3-8 in notebook settings.
Run the training script using the following command:
python train_run.py --experiment e2e-tgt-tree --app "--init_emb diffusion_models/diff_roc_pad_rand128_transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd101_xstart_e2e --n_embd 16 --learned_emb yes" --pretrained_model bert-base-uncased --epoch 6 --bsz 10
Expected Behavior: The script should run seamlessly in the Kaggle environment with TPU enabled.

Observed Behavior: The TPU initialization fails with the following error:
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
ebugging Efforts:

Verified Kaggle TPU configurations and ensured the TPU VM v3-8 is enabled.
Ensured all dependencies were installed using the requirements.txt file.
Debugged manually and with external resources (e.g., GPT-4) to check compatibility.
Confirmed the script works perfectly on GPU, but the TPU-specific initialization consistently fails.
from transformers import Trainer, TrainingArguments
import torch_xla.core.xla_model as xm

device = xm.xla_device()
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=10,
num_train_epochs=6,
do_train=True,
evaluation_strategy="steps",
logging_dir="./logs",
)

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()

Expected behavior

Expected Behavior
The script should execute seamlessly on Kaggle's TPU environment with the following outcomes:

TPU Initialization:
The TPU hardware should initialize successfully without errors.

Training Process:
The training process should begin and complete the specified number of epochs using TPU acceleration.

Performance:
The script should leverage the TPU's computational power, resulting in faster execution compared to GPU.

Output:
The model checkpoints, logs, and results should be saved to the specified directories as per the script's configuration.

The text was updated successfully, but these errors were encountered:

kashifliaqat606 · 2025-01-19T19:20:34Z

Commands Causing TPU Runtime Error

The following commands, when executed, result in a TPU runtime configuration error during training:
%cd aidiff
%cd improved-diffusion
!ls
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64

This issue persists even after verifying the TPU setup and ensuring the environment is configured correctly. Please advise on any potential fixes or further diagnostics.

mkfdj · 2025-01-19T19:38:55Z

Refer to me to ask any further questions.

kashifliaqat606 · 2025-01-19T19:46:26Z

Sure! Where can I mention you as a reference?

…

On Mon, Jan 20, 2025 at 12:39 AM mkfdj ***@***.***> wrote: Refer to me to ask any further questions. — Reply to this email directly, view it on GitHub <#35774 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4MYL6M3I7AAECE3OOWDRKD2LP5OJAVCNFSM6AAAAABVO2ARGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBQHE4TINBVGU> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

kashifliaqat606 added the bug label Jan 19, 2025

kashifliaqat606 closed this as completed Jan 19, 2025

kashifliaqat606 reopened this Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU Initialization Error with Transformers in Kaggle TPU VM v3-8 #35774

TPU Initialization Error with Transformers in Kaggle TPU VM v3-8 #35774

kashifliaqat606 commented Jan 19, 2025 •

edited

Loading

kashifliaqat606 commented Jan 19, 2025

mkfdj commented Jan 19, 2025

kashifliaqat606 commented Jan 19, 2025 via email

TPU Initialization Error with Transformers in Kaggle TPU VM v3-8 #35774

TPU Initialization Error with Transformers in Kaggle TPU VM v3-8 #35774

Comments

kashifliaqat606 commented Jan 19, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

kashifliaqat606 commented Jan 19, 2025

Commands Causing TPU Runtime Error

mkfdj commented Jan 19, 2025

kashifliaqat606 commented Jan 19, 2025 via email

kashifliaqat606 commented Jan 19, 2025 •

edited

Loading