Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU Initialization Error with Transformers in Kaggle TPU VM v3-8 #35774

Open
4 tasks
kashifliaqat606 opened this issue Jan 19, 2025 · 3 comments
Open
4 tasks
Labels

Comments

@kashifliaqat606
Copy link

kashifliaqat606 commented Jan 19, 2025

System Info

Title: Persistent TPU Initialization Error on Kaggle with Transformers Library
Issue Description
While running a script using the transformers library in the Kaggle environment with TPU enabled, we consistently encounter a TPU initialization error. Despite verifying all settings and configurations, the error persists. The script runs without issues on GPU, but fails on TPU.

Environment Details
Transformers Version: [Specify the version, e.g., 4.35.0]
PyTorch Version: [Specify the version, e.g., 2.0.1]
TPU Hardware: Kaggle TPU VM v3-8
Python Version: [e.g., 3.10]
Platform: Kaggle
Error Traceback
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
Steps to Reproduce
Enable TPU in the Kaggle environment.
Run the following command:
%cd aidiff
%cd improved-diffusion
!ls
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64

Expected Behavior
The script should run seamlessly with TPU without encountering initialization errors.

Observed Behavior
TPU is correctly detected in the Kaggle environment.
TPU initialization fails with the provided error message.
Switching to GPU resolves the issue, but significantly increases runtime.
Debugging Efforts
Verified TPU configurations and compatibility with Kaggle.
Ensured the required libraries and dependencies are installed and up-to-date.
Performed manual debugging and consulted external resources (e.g., GPT-4).
Confirmed the learning step works perfectly on GPU but fails on TPU.
Current Workaround
Using GPU instead of TPU bypasses the issue but is not an ideal solution due to increased runtime and resource costs.

Request for Assistance
We believe this is a bug or a compatibility issue between the transformers library and Kaggle's TPU environment or a bug in TPU initialization. Any insights, fixes, or guidance to resolve this would be greatly appreciated.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Reproduction
Kaggle Notebook Link:
The issue occurs when running the script in the Kaggle environment. Here’s the notebook:
Kaggle Notebook

GitHub Repository:
The repository being used for this implementation is:
Diffusion-LM GitHub Repository
Steps to reproduce the behavior:

  1. Clone the GitHub repository and navigate to the working directory:
    !git clone https://github.com/XiangLi1999/Diffusion-LM.git
    %cd Diffusion-LM

!pip install -r requirements.txt
Enable TPU in the Kaggle environment by selecting TPU VM v3-8 in notebook settings.
Run the training script using the following command:
python train_run.py --experiment e2e-tgt-tree --app "--init_emb diffusion_models/diff_roc_pad_rand128_transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd101_xstart_e2e --n_embd 16 --learned_emb yes" --pretrained_model bert-base-uncased --epoch 6 --bsz 10
Expected Behavior: The script should run seamlessly in the Kaggle environment with TPU enabled.

Observed Behavior: The TPU initialization fails with the following error:
RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/accel0): Operation not permitted: Couldn't open device: /dev/accel0; Unable to create Node RegisterInterface for node 0...
ebugging Efforts:

Verified Kaggle TPU configurations and ensured the TPU VM v3-8 is enabled.
Ensured all dependencies were installed using the requirements.txt file.
Debugged manually and with external resources (e.g., GPT-4) to check compatibility.
Confirmed the script works perfectly on GPU, but the TPU-specific initialization consistently fails.
from transformers import Trainer, TrainingArguments
import torch_xla.core.xla_model as xm

device = xm.xla_device()
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=10,
num_train_epochs=6,
do_train=True,
evaluation_strategy="steps",
logging_dir="./logs",
)

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()

Expected behavior

Expected Behavior
The script should execute seamlessly on Kaggle's TPU environment with the following outcomes:

TPU Initialization:
The TPU hardware should initialize successfully without errors.

Training Process:
The training process should begin and complete the specified number of epochs using TPU acceleration.

Performance:
The script should leverage the TPU's computational power, resulting in faster execution compared to GPU.

Output:
The model checkpoints, logs, and results should be saved to the specified directories as per the script's configuration.

@kashifliaqat606
Copy link
Author

Commands Causing TPU Runtime Error

The following commands, when executed, result in a TPU runtime configuration error during training:
%cd aidiff
%cd improved-diffusion
!ls
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e
!python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64

This issue persists even after verifying the TPU setup and ensuring the environment is configured correctly. Please advise on any potential fixes or further diagnostics.

@mkfdj
Copy link

mkfdj commented Jan 19, 2025

Refer to me to ask any further questions.

@kashifliaqat606
Copy link
Author

kashifliaqat606 commented Jan 19, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants