Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 #73

Open
RuojiWang opened this issue Sep 14, 2022 · 1 comment

Comments

@RuojiWang
Copy link

i meet the following error:

ssh://[email protected]:10001/root/anaconda3/bin/python3.8 -u /workspace/project/huafeng/big_transfer-master/train.py --name huaweishengteng --model BiT-M-R50x1 --logdir ./logs --dataset cifar10 --datadir ./cifar
2022-09-14 11:42:29,707 [INFO] bit_common: Namespace(base_lr=0.003, batch=512, batch_split=1, bit_pretrained_dir='.', datadir='./cifar', dataset='cifar10', eval_every=None, examples_per_class=None, examples_per_class_seed=0, logdir='./logs', model='BiT-M-R50x1', name='huaweishengteng', save=True, workers=8)
2022-09-14 11:42:29,707 [INFO] bit_common: Going to train on cuda:1
Files already downloaded and verified
Files already downloaded and verified
2022-09-14 11:42:31,373 [INFO] bit_common: Using a training set with 50000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Using a validation set with 10000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Loading model from BiT-M-R50x1.npz
2022-09-14 11:42:31,893 [INFO] bit_common: Moving model onto all GPUs
2022-09-14 11:42:31,908 [INFO] bit_common: Model will be saved in './logs/huaweishengteng/bit.pth.tar'
2022-09-14 11:42:31,908 [INFO] bit_common: Fine-tuning from BiT
2022-09-14 11:42:34,812 [INFO] bit_common: Starting training!
Traceback (most recent call last):
File "/workspace/project/huafeng/big_transfer-master/train.py", line 296, in
main(parser.parse_args())
File "/workspace/project/huafeng/big_transfer-master/train.py", line 237, in main
logits = model(x)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Process finished with exit code 1

cuda:0 is used by my lab, i can only use other cuda, maybe cuda:7, how can i solve the problem?

@RuojiWang
Copy link
Author

RuojiWang commented Sep 14, 2022

well, i find the following information:

https://stackoverflow.com/questions/59249563/runtimeerror-module-must-have-its-parameters-and-buffers-on-device-cuda1-devi

just modify the code as following:

#device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
#logger.info(f"Going to train on {device}")
device_ids = [1, 4, 5]
device = torch.device("cuda:1")
logger.info(f"Going to train on {device}")

train_set, valid_set, train_loader, valid_loader = mktrainval(args, logger)

logger.info(f"Loading model from {args.model}.npz")
model = models.KNOWN_MODELS[args.model](head_size=len(valid_set.classes), zero_head=True)
model.load_from(np.load(f"{args.model}.npz"))

logger.info("Moving model onto all GPUs")
model = torch.nn.DataParallel(model, device_ids=device_ids)

that just solve my problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant