module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 #73

RuojiWang · 2022-09-14T09:44:28Z

i meet the following error:

ssh://[email protected]:10001/root/anaconda3/bin/python3.8 -u /workspace/project/huafeng/big_transfer-master/train.py --name huaweishengteng --model BiT-M-R50x1 --logdir ./logs --dataset cifar10 --datadir ./cifar
2022-09-14 11:42:29,707 [INFO] bit_common: Namespace(base_lr=0.003, batch=512, batch_split=1, bit_pretrained_dir='.', datadir='./cifar', dataset='cifar10', eval_every=None, examples_per_class=None, examples_per_class_seed=0, logdir='./logs', model='BiT-M-R50x1', name='huaweishengteng', save=True, workers=8)
2022-09-14 11:42:29,707 [INFO] bit_common: Going to train on cuda:1
Files already downloaded and verified
Files already downloaded and verified
2022-09-14 11:42:31,373 [INFO] bit_common: Using a training set with 50000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Using a validation set with 10000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Loading model from BiT-M-R50x1.npz
2022-09-14 11:42:31,893 [INFO] bit_common: Moving model onto all GPUs
2022-09-14 11:42:31,908 [INFO] bit_common: Model will be saved in './logs/huaweishengteng/bit.pth.tar'
2022-09-14 11:42:31,908 [INFO] bit_common: Fine-tuning from BiT
2022-09-14 11:42:34,812 [INFO] bit_common: Starting training!
Traceback (most recent call last):
File "/workspace/project/huafeng/big_transfer-master/train.py", line 296, in
main(parser.parse_args())
File "/workspace/project/huafeng/big_transfer-master/train.py", line 237, in main
logits = model(x)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Process finished with exit code 1

cuda:0 is used by my lab, i can only use other cuda, maybe cuda:7, how can i solve the problem?

RuojiWang · 2022-09-14T12:23:38Z

well, i find the following information:

https://stackoverflow.com/questions/59249563/runtimeerror-module-must-have-its-parameters-and-buffers-on-device-cuda1-devi

just modify the code as following:

#device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
#logger.info(f"Going to train on {device}")
device_ids = [1, 4, 5]
device = torch.device("cuda:1")
logger.info(f"Going to train on {device}")

train_set, valid_set, train_loader, valid_loader = mktrainval(args, logger)

logger.info(f"Loading model from {args.model}.npz")
model = models.KNOWN_MODELS[args.model](head_size=len(valid_set.classes), zero_head=True)
model.load_from(np.load(f"{args.model}.npz"))

logger.info("Moving model onto all GPUs")
model = torch.nn.DataParallel(model, device_ids=device_ids)

that just solve my problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 #73

module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 #73

RuojiWang commented Sep 14, 2022

RuojiWang commented Sep 14, 2022 •

edited

Loading

module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 #73

module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 #73

Comments

RuojiWang commented Sep 14, 2022

RuojiWang commented Sep 14, 2022 • edited Loading

RuojiWang commented Sep 14, 2022 •

edited

Loading