You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ssh://[email protected]:10001/root/anaconda3/bin/python3.8 -u /workspace/project/huafeng/big_transfer-master/train.py --name huaweishengteng --model BiT-M-R50x1 --logdir ./logs --dataset cifar10 --datadir ./cifar
2022-09-14 11:42:29,707 [INFO] bit_common: Namespace(base_lr=0.003, batch=512, batch_split=1, bit_pretrained_dir='.', datadir='./cifar', dataset='cifar10', eval_every=None, examples_per_class=None, examples_per_class_seed=0, logdir='./logs', model='BiT-M-R50x1', name='huaweishengteng', save=True, workers=8)
2022-09-14 11:42:29,707 [INFO] bit_common: Going to train on cuda:1
Files already downloaded and verified
Files already downloaded and verified
2022-09-14 11:42:31,373 [INFO] bit_common: Using a training set with 50000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Using a validation set with 10000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Loading model from BiT-M-R50x1.npz
2022-09-14 11:42:31,893 [INFO] bit_common: Moving model onto all GPUs
2022-09-14 11:42:31,908 [INFO] bit_common: Model will be saved in './logs/huaweishengteng/bit.pth.tar'
2022-09-14 11:42:31,908 [INFO] bit_common: Fine-tuning from BiT
2022-09-14 11:42:34,812 [INFO] bit_common: Starting training!
Traceback (most recent call last):
File "/workspace/project/huafeng/big_transfer-master/train.py", line 296, in
main(parser.parse_args())
File "/workspace/project/huafeng/big_transfer-master/train.py", line 237, in main
logits = model(x)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
Process finished with exit code 1
cuda:0 is used by my lab, i can only use other cuda, maybe cuda:7, how can i solve the problem?
The text was updated successfully, but these errors were encountered:
#device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
#logger.info(f"Going to train on {device}")
device_ids = [1, 4, 5]
device = torch.device("cuda:1")
logger.info(f"Going to train on {device}")
logger.info(f"Loading model from {args.model}.npz")
model = models.KNOWN_MODELS[args.model](head_size=len(valid_set.classes), zero_head=True)
model.load_from(np.load(f"{args.model}.npz"))
logger.info("Moving model onto all GPUs")
model = torch.nn.DataParallel(model, device_ids=device_ids)
i meet the following error:
ssh://[email protected]:10001/root/anaconda3/bin/python3.8 -u /workspace/project/huafeng/big_transfer-master/train.py --name huaweishengteng --model BiT-M-R50x1 --logdir ./logs --dataset cifar10 --datadir ./cifar
2022-09-14 11:42:29,707 [INFO] bit_common: Namespace(base_lr=0.003, batch=512, batch_split=1, bit_pretrained_dir='.', datadir='./cifar', dataset='cifar10', eval_every=None, examples_per_class=None, examples_per_class_seed=0, logdir='./logs', model='BiT-M-R50x1', name='huaweishengteng', save=True, workers=8)
2022-09-14 11:42:29,707 [INFO] bit_common: Going to train on cuda:1
Files already downloaded and verified
Files already downloaded and verified
2022-09-14 11:42:31,373 [INFO] bit_common: Using a training set with 50000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Using a validation set with 10000 images.
2022-09-14 11:42:31,373 [INFO] bit_common: Loading model from BiT-M-R50x1.npz
2022-09-14 11:42:31,893 [INFO] bit_common: Moving model onto all GPUs
2022-09-14 11:42:31,908 [INFO] bit_common: Model will be saved in './logs/huaweishengteng/bit.pth.tar'
2022-09-14 11:42:31,908 [INFO] bit_common: Fine-tuning from BiT
2022-09-14 11:42:34,812 [INFO] bit_common: Starting training!
Traceback (most recent call last):
File "/workspace/project/huafeng/big_transfer-master/train.py", line 296, in
main(parser.parse_args())
File "/workspace/project/huafeng/big_transfer-master/train.py", line 237, in main
logits = model(x)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
Process finished with exit code 1
cuda:0 is used by my lab, i can only use other cuda, maybe cuda:7, how can i solve the problem?
The text was updated successfully, but these errors were encountered: