How to write training loop for object detection using `rastervision` and `pytorch-lightning` #1769

aerotractjack · 2023-04-05T20:29:01Z

aerotractjack
Apr 5, 2023

I am attmepting to use rastervision and pytorch-lightining to train an object detection model (fasterrcnn_resnet50_fpn_v2 to be exact) but am running into multiple issues within my training and validation loops. During early training stages, my model is predicting a variable number of bounding boxes per image. This is leading to errors with my loss function, as the prediction vector and truth vector have different shapes.

Here is my model definition

class ObjectDetection(pl.LightningModule):

    def __init__(self, backbone, lr=1e-4):
        super().__init__()
        self.backbone = TorchVisionODAdapter(backbone)
        self.lr = lr

    def forward(self, img):
        return self.backbone(img)

    def training_step(self, batch, batch_idx):
        print("sanity training")
        image, target = batch
        loss_dict = self.backbone(image, target)
        losses = sum(loss for loss in loss_dict.values())
        batch_size = len(batch[0])
        self.log_dict(loss_dict, batch_size=batch_size)
        self.log("train_loss", losses, batch_size=batch_size)
        return losses
    
    def validation_step(self, batch, batch_idx):
        print("sanity validation")
        image, target = batch
        # error occurs here
        loss_dict = self.backbone(image, target)
        losses = sum(loss for loss in loss_dict.values())
        batch_size = len(batch[0])
        self.log_dict(loss_dict, batch_size=batch_size)
        self.log("val_loss", losses, batch_size=batch_size)
        return losses

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(
            self.backbone.parameters(), lr=self.lr)
        return optimizer

And here is my training code

    def train(self):
        kw = self.kw.get("train_kw", {})
        lr = float(kw.get("lr", 1e-4))
        epochs = kw.get("epochs", 1)
        output_dir = self.output_uri
        make_dir(output_dir)
        fast_dev_run = False
        backbone = fasterrcnn_resnet50_fpn_v2(
            num_classes=len(self.cc), pretrained=True)
        model = ObjectDetection(backbone, lr=lr)
        tb_logger = TensorBoardLogger(save_dir=output_dir + "/tensorboard", flush_secs=10)
        trainer = pl.Trainer(
            accelerator='auto',
            min_epochs=1,
            max_epochs=epochs+1,
            default_root_dir=output_dir + "/trainer",
            logger=[tb_logger],
            fast_dev_run=fast_dev_run,
            log_every_n_steps=1,
        )
        train_dl, val_dl = self.build_train_val_loader()
        trainer.fit(model, train_dl, val_dl)
        trainer.save_checkpoint(output_dir + "/trainer/final-model.ckpt")

And here is the error message I get when I run the validation_step()

sanity validation
Traceback (most recent call last):
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 148, in <module>
    run(sys.argv[1])
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 144, in run
    obj.train()
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 127, in train
    trainer.fit(model, train_dl, val_dl)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 520, in fit
    call._call_and_handle_interrupt(
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 935, in _run
    results = self._run_stage()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 976, in _run_stage
    self._run_sanity_check()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1005, in _run_sanity_check
    val_loop.run()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 177, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 375, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 288, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 44, in validation_step
    loss_dict = self.backbone(image, target)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/rastervision/pytorch_learner/object_detection_utils.py", line 341, in forward
    loss_dict['total_loss'] = sum(list(loss_dict.values()))
AttributeError: 'list' object has no attribute 'values'

But when I run the training_step(), I get a different error:

sanity training
Traceback (most recent call last):
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 148, in <module>
    run(sys.argv[1])
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 144, in run
    obj.train()
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 127, in train
    trainer.fit(model, train_dl, val_dl)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 520, in fit
    call._call_and_handle_interrupt(
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 935, in _run
    results = self._run_stage()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 978, in _run_stage
    self.fit_loop.run()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 218, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 185, in run
    self._optimizer_step(kwargs.get("batch_idx", 0), closure)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 261, in _optimizer_step
    call._call_lightning_module_hook(
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 142, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1265, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 158, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 224, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 114, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/optim/adam.py", line 118, in step
    loss = closure()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 101, in _wrap_closure
    closure_result = closure()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 126, in closure
    step_output = self._step_fn()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 308, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 288, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 366, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 34, in training_step
    loss_dict = self.backbone(image, target)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/rastervision/pytorch_learner/object_detection_utils.py", line 340, in forward
    loss_dict = self.model(input, _targets)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torchvision/models/detection/generalized_rcnn.py", line 105, in forward
    detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torchvision/models/detection/roi_heads.py", line 772, in forward
    loss_classifier, loss_box_reg = fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torchvision/models/detection/roi_heads.py", line 31, in fastrcnn_loss
    classification_loss = F.cross_entropy(class_logits, labels)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
IndexError: Target 2 is out of bounds.
Epoch 0:   3%|▎         | 1/34 [00:14<07:52, 14.32s/it, v_num=93]

I believe this error occurs because my prediction has many more bounding boxes than my ground truth, but that seems like a normal issue that would occur in training, so I'm not sure how to fix it. Any help is appreciated

Answered by AdeelH

Apr 6, 2023

I believe this error occurs because my prediction has many more bounding boxes than my ground truth, but that seems like a normal issue that would occur in training, so I'm not sure how to fix it. Any help is appreciated

No, the 'index' in the error is referring to the class ID. Try the following:

remove the background/null class from your ClassConfig
when creating the FasterRCNN use num_classes=(len(class_config) + 1)

And here is the error message I get when I run the validation_step()

Torchvision OD models behave differently during validation. Instead of returning losses, they return predicted boxes.

This might help: https://github.com/azavea/raster-vision/blob/master/rastervision…

View full answer

aerotractjack · 2023-04-05T21:15:02Z

aerotractjack
Apr 5, 2023
Author

with the following changes I'm able to run my model, but it throws an error any time there is a different number of ground truth boxes and predicted boxes (which happens every time)

    def boxlist_to_tensor(self, bl):
        bl = [self.backbone.boxlist_to_model_input_dict(b) for b in bl]
        boxes = [b["boxes"] for b in bl]
        labels = [b["labels"] for b in bl]
        boxes = torch.vstack(boxes)
        labels = torch.concat(labels).float()
        return boxes, labels
    
    def training_step(self, batch, batch_idx):
        print("Sanity training")
        x, y = batch
        y_hat = self.backbone.forward(x)
        box_hat, label_hat = self.boxlist_to_tensor(y_hat)
        box, label = self.boxlist_to_tensor(y)
        box_loss = generalized_box_iou_loss(box, box_hat)
        label_loss = F.mse_loss(label, label_hat)
        return box_loss + label_loss
    
    def validation_step(self, batch, batch_idx):
        print("Sanity validation")
        x, y = batch
        y_hat = self.backbone.forward(x)
        box_hat, label_hat = self.boxlist_to_tensor(y_hat)
        box, label = self.boxlist_to_tensor(y)
        box_loss = generalized_box_iou_loss(box, box_hat)
        label_loss = F.mse_loss(label, label_hat)
        return box_loss + label_loss

I get errors such as

Traceback (most recent call last):
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 159, in <module>
    run(sys.argv[1])
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 155, in run
    obj.train()
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 138, in train
    trainer.fit(model, train_dl, val_dl)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 520, in fit
    call._call_and_handle_interrupt(
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 935, in _run
    results = self._run_stage()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 976, in _run_stage
    self._run_sanity_check()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1005, in _run_sanity_check
    val_loop.run()
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py", line 177, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 375, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 288, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/aerotract/software/rvml-lightning-pipeline/objdet/rvlightning.py", line 58, in validation_step
    box_loss = generalized_box_iou_loss(box, box_hat)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torchvision/ops/giou_loss.py", line 48, in generalized_box_iou_loss
    intsctk, unionk = _loss_inter_union(boxes1, boxes2)
  File "/home/aerotract/.miniconda3/envs/rv/lib/python3.9/site-packages/torchvision/ops/_utils.py", line 103, in _loss_inter_union
    intsctk[mask] = (xkis2[mask] - xkis1[mask]) * (ykis2[mask] - ykis1[mask])
IndexError: The shape of the mask [264] at index 0 does not match the shape of the indexed tensor [1] at index 0

2 replies

aerotractjack Apr 5, 2023
Author

Should I just zero-pad the two inputs to the loss function so they are the same shape? For example, if boxes_true is of shape (1,4) and boxes_pred is of shape (58,4) can I add 57 rows of [0,0,0,0] to boxes_true?

aerotractjack Apr 6, 2023
Author

I am getting all sorts of errors with this. My data has a variable number of boxes per image, and my model predicts a variable number of boxes per image, and pytorch doesn't seem to like that. Is there something really obvious I'm missing? It seems like this would be a normal thing to happen in early training stages.

AdeelH · 2023-04-06T10:13:14Z

AdeelH
Apr 6, 2023
Maintainer

I believe this error occurs because my prediction has many more bounding boxes than my ground truth, but that seems like a normal issue that would occur in training, so I'm not sure how to fix it. Any help is appreciated

No, the 'index' in the error is referring to the class ID. Try the following:

remove the background/null class from your ClassConfig
when creating the FasterRCNN use num_classes=(len(class_config) + 1)

And here is the error message I get when I run the validation_step()

Torchvision OD models behave differently during validation. Instead of returning losses, they return predicted boxes.

This might help: https://github.com/azavea/raster-vision/blob/master/rastervision_pytorch_learner/rastervision/pytorch_learner/object_detection_learner.py#L64-L91

4 replies

aerotractjack Apr 6, 2023
Author

Removing the "null" class from my ClassConfig gives me an error that

  The null_class, "null", must be in list of class names. (type=value_error.config)

aerotractjack Apr 6, 2023
Author

Adding the null class back and incorporating the train/val steps from the learner class solved my issues. I was getting tripped up in the communication between RV and lightning. Thank you

AdeelH Apr 7, 2023
Maintainer

Removing the "null" class from my ClassConfig gives me an error that
  The null_class, "null", must be in list of class names. (type=value_error.config)

You also have to remove the null_class= param when you do that.

It's great that you got it to work though!

aerotractjack Apr 7, 2023
Author

You also have to remove the null_class= param when you do that.
That makes sense!

aerotractjack · 2023-04-07T16:09:15Z

aerotractjack
Apr 7, 2023
Author

Youre a legend Adeel! Appreciate the help so much. Sorry for the super long tracebacks haha. If you're ever in oregon come stop by Aerotract HQ!

1 reply

AdeelH Apr 8, 2023
Maintainer

Thank you! Happy to help! And tracebacks are encouraged! Much harder to debug without them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to write training loop for object detection using `rastervision` and `pytorch-lightning` #1769

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to write training loop for object detection using rastervision and pytorch-lightning #1769

aerotractjack Apr 5, 2023

Replies: 3 comments · 7 replies

aerotractjack Apr 5, 2023 Author

aerotractjack Apr 5, 2023 Author

aerotractjack Apr 6, 2023 Author

AdeelH Apr 6, 2023 Maintainer

aerotractjack Apr 6, 2023 Author

aerotractjack Apr 6, 2023 Author

AdeelH Apr 7, 2023 Maintainer

aerotractjack Apr 7, 2023 Author

aerotractjack Apr 7, 2023 Author

AdeelH Apr 8, 2023 Maintainer

How to write training loop for object detection using `rastervision` and `pytorch-lightning` #1769

aerotractjack
Apr 5, 2023

Replies: 3 comments 7 replies

aerotractjack
Apr 5, 2023
Author

aerotractjack Apr 5, 2023
Author

aerotractjack Apr 6, 2023
Author

AdeelH
Apr 6, 2023
Maintainer

aerotractjack Apr 6, 2023
Author

aerotractjack Apr 6, 2023
Author

AdeelH Apr 7, 2023
Maintainer

aerotractjack Apr 7, 2023
Author

aerotractjack
Apr 7, 2023
Author

AdeelH Apr 8, 2023
Maintainer