`RasterVision`+`Lightning` experiment results much worse than `rastervision` pipeline results #1770

aerotractjack · 2023-04-10T22:43:42Z

aerotractjack
Apr 10, 2023

Hi, my colleague and I are using RasterVision to detect objects from aerial imagery. To get started, we build a pipeline using the rastervision pipeline setup and it achieves great results with little work from us. This is super cool! In an effort to gain more visibility/customize-ability into the pipeline, we want to recreate the experiment using rastervision to handle our datasets and predictions, and pytorch/lightning for our model.

Before I get into my question, I totally know this may not be a RasterVision issue, it could just be a model training/optimization or dataset question in which case this is not the place to ask, but I'm curious if I'm missing something obvious from the rastervision side. Following the lightning tutorial on the website, and all the great help you've provided me on the discussions the past week, I finally got a training+prediction pipeline working the other day. But strangely, even though my loss converges to 0 somewhat smoothly, my mAP and mAP50 metrics seem to level off around 50-60%, and my actual prediction boxes are all over the place.

Colleague's working pipeline repository
My not-so-working RV+lightning repo

Both of us are using a pretrained version of ResNet50 and training on the same data. The only major difference is that he is using the pipeline and I am not. [Here is a link to his pipeline configuration] (https://github.com/aerotractjack/rvml/blob/seth/rvpipeline.py). One important snippet is how he defines his model

# input-config.yaml
backbone_config:  
  source: "rastervision"
  name: "resnet50"

# create model from config file
def get_model_cfg(bbkw, class_config):
    ''' Return instance of model defined in config YAML file'''
    src = bbkw["source"]
    if src == "rastervision":
        name = bbkw["name"]
        backbone = getattr(Backbone, name)
        return ObjectDetectionModelConfig(backbone=backbone)
    elif src == "external":
        external_def = ExternalModuleConfig(
            github_repo=bbkw["github_repo"],
            name=bbkw["name"],
            entrypoint=bbkw["entrypoint"],
            force_reload=bbkw.get("force_reload", True),
            entrypoint_kwargs={
                "num_classes": len(class_config) + 1,
                "pretrained": bbkw.get("pretrained", False),
                "pretrained_backbone": bbkw.get("pretrained_backbone", True)
            }
        )
        return ObjectDetectionModelConfig(external_def=external_def)
    else:
        raise NotImplementedError("valid Backbone sources: {rastervision, external}")

And here is how I create my model

# input-config.yaml
model_kw:
  name: fasterrcnn_resnet50_fpn_v2
  lr: 1e-4

class RVLightining(RVBase):
    ...
    def model_type(self):
        try:
            return getattr(torchvision.models.detection, self.kw["model_kw"]["name"])
        except:
            return getattr(torchvision.models, self.kw["model_kw"]["name"])

    def build_model(self):
        kw = self.kw.get("model_kw", {})
        lr = float(kw.get("lr", 1e-4))
        output_dir = self.output_uri
        make_dir(output_dir)
        backbone = self.model_type()(
            weights=FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT,
            weights_backbone=ResNet50_Weights.DEFAULT,
            box_detections_per_img=5,)
        num_classes = len(self.cc) + 1
        num_boxes = num_classes * 4
        backbone.roi_heads.box_predictor.cls_score = torch.nn.Linear(1024, num_classes)
        backbone.roi_heads.box_predictor.bbox_pred = torch.nn.Linear(1024, num_boxes)
        model = LightningModel(backbone, lr=lr)
        return model

My LightningModel is very similar to the ObjectDetectionLearner, you can see the code here: LightningModel

At first, I wasn't setting the weights for my ResNet, resulting in not using a pretrained model. Realizing this, I was able to set up my model to use the default COCOv1 weights, and that helped my mAP improve, but I still hit a wall at around 50% as you can see on tensorboard. I've included some images of my data, predictions, and tensorboard output. This is all being done on a small sample of my data in an attempt to narrow down my problem.

Here is a link to a sample training data chip
Here is a link to some sample predictions (RED) overlaid the ground truth (GREEN)
Here is a snapshot of my tensorboard training loss value
Here is a snapshot of my tensorboard validation mAP and mAP50 values

As you can see in the images, my loss curve is approaching 0 which is promising, and my mAP/mAP50 values increase for the first few epochs then begin to level off. Finally, my predictions seem completely random, and even though I have small loss values and an OK mAP, my predictions are not correct at all.

So far I've tried different variations of ResNet, different LR/optimizer values and combos, and training for much longer on more data. All results end up similar. My colleague's pipeline is performing very well, and during training his loss value looks similar (his is smoother) but his mAP and mAP50 values quickly approach 95% in a few (5-10) epochs.

Do you have any ideas where I could be going wrong? Am I leaving something out when constructing my model? Looking at rastervision source code, I do not see any major differences between how I make my model and how the pipeline does. Any help is super appreciated. Thank you!

AdeelH · 2023-04-11T09:36:05Z

AdeelH
Apr 11, 2023
Maintainer

Hi, I have not had the chance to look at this closely yet, but here are some initial thoughts:

I don't seem to be able to access your colleague's repo.
Are you using the same configuration for the training dataset that he is using? I notice in your code, that you are using an ObjectDetectionSlidingWindowGeoDataset; you might want to use an ObjectDetectionRandomWindowGeoDataset instead, making use of its neg_ratio param. Note: this is for the training set only; you would still want to use a sliding window dataset for the validation set.
You might need to pass image_mean and image_std to FasterRCNN as RV does here.

1 reply

aerotractjack Apr 11, 2023
Author

Thank you. I just made the repo public if you want to take a look. The image_mean and image_std seem to have helped my model reach a better mAp/mAp50 quicker but it still levels off around 60%. Seems like a training issue I won't bother you too much with. One last question about the dataset, though

My images are large, some around 6-10GB. When I use a RandomWindowDataset and do not specify max_windows, I get a memory error from torch. I believe this is because of my large image size and small chip size, which would make sense. I want to train my model on all of my boxes, so I don't want to set max_windows in fear of missing out on some data. I thought I could set max_windows to be something very large, like 10000 (I only have around 100-200 training boxes per image) and my dataset would stop iterating after 100-200 images, but this is not the case. Are duplicate windows being returned from the dataset, or does max_windows override the neg_ratio to force out empty images?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`RasterVision`+`Lightning` experiment results much worse than `rastervision` pipeline results #1770

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RasterVision+Lightning experiment results much worse than rastervision pipeline results #1770

aerotractjack Apr 10, 2023

Replies: 1 comment · 1 reply

AdeelH Apr 11, 2023 Maintainer

aerotractjack Apr 11, 2023 Author

`RasterVision`+`Lightning` experiment results much worse than `rastervision` pipeline results #1770

aerotractjack
Apr 10, 2023

Replies: 1 comment 1 reply

AdeelH
Apr 11, 2023
Maintainer

aerotractjack Apr 11, 2023
Author