Quickly Generating Geodataset from many scenes #2191

JamesZDonline · 2024-07-05T15:40:31Z

JamesZDonline
Jul 5, 2024

Hello! I want to start out by saying that we've been using RasterVision for a couple of years now and love it! It just keeps getting better.

We're creating ClassificationSlidingWindowGeoDatasets and depending on the size and complexity of the AOI/transformations it takes a dozen seconds to a few minutes to create a dataset for a given scene.

The issue we're running into now is that we're trying to scale up. We have 670 scenes distributed across ~100k sqkm we want to use to create our dataset. This means it would take a few hours just to build the dataset. This is a minor inconvenience if we have to do it once, but if we have to do it every time we want to run an experiment it becomes untenable. We've thought of two possible solutions:

Build the dataset once and save it
Use multiprocessing to speed up the creation of the dataset

Unfortunately, both of these ideas have been defeated by the fact that Rasterio data can't be pickled (a requirement for both saving as a pytorch dataset, or for python's multiprocessing libraries).

Does anyone have any suggestions about how we might speed this up? And/or save the dataset output so that it can be reloaded instead of starting over each time?

Thanks in advance for any ideas!

AdeelH · 2024-07-05T16:55:37Z

AdeelH
Jul 5, 2024
Maintainer

Hi, I've noticed datasets taking a few seconds to build for remote files (which is slow enough as it is) but never dozens of seconds. Currently, RasterioSource reads a small 1x1xbands chip to determine num_channels and dtype after applying transformers. I don't think this is ideal and this might be what is causing the bottleneck in your case.

Questions:

Are these files remote?
How many bands are there?
How much faster is it if you specify no AOI?
How much faster is it if you just call rasterio.open() on the file instead of creating a RV GeoDataset?

Another solution would be to pre-chip the dataset and then use ClassificationImageDataset.

we've been using RasterVision for a couple of years now and love it! It just keeps getting better.

Thank you very much for the kind words!

2 replies

JamesZDonline Jul 5, 2024
Author

Hello,
Okay, good to know. It seemed like seconds/minutes was a longer time than I expected/remembered from previous runs, but its been a few months since I've run a dataset, so I wasn't trusting my memory.

No the files are fortunately local.
8 bands (WorldView 2 and WorldView3). They're huge images. (238525x39869 pixels for example)
You nailed it... a dozen images with AOI took 4 min 48 seconds. They took 1.8 seconds without AOI.
rasterio.open() is almost instantaneous (0.3 seconds)

This is surprising to me! I had assumed that the AOI would speed things up since it could theoretically ignore most of the image. I guess it's creating windows for the whole image and then doing some sort of spatial join to figure out which ones are in the AOI before continuing? And the spatial join step is what's slowing it down? I'm having trouble tracking down exactly how that works in the documentation.

If I'm correct in that guess, I'm thinking a potential workaround might be to:

extract the AOI extent,
crop the raster to the AOI on import using the extent parameter for RasterioSource and then
pass the cropped image to the geodataset

Does that seem plausible to you? Do you have a better idea about how to get around this? I do unfortunately need the AOI.

AdeelH Jul 5, 2024
Maintainer

I guess it's creating windows for the whole image and then doing some sort of spatial join to figure out which ones are in the AOI before continuing? And the spatial join step is what's slowing it down?

That's right. It's not even a spatial join, but rather it loops over all the windows, converts them to shapely polygons, and then checks if they lie within the AOI. The relevant code is here:

raster-vision/rastervision_pytorch_learner/rastervision/pytorch_learner/dataset/dataset.py

Lines 265 to 277 in dcbdd71

    
           def init_windows(self) -> None: 
        
               """Pre-compute windows.""" 
        
               windows = self.scene.extent.get_windows( 
        
                   self.size, 
        
                   stride=self.stride, 
        
                   padding=self.padding, 
        
                   pad_direction=self.pad_direction) 
        
               if len(self.scene.aoi_polygons_bbox_coords) > 0: 
        
                   windows = Box.filter_by_aoi( 
        
                       windows, 
        
                       self.scene.aoi_polygons_bbox_coords, 
        
                       within=self.within_aoi) 
        
               self.windows = windows

and here:

raster-vision/rastervision_core/rastervision/core/box.py

Lines 456 to 475 in dcbdd71

    
               def filter_by_aoi(windows: List['Box'], 
        
                                 aoi_polygons: List[Polygon], 
        
                                 within: bool = True) -> List['Box']: 
        
                   """Filters windows by a list of AOI polygons 
        
                   Args: 
        
                       within: if True, windows are only kept if they lie fully within an 
        
                           AOI polygon. Otherwise, windows are kept if they intersect an 
        
                           AOI polygon. 
        
                   """ 
        
                   # merge overlapping polygons, if any 
        
                   aoi_polygons: Polygon | MultiPolygon = unary_union(aoi_polygons) 
        
                   if within: 
        
                       keep_window = aoi_polygons.contains 
        
                   else: 
        
                       keep_window = aoi_polygons.intersects 
        
                   out = [w for w in windows if keep_window(w.to_shapely())] 
        
                   return out

I'm not sure what a simple way to speed it up would be, but a quick-and-dirty workaround would be to just store the filtered windows and assign them directly next time. That is:

ds = SlidingWindowGeoDataset(...)
filtered_windows = ds.windows

# next time:
ds = SlidingWindowGeoDataset(size=50_000, stride=50_000) # large nums to avoid generating windows
ds.windows = filtered_windows

JamesZDonline · 2024-07-10T07:49:00Z

JamesZDonline
Jul 10, 2024
Author

This is very helpful! Thank you!

My idea for speeding things up was that I should be able to crop the image using the extent of the AOI and the bbox parameter for RasterioSource and then only generate windows for that small area (1kmx1km) instead of each massive image (15kmx100km).

This seems to work really well for unlabeled data. Using the from_uris constructor before, it took over 2 hours to create the dataset using our 6300 AOI across our survey area. Cropping the images first let us do it in less than 10 minutes!

The problem comes in when I try the same approach using geojson labels for chip classification. I'm pretty sure there's something simple I'm missing, but I keep getting a key error when I try to query the label source. Would you be willing to look at this and help me figure out what I'm missing? I've been fighting it for a couple of days with no luck.

image_path="data/Analysis_Imagery/Region01/Full/506412069050_Ortho_Bundle_Mosaic_8bit.TIF"
aoi_path = "data/AOIs/Region01/Training/imageid_506412069050.geojson"
label_path = "data/Labels/Region01/imageid_506412069050.geojson"

 # Create Class config
class_config = ClassConfig(
names=['nothing', 'object'],
colors=['lightgray', 'lightblue'],
null_class='nothing')

#Create crs_transformer from image
crs_transformer = RasterioCRSTransformer.from_uri(image_path)

# Create an extent to clip from the AOI
aoiSource = GeoJSONVectorSource(
    aoi_path,crs_transformer)

myextent=aoiSource.extent

rasterSource = RasterioSource(
    image_path, #path to the image
    allow_streaming=True, # allow_streaming so we don't have to load the whole image
    bbox=myextent
    ) # Clip the image to the extent of the aoi. This means chip windows will only be created within the bounds of the aoi extent

#Create the AOI
aoiSource = GeoJSONVectorSource(
    aoi_path,rasterSource.crs_transformer,bbox=rasterSource.bbox)


#If there are labels, import them as GeoJSONVectorSource, clipping them to the AOI extent using bbox

labelSource=None
if label_path is not None: 
    #import labels as a GeoJSONVectorSource
    labelVectorSource = GeoJSONVectorSource(
        label_path, # path to the label geojson
        rasterSource.crs_transformer, # convert labels from geographic to pixel coordinates
        bbox=rasterSource.bbox, # clip them to the AOI extent
        vector_transformers=[
            ClassInferenceTransformer(
                default_class_id=class_config.get_class_id('object') #use class config
            )
        ]
    )
    
        #Configure labels for Chip Classification
labelSourceConfig=ChipClassificationLabelSourceConfig(ioa_thresh=0.5, # 50% of the feature must be in the Chip for the chip to be positive. NOTE: This theshold could be changed
                                                        infer_cells=True, # Figure out what the cells are, we're not providing them explicitly
                                                        background_class_id=class_config.null_class_id, #
                                                        use_intersection_over_cell=False) # If true ioa_thresh would require 50% of the *chip* to contain features to be positive. That's not what we want.


#Convert the label vector to a lable source (format suitable for machine learning)
labelSource=ChipClassificationLabelSource(labelSourceConfig, #use the above config
    labelVectorSource, #use the above label vectors
    bbox=rasterSource.bbox, #clip to aoi extent
    lazy=True) #Don't actually create the labels until they are called for. This prevents us from creating unnecessary labels

chip = rasterSource[:100,:100,[4,2,1]]

label = labelSource[:100,:100]
print(label)

This results in the following error

{
	"name": "KeyError",
	"message": "Box(ymin=0, xmin=0, ymax=100, xmax=100)",
	"stack": "---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[81], line 3
      1 chip = rasterSource[:100,:100,[4,2,1]]
----> 3 label = labelSource[:100,:100]
      4 print(label)

File /opt/src/rastervision_core/rastervision/core/data/label_source/chip_classification_label_source.py:232, in ChipClassificationLabelSource.__getitem__(self, key)
    230     return self.labels[window].class_id
    231 else:
--> 232     return super().__getitem__(key)

File /opt/src/rastervision_core/rastervision/core/data/label_source/label_source.py:61, in LabelSource.__getitem__(self, key)
     59     raise NotImplementedError()
     60 window, _ = parse_array_slices_2d(key, extent=self.extent)
---> 61 return self[window]

File /opt/src/rastervision_core/rastervision/core/data/label_source/chip_classification_label_source.py:230, in ChipClassificationLabelSource.__getitem__(self, key)
    228     if window not in self.labels:
    229         self.labels += self.infer_cells(cells=[window])
--> 230     return self.labels[window].class_id
    231 else:
    232     return super().__getitem__(key)

File /opt/src/rastervision_core/rastervision/core/data/label/chip_classification_labels.py:56, in ChipClassificationLabels.__getitem__(self, cell)
     55 def __getitem__(self, cell: Box) -> ClassificationLabel:
---> 56     return self.cell_to_label[cell]

KeyError: Box(ymin=0, xmin=0, ymax=100, xmax=100)"
}

1 reply

AdeelH Jul 10, 2024
Maintainer

This is a bug. Sorry. Thanks for catching and reporting it. I have a fix here: #2193.

JamesZDonline · 2024-07-10T17:54:00Z

JamesZDonline
Jul 10, 2024
Author

This is awesome! Thank you! I was afraid that I was losing my mind...

I've tried it with the new update and everything works now. Thank you so much for your help!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quickly Generating Geodataset from many scenes #2191

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Quickly Generating Geodataset from *many* scenes #2191

JamesZDonline Jul 5, 2024

Replies: 3 comments · 3 replies

AdeelH Jul 5, 2024 Maintainer

JamesZDonline Jul 5, 2024 Author

AdeelH Jul 5, 2024 Maintainer

JamesZDonline Jul 10, 2024 Author

AdeelH Jul 10, 2024 Maintainer

JamesZDonline Jul 10, 2024 Author

Quickly Generating Geodataset from many scenes #2191

JamesZDonline
Jul 5, 2024

Replies: 3 comments 3 replies

AdeelH
Jul 5, 2024
Maintainer

JamesZDonline Jul 5, 2024
Author

AdeelH Jul 5, 2024
Maintainer

JamesZDonline
Jul 10, 2024
Author

AdeelH Jul 10, 2024
Maintainer

JamesZDonline
Jul 10, 2024
Author