Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEOMESA-3259 FSDS - Add support for GeoParquet #3064

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

adeet1
Copy link
Contributor

@adeet1 adeet1 commented Mar 20, 2024

  • Create a bounding box for each geometry, and add it to the GeoParquet metadata (which requires the metadata map to be changed to a mutable data structure)
  • Read and write all geometry attributes as binary (a primitive Parquet type) instead of as a pair of x/y doubles (a group Parquet type), using the same converter and attribute writer for all geometry types, while also maintaining backwards compatibility
  • Add support for parsing WKB bytes in the Parquet geometry transformer functions
  • Use a spatial index instead of a GeoTools filter for bounding box queries

@adeet1
Copy link
Contributor Author

adeet1 commented Mar 20, 2024

To-do items:

  • Make FilterConverter.spatial backwards-compatible
  • Add support for 3D geometries and bounding boxes
  • Add a unit test assert that ensures the file metadata validates against the GeoParquet metadata schema

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
Copy link
Contributor Author

@adeet1 adeet1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • When we compact GeoParquet files in a filesystem partition, we need to ensure that the bounding boxes in the metadata of the files get merged correctly (i.e. assert that the union of bounding boxes of the files before compaction is equal to the union of bounding boxes of the newly compacted files).

@elahrvivaz elahrvivaz marked this pull request as draft April 11, 2024 13:43
commit 0ea8bff
Author: adeet1 <[email protected]>
Date:   Fri Mar 29 20:29:40 2024 +0000

    Optimize imports

commit 9ebd85a
Author: adeet1 <[email protected]>
Date:   Fri Mar 29 20:12:03 2024 +0000

    Initialize bounds as an empty array instead of null

    * This fixes a failing unit test "suppress or allow empty output files" in ExportCommandTest.scala

commit 4cff76a
Author: adeet1 <[email protected]>
Date:   Fri Mar 29 15:18:09 2024 +0000

    Split Parquet and Orc file compaction tests in order to differentiate the comparisons

commit 16d88fd
Author: adeet1 <[email protected]>
Date:   Wed Mar 27 20:48:07 2024 +0000

    Assert in each partition that GeoParquet metadata bounding boxes across files are correctly merged upon compaction

    * Write features with different geometries and coordinates, so we can test the merging of unique bounding boxes.

commit 4197e4d
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 21:27:17 2024 +0000

    Change thunk to lazy vals

commit 4eaf9fc
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 20:22:10 2024 +0000

    Implement methods instead of lazy vals

commit c82c0d2
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 20:13:56 2024 +0000

    Move test scope

commit 09588e8
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 20:01:00 2024 +0000

    Don't create a GeoParquet metadata string if the SFT has no geometries

commit 137dcb5
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 19:36:31 2024 +0000

    Re-implement GeoParquet metadata logic to work for SFTs with multiple geometries

commit 360c2c7
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 16:58:26 2024 +0000

    Change back to GroupReadSupport

    * This simply checks if the Parquet file is valid - it won't deserialize/manifest everything and thus saves us some processing

commit 3bce59e
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 14:39:34 2024 +0000

    Use the released GeoParquet metadata schema, not the dev one

commit 878abb5
Author: adeet1 <[email protected]>
Date:   Thu Mar 28 14:30:35 2024 +0000

    Optimize imports

commit d49fc3a
Author: adeet1 <[email protected]>
Date:   Wed Mar 27 14:47:54 2024 +0000

    Assert that the bounding box in the GeoParquet metadata is correct

commit 2ae9574
Author: adeet1 <[email protected]>
Date:   Tue Mar 26 23:14:46 2024 +0000

    Instantiate the observer directly in SimpleFeatureWriteSupport instead of passing it down from SimpleFeatureParquetWriter

commit 9770a3a
Author: adeet1 <[email protected]>
Date:   Fri Mar 22 14:09:05 2024 +0000

    Tweak targetSize

commit 604e614
Author: adeet1 <[email protected]>
Date:   Wed Mar 20 19:55:59 2024 +0000

    Assert that the file metadata adheres to the GeoParquet metadata json schema

commit 2257d6c
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 22:03:29 2024 +0000

    Deprecate the ParquetFunctionFactory class, but provide backwards compatibility

commit 03e699f
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 20:04:43 2024 +0000

    Create a new metadata map instance when adding bounding box

commit 8630eed
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 18:07:30 2024 +0000

    Change BoundsObserver argument back to FileSystemObserver

commit 921274b
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 17:53:38 2024 +0000

    If the sft has no geometry field, then omit the GeoParquet metadata entirely

commit c1dda99
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 17:51:26 2024 +0000

    Omit orientation, edges and epoch

commit dabdc43
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 17:39:47 2024 +0000

    Make variables private to avoid exposing mutable state outside the scope of the class

commit 5eecf48
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 17:32:01 2024 +0000

    Delete redundant checks in geometry read and write support

commit 0ed5c65
Author: adeet1 <[email protected]>
Date:   Thu Mar 21 14:55:29 2024 +0000

    Delete duplicate dependency

commit 3dc798d
Author: adeet1 <[email protected]>
Date:   Wed Mar 20 19:09:44 2024 +0000

    Support backwards compatibility for FilterConverter

commit 7dea125
Author: adeet1 <[email protected]>
Date:   Wed Mar 20 15:32:31 2024 +0000

    Delete .parquet.crc file after running tests

commit 652bf3a
Author: Adeet Patel <[email protected]>
Date:   Mon Feb 12 12:16:35 2024 -0500

    GEOMESA-3259 FSDS - Add support for GeoParquet

    * Create a BoundsObserver trait, and tweak various classes and methods to use that trait
    * Add an observer to the SimpleFeatureParquetWriter and write records to it, in order to create a bounding box of all the geometries. Add this bounding box to the GeoParquet metadata (which requires the metadata map to be changed to a mutable data structure).
    * Read/write all geometry attributes in binary (a primitive Parquet type) instead of as a pair of x/y doubles (a group Parquet type), using the same converter and attribute writer for all geometry types, while also maintaining backwards compatibility
    * Add support for parsing WKB bytes in the Parquet geometry transformer functions
    * Exclude bounding box from the GeoTools filter and use a spatial index instead

    Co-authored-by: Emilio Lahr-Vivaz <[email protected]>
@adeet1 adeet1 marked this pull request as ready for review June 6, 2024 16:18
@adeet1 adeet1 requested a review from elahrvivaz June 7, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants