Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish as Parquet / Geoparquet #7188

Open
tjwebb opened this issue Apr 30, 2024 · 17 comments
Open

Publish as Parquet / Geoparquet #7188

tjwebb opened this issue Apr 30, 2024 · 17 comments

Comments

@tjwebb
Copy link

tjwebb commented Apr 30, 2024

Overture Maps and other prominent projects have decided to publish their spatial datasets using the parquet (and/or geoparquet) data formats. This can make it easier to load the data into cloud databases such as Athena and BigQuery, and decreasing the manual effort attendant on dealing with zipped geojson files.

Has the team considered publishing the data in parquet format alongside the zip files?

@iandees
Copy link
Member

iandees commented Apr 30, 2024

Hi Travis! Good suggestion. We can investigate generating and posting geoparquet files.

I know that Athena can load data from requester pays S3 buckets, but does BigQuery support import from such buckets?

@tjwebb
Copy link
Author

tjwebb commented Apr 30, 2024

I believe so; BigQuery Data Transfer Service requires AWS credentials to pull the data, so the account associated with those AWS credentials should be charged in the case where requester pays is enabled for your S3 bucket.

I will try to verify this

@tjwebb
Copy link
Author

tjwebb commented May 2, 2024

Well how about this: if you publish the data somewhere in a requester pays bucket, I will test it with BigQuery and let you know if it works :)

@iandees
Copy link
Member

iandees commented May 2, 2024

What if I put it in a Cloudflare R2 bucket? Then it's about the same amount of miserable for everyone but at least it's free transit 🙂

@tjwebb
Copy link
Author

tjwebb commented May 2, 2024

whichever is easiest for you. I think the format is the main thing, moreso than the location.

@tjwebb
Copy link
Author

tjwebb commented May 13, 2024

any eta on this @iandees?

@iandees
Copy link
Member

iandees commented May 15, 2024

No ETA yet, no. I got a bit stymied by finding a library to write Parquet files efficiently. Do you have any pointers, preferably in Node?

cc @jwass

@tjwebb
Copy link
Author

tjwebb commented May 15, 2024

AFAICT, looks like https://www.npmjs.com/package/@dsnp/parquetjs is the go-to parquet module these days.

@jwass
Copy link
Contributor

jwass commented May 16, 2024

@iandees @tjwebb - I don't have much node experience here so couldn't help, but I started on a simple Python pyarrow json -> parquet converter that can read files from the batch output zip files and save them out as parquet.

Have you thought about what the schema and partitioning might look like for an OA parquet distribution. For example:

country=us/region=ma/city_of_dedham.parquet
...

or

country=us/region=ma/city=dedham/city_of_dedham.parquet <-- filename isn't important
...

I think there should also be some dataset_name field to indicate the source rather than having to get it from the underlying filename.

Also - since data files can only be in leaf directories you'd need special treatment on the countrywide and statewide files.

Something like:

country=no/region=countrywide/data.parquet

This couldn't live in country=no/countrywide.parquet if there are also deeper region= or other partitions.

I'm not sure what the best answer is but happy to brainstorm more if you're going down that route.

@iandees
Copy link
Member

iandees commented May 23, 2024

Figuring out a partitioning scheme seems relatively difficult here. Part of the reason we produce data as a single zipped blob of GeoJSON (and CSV before that) was that our data consumers almost always just wanted a big huge list of all addresses to throw at their geocoder. I don't think I've ever heard from a data consumer that wants to filter the output beyond excluding some data by license type (back when we included share-alike data).

Part of figuring out partitioning means we need to take a guess at how data consumers might query this data. Since you're suggesting partitions by source file name, are you thinking you'd want to only include data from certain sources? Would it be better to partition by license information in some way instead?

Another possible partitioning scheme I could see is by date of last successful retrieval.

I agree that there should be some extra metadata on every address row (data source reference and capture date at least) and that there should be a second dataset of source information (including contact, license information, etc.).

@tjwebb
Copy link
Author

tjwebb commented May 23, 2024

I'm a bit lost. Why is it a requirement that the data is partitioned?

@iandees
Copy link
Member

iandees commented May 23, 2024

I suppose it's not a requirement, but if we're publishing parquet it's because we expect it to be queried in some way. We would want to partition in order to make those queries more performant.

@tjwebb
Copy link
Author

tjwebb commented May 23, 2024

It's the database's job to solve this issue. Many databases, including Athena and BigQuery, have a way to cluster/index their external tables loaded from parquet just as they would native tables. Or, they can just import the parquet and turn it into a native table for full control.

Partitioning it the wrong way can just as easily make it less performant and less usable. I don't know why we'd try to make these assumptions. There's no such assumption in the zipfile format.

the reason we produce data as a single zipped blob of GeoJSON (and CSV before that) was that our data consumers almost always just wanted a big huge list of all addresses

Exactly.

@jwass
Copy link
Contributor

jwass commented Jun 9, 2024

Maybe a dumb question but I was comparing to the global-collection.zip file from the "Download options" on the main page which is split into the individual source outputs. Is there a single zipped blob of GeoJSON available somewhere else?

@iandees
Copy link
Member

iandees commented Jun 9, 2024

Is there a single zipped blob of GeoJSON available somewhere else?

Nope that's the output. When I said "single zipped blob of GeoJSON" I meant the blob containing individual sub-blobs of GeoJSON 😄

@iandees
Copy link
Member

iandees commented Jun 9, 2024

As an update here: I wrote a bit of Python code that converts all the GeoJSON into individual Parquet files and then concatenates them all into one, but I can't confirm the resulting Parquet is valid because the QGIS download for Mac doesn't seem to support Parquet by default.

@jwass
Copy link
Contributor

jwass commented Jun 9, 2024

Yeah you need a version of QGIS that distributes GDAL >3.5
https://docs.overturemaps.org/examples/QGIS/ has some more info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants