Publish as Parquet / Geoparquet #7188

tjwebb · 2024-04-30T20:01:48Z

Overture Maps and other prominent projects have decided to publish their spatial datasets using the parquet (and/or geoparquet) data formats. This can make it easier to load the data into cloud databases such as Athena and BigQuery, and decreasing the manual effort attendant on dealing with zipped geojson files.

Has the team considered publishing the data in parquet format alongside the zip files?

iandees · 2024-04-30T21:46:26Z

Hi Travis! Good suggestion. We can investigate generating and posting geoparquet files.

I know that Athena can load data from requester pays S3 buckets, but does BigQuery support import from such buckets?

tjwebb · 2024-04-30T22:01:28Z

I believe so; BigQuery Data Transfer Service requires AWS credentials to pull the data, so the account associated with those AWS credentials should be charged in the case where requester pays is enabled for your S3 bucket.

I will try to verify this

tjwebb · 2024-05-02T03:46:57Z

Well how about this: if you publish the data somewhere in a requester pays bucket, I will test it with BigQuery and let you know if it works :)

iandees · 2024-05-02T13:48:48Z

What if I put it in a Cloudflare R2 bucket? Then it's about the same amount of miserable for everyone but at least it's free transit 🙂

tjwebb · 2024-05-02T13:58:15Z

whichever is easiest for you. I think the format is the main thing, moreso than the location.

tjwebb · 2024-05-13T14:02:55Z

any eta on this @iandees?

iandees · 2024-05-15T03:26:45Z

No ETA yet, no. I got a bit stymied by finding a library to write Parquet files efficiently. Do you have any pointers, preferably in Node?

cc @jwass

tjwebb · 2024-05-15T14:32:02Z

AFAICT, looks like https://www.npmjs.com/package/@dsnp/parquetjs is the go-to parquet module these days.

jwass · 2024-05-16T10:54:36Z

@iandees @tjwebb - I don't have much node experience here so couldn't help, but I started on a simple Python pyarrow json -> parquet converter that can read files from the batch output zip files and save them out as parquet.

Have you thought about what the schema and partitioning might look like for an OA parquet distribution. For example:

country=us/region=ma/city_of_dedham.parquet
...

or

country=us/region=ma/city=dedham/city_of_dedham.parquet <-- filename isn't important
...

I think there should also be some dataset_name field to indicate the source rather than having to get it from the underlying filename.

Also - since data files can only be in leaf directories you'd need special treatment on the countrywide and statewide files.

Something like:

country=no/region=countrywide/data.parquet

This couldn't live in country=no/countrywide.parquet if there are also deeper region= or other partitions.

I'm not sure what the best answer is but happy to brainstorm more if you're going down that route.

iandees · 2024-05-23T04:42:41Z

Figuring out a partitioning scheme seems relatively difficult here. Part of the reason we produce data as a single zipped blob of GeoJSON (and CSV before that) was that our data consumers almost always just wanted a big huge list of all addresses to throw at their geocoder. I don't think I've ever heard from a data consumer that wants to filter the output beyond excluding some data by license type (back when we included share-alike data).

Part of figuring out partitioning means we need to take a guess at how data consumers might query this data. Since you're suggesting partitions by source file name, are you thinking you'd want to only include data from certain sources? Would it be better to partition by license information in some way instead?

Another possible partitioning scheme I could see is by date of last successful retrieval.

I agree that there should be some extra metadata on every address row (data source reference and capture date at least) and that there should be a second dataset of source information (including contact, license information, etc.).

tjwebb · 2024-05-23T04:53:55Z

I'm a bit lost. Why is it a requirement that the data is partitioned?

iandees · 2024-05-23T13:54:04Z

I suppose it's not a requirement, but if we're publishing parquet it's because we expect it to be queried in some way. We would want to partition in order to make those queries more performant.

tjwebb · 2024-05-23T14:07:43Z

It's the database's job to solve this issue. Many databases, including Athena and BigQuery, have a way to cluster/index their external tables loaded from parquet just as they would native tables. Or, they can just import the parquet and turn it into a native table for full control.

Partitioning it the wrong way can just as easily make it less performant and less usable. I don't know why we'd try to make these assumptions. There's no such assumption in the zipfile format.

the reason we produce data as a single zipped blob of GeoJSON (and CSV before that) was that our data consumers almost always just wanted a big huge list of all addresses

Exactly.

jwass · 2024-06-09T14:28:13Z

Maybe a dumb question but I was comparing to the global-collection.zip file from the "Download options" on the main page which is split into the individual source outputs. Is there a single zipped blob of GeoJSON available somewhere else?

iandees · 2024-06-09T14:34:18Z

Is there a single zipped blob of GeoJSON available somewhere else?

Nope that's the output. When I said "single zipped blob of GeoJSON" I meant the blob containing individual sub-blobs of GeoJSON 😄

iandees · 2024-06-09T14:36:58Z

As an update here: I wrote a bit of Python code that converts all the GeoJSON into individual Parquet files and then concatenates them all into one, but I can't confirm the resulting Parquet is valid because the QGIS download for Mac doesn't seem to support Parquet by default.

jwass · 2024-06-09T14:49:32Z

Yeah you need a version of QGIS that distributes GDAL >3.5
https://docs.overturemaps.org/examples/QGIS/ has some more info

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish as Parquet / Geoparquet #7188

Publish as Parquet / Geoparquet #7188

tjwebb commented Apr 30, 2024 •

edited

iandees commented Apr 30, 2024

tjwebb commented Apr 30, 2024 •

edited

tjwebb commented May 2, 2024 •

edited

iandees commented May 2, 2024

tjwebb commented May 2, 2024

tjwebb commented May 13, 2024

iandees commented May 15, 2024

tjwebb commented May 15, 2024

jwass commented May 16, 2024

iandees commented May 23, 2024

tjwebb commented May 23, 2024

iandees commented May 23, 2024

tjwebb commented May 23, 2024 •

edited

jwass commented Jun 9, 2024

iandees commented Jun 9, 2024

iandees commented Jun 9, 2024

jwass commented Jun 9, 2024

Publish as Parquet / Geoparquet #7188

Publish as Parquet / Geoparquet #7188

Comments

tjwebb commented Apr 30, 2024 • edited

iandees commented Apr 30, 2024

tjwebb commented Apr 30, 2024 • edited

tjwebb commented May 2, 2024 • edited

iandees commented May 2, 2024

tjwebb commented May 2, 2024

tjwebb commented May 13, 2024

iandees commented May 15, 2024

tjwebb commented May 15, 2024

jwass commented May 16, 2024

iandees commented May 23, 2024

tjwebb commented May 23, 2024

iandees commented May 23, 2024

tjwebb commented May 23, 2024 • edited

jwass commented Jun 9, 2024

iandees commented Jun 9, 2024

iandees commented Jun 9, 2024

jwass commented Jun 9, 2024

tjwebb commented Apr 30, 2024 •

edited

tjwebb commented Apr 30, 2024 •

edited

tjwebb commented May 2, 2024 •

edited

tjwebb commented May 23, 2024 •

edited