-
-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish as Parquet / Geoparquet #7188
Comments
Hi Travis! Good suggestion. We can investigate generating and posting geoparquet files. I know that Athena can load data from requester pays S3 buckets, but does BigQuery support import from such buckets? |
I believe so; BigQuery Data Transfer Service requires AWS credentials to pull the data, so the account associated with those AWS credentials should be charged in the case where requester pays is enabled for your S3 bucket. I will try to verify this |
Well how about this: if you publish the data somewhere in a requester pays bucket, I will test it with BigQuery and let you know if it works :) |
What if I put it in a Cloudflare R2 bucket? Then it's about the same amount of miserable for everyone but at least it's free transit 🙂 |
whichever is easiest for you. I think the format is the main thing, moreso than the location. |
any eta on this @iandees? |
No ETA yet, no. I got a bit stymied by finding a library to write Parquet files efficiently. Do you have any pointers, preferably in Node? cc @jwass |
AFAICT, looks like https://www.npmjs.com/package/@dsnp/parquetjs is the go-to parquet module these days. |
@iandees @tjwebb - I don't have much node experience here so couldn't help, but I started on a simple Python pyarrow json -> parquet converter that can read files from the batch output zip files and save them out as parquet. Have you thought about what the schema and partitioning might look like for an OA parquet distribution. For example:
or
I think there should also be some Also - since data files can only be in leaf directories you'd need special treatment on the countrywide and statewide files. Something like:
This couldn't live in I'm not sure what the best answer is but happy to brainstorm more if you're going down that route. |
Figuring out a partitioning scheme seems relatively difficult here. Part of the reason we produce data as a single zipped blob of GeoJSON (and CSV before that) was that our data consumers almost always just wanted a big huge list of all addresses to throw at their geocoder. I don't think I've ever heard from a data consumer that wants to filter the output beyond excluding some data by license type (back when we included share-alike data). Part of figuring out partitioning means we need to take a guess at how data consumers might query this data. Since you're suggesting partitions by source file name, are you thinking you'd want to only include data from certain sources? Would it be better to partition by license information in some way instead? Another possible partitioning scheme I could see is by date of last successful retrieval. I agree that there should be some extra metadata on every address row (data source reference and capture date at least) and that there should be a second dataset of source information (including contact, license information, etc.). |
I'm a bit lost. Why is it a requirement that the data is partitioned? |
I suppose it's not a requirement, but if we're publishing parquet it's because we expect it to be queried in some way. We would want to partition in order to make those queries more performant. |
It's the database's job to solve this issue. Many databases, including Athena and BigQuery, have a way to cluster/index their external tables loaded from parquet just as they would native tables. Or, they can just import the parquet and turn it into a native table for full control. Partitioning it the wrong way can just as easily make it less performant and less usable. I don't know why we'd try to make these assumptions. There's no such assumption in the zipfile format.
Exactly. |
Maybe a dumb question but I was comparing to the global-collection.zip file from the "Download options" on the main page which is split into the individual source outputs. Is there a single zipped blob of GeoJSON available somewhere else? |
Nope that's the output. When I said "single zipped blob of GeoJSON" I meant the blob containing individual sub-blobs of GeoJSON 😄 |
As an update here: I wrote a bit of Python code that converts all the GeoJSON into individual Parquet files and then concatenates them all into one, but I can't confirm the resulting Parquet is valid because the QGIS download for Mac doesn't seem to support Parquet by default. |
Yeah you need a version of QGIS that distributes GDAL >3.5 |
Overture Maps and other prominent projects have decided to publish their spatial datasets using the parquet (and/or geoparquet) data formats. This can make it easier to load the data into cloud databases such as Athena and BigQuery, and decreasing the manual effort attendant on dealing with zipped geojson files.
Has the team considered publishing the data in parquet format alongside the zip files?
The text was updated successfully, but these errors were encountered: