-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest driven downloads #117
base: main
Are you sure you want to change the base?
Conversation
site/src/DownloadCatalog.js
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to expand this file, as it used to just be a simple readout of all the building parquets- now it actually has some refinement logic in it for the bboxes and visible theme refinements we apply.
Two pieces of information would be super useful in the manifests:
It likely won't make a huge difference to this repo given the bounding box optimizations (and the CF distro of course), but a 50% cut in metadata requests is still nice (that and I reckon a bit of offline behaviour/speculative read-ahead is possible with that). Also assuming these manifests make their way out to general use, broad bounding boxes + non-spatial filters will benefit greatly. |
Agreed @H-Plus-Time - the manifests are supposed to be general purpose info that helps everybody, so this is great info for me to consider. While I have this site in mind as a use case, I want to make sure useful info goes in for others. |
@@ -0,0 +1,4038 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to publish a manifest like this (or at least just a list of file names) alongside each overture release in s3 or azure so that other tools like this could process the list of files without having to know or care what blob storage API you need to use to list files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I realize this is similar to @H-Plus-Time's comment. It seems like this could either go in this repo to benefit this one site, overture releases to benefit all tools that work with overture data, or even into the geoparquet spec to benefit other datasets as well. Seems like it helps bridge the gap from parquet's initial big data roots where downloading an extra 250mb is no big deal to more consumer-facing use cases like this site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msbarry yes, that's the eventual goal of this sort of work. I saw that you'd also commented on OvertureMaps/data#25 from last year and understand your sentiments.
I've put the beginnings of a manifest file creation tool in my own personal github repo at https://github.com/Bonkles/manifest-generator to try and alleviate this kind of concern. Definitely interested in any other ideas around info the manifest should contain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I think the bare minimum (not overture specific) would be a list of: relative path, file size, and geoparquet metadata (or a subset like bbox, geometry types if the entire metadata is too big). Anything overture-specific beyond that could be nice to limit processing but not strictly necessary.
0703d46
to
999499a
Compare
…he download button.
… that aren't visible.
…filespecs relative to that basepath.
…sing concurrent promise pipelining.
…r the 09-18 release.
…rds to speak of.
6f71cf5
to
7aa57ab
Compare
This PR attempts to improve our download times by reducing file count in two ways:
The first item is accomplished from the manifests generated by @Bonkles/manifest-generator to speed up and improve our download experience.
These two bits of logic allow us to make refinements to the total catalog of files we might need to consider- greatly decreasing the number of files to consult.
So, when the user clicks the download button, we can massively pare down the # of HTTP requests/data loading required to assemble a valid catalog. If the user wants buildings, we'd previously have to send 4 HTTP requests per file, a total of about 800+ requests, before we could start downloading data.
Still to-do:
This PR is in draft form as I'm waiting for the new geoparquet reader code revamp to land, then I can really test.