feat(scrapers): Add GitHub metadata scraper for issues and pull requests #93
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for scraping all the existing github metadata backups from bitcoin-data which updates every hour for the following repositories:
bitcoin/bips
,bitcoin/bitcoin
,bitcoin-core/secp256k1
,bitcoin-core/gui
, thus fixes #92Additionally, this PR introduces a separate 'elastic' command group that handles index initialization with custom mappings, cleanup operations, and mapping inspection. We need index initialization with custom mapping in order to model the relationships between documents in the current github metadata schema using nesting.
An experimental addition to the Bitcoin Search API is currently in progress in order to better showcase and validate this scraper.
The Schema
The schema is designed to store three main types of content:
and can be found at
scraper/models/github_metadata.py
This schema allows for efficient querying of:
The nested structure maintains the relationships between reviews, threads, and comments while allowing for efficient querying and aggregation at each level.
Example document