Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapers: GitHub metadata from bitcoin/bitcoin repository #92

Closed
kouloumos opened this issue Dec 2, 2024 · 0 comments · Fixed by #93
Closed

scrapers: GitHub metadata from bitcoin/bitcoin repository #92

kouloumos opened this issue Dec 2, 2024 · 0 comments · Fixed by #93
Assignees

Comments

@kouloumos
Copy link
Contributor

We need to enable regular updates for GitHub metadata (Issues/PRs and their comments) for the Bitcoin Core repository in the coredev index. Once this is implemented, we can extend support to other repositories, including:

  • bitcoin/bips
  • bitcoin-core/gui
  • bitcoin-core/secp256k1
  • repositories for other Bitcoin open-source software as needed.

Current Status

The coredev index powers the CoreDev bot and focuses on Bitcoin Core-related sources. Current sources includes:

However, the index was created from a one-time scrape and has not been updated. While the onboarding guide is static and doesn’t require updates, other sources need regular indexing to remain relevant.

Proposed Approach

To achieve regular updates, we can use 0xB10C's github-metadata-backup. There are two options:

  1. Run our own instance of the tool.
  2. Use the existing backup from bitcoin-data which updates every hour.

Our scraper already supports GitHub repositories, so Option 2 should be straightforward.

Considerations

  1. Start with smaller datasets:

    • Begin with a subset of the bitcoin/bitcoin repository or smaller repositories like bitcoin/bips or bitcoin-core/secp256k1 to simplify testing and validation.
  2. Data storage format:

    • Ensure replies to issues are linked to the main issue/PR using the issue field (as in the PR Review Club scraper).
    • Alternatively, the thread_url field could be used, but the issue field is preferred for consistency.
  3. Follow the terminology outlined in Improving Terminology Consistency Across Data Infrastructure #89

    • Source: https://github.com/bitcoin/bitcoin/
    • Resource: Issue/PR
    • Item: Individual comments within an issue/PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant