Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scrapers): Add GitHub metadata scraper for issues and pull requests #93

Merged
merged 3 commits into from
Dec 23, 2024

Conversation

kouloumos
Copy link
Contributor

This PR adds support for scraping all the existing github metadata backups from bitcoin-data which updates every hour for the following repositories: bitcoin/bips, bitcoin/bitcoin, bitcoin-core/secp256k1, bitcoin-core/gui, thus fixes #92

Additionally, this PR introduces a separate 'elastic' command group that handles index initialization with custom mappings, cleanup operations, and mapping inspection. We need index initialization with custom mapping in order to model the relationships between documents in the current github metadata schema using nesting.

An experimental addition to the Bitcoin Search API is currently in progress in order to better showcase and validate this scraper.

The Schema

The schema is designed to store three main types of content:

  1. Issue/PR metadata
  2. Reviews (for PRs)
  3. Comments (both general comments and review-related discussions)

and can be found at scraper/models/github_metadata.py

This schema allows for efficient querying of:

  • Issues and PRs by type, state, labels, or authors
  • All comments on a specific file or line
  • Complete review threads with their full discussion history
  • Reviews by specific users or on specific commits
  • General comments
  • ...

The nested structure maintains the relationships between reviews, threads, and comments while allowing for efficient querying and aggregation at each level.

Example document

{
  "id": "github-metadata-bitcoin-core-12345",
  "title": "Add new validation rules for taproot transactions",
  "body": "This PR implements additional validation rules for taproot transactions as specified in BIP 341. It adds checks for:\n\n- Script path spending conditions\n- Leaf version requirements\n- Signature verification rules\n\nThe changes have been tested extensively with the test vectors from BIP 341.",
  "body_type": "markdown",
  "domain": "https://github.com/bitcoin/bitcoin",
  "indexed_at": "2024-12-10T12:00:00.000Z",
  "created_at": "2024-03-15T14:30:00Z",
  "url": "https://github.com/bitcoin/bitcoin/pull/12345",
  "type": "pull",
  "authors": ["alice-developer"],
  "number": "12345",
  "updated_at": "2024-03-20T09:15:00Z",
  "closed_at": "2024-03-25T16:45:00Z",
  "merged_at": "2024-03-25T16:45:00Z",
  "state": "merged",
  "labels": ["consensus", "needs testing", "taproot"],
  "head_sha": "abc123def456789ghijklmnop0123456789qrstuv",
  "draft": false,
  "reviews": [
    {
      "id": 987654321,
      "author": "bob-reviewer",
      "commit_id": "abc123def456789ghijklmnop0123456789qrstuv",
      "submitted_at": "2024-03-18T11:20:00Z",
      "body": "The implementation looks correct and follows the BIP specification. Tested with the provided test vectors."
    }
  ],
  "review_threads": [
    {
      "pull_request_review_id": 987654321,
      "path": "src/consensus/validation.cpp",
      "diff_hunk": "@@ -150,6 +150,15 @@ bool CheckTaprootSignature(...)",
      "commit_id": "abc123def456789ghijklmnop0123456789qrstuv",
      "original_commit_id": "abc123def456789ghijklmnop0123456789qrstuv",
      "position": 8,
      "original_position": 8,
      "line": 158,
      "original_line": 158,
      "comments": [
        {
          "id": 11111111,
          "author": "bob-reviewer",
          "created_at": "2024-03-18T11:25:00Z",
          "updated_at": "2024-03-18T11:25:00Z",
          "body": "Consider adding a comment explaining the merkle path verification logic here.",
          "pull_request_review_id": 987654321
        }
      ]
    }
  ],
  "comments": [
    {
      "id": 22222222,
      "author": "charlie-commenter",
      "created_at": "2024-03-16T10:00:00Z",
      "updated_at": "2024-03-16T10:00:00Z",
      "body": "Have you considered adding test cases for edge cases with empty script paths?"
    }
  ]
}

Move Elasticsearch operations into a separate 'elastic' command group that handles index
initialization with custom mappings, cleanup operations, and mapping inspection.

The new structure supports initializing indices with custom mapping files.
Implement scraper for processing GitHub issue and PR metadata from JSON
backups. Add Pydantic models for structured metadata including reviews,
comments and review threads. Include Elasticsearch mapping for the
metadata schema and register new data sources in sources.yaml.
Use the `number` field instead of `issue` to refer to the
number of the related pull request. This follows the same
schema used for github metadata
@kouloumos kouloumos merged commit 22f5b89 into bitcoinsearch:master Dec 23, 2024
@kouloumos kouloumos deleted the scrape-github-metadata branch December 23, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scrapers: GitHub metadata from bitcoin/bitcoin repository
1 participant