Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Chunking Processor #95

Open
kouloumos opened this issue Dec 11, 2024 · 0 comments
Open

Semantic Chunking Processor #95

kouloumos opened this issue Dec 11, 2024 · 0 comments

Comments

@kouloumos
Copy link
Contributor

We need to implement a single processor to handle semantic chunking for documents. The processor will:

  • Break documents into chunks based on semantic structure.
  • Store these chunks using a nested schema to model the relationship between the resource and its chunks.

This approach replaces the earlier plan of having two separate processors (semantic chunking and markdown-based chunking).

Requirements

  1. Customizable Chunking Strategy

    • The processor should allow experimentation with different chunking strategies.
    • Ensure the strategy is modular and can be easily modified as needed.
  2. Input Format

    • The input for chunking will be the body field of the document.
    • This field is consistently formatted in markdown across all documents.
  3. Schema for Storing Chunks

    Example Document with Nested Schema

    {
      "id": "delving-bitcoin-1257-11-3754",
      "title": "Understanding Bitcoin",
      "body": "Full markdown content here...",
      "chunks": [
        {
          "id": "delving-bitcoin-1257-11-3754-chunk1",
          "title": "Introduction to Bitcoin",
          "body": "Bitcoin is a decentralized digital currency...",
        },
        {
          "id": "delving-bitcoin-1257-11-3754-chunk2",
          "title": "How Bitcoin Works",
          "body": "Bitcoin operates on a blockchain...",
        },
        {
          "id": "delving-bitcoin-1257-11-3754-chunk3",
          "body": "Additional details on mining...",
        }
      ],
      "chunking_strategy": "v1.0"
    }
  4. Version Control

    • Add a chunking strategy version field to the document schema to track the chunking method used.
  5. Configurable Text Limit

    • Not all documents need to be chunked.
    • Introduce a configurable text length threshold to decide whether a document is chunked.

Additional Notes

  • The previous plan involving two processors has been abandoned as it added unnecessary complexity for the scraper.

Open Questions

  1. Should we assign titles to chunks? This could be especially valuable for deriving chapter titles in transcript documents.
  2. If semantic chunking leverages embeddings, should we store these embeddings as part of the chunk data for potential future applications?
  3. Can existing markdown headings enhance the semantic chunking process? For example, should they be used as a guide or a starting point for chunk definitions?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant