Semantic Chunking Processor #95

kouloumos · 2024-12-11T14:09:59Z

We need to implement a single processor to handle semantic chunking for documents. The processor will:

Break documents into chunks based on semantic structure.
Store these chunks using a nested schema to model the relationship between the resource and its chunks.

This approach replaces the earlier plan of having two separate processors (semantic chunking and markdown-based chunking).

Requirements

Customizable Chunking Strategy
- The processor should allow experimentation with different chunking strategies.
- Ensure the strategy is modular and can be easily modified as needed.
Input Format
- The input for chunking will be the body field of the document.
- This field is consistently formatted in markdown across all documents.

Schema for Storing Chunks

Use a nested schema to represent the relationship between a resource and its chunks. Refer to feat(scrapers): Add GitHub metadata scraper for issues and pull requests #93 for a similar implementation of nested schemas.
Explore assigning a title to each chunk where possible. This is particularly useful for transcripts, where the title can serve as a chapter title.

Example Document with Nested Schema

{
  "id": "delving-bitcoin-1257-11-3754",
  "title": "Understanding Bitcoin",
  "body": "Full markdown content here...",
  "chunks": [
    {
      "id": "delving-bitcoin-1257-11-3754-chunk1",
      "title": "Introduction to Bitcoin",
      "body": "Bitcoin is a decentralized digital currency...",
    },
    {
      "id": "delving-bitcoin-1257-11-3754-chunk2",
      "title": "How Bitcoin Works",
      "body": "Bitcoin operates on a blockchain...",
    },
    {
      "id": "delving-bitcoin-1257-11-3754-chunk3",
      "body": "Additional details on mining...",
    }
  ],
  "chunking_strategy": "v1.0"
}

Version Control
- Add a chunking strategy version field to the document schema to track the chunking method used.
Configurable Text Limit
- Not all documents need to be chunked.
- Introduce a configurable text length threshold to decide whether a document is chunked.

Additional Notes

The previous plan involving two processors has been abandoned as it added unnecessary complexity for the scraper.

Open Questions

Should we assign titles to chunks? This could be especially valuable for deriving chapter titles in transcript documents.
If semantic chunking leverages embeddings, should we store these embeddings as part of the chunk data for potential future applications?
Can existing markdown headings enhance the semantic chunking process? For example, should they be used as a guide or a starting point for chunk definitions?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic Chunking Processor #95

Semantic Chunking Processor #95

kouloumos commented Dec 11, 2024

Semantic Chunking Processor #95

Semantic Chunking Processor #95

Comments

kouloumos commented Dec 11, 2024

Requirements

Additional Notes

Open Questions