You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Explore assigning a title to each chunk where possible. This is particularly useful for transcripts, where the title can serve as a chapter title.
Example Document with Nested Schema
{
"id": "delving-bitcoin-1257-11-3754",
"title": "Understanding Bitcoin",
"body": "Full markdown content here...",
"chunks": [
{
"id": "delving-bitcoin-1257-11-3754-chunk1",
"title": "Introduction to Bitcoin",
"body": "Bitcoin is a decentralized digital currency...",
},
{
"id": "delving-bitcoin-1257-11-3754-chunk2",
"title": "How Bitcoin Works",
"body": "Bitcoin operates on a blockchain...",
},
{
"id": "delving-bitcoin-1257-11-3754-chunk3",
"body": "Additional details on mining...",
}
],
"chunking_strategy": "v1.0"
}
Version Control
Add a chunking strategy version field to the document schema to track the chunking method used.
Configurable Text Limit
Not all documents need to be chunked.
Introduce a configurable text length threshold to decide whether a document is chunked.
Additional Notes
The previous plan involving two processors has been abandoned as it added unnecessary complexity for the scraper.
Open Questions
Should we assign titles to chunks? This could be especially valuable for deriving chapter titles in transcript documents.
If semantic chunking leverages embeddings, should we store these embeddings as part of the chunk data for potential future applications?
Can existing markdown headings enhance the semantic chunking process? For example, should they be used as a guide or a starting point for chunk definitions?
The text was updated successfully, but these errors were encountered:
We need to implement a single processor to handle semantic chunking for documents. The processor will:
This approach replaces the earlier plan of having two separate processors (semantic chunking and markdown-based chunking).
Requirements
Customizable Chunking Strategy
Input Format
body
field of the document.Schema for Storing Chunks
Example Document with Nested Schema
Version Control
Configurable Text Limit
Additional Notes
Open Questions
The text was updated successfully, but these errors were encountered: