Chunx

Chunx is an Elixir library for splitting text into meaningful chunks using various strategies. It's particularly useful for processing large texts for LLMs, semantic search, and other NLP tasks.

Credit

This library is based on chonkie-ai/chonkie

Features

Multiple chunking strategies:
- Token-based chunking
- Word-based chunking
- Sentence-based chunking
- Semantic chunking with embeddings
Configurable options for each strategy
Support for overlapping chunks
Token count tracking
Embedding support

Installation

Add chunx to your list of dependencies in mix.exs:

def deps do
  [
    {:chunx, github: "preciz/chunx"}
  ]
end

Usage

Token-based Chunking

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
{:ok, chunks} = Chunx.Chunker.Token.chunk("Your text here", tokenizer, chunk_size: 512)

Word-based Chunking

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
{:ok, chunks} = Chunx.Chunker.Word.chunk("Your text here", tokenizer, chunk_size: 512)

Sentence-based Chunking

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
{:ok, chunks} = Chunx.Chunker.Sentence.chunk("Your text here", tokenizer)

Semantic Chunking

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")

# The embedding function must return a list of Nx.Tensor.t()
embedding_fn = fn texts ->
  # Your embedding function here
end

{:ok, chunks} = Chunx.Chunker.Semantic.chunk("Your text here", tokenizer, embedding_fn)

Configuration

Each chunking strategy accepts various options to customize the chunking behavior:

chunk_size: Maximum number of tokens per chunk
chunk_overlap: Number of tokens or percentage to overlap between chunks
min_sentences: Minimum number of sentences per chunk (for sentence-based)
threshold: Similarity threshold for semantic chunking
And more...

See the documentation for each chunker module for detailed configuration options.

Testing

# Run the test suite
mix test

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunx

Credit

Features

Installation

Usage

Token-based Chunking

Word-based Chunking

Sentence-based Chunking

Semantic Chunking

Configuration

Testing

License

About

Releases

Packages

Languages

License

preciz/chunx

Folders and files

Latest commit

History

Repository files navigation

Chunx

Credit

Features

Installation

Usage

Token-based Chunking

Word-based Chunking

Sentence-based Chunking

Semantic Chunking

Configuration

Testing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages