Show HN: Chonkie – A Fast, Lightweight Text Chunking Library for RAG
(github.com)195 points by bhavnicksm 4 days ago | 37 comments
I built Chonkie because I was tired of rewriting chunking code for RAG applications. Existing libraries were either too bloated (80MB+) or too basic, with no middle ground.
Core features:
- 21MB default install vs 80-171MB alternatives
- 33x faster token chunking than popular alternatives
- Supports multiple chunking strategies: token, word, sentence, and semantic
- Works with all major tokenizers (transformers, tokenizers, tiktoken)
- Zero external dependencies for basic functionality
Technical optimizations:
- Uses tiktoken with multi-threading for faster tokenization
- Implements aggressive caching and precomputation
- Running mean pooling for efficient semantic chunking
- Modular dependency system (install only what you need)
Benchmarks and code: https://github.com/bhavnicksm/chonkie
Looking for feedback on the architecture and performance optimizations. What other chunking strategies would be useful for RAG applications?
mattmein 3 days ago | next |
Also check out https://github.com/D-Star-AI/dsRAG/ for a bit more involved chunking strategy.