RAG chunker that respects document structure.
pip install chunkweaverExtras:
pip install chunkweaver[langchain] # LangChain TextSplitter integration
pip install chunkweaver[llamaindex] # LlamaIndex NodeParser integration
pip install chunkweaver[cli] # CLI (included in base install)from chunkweaver import Chunker
# Minimal — just better defaults
chunker = Chunker(target_size=1024)
chunks = chunker.chunk(text)
# With structure-aware boundaries
from chunkweaver.presets import LEGAL_EU
chunker = Chunker(
target_size=1024,
overlap=2,
overlap_unit="sentence",
boundaries=LEGAL_EU,
)
chunks = chunker.chunk(text)
# With metadata
chunks = chunker.chunk_with_metadata(text)
for c in chunks:
print(c.text) # full chunk content
print(c.start, c.end) # character offsets in original
print(c.boundary_type) # "section" | "paragraph" | "sentence" | "word" | "keep_together"
print(c.overlap_text) # the overlap prefix (if any)
print(c.content_text) # text without overlap (for dedup) PDF / DOCX / HTML Your vector DB
│ ▲
▼ │
┌───────────┐ ┌──────────────┐ │
│ Extractor │───▶│ chunkweaver │────┘
│ │ │ │
│ unstructured │ boundaries │ Embed + upsert
│ marker-pdf │ detectors │ into Pinecone,
│ docling │ presets │ Qdrant, Weaviate,
│ pdfminer │ overlap │ ChromaDB, etc.
└───────────┘ └──────────────┘
Your extractor turns files into text. Your vector DB stores embeddings. chunkweaver sits in the middle — splitting that text at structural boundaries so each chunk is a coherent unit of meaning, not an arbitrary slice of characters.
Standard chunkers, including LangChain's RecursiveCharacterTextSplitter,
don't know that "Article 17" starts a new legal section, or that a table's
header row belongs with its data. The result: chunks that straddle topic
boundaries, producing blurry embeddings and incomplete retrievals.
The fix is cheap: surface markers (headings, article numbers, table rules) are reliable proxies for where topics change. Detecting them costs O(n) character comparisons — orders of magnitude less than computing semantic boundaries from embeddings.
Our LLM-as-judge benchmark on 11 documents across four domains and 58 queries:
| Baseline | CW wins | Baseline wins | p-value |
|---|---|---|---|
| Naive 600-char | 15 | 4 | 0.019 |
| LangChain RCTS | 11 | 4 | 0.119 |
See benchmarks/llm_judge/ for full results, methodology, and reproduction steps.
- Zero dependencies — stdlib only, no LangChain/LlamaIndex tax
- Regex boundaries — you tell the chunker where sections start (
^Article \d+,^##,^Item 1.) - Hierarchical levels — CHAPTER > Section > Article > clause; split only as deep as needed
- Heuristic detectors —
HeadingDetector,TableDetectordiscover structure from text patterns - Annotation ingestion — accept pre-computed structure from any extractor
- Semantic overlap — sentences, not characters
- Full metadata — offsets, boundary types, hierarchy levels, overlap tracking
- Integrations — LangChain and LlamaIndex drop-ins
Built-in boundary patterns for common document types:
from chunkweaver.presets import (
LEGAL_EU, LEGAL_US, RFC, MARKDOWN,
CHAT, CLINICAL, FINANCIAL, FINANCIAL_TABLE,
SEC_10K, FDA_LABEL, PLAIN,
)| Preset | Domain | Detects |
|---|---|---|
LEGAL_EU |
EU legislation | Article N, CHAPTER, SECTION, (1) recitals |
LEGAL_US |
US law / contracts | § N, Section N, WHEREAS, 1.1 clauses |
RFC |
IETF RFCs | 1. Intro, 3.1 Overview, Appendix A |
MARKDOWN |
Markdown | # headings, --- rules |
CHAT |
Chat logs | [14:30], ISO timestamps, speaker: turns |
CLINICAL |
Medical notes | HPI:, ASSESSMENT:, PLAN:, etc. |
FINANCIAL |
SEC filings | Item 1., PART I, NOTE 1, Schedule A |
FINANCIAL_TABLE |
Data tables | TABLE N, markdown/ASCII separators |
SEC_10K |
SEC annual reports | PART I–IV, Item N., ALL-CAPS sub-headings |
FDA_LABEL |
Drug labels | 1 INDICATIONS, ## 2.1 Adult Dosage |
PLAIN |
Any | No boundaries — pure paragraph/sentence fallback |
Combine presets freely:
boundaries = LEGAL_EU + [r"^TABLE\s+", r"^Annex\s+"]
boundaries = FINANCIAL + FINANCIAL_TABLEMost presets have _LEVELED variants with hierarchical splits:
LEGAL_EU_LEVELED, LEGAL_US_LEVELED, RFC_LEVELED, MARKDOWN_LEVELED,
FINANCIAL_LEVELED, SEC_10K_LEVELED, FDA_LABEL_LEVELED.
See docs/cookbook.md for details.
chunkweaver document.txt --size 1024 --overlap 2
chunkweaver legal_doc.txt --preset legal-eu --format json
chunkweaver file.txt --detect-boundaries --boundaries "^Article\s+\d+"
cat document.txt | chunkweaver --size 1024 --preset rfc
chunkweaver --recommend my_document.txt
chunkweaver --inspect my_document.txtThe --recommend flag analyzes a document and suggests the right config:
=== chunkweaver recommend ===
Document: 12,340 chars, 380 lines, 45 paragraphs
--- Preset matching ---
legal-eu 8 hits <-- best
--- Detectors ---
HeadingDetector: YES (12 headings found)
TableDetector: YES (2 tables found)
--- Python snippet ---
from chunkweaver import Chunker
from chunkweaver.presets import LEGAL_EU
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector
chunker = Chunker(
target_size=1024,
overlap=2,
boundaries=LEGAL_EU,
detectors=[HeadingDetector(), TableDetector()],
)
Drop-in replacement for RecursiveCharacterTextSplitter:
from chunkweaver.integrations.langchain import ChunkWeaverSplitter
splitter = ChunkWeaverSplitter(
target_size=1024,
overlap=2,
boundaries=[r"^#{1,3}\s"],
)
docs = splitter.create_documents([text])Requires: pip install chunkweaver[langchain]
Drop-in NodeParser for ingestion pipelines:
from chunkweaver.integrations.llamaindex import ChunkWeaverNodeParser
from chunkweaver.presets import LEGAL_EU
parser = ChunkWeaverNodeParser(
target_size=1024,
boundaries=LEGAL_EU,
overlap=2,
)
nodes = parser.get_nodes_from_documents(documents)Requires: pip install chunkweaver[llamaindex]
For documents without clean section markers — SEC filings, scanned contracts, extracted PDFs — heuristic detectors discover structure from text patterns.
from chunkweaver import Chunker
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector
chunker = Chunker(
target_size=1024,
detectors=[HeadingDetector(), TableDetector()],
)HeadingDetector scores lines on casing, length, whitespace context, and known prefixes. Works well on Title Case and ALL CAPS headings.
TableDetector identifies numeric data runs and marks them as keep-together regions. On SEC 10-K filings, keeps 80% of financial tables intact vs. 21% without it.
Custom detectors: subclass BoundaryDetector and return SplitPoint /
KeepTogetherRegion. See examples/ml-detectors/
for scikit-learn examples.
- Cookbook — domain recipes (clinical, FDA, SEC, financial, legal, chat, CJK), hierarchical boundaries, annotation ingestion, vector DB integration, tuning tips
- API Reference — full parameter tables,
Chunkattributes, algorithm details, architecture - FAQ — "What if I need to..." for common questions
- Benchmark — LLM-as-judge methodology, reproduction steps, raw results
Part of a RAG tools suite for retrieval quality:
| Tool | Role | What it measures |
|---|---|---|
| chunkweaver | Ingestion | Structure-aware chunking — controls what text enters the prompt |
| ragtune | Evaluation | Retrieval metrics (Recall@K, MRR, bootstrap CI) — measures how well your pipeline retrieves |
| ragprobe | Pre-deployment | Domain difficulty analysis — predicts how hard retrieval will be before you build |
chunkweaver --recommend my_doc.txt # what config to use
chunkweaver my_doc.txt --preset legal-eu \
--export-dir ./chunks/ # chunk → one .txt per chunk
ragtune ingest ./chunks/ --pre-chunked # embed + store
ragtune simulate --queries golden.json # measure retrieval
ragprobe analyze --corpus ./docs # how hard is this domain--export-dir writes ragtune-compatible files directly — no glue script needed.
Use --format json with --recommend or --inspect for CI-friendly output.
- Sentence detection defaults to a simple regex (
[.!?]\s+(?=[A-Z"(])). Abbreviations like "Dr. Smith" may cause false splits. For non-English or informal text, passsentence_pattern— built-in alternatives:SENTENCE_END_CJK,SENTENCE_END_PERMISSIVE. See the cookbook for Cyrillic, Spanish, and other script-specific guidance. - Boundaries are line-level regex matches — they won't detect inline structural markers.
- No tokenizer awareness —
target_sizeis in characters, not tokens. For token budgets, estimatetokens ≈ chars / 4.
Oleksii Alexapolsky (𝕏) — building retrieval quality tools: chunkweaver, ragtune, ragprobe.
MIT
