This project explores and tests multiple methods of text chunking using various techniques and libraries. Text chunking is a vital step in Natural Language Processing (NLP) for tasks like retrieval-augmented generation (RAG), semantic search, and document processing. The goal is to evaluate how different chunking strategies impact the effectiveness of downstream tasks.
- Character-based Chunking: Splitting text into fixed-size chunks manually or using
CharacterTextSplitter. - Recursive Character Chunking: Handling large documents with recursive chunking logic for optimal text segmentation.
- Document-Specific Chunking: Chunking tailored for formats like Markdown, Python, and JavaScript code.
- Semantic Chunking: Using embeddings for semantically meaningful text segmentation.
- Agentic Chunking: Employing AI-driven approaches for proportion-based and proposition-aware chunking.
-
Ensure you have Ollama installed:
- Go to Ollama's Website and install latest version according to your OS.
-
Clone the repository:
git clone https://github.com/Yash8745/Chunking_RAG.git
-
Create and activate a virtual environment:
conda create -n chunking python=3.11 -y conda activate chunking
-
Install the required dependencies:
pip install -r requirements.txt
-
Pull the required models using Ollama:
ollama pull nomic-embed-text ollama pull mistral
-
Create a
.envfile for environment-specific configurations.
- Ensure you have the necessary data files, such as
sample_data.txt, in the project directory. - Run the main script:
python main.py
- View the output for each chunking method in the terminal.
chunking_rag/
│
├── README.md
├── agentic_chunker.py
├── app.py
├── requirements.txt
└── sample_data.txt
Chunking methods refer to various strategies for breaking down large pieces of text into manageable, meaningful segments. These methods are essential for applications in natural language processing, such as summarization, semantic search, and text generation.
This method involves dividing text into chunks based on character count. It’s straightforward and commonly used in text preprocessing.
-
Manual Splitting: Text is split into chunks of a fixed size (e.g., 1000 characters) without overlap.
Example:Original Text: "Hi my name is Yash and I am trying to become an machine learning Engineer..." Chunk 1: "Hi my name is Yash and I am trying to be" Chunk 2: "come an machine learning Engineer..." -
Automatic Splitting: Uses tools like
CharacterTextSplitterto split text with configurable parameters.
Parameters:chunk_size: Maximum size of each chunk.chunk_overlap: Number of overlapping characters between consecutive chunks.
Example (Using
chunk_size=50,chunk_overlap=10):Chunk 1: "Hi my name is Yash and I am trying to become an machin" Chunk 2: "an machine learning Engineer..."
Recursive splitting is used when the text is too large for fixed-size chunks. It divides text progressively into smaller chunks, ensuring better segmentation.
- Configurable Parameters:
chunk_size: Size of the final chunks.chunk_overlap: Overlap between chunks to maintain context.
Example:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_text("Large text input goes here...")This method customizes chunking for different document types to preserve structure and logic.
-
Markdown Splitting: Segments Markdown files while maintaining headings and sections.
Example:### Section 1 Content of section 1... ### Section 2 Content of section 2...
Result:
- Chunk 1: "### Section 1\nContent of section 1..."
- Chunk 2: "### Section 2\nContent of section 2..."
-
Python Code Splitting: Splits Python code while respecting logical blocks like functions and classes.
Example:def function1(): # Function 1 implementation def function2(): # Function 2 implementation
Result:
- Chunk 1:
def function1():\n # Function 1 implementation - Chunk 2:
def function2():\n # Function 2 implementation
- Chunk 1:
-
JavaScript Code Splitting: Similar to Python, this method respects constructs like functions, objects, and modules.
Example:function func1() { // Implementation } function func2() { // Implementation }
Semantic chunking uses embeddings to split text based on meaning rather than structure. This ensures that semantically similar content is grouped together.
OllamaEmbeddingsor similar embedding models.
- Compute embeddings for small sections of text.
- Merge sections with similar embeddings until a semantic threshold is reached.
Example:
Text: "Artificial intelligence is transforming industries. Machine learning is a subset of AI. Deep learning specializes in neural networks."
Chunks:
1. "Artificial intelligence is transforming industries."
2. "Machine learning is a subset of AI. Deep learning specializes in neural networks."
This advanced method uses AI to extract propositions or logical segments from text. It’s ideal for complex documents requiring logical segmentation.
AgenticChunkeror similar frameworks.
Input Text:
"Climate change is a global issue. Reducing emissions is crucial. Governments should enforce stricter laws."
Chunks:
- "Climate change is a global issue."
- "Reducing emissions is crucial."
- "Governments should enforce stricter laws."
Use Case: Extracting logical arguments for debate or analysis.
The project leverages the following Python libraries:
- LangChain: For text splitting and chunking.
- Rich: For enhanced console output.
- Chroma: For vector database operations.
- LangChain Community Tools: For embeddings and specialized chunking methods.
Each chunking method demonstrates the segmented documents in the following format:
[Document(page_content='Chunk text here...', metadata={'Source': 'local'})]The semantic and agentic methods output chunks that align with the text's meaning or logical propositions.
- Add more document-specific chunkers for formats like HTML, JSON, and XML.
- Evaluate the impact of chunking methods on RAG performance.
- Integrate chunking with downstream NLP pipelines.
Contributions are welcome! Please submit a pull request or open an issue to discuss any improvements or suggestions.