sp-agent

A simple information delivery agent using a RAG system

Chunking strategy by Levels of Text Splitting by Greg Kamradt

Prequisites

python3
jupyter notebook
pip install -r requirements.txt

Key Features

🚀 High Performance:

Parallel processing of multiple files
Efficient token estimation (4 chars/token rule)
Embedding caching to avoid recomputation
Batch processing for embeddings

🧠 Semantic Intelligence:

Respects paragraph and sentence boundaries
Adaptive chunking based on content structure
Keyword extraction for each chunk
Cosine similarity ranking for queries

🔧 Your Model Integration:

Designed to work with CompendiumLabs/bge-base-en-v1.5-gguf
Ready for integration with Llama-3.2-1B-Instruct-GGUF
Proper embedding dimension handling

⚡ Fast Processing:

Compiled regex patterns for speed
ThreadPoolExecutor for parallel file processing
In-memory caching with disk persistence
Minimal overhead chunking algorithm

Advanced Features

Adaptive Chunking: Respects semantic boundaries while maintaining size constraints
Rich Metadata: Each chunk includes keywords, token counts, source info, and embeddings
Caching System: Persistent embedding cache for faster subsequent runs
Statistics: Built-in analytics for your chunk collection
Batch Processing: Optimized for large document collections

Quick Start

Install required packages
Add your data to ./datasets or use default data

cp <YOUR_PDF> ./datasets 
./process_training.sh

Configure your RAG

jupyter notebook
...
# open in browser
# scroll all the way down to last box
# edit the following
def start_rag_chunker():
    chunker = FastRAGChunker(
        embedding_model_path="sentence-transformers/all-MiniLM-L6-v2", # change model to custom
        chunk_size=512,
        chunk_overlap=50,
        max_workers=2,
        cache_embeddings=True
    )
# when finished press CTRL+ENTER and follow prompt

Ask questions and receive answers or unanswered context. Enter quit or exit to end the conversation

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
datasets		datasets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
process_training.sh		process_training.sh
requirements.txt		requirements.txt
sp-agent.ipynb		sp-agent.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sp-agent

Prequisites

Key Features

Advanced Features

Quick Start

About

Uh oh!

Releases 1

Packages

Languages

License

TimothyLe/sp-agent

Folders and files

Latest commit

History

Repository files navigation

sp-agent

Prequisites

Key Features

Advanced Features

Quick Start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages