This repository contains a production-ready Semantic Search and Retrieval-Augmented Generation (RAG) system built using Python, FastAPI, Pinecone, and Llama 3 (served via Groq).
The system combines sparse keyword-based retrieval (BM25) with dense vector retrieval (Pinecone) to deliver accurate, low-latency, and scalable question answering.
The project is intentionally structured as a modular Python codebase, avoiding notebook-centric designs.
-
Hybrid Retrieval
- Sparse retrieval using BM25 for exact keyword matching
- Dense semantic retrieval using sentence embeddings stored in Pinecone
- Weighted score fusion for improved recall and robustness
-
Production-Grade Architecture
- Modular Python files with clear separation of concerns
- Config-driven design with environment-based secrets
- Designed for extensibility (reranking, evaluation, APIs)
-
Low-Latency LLM Inference
- Llama 3 (
8B / 70B) served via Groq - Fast response times suitable for real-time applications
- Llama 3 (
-
API-First Design
- FastAPI-based REST service
- Interactive Swagger documentation
- CLI and API interfaces supported
semantic-search-api/
│
├── src/
│ ├── config.py # Central configuration & environment loading
│ ├── api.py # FastAPI application
│ ├── data.py # Document loading and chunking
│ ├── retrieval.py # BM25, dense, and hybrid retrievers
│ ├── generation.py # Prompting and Llama 3 (Groq) integration
│ ├── pipeline.py # End-to-end RAG orchestration
│ └── main.py # CLI entry point
│
├── data/
│ └── documents/ # Input text documents
│
├── tools/
│ └── build_pinecone_index.py # One-time vector index builder
│
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md
- Documents are loaded and chunked into passages
- BM25 retrieves keyword-relevant passages in-memory
- Pinecone retrieves semantically relevant passages using embeddings
- Retrieval scores are fused via a hybrid strategy
- Top-ranked context is injected into a prompt
- Llama 3 on Groq generates a grounded answer
This separation of retrieval, ranking, and generation makes the system easier to tune, debug, and scale.
git clone <your-repository-url>
cd semantic-search-apipip install -r requirements.txtCreate a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here.env to version control.
Before running the system, build the vector index:
python tools/build_pinecone_index.pyThis step:
- Embeds document chunks
- Uploads vectors to Pinecone
- Stores text as metadata for retrieval
uvicorn src.api:app --reload-
Swagger UI: http://127.0.0.1:8000/docs
-
Endpoint:
POST /query
Example request:
{
"query": "What is hybrid search?",
"top_k": 5
}python src/main.pyExample:
Query: Explain hybrid retrieval
Answer: ...
The system is designed to be easily extended with:
- Cross-encoder or LLM-based rerankers
- Streaming responses
- Authentication and rate limiting
- Vector store alternatives (FAISS, Chroma)
- Monitoring and feedback loops
This project demonstrates:
- Practical understanding of information retrieval systems
- Trade-offs between sparse and dense search
- Real-world RAG system design beyond notebooks
- Clean, production-style Python engineering
- API-first ML system deployment