Infini is a local, privacy-first small language model (SLM) designed to serve as a personalized coding companion and knowledge assistant.
It runs entirely on your machine using Ollama, LangChain, and FastAPI, powered by your own knowledge base stored in Markdown or JSON.
- Local & Private — all inference runs through your local Ollama server.
- Retrieval-Augmented Generation (RAG) — connects your knowledge (
/knowledge/md) with a Chroma vector store. - Streaming Responses — tokens flow live from the model for real-time interactivity.
- API-Ready — serves a REST + SSE API for integration with a front-end UI.
- Extendable — easily plug in more models, add new knowledge, or integrate a web front-end.
┌────────────────────────┐
│ React / Frontend UI │ ← (optional, connects via SSE)
└──────────┬─────────────┘
│
[ FastAPI Server ]
│
▼
LangChain Retriever
│
▼
Ollama Local Model (phi3)
│
▼
knowledge/md + chroma_db
git clone https://github.com/YOUR_USERNAME/infini-local.git
cd infini-localpython -m venv .venv
source .venv/bin/activate # (Mac/Linux)
# or
.\.venv\Scripts\activate # (Windows)pip install -r requirements.txtMake sure Ollama is installed and running locally:
ollama servePull your base model (for example Phi-3):
ollama pull phi3python infini_api.pyYou should see:
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
| Endpoint | Method | Description |
|---|---|---|
/ask |
POST | Returns a full JSON response (non-streaming). |
/ask/stream |
POST | Streams tokens live via Server-Sent Events (SSE). |
/cancel |
POST | Cancels an active request. |
/stats |
GET | Returns total query count and average response time. |
/ |
GET | Health check (“Infini API is running 🚀”). |
curl -X POST http://127.0.0.1:8000/ask \
-H "Content-Type: application/json" \
-d '{"query": "Which projects used MLflow?"}'curl -N -X POST http://127.0.0.1:8000/ask/stream \
-H "Content-Type: application/json" \
-d '{"query": "Summarize the Retail360 project."}'Place all your knowledge files in:
knowledge/md/
You can generate them from JSON, Markdown notes, or project data.
When the API starts, it automatically builds or updates the chroma_db index for retrieval.
- Personal coding assistant trained on your projects
- Company-specific data knowledge base (private RAG)
- Databricks / MLflow / DQX pipeline query tool
- Local alternative to cloud LLMs
A React chat interface can connect directly to /ask/stream
for real-time conversation, cancel button, and response timer.
To preview it later:
cd frontend
npm install
npm run devThis project is for personal and educational use.
Feel free to modify, extend, and self-host your own Infini instance.
- Thomas Wolfe — concept, architecture, and implementation
- Ollama — local model hosting
- LangChain + Chroma — RAG engine
- FastAPI — backend API server