Chat with your own files. 100% local, nothing leaves your machine.
Beacon indexes the folders you choose, finds the passages that actually answer your question, and answers using a local LLM through Ollama. Context is optimized by TokenGate so you get sharp answers instead of token dumps.
Your laptop is full of answers: resumes, contracts, notes, PDFs, code docs. Finding "which file said that?" normally means opening a dozen windows. Cloud tools solve this by uploading your files to someone else's servers.
Beacon keeps everything on your device.
- π Private by design. Files, embeddings, the vector DB, and chat history never leave your machine. The only network calls are to your own local Ollama and a one-time model download from Hugging Face.
- π― Grounded answers. Every reply cites the exact files it used, with inline previews you can click to open.
- π§ Smart context, not token dumps. TokenGate reranks, deduplicates, compresses, and budgets the retrieved chunks so the LLM sees only the most relevant content.
- π Full transparency. The built-in TokenGate Insights view shows, per question, what was retrieved, what was kept vs. dropped, tokens saved, and the exact prompt sent to the LLM.
- β‘ Agentic mode. For models that support tool calling (e.g.
gemma4:e4b), Beacon lets the LLM search files on demand instead of doing a single bulk retrieval. - π A/B comparison built-in. Toggle TokenGate on/off per chat to compare against a best-practice baseline RAG (rerank, top-N selection, LangChain stuffing).
Your folders
| scan (incremental, ignores node_modules/site-packages/...)
v
Extract text TXT / MD / PDF / DOCX / images (OCR optional)
| chunk (token-aware, with overlap)
v
BGE-M3 embed --> LanceDB (local vector store)
^
your question -------+ retrieve top-50 within an optional folder scope
|
v
Relevance gate cross-encoder check; chitchat is answered directly without file context
|
v
TokenGate.optimize() OR Best-practice baseline (rerank, top-N, LangChain)
|
v
Ollama (local LLM) --> streamed answer, cited files, full audit
| Layer | Technology |
|---|---|
| Backend API | Python 3.12, FastAPI, uv |
| Retrieval | LanceDB (vectors) + BGE-M3 embeddings |
| Reranking | BGE-Reranker-v2-m3 (cross-encoder) |
| Context optimization | tokengate library |
| Baseline RAG | LangChain (LCEL, ChatOllama) |
| Local LLM | Ollama |
| Metadata | SQLite |
| Frontend | Vite, React, TypeScript, Framer Motion |
| Desktop (planned) | Tauri |
| Requirement | Notes |
|---|---|
| Python 3.12 | Backend pinned to >=3.12,<3.13 for ML wheel compatibility. Install from python.org or via pyenv. |
| uv | Python env and dependency manager. pip install uv or see uv docs. |
| Node.js >= 20 | For the Vite frontend. nodejs.org |
| Ollama | Local LLM runtime. Install from ollama.com, then pull a model. |
| NVIDIA GPU + CUDA 12.8 | Optional. Speeds up embedding and reranking; CPU fallback is automatic. |
First run downloads models. BGE-M3 (embedder, ~600 MB) and BGE-Reranker-v2-m3 (reranker, ~1.1 GB) are fetched from Hugging Face the first time you index or chat. They are cached locally afterward.
git clone https://github.com/Mario-Vishal/beacon.git
cd beaconollama pull gemma4:e4b # recommended default (supports tool-calling + thinking, 128K ctx)
# or: ollama pull llama3.2 # lighter alternative
ollama serve # or just have the Ollama desktop app runningcd backend
uv sync # create .venv and install all Python deps
uv run uvicorn beacon.main:app --port 8000 # API on http://localhost:8000Verify it's up by opening http://localhost:8000/health. It returns the current GPU/CPU mode and Ollama status.
Open a new terminal:
cd frontend
npm install
npm run dev # UI on http://localhost:5173Windows / PowerShell: if
npmis blocked by execution policy, usenpm.cmd run devor runSet-ExecutionPolicy -Scope CurrentUser RemoteSignedonce.
- Open http://localhost:5173
- Click Add folder and pick a folder to index (e.g.
~/Documents) - Wait for indexing to finish. Progress appears in the file explorer.
- Ask a question: "Which of my resumes mentions AWS experience?"
After pulling new code: restart the backend (
Ctrl+Cthenuv run uvicorn ...) so the running process picks up the changes.
Open the Settings overlay (gear icon) to adjust:
| Setting | Options | Notes |
|---|---|---|
| Ollama model | any model name | default gemma4:e4b |
| GPU mode | auto / cpu_only / force_gpu | auto runs the embedder on CPU and the LLM on GPU, which is optimal for 12 GB VRAM |
| Retrieval mode | auto / gate / always / agentic | gate = adaptive (chitchat answered directly); agentic = LLM searches files with tools |
| Strategy | speed / balanced / quality / max_compression | TokenGate optimization preset |
| Max prompt tokens | integer | overrides the dynamic budget if set |
| File types | list of extensions | which files get indexed |
Settings persist in a local SQLite database under your OS app-data directory. Override the location with BEACON_DATA_DIR.
Every chat session has a TokenGate toggle. Turn it off to run the best-practice baseline RAG instead: cross-encoder rerank, top-20 selection, then LangChain stuffing. The Insights view shows both paths so you can directly compare token usage and answer quality.
When the model supports tool calling (gemma4:e4b does), retrieval_mode=agentic lets the LLM invoke search_files, list_directory, and read_file tools within the indexed boundary. TokenGate optimizes each tool result before it enters the context window.
The dock's TokenGate Insights tab shows, per message:
- Tokens in, tokens out, and % saved
- Per-stage funnel with blocks in/out at each pipeline stage
- Mode badge (TokenGate vs. LangChain Baseline)
- Per-question LLM spend in prompt and output tokens
Chat history is saved. Click the History (clock) icon to resume a previous session. Citations, toggles, and the full audit are all restored.
Every answer shows the files it used as clickable chips. Click a chip to preview the file (image inline, PDF/text extracted) and jump to it in the file explorer.
Beacon is local-first:
- Indexed file contents, embeddings, LanceDB/SQLite databases, and content-bearing logs are in
.gitignoreand never committed or transmitted. - The only network calls are to Hugging Face for the first-run model download and to your own local Ollama.
- No telemetry, no analytics, no cloud sync.
# backend tests
cd backend && uv run pytest # 140 tests
uv run ruff check src tests # lint
uv run mypy src # type-check (strict)
# frontend build
cd frontend && npm run build # type-check + production build