Intelligent Local AI Orchestration System
HyperOrch bridges the gap between running simple local models and orchestrating complex AI workflows. It acts as a highly intelligent task router, resource monitor, and API server for Ollama and llama.cpp, bringing cloud-level orchestration to consumer hardware.
Features • Installation • Task Router • Architecture • Usage
- 🧠 Smart Task Router: Define profiles (e.g., Coding, Chat, Embedding) and HyperOrch automatically routes prompts to the optimal model and hardware (GPU vs. CPU).
- 🚀 Hybrid Execution: Runs interactive models on your GPU for speed, while offloading background tasks (embeddings, summaries) to your CPU to avoid VRAM bottlenecks.
- 📊 Live Hardware Dashboard: Real-time websocket streaming of CPU, RAM, and GPU utilization. Built-in workarounds for AMD GPU VRAM reporting on Windows.
- 🛡️ Memory Safety Guards: Calculates required VRAM before launching models, preventing OOM (Out of Memory) crashes.
- 💾 Context Persistence: Save and load exact system states (
context.json) for seamless handoffs between human developers and AI coding assistants. - 🌐 Dual Interface: Fully featured Web GUI (
localhost:8001) and a robust command-line interface (hyperorch).
- Python 3.10+
- Ollama installed and running in the background.
# 1. Clone the repository
git clone https://github.com/desagencydes-rgb/HyperOrch.git
cd HyperOrch
# 2. Install dependencies
pip install -r requirements.txt
pip install -e .
# 3. Download a model (if you don't have one)
ollama pull llama3:latestTerminal 1: Start Ollama
ollama serveTerminal 2: Start HyperOrch
hyperorch serve --port 8001Terminal 3 (or Browser):
Navigate to http://localhost:8001 to access the Cyber-Dashboard.
HyperOrch introduces Task Profiles. Instead of manually managing which model is loaded in VRAM, you simply declare your intent.
| Task Intent | Target Model | Hardware Assigned | Priority | Why? |
|---|---|---|---|---|
| Coding | qwen2.5-coder:7b |
🟢 GPU | P1 | Needs lowest latency and large context window. |
| Chat | llama3:latest |
🟢 GPU | P2 | Interactive conversational latency matters. |
| Summarize | dolphin-mistral |
🟡 AUTO | P5 | Uses GPU if free; falls back to CPU if busy. |
| Embed | nomic-embed-text |
🔵 CPU | P8 | Tiny model (274MB); do not waste precious GPU VRAM. |
Profiles are fully configurable in the Python router.
HyperOrch utilizes a FastAPI backend connected to a Vanilla JS/WebSocket frontend.
graph TD
UI[Web Dashboard] <-->|WebSockets| API(FastAPI Server)
CLI[HyperOrch CLI] --> API
API --> ROUTER[Task Router]
API --> MON[Resource Monitor]
API --> MGR[Model Manager]
ROUTER -->|Generate/Embed/Chat| OLL[Ollama Client]
MGR -->|Launch/Stop via subprocess| OLLAMA[Ollama Process]
MGR -->|Launch via subprocess| LCPP[llama.cpp Process]
MON -->|psutil/pyadl/WMI| HW[Hardware GPU/CPU]
subgraph Execution Layer
OLLAMA
LCPP
end
server.py: FastAPI application serving REST endpoints and WebSockets for real-time logs and metrics.task_router.py: Profile-based intent router. Prevents VRAM thrashing by queuing tasks appropriately.ollama_client.py: Async HTTP client for streaming interactions with the local Ollama API.resource_monitor.py: Cross-vendor hardware tracking. Features custom fallbacks for AMD Radeon GPUs via WMI andpyadl.orchestrator.py: Evaluates hardware capability vs model requirements to calculate safe-ngl(GPU Layer) offloading parameters.
HyperOrch provides a powerful command-line interface for headless environments.
# Scan system for existing Ollama/GGUF models
hyperorch detect
# View current CPU/RAM/GPU usage and active model
hyperorch status
# Launch a specific model directly (bypassing the router)
hyperorch launch llama3:latest
# Stop the currently running model
hyperorch stop
# Save current CLI/System state for AI continuity
hyperorch context save my_state.jsonHyperOrch has built-in support for AMD GPUs via Vulkan.
- For Ollama: Ollama (v0.17.1+) automatically utilizes Vulkan for AMD. HyperOrch will detect your Radeon GPU and monitor its VRAM capacity accordingly.
- For llama.cpp: Ensure you download the Vulkan Build from the
llama.cppreleases page.- Launch via CLI:
hyperorch launch model.gguf --backend llamacpp --gpu-layers -1
- Launch via CLI:
To change how tasks are routed, edit the DEFAULT_PROFILES array in hyperorch/task_router.py. You can adjust hardware preferences (Hardware.GPU, Hardware.CPU, Hardware.AUTO) and assign custom internal fallbacks.
Released under the MIT License.