The Ampere LLM Orchestrator is a high-performance, high-density platform designed for orchestrating and monitoring multiple Large Language Model (LLM) instances on AmpereOne® M CPUs.
By leveraging the high core-count architecture of Ampere processors, this application achieves extreme throughput and density, running multiple independent llama.cpp servers concurrently—each isolated and pinned to specific 32-core segments for deterministic performance.
- High-Density Orchestration: Deploy and manage 4 independent
llama.cppnodes simultaneously, each with its own dedicated 32-core resources. - Domain-Specific AI Experts:
- Legal & Compliance: Regulatory and corporate law specialist.
- Cybersecurity: Threat intelligence and network security architect.
- Fintech & Finance: Algorithmic trading and financial regulation expert.
- Supply Chain & Ops: Global logistics and operational resilience specialist.
- Modern Startup UI: Sleek dark-mode dashboard with real-time performance visualization.
- Real-Time Performance Metrics:
- Cluster Throughput (TPS): Aggregate tokens per second across all 20 active workers.
- Peak Throughput: Tracking session-high performance levels.
- CPU Load Monitoring: Real-time utilization stats per container, normalized to the assigned 32-core
cpuset.
- Deterministic Compute: Uses
cpusetpinning to ensure no resource contention between AI domains.
The system consists of a modern full-stack architecture optimized for high-density inference:
- Frontend: React (Vite) + Tailwind CSS + Lucide Icons featuring a glassmorphism "AI Startup" aesthetic.
- Backend: Node.js (Express) serving as an intelligent API proxy and performance monitor.
- Inference Nodes: 4x
llama.cppcontainers running in Docker, each serving specialized GGUF models.
- Docker and Docker Compose
- Ampere-based Cloud Instance (OCI A1, AWS M6g/C6g/R6g, GCP T2A, etc.)
- Node.js 20+
- Clone the repository.
- Configure your environment:
Copy
.env.exampleto.envand adjust the model parameters.cp .env.example .env
- Launch the Orchestrator:
docker-compose up -d
- Access the Dashboard:
Navigate to
http://localhost:3000to start the cluster orchestration.
The orchestration is highly configurable via environment variables:
| Variable | Description |
|---|---|
HF_REPO_X |
Hugging Face repository for Domain Expert X (1-4). |
HF_FILE_X |
Specific GGUF model file for Domain Expert X (1-4). |
LLAMA_ARG_THREADS |
Number of CPU threads (default 32) per instance. |
cpuset |
Core pinning (e.g., 0-31, 32-63) defined in docker-compose.yml. |
AmpereOne® M CPUs provide a deterministic, high-core-count environment ideal for LLM density. By avoiding the overhead of multi-threading contention and leveraging large core counts, this orchestrator demonstrates how a single Ampere server can replace expensive GPU-based infrastructure for high-throughput, low-latency inference workloads.
Built for High-Efficiency AI on Ampere Computing.