Intelligent LLM Request Scheduler with Workload Prediction
Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control
Paper · English · 中文 · Quick Start · Architecture · API Reference
SageSched is a prediction-driven scheduling system for black-box LLM APIs. It predicts request workload (output tokens, cost, risk) before execution and uses these predictions to schedule requests with QoS guarantees — prioritizing short interactive requests while managing heavy workloads through dual-queue, dual-budget mechanisms.
- Workload Prediction — Hybrid ML + semantic retrieval predicts output tokens, cost, and risk before execution
- QoS-Aware Scheduling — Short-first dual-queue with token budget isolation (interactive vs. heavy)
- Adaptive Congestion Control — Three-mode (free/moderate/congested) automatic scaling
- Multi-Account Load Balancing — AIMD congestion control across multiple LLM API accounts
- Gittins Index Scheduling — Theoretically optimal strategy for minimizing average completion time (TTLT)
- Zero-GPU Required — Runs on CPU with FAISS + MiniLM embeddings
When calling LLM APIs (OpenAI, Azure, Doubao, Gemini, etc.), you face:
| Problem | SageSched Solution |
|---|---|
| Short requests blocked by long ones | Short-first dual-queue with budget isolation |
| Can't predict cost before execution | ML + semantic retrieval predicts output tokens |
| Rate limits across multiple accounts | AIMD load balancer with congestion detection |
| No visibility into request behavior | Real-time metrics, events, and risk scoring |
SageSched 是一个面向黑盒 LLM API 的预测驱动调度系统。它在请求执行前预测工作量(输出 token 数、成本、风险),并基于预测结果进行 QoS 保障调度——优先处理短交互请求,同时通过双队列、双预算机制管理重型工作负载。
┌─────────────────────────────────────────────────────────┐
│ Client App │
│ baseURL: localhost:3000/v1 │
└──────────────────────┬──────────────────────────────────┘
│
┌────────────▼────────────┐
│ LLM Load Balancer │ ← Multi-account proxy
│ (Node.js:3000) │ AIMD congestion ctrl
└────────────┬────────────┘
│
┌──────────────────▼──────────────────┐
│ Scheduler (Python:8010) │
│ ┌─────────┐ ┌──────────────────┐ │
│ │ Dual │ │ Dual Token │ │
│ │ Queues │ │ Budget │ │
│ │ interactive│ interactive: 3K │ │
│ │ heavy │ │ heavy: 7K │ │
│ └────┬────┘ └──────────────────┘ │
│ │ ┌──────────────────────┐ │
│ │ │ Adaptive Congestion │ │
│ └─►│ free/moderate/congest│ │
│ └──────────────────────┘ │
└──────────────────┬──────────────────┘
│ POST /predict
┌────────────▼────────────┐
│ Predictor (Python:8000)│
│ ┌────────┐ ┌────────┐ │
│ │ ML GBM │ │ FAISS │ │
│ │Quantile│ │Semantic│ │
│ │ Model │ │Retrieve│ │
│ └────────┘ └────────┘ │
│ ┌────────────────────┐ │
│ │ SQLite History DB │ │
│ └────────────────────┘ │
└─────────────────────────┘
| Component | Port | Tech Stack | Role |
|---|---|---|---|
| Predictor | 8000 | FastAPI + FAISS + scikit-learn | Predict output tokens, cost, risk from prompt |
| Scheduler | 8010 | FastAPI + httpx | QoS scheduling with dual-queue/budget |
| LLM Proxy | 3000 | Node.js + TypeScript | Multi-account load balancing with AIMD |
git clone https://github.com/Sakura66/sagesched.git
cd sagesched
# Configure your LLM API keys
cp .env.example .env
# Edit .env with your API keys
# For the proxy, create config from example
mkdir -p config
cp predictor/proxy/llm-proxy.config.example.yaml config/llm-proxy.config.yaml
# Edit config/llm-proxy.config.yaml with your deployments
# Start all services
docker compose up --build -d
# Verify
curl http://localhost:8000/health # Predictor
curl http://localhost:8010/health # Scheduler
curl http://localhost:3000/health # Proxy# Terminal 1: Predictor
cd predictor
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --port 8000 --reload
# Terminal 2: Scheduler
cd scheduler_mvp
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --port 8010 --reload
# Terminal 3: LLM Proxy (optional)
cd predictor/proxy
npm install && npx tsc
node dist/index.js./scripts/start.sh all # Start everything
./scripts/start.sh predictor # Start predictor only
./scripts/start.sh docker # Docker mode
./scripts/start.sh stop # Stop Docker servicescurl -X POST http://localhost:8010/submit \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum computing in simple terms",
"max_tokens": 512,
"model": "gpt-5.4-mini",
"task_type": "general"
}'
# → {"request_id": "abc123"}curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum computing in simple terms",
"max_tokens": 512,
"model": "gpt-5.4-mini",
"task_type": "general"
}'
# → {
# "size_bucket": "short",
# "expected_output_tokens_p50": 128,
# "expected_output_tokens_p90": 256,
# "predicted_cost_p50": 1024.0,
# "risk_score": 0.23,
# "confidence_score": 0.85,
# "fallback_level": "tenant_neighbors"
# }curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum computing",
"output_tokens": 142,
"prompt_tokens": 12,
"max_tokens": 512,
"model": "gpt-5.4-mini",
"task_type": "general",
"success": true
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3000/v1",
api_key="anything", # proxy handles real keys
)
response = client.chat.completions.create(
model="gpt-5.4-mini", # matches config model_name
messages=[{"role": "user", "content": "Hello!"}],
)curl http://localhost:8010/metrics
# → queue lengths, budget usage, adaptive mode, dispatch stats| Method | Endpoint | Description |
|---|---|---|
POST |
/predict |
Predict workload for a prompt |
POST |
/ingest |
Record completed request for future predictions |
GET |
/models |
List known model profiles |
GET |
/health |
Health check |
| Method | Endpoint | Description |
|---|---|---|
POST |
/submit |
Submit request for scheduled execution |
POST |
/tick |
Trigger one scheduling round |
GET |
/status/{id} |
Check request status |
GET |
/metrics |
Queue, budget, and adaptive metrics |
GET |
/events |
Recent scheduling events |
POST |
/reset |
Reset all state |
POST |
/admin/config |
Hot-reload scheduling config |
| Method | Endpoint | Description |
|---|---|---|
* |
/v1/* |
OpenAI-compatible proxy endpoint |
GET |
/health |
Health check |
GET |
/_stats |
Load balancer statistics (token-protected) |
| Variable | Default | Description |
|---|---|---|
EMBEDDING_MODEL_NAME |
paraphrase-multilingual-MiniLM-L12-v2 |
Sentence transformer model |
TOP_K_NEIGHBORS |
5 |
Nearest neighbors for retrieval |
SIMILARITY_THRESHOLD |
0.8 |
Minimum cosine similarity |
HISTORY_WINDOW_SIZE |
10000 |
FIFO history window |
PORT |
8000 |
Service port |
| Variable | Default | Description |
|---|---|---|
PREDICTOR_BASE_URL |
http://127.0.0.1:8000 |
Predictor service URL |
INTERACTIVE_INFLIGHT_TOKENS |
3000 |
Interactive queue budget |
HEAVY_INFLIGHT_TOKENS |
7000 |
Heavy queue budget |
PORT |
8010 |
Service port |
See llm-proxy.config.example.yaml for full configuration with multi-account setup.
# Predictor: strict replay evaluation (no data leakage)
cd predictor
python scripts/run_evaluation.py --sqlite data/history.db --strict-replay
# Predictor: scheduling strategy simulation
python scripts/run_simulation.py --db data/history.db --concurrency 4
# Scheduler: load test (requires predictor + scheduler running)
cd ../scheduler_mvp
python scripts/load_test.py --mode scheduler --tick-ms 10 --out result.json
# Full pipeline benchmark on real ShareGPT trace (no API key needed)
python scripts/run_sharegpt_benchmark.py --total-requests 200 --seed-count 300End-to-end QoS comparison on a real ShareGPT trace (200 requests, congested
provider simulated by MockProvider with congestion_alpha enabled). The full
report and methodology are in docs/benchmarks/sharegpt_v1.json.
Workload distribution (real, from ShareGPT): short=10.1%, medium=47.8%, long=42.1%
| Metric | Naive | SageSched | Δ |
|---|---|---|---|
| Short avg latency | 4 891 ms | 2 275 ms | −53.5% |
| Short P95 latency ★ | 11 403 ms | 8 684 ms | −23.8% |
| Short P50 latency | 2 697 ms | 743 ms | −72.5% |
| Deadline satisfaction rate | 50.5 % | 64.0 % | +26.7% |
| Completion rate | 100.0 % | 99.5 % | −0.5 % |
| Makespan | 50 635 ms | 54 651 ms | +7.9 % |
| Heavy P95 latency | 25 875 ms | 39 058 ms | +51.0 % |
| Global P95 latency | 24 238 ms | 37 218 ms | +53.6 % |
★ Short P95 is the primary QoS metric: it measures whether interactive requests stay responsive when the upstream provider is saturated by heavy workloads.
- Short-request protection works. Under provider congestion, SageSched cuts short-request P95 by 23.8% and P50 by 72.5% vs. naive direct dispatch — exactly the goal of the dual-queue, dual-budget design.
- Deadline satisfaction improves by +26.7%, meaning more requests finish inside their SLO window despite identical upstream capacity.
- The trade-off is principled, not free. Heavy requests pay a slowdown (+51% P95) because the scheduler intentionally throttles them to protect interactive traffic. This is the classic LAS/Gittins-style policy — and exactly what you want for an interactive-heavy production deployment.
- Makespan is essentially flat (+7.9%): no throughput collapse, only reordering.
# 1. Start services (no API keys required for benchmark)
cd predictor && uvicorn app.main:app --port 8000 &
cd scheduler_mvp && python scripts/run_local.py &
# 2. Run the benchmark (downloads ShareGPT ~200MB on first run, then cached)
cd scheduler_mvp
python scripts/run_sharegpt_benchmark.py \
--total-requests 200 \
--seed-count 300 \
--out results.jsonThe benchmark seeds the predictor with 300 historical entries, then submits
200 fresh requests through both naive and scheduler modes against the
same congestion-sensitive MockProvider, so the comparison is apples-to-apples.
sagesched/
├── predictor/ # Workload prediction service
│ ├── app/
│ │ ├── main.py # FastAPI entrypoint
│ │ ├── api/routes.py # /predict, /ingest, /models
│ │ ├── services/ # Core prediction logic
│ │ │ ├── predictor.py # Hybrid ML + retrieval predictor
│ │ │ ├── faiss_store.py # FAISS vector store
│ │ │ ├── embedding_service.py
│ │ │ ├── output_model.py # Quantile GBM
│ │ │ ├── cost_model.py
│ │ │ ├── confidence.py
│ │ │ └── retrieval_strategy.py # Tenant/global dual-pool
│ │ ├── eval/ # Strict replay evaluation
│ │ └── scheduler/ # Gittins Index strategy
│ ├── proxy/ # Node.js LLM load balancer
│ │ └── src/
│ │ ├── proxy.ts # AIMD multi-account routing
│ │ ├── config.ts # YAML config loader
│ │ └── congestion.ts # Per-account congestion control
│ ├── scripts/
│ │ ├── seed_mock_data.py # Generate demo data
│ │ ├── rebuild_index.py # Rebuild FAISS index
│ │ ├── run_evaluation.py # Offline evaluation
│ │ └── run_simulation.py # Strategy simulation
│ ├── Dockerfile
│ └── requirements.txt
├── scheduler_mvp/ # Request scheduling service
│ ├── app/
│ │ ├── main.py # FastAPI entrypoint
│ │ ├── scheduler/
│ │ │ ├── engine.py # Core scheduling engine
│ │ │ ├── budget.py # Dual token budget
│ │ │ ├── queues.py # Interactive + heavy queues
│ │ │ └── congestion_controller.py
│ │ ├── dispatcher/
│ │ │ ├── worker.py # Async dispatch worker
│ │ │ └── provider_client.py # Mock / real LLM provider
│ │ └── predictor_client/ # HTTP client → Predictor
│ ├── scripts/
│ │ ├── run_local.py # Start scheduler locally
│ │ └── load_test.py # Load testing tool
│ ├── Dockerfile
│ └── requirements.txt
├── docker-compose.yml # One-command deployment
├── scripts/start.sh # Local startup script
└── .env.example # Environment template
This project is the reference implementation for the following paper:
Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control
LLM inference is increasingly consumed through black-box APIs where the caller controls only when and whether to submit each request. Under high concurrency, provider congestion inflates short-request tail latency, causes expensive requests to be delayed or silently dropped, and degrades useful throughput. We observe that this scheduling problem decomposes into three separable concerns — allocation, ordering, and overload control — and instantiate this architecture with adaptive deficit round-robin, slowdown-aware feasible-set ordering, and conservative admission control.
- Short P95 latency within 5–8% of quota-based baseline
- Completion rate improved from 0.70–0.97 to 0.92–1.00
- Robust across 4 workload-congestion scenarios (5 seeds each)
| Technique | Module | Description |
|---|---|---|
| Gittins Index Scheduling | predictor/app/scheduler/gittins.py |
Theoretically optimal policy for minimizing mean TTLT |
| Hybrid ML + Retrieval Prediction | predictor/app/services/predictor.py |
Quantile GBM baseline + FAISS semantic retrieval correction |
| Dual-Pool Retrieval | predictor/app/services/retrieval_strategy.py |
Tenant/global pool with waterfall fallback |
| Dual-Queue Dual-Budget | scheduler_mvp/app/scheduler/ |
Interactive vs. heavy isolation with tier caps |
| Adaptive Congestion Control | scheduler_mvp/app/scheduler/congestion_controller.py |
Three-mode (free/moderate/congested) automatic scaling |
| AIMD Load Balancing | predictor/proxy/src/congestion.ts |
Latency-driven window sizing for multi-account routing |
| Strict Replay Evaluation | predictor/app/eval/evaluator.py |
Leak-free offline evaluation using temporal ordering |
If you use SageSched in your research, please cite:
@article{sagesched2025,
title = {Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control},
author = {SageSched Contributors},
journal = {arXiv preprint arXiv:2603.07917},
year = {2025},
url = {https://arxiv.org/abs/2603.07917}
}See CONTRIBUTING.md for guidelines.