SageSched

Intelligent LLM Request Scheduler with Workload Prediction

Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control

Paper · English · 中文 · Quick Start · Architecture · API Reference

Overview

SageSched is a prediction-driven scheduling system for black-box LLM APIs. It predicts request workload (output tokens, cost, risk) before execution and uses these predictions to schedule requests with QoS guarantees — prioritizing short interactive requests while managing heavy workloads through dual-queue, dual-budget mechanisms.

Key Features

Workload Prediction — Hybrid ML + semantic retrieval predicts output tokens, cost, and risk before execution
QoS-Aware Scheduling — Short-first dual-queue with token budget isolation (interactive vs. heavy)
Adaptive Congestion Control — Three-mode (free/moderate/congested) automatic scaling
Multi-Account Load Balancing — AIMD congestion control across multiple LLM API accounts
Gittins Index Scheduling — Theoretically optimal strategy for minimizing average completion time (TTLT)
Zero-GPU Required — Runs on CPU with FAISS + MiniLM embeddings

Why SageSched?

When calling LLM APIs (OpenAI, Azure, Doubao, Gemini, etc.), you face:

Problem	SageSched Solution
Short requests blocked by long ones	Short-first dual-queue with budget isolation
Can't predict cost before execution	ML + semantic retrieval predicts output tokens
Rate limits across multiple accounts	AIMD load balancer with congestion detection
No visibility into request behavior	Real-time metrics, events, and risk scoring

概述

SageSched 是一个面向黑盒 LLM API 的预测驱动调度系统。它在请求执行前预测工作量（输出 token 数、成本、风险），并基于预测结果进行 QoS 保障调度——优先处理短交互请求，同时通过双队列、双预算机制管理重型工作负载。

Architecture

┌─────────────────────────────────────────────────────────┐
│                      Client App                         │
│              baseURL: localhost:3000/v1                  │
└──────────────────────┬──────────────────────────────────┘
                       │
          ┌────────────▼────────────┐
          │    LLM Load Balancer    │  ← Multi-account proxy
          │      (Node.js:3000)     │     AIMD congestion ctrl
          └────────────┬────────────┘
                       │
    ┌──────────────────▼──────────────────┐
    │         Scheduler (Python:8010)      │
    │  ┌─────────┐  ┌──────────────────┐  │
    │  │ Dual    │  │ Dual Token       │  │
    │  │ Queues  │  │ Budget           │  │
    │  │ interactive│ interactive: 3K   │  │
    │  │ heavy   │  │ heavy: 7K        │  │
    │  └────┬────┘  └──────────────────┘  │
    │       │  ┌──────────────────────┐   │
    │       │  │ Adaptive Congestion  │   │
    │       └─►│ free/moderate/congest│   │
    │          └──────────────────────┘   │
    └──────────────────┬──────────────────┘
                       │ POST /predict
          ┌────────────▼────────────┐
          │   Predictor (Python:8000)│
          │  ┌────────┐ ┌────────┐  │
          │  │ ML GBM │ │ FAISS  │  │
          │  │Quantile│ │Semantic│  │
          │  │ Model  │ │Retrieve│  │
          │  └────────┘ └────────┘  │
          │  ┌────────────────────┐  │
          │  │ SQLite History DB  │  │
          │  └────────────────────┘  │
          └─────────────────────────┘

Components

Component	Port	Tech Stack	Role
Predictor	8000	FastAPI + FAISS + scikit-learn	Predict output tokens, cost, risk from prompt
Scheduler	8010	FastAPI + httpx	QoS scheduling with dual-queue/budget
LLM Proxy	3000	Node.js + TypeScript	Multi-account load balancing with AIMD

Quick Start

Option 1: Docker Compose (Recommended)

git clone https://github.com/Sakura66/sagesched.git
cd sagesched

# Configure your LLM API keys
cp .env.example .env
# Edit .env with your API keys

# For the proxy, create config from example
mkdir -p config
cp predictor/proxy/llm-proxy.config.example.yaml config/llm-proxy.config.yaml
# Edit config/llm-proxy.config.yaml with your deployments

# Start all services
docker compose up --build -d

# Verify
curl http://localhost:8000/health   # Predictor
curl http://localhost:8010/health   # Scheduler
curl http://localhost:3000/health   # Proxy

Option 2: Local Development

# Terminal 1: Predictor
cd predictor
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --port 8000 --reload

# Terminal 2: Scheduler
cd scheduler_mvp
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --port 8010 --reload

# Terminal 3: LLM Proxy (optional)
cd predictor/proxy
npm install && npx tsc
node dist/index.js

Option 3: One-Click Script

./scripts/start.sh all        # Start everything
./scripts/start.sh predictor  # Start predictor only
./scripts/start.sh docker     # Docker mode
./scripts/start.sh stop       # Stop Docker services

Usage

1. Submit a Request (via Scheduler)

curl -X POST http://localhost:8010/submit \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 512,
    "model": "gpt-5.4-mini",
    "task_type": "general"
  }'
# → {"request_id": "abc123"}

2. Get Prediction Only

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 512,
    "model": "gpt-5.4-mini",
    "task_type": "general"
  }'
# → {
#     "size_bucket": "short",
#     "expected_output_tokens_p50": 128,
#     "expected_output_tokens_p90": 256,
#     "predicted_cost_p50": 1024.0,
#     "risk_score": 0.23,
#     "confidence_score": 0.85,
#     "fallback_level": "tenant_neighbors"
#   }

3. Ingest Historical Data

curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing",
    "output_tokens": 142,
    "prompt_tokens": 12,
    "max_tokens": 512,
    "model": "gpt-5.4-mini",
    "task_type": "general",
    "success": true
  }'

4. Use the Proxy as OpenAI Drop-in

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="anything",  # proxy handles real keys
)

response = client.chat.completions.create(
    model="gpt-5.4-mini",  # matches config model_name
    messages=[{"role": "user", "content": "Hello!"}],
)

5. Monitor Metrics

curl http://localhost:8010/metrics
# → queue lengths, budget usage, adaptive mode, dispatch stats

API Reference

Predictor (`:8000`)

Method	Endpoint	Description
`POST`	`/predict`	Predict workload for a prompt
`POST`	`/ingest`	Record completed request for future predictions
`GET`	`/models`	List known model profiles
`GET`	`/health`	Health check

Scheduler (`:8010`)

Method	Endpoint	Description
`POST`	`/submit`	Submit request for scheduled execution
`POST`	`/tick`	Trigger one scheduling round
`GET`	`/status/{id}`	Check request status
`GET`	`/metrics`	Queue, budget, and adaptive metrics
`GET`	`/events`	Recent scheduling events
`POST`	`/reset`	Reset all state
`POST`	`/admin/config`	Hot-reload scheduling config

Proxy (`:3000`)

Method	Endpoint	Description
`*`	`/v1/*`	OpenAI-compatible proxy endpoint
`GET`	`/health`	Health check
`GET`	`/_stats`	Load balancer statistics (token-protected)

Configuration

Predictor (`predictor/.env`)

Variable	Default	Description
`EMBEDDING_MODEL_NAME`	`paraphrase-multilingual-MiniLM-L12-v2`	Sentence transformer model
`TOP_K_NEIGHBORS`	`5`	Nearest neighbors for retrieval
`SIMILARITY_THRESHOLD`	`0.8`	Minimum cosine similarity
`HISTORY_WINDOW_SIZE`	`10000`	FIFO history window
`PORT`	`8000`	Service port

Scheduler (`scheduler_mvp/.env`)

Variable	Default	Description
`PREDICTOR_BASE_URL`	`http://127.0.0.1:8000`	Predictor service URL
`INTERACTIVE_INFLIGHT_TOKENS`	`3000`	Interactive queue budget
`HEAVY_INFLIGHT_TOKENS`	`7000`	Heavy queue budget
`PORT`	`8010`	Service port

Proxy (`predictor/proxy/llm-proxy.config.yaml`)

See llm-proxy.config.example.yaml for full configuration with multi-account setup.

Evaluation & Testing

# Predictor: strict replay evaluation (no data leakage)
cd predictor
python scripts/run_evaluation.py --sqlite data/history.db --strict-replay

# Predictor: scheduling strategy simulation
python scripts/run_simulation.py --db data/history.db --concurrency 4

# Scheduler: load test (requires predictor + scheduler running)
cd ../scheduler_mvp
python scripts/load_test.py --mode scheduler --tick-ms 10 --out result.json

# Full pipeline benchmark on real ShareGPT trace (no API key needed)
python scripts/run_sharegpt_benchmark.py --total-requests 200 --seed-count 300

Benchmark Results

End-to-end QoS comparison on a real ShareGPT trace (200 requests, congested provider simulated by MockProvider with congestion_alpha enabled). The full report and methodology are in docs/benchmarks/sharegpt_v1.json.

Workload distribution (real, from ShareGPT): short=10.1%, medium=47.8%, long=42.1%

Headline numbers

Metric	Naive	SageSched	Δ
Short avg latency	4 891 ms	2 275 ms	−53.5%
Short P95 latency ★	11 403 ms	8 684 ms	−23.8%
Short P50 latency	2 697 ms	743 ms	−72.5%
Deadline satisfaction rate	50.5 %	64.0 %	+26.7%
Completion rate	100.0 %	99.5 %	−0.5 %
Makespan	50 635 ms	54 651 ms	+7.9 %
Heavy P95 latency	25 875 ms	39 058 ms	+51.0 %
Global P95 latency	24 238 ms	37 218 ms	+53.6 %

★ Short P95 is the primary QoS metric: it measures whether interactive requests stay responsive when the upstream provider is saturated by heavy workloads.

Interpretation

Short-request protection works. Under provider congestion, SageSched cuts short-request P95 by 23.8% and P50 by 72.5% vs. naive direct dispatch — exactly the goal of the dual-queue, dual-budget design.
Deadline satisfaction improves by +26.7%, meaning more requests finish inside their SLO window despite identical upstream capacity.
The trade-off is principled, not free. Heavy requests pay a slowdown (+51% P95) because the scheduler intentionally throttles them to protect interactive traffic. This is the classic LAS/Gittins-style policy — and exactly what you want for an interactive-heavy production deployment.
Makespan is essentially flat (+7.9%): no throughput collapse, only reordering.

Reproduce locally

# 1. Start services (no API keys required for benchmark)
cd predictor && uvicorn app.main:app --port 8000 &
cd scheduler_mvp && python scripts/run_local.py &

# 2. Run the benchmark (downloads ShareGPT ~200MB on first run, then cached)
cd scheduler_mvp
python scripts/run_sharegpt_benchmark.py \
    --total-requests 200 \
    --seed-count 300 \
    --out results.json

The benchmark seeds the predictor with 300 historical entries, then submits 200 fresh requests through both naive and scheduler modes against the same congestion-sensitive MockProvider, so the comparison is apples-to-apples.

Project Structure

sagesched/
├── predictor/                  # Workload prediction service
│   ├── app/
│   │   ├── main.py             # FastAPI entrypoint
│   │   ├── api/routes.py       # /predict, /ingest, /models
│   │   ├── services/           # Core prediction logic
│   │   │   ├── predictor.py    # Hybrid ML + retrieval predictor
│   │   │   ├── faiss_store.py  # FAISS vector store
│   │   │   ├── embedding_service.py
│   │   │   ├── output_model.py # Quantile GBM
│   │   │   ├── cost_model.py
│   │   │   ├── confidence.py
│   │   │   └── retrieval_strategy.py  # Tenant/global dual-pool
│   │   ├── eval/               # Strict replay evaluation
│   │   └── scheduler/          # Gittins Index strategy
│   ├── proxy/                  # Node.js LLM load balancer
│   │   └── src/
│   │       ├── proxy.ts        # AIMD multi-account routing
│   │       ├── config.ts       # YAML config loader
│   │       └── congestion.ts   # Per-account congestion control
│   ├── scripts/
│   │   ├── seed_mock_data.py   # Generate demo data
│   │   ├── rebuild_index.py    # Rebuild FAISS index
│   │   ├── run_evaluation.py   # Offline evaluation
│   │   └── run_simulation.py   # Strategy simulation
│   ├── Dockerfile
│   └── requirements.txt
├── scheduler_mvp/              # Request scheduling service
│   ├── app/
│   │   ├── main.py             # FastAPI entrypoint
│   │   ├── scheduler/
│   │   │   ├── engine.py       # Core scheduling engine
│   │   │   ├── budget.py       # Dual token budget
│   │   │   ├── queues.py       # Interactive + heavy queues
│   │   │   └── congestion_controller.py
│   │   ├── dispatcher/
│   │   │   ├── worker.py       # Async dispatch worker
│   │   │   └── provider_client.py  # Mock / real LLM provider
│   │   └── predictor_client/   # HTTP client → Predictor
│   ├── scripts/
│   │   ├── run_local.py        # Start scheduler locally
│   │   └── load_test.py        # Load testing tool
│   ├── Dockerfile
│   └── requirements.txt
├── docker-compose.yml          # One-command deployment
├── scripts/start.sh            # Local startup script
└── .env.example                # Environment template

Research

This project is the reference implementation for the following paper:

Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control

arXiv:2603.07917

LLM inference is increasingly consumed through black-box APIs where the caller controls only when and whether to submit each request. Under high concurrency, provider congestion inflates short-request tail latency, causes expensive requests to be delayed or silently dropped, and degrades useful throughput. We observe that this scheduling problem decomposes into three separable concerns — allocation, ordering, and overload control — and instantiate this architecture with adaptive deficit round-robin, slowdown-aware feasible-set ordering, and conservative admission control.

Key Results

Short P95 latency within 5–8% of quota-based baseline
Completion rate improved from 0.70–0.97 to 0.92–1.00
Robust across 4 workload-congestion scenarios (5 seeds each)

Implemented Techniques

Technique	Module	Description
Gittins Index Scheduling	`predictor/app/scheduler/gittins.py`	Theoretically optimal policy for minimizing mean TTLT
Hybrid ML + Retrieval Prediction	`predictor/app/services/predictor.py`	Quantile GBM baseline + FAISS semantic retrieval correction
Dual-Pool Retrieval	`predictor/app/services/retrieval_strategy.py`	Tenant/global pool with waterfall fallback
Dual-Queue Dual-Budget	`scheduler_mvp/app/scheduler/`	Interactive vs. heavy isolation with tier caps
Adaptive Congestion Control	`scheduler_mvp/app/scheduler/congestion_controller.py`	Three-mode (free/moderate/congested) automatic scaling
AIMD Load Balancing	`predictor/proxy/src/congestion.ts`	Latency-driven window sizing for multi-account routing
Strict Replay Evaluation	`predictor/app/eval/evaluator.py`	Leak-free offline evaluation using temporal ordering

Citation

If you use SageSched in your research, please cite:

@article{sagesched2025,
  title   = {Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control},
  author  = {SageSched Contributors},
  journal = {arXiv preprint arXiv:2603.07917},
  year    = {2025},
  url     = {https://arxiv.org/abs/2603.07917}
}

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
docs/benchmarks		docs/benchmarks
predictor		predictor
scheduler_mvp		scheduler_mvp
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

SageSched

Overview

Key Features

Why SageSched?

概述

Architecture

Components

Quick Start

Option 1: Docker Compose (Recommended)

Option 2: Local Development

Option 3: One-Click Script

Usage

1. Submit a Request (via Scheduler)

2. Get Prediction Only

3. Ingest Historical Data

4. Use the Proxy as OpenAI Drop-in

5. Monitor Metrics

API Reference

Predictor (:8000)

Scheduler (:8010)

Proxy (:3000)

Configuration

Predictor (predictor/.env)

Scheduler (scheduler_mvp/.env)

Proxy (predictor/proxy/llm-proxy.config.yaml)

Evaluation & Testing

Benchmark Results

Headline numbers

Interpretation

Reproduce locally

Project Structure

Research

Key Results

Implemented Techniques

Citation

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Predictor (`:8000`)

Scheduler (`:8010`)

Proxy (`:3000`)

Predictor (`predictor/.env`)

Scheduler (`scheduler_mvp/.env`)

Proxy (`predictor/proxy/llm-proxy.config.yaml`)

Packages