Skip to content

Sakura66/sagesched

Repository files navigation

SageSched

Intelligent LLM Request Scheduler with Workload Prediction

Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control

arXiv License Python 3.11+ Node 18+ FastAPI Docker

Paper · English · 中文 · Quick Start · Architecture · API Reference


Overview

SageSched is a prediction-driven scheduling system for black-box LLM APIs. It predicts request workload (output tokens, cost, risk) before execution and uses these predictions to schedule requests with QoS guarantees — prioritizing short interactive requests while managing heavy workloads through dual-queue, dual-budget mechanisms.

Key Features

  • Workload Prediction — Hybrid ML + semantic retrieval predicts output tokens, cost, and risk before execution
  • QoS-Aware Scheduling — Short-first dual-queue with token budget isolation (interactive vs. heavy)
  • Adaptive Congestion Control — Three-mode (free/moderate/congested) automatic scaling
  • Multi-Account Load Balancing — AIMD congestion control across multiple LLM API accounts
  • Gittins Index Scheduling — Theoretically optimal strategy for minimizing average completion time (TTLT)
  • Zero-GPU Required — Runs on CPU with FAISS + MiniLM embeddings

Why SageSched?

When calling LLM APIs (OpenAI, Azure, Doubao, Gemini, etc.), you face:

Problem SageSched Solution
Short requests blocked by long ones Short-first dual-queue with budget isolation
Can't predict cost before execution ML + semantic retrieval predicts output tokens
Rate limits across multiple accounts AIMD load balancer with congestion detection
No visibility into request behavior Real-time metrics, events, and risk scoring

概述

SageSched 是一个面向黑盒 LLM API 的预测驱动调度系统。它在请求执行前预测工作量(输出 token 数、成本、风险),并基于预测结果进行 QoS 保障调度——优先处理短交互请求,同时通过双队列、双预算机制管理重型工作负载。


Architecture

┌─────────────────────────────────────────────────────────┐
│                      Client App                         │
│              baseURL: localhost:3000/v1                  │
└──────────────────────┬──────────────────────────────────┘
                       │
          ┌────────────▼────────────┐
          │    LLM Load Balancer    │  ← Multi-account proxy
          │      (Node.js:3000)     │     AIMD congestion ctrl
          └────────────┬────────────┘
                       │
    ┌──────────────────▼──────────────────┐
    │         Scheduler (Python:8010)      │
    │  ┌─────────┐  ┌──────────────────┐  │
    │  │ Dual    │  │ Dual Token       │  │
    │  │ Queues  │  │ Budget           │  │
    │  │ interactive│ interactive: 3K   │  │
    │  │ heavy   │  │ heavy: 7K        │  │
    │  └────┬────┘  └──────────────────┘  │
    │       │  ┌──────────────────────┐   │
    │       │  │ Adaptive Congestion  │   │
    │       └─►│ free/moderate/congest│   │
    │          └──────────────────────┘   │
    └──────────────────┬──────────────────┘
                       │ POST /predict
          ┌────────────▼────────────┐
          │   Predictor (Python:8000)│
          │  ┌────────┐ ┌────────┐  │
          │  │ ML GBM │ │ FAISS  │  │
          │  │Quantile│ │Semantic│  │
          │  │ Model  │ │Retrieve│  │
          │  └────────┘ └────────┘  │
          │  ┌────────────────────┐  │
          │  │ SQLite History DB  │  │
          │  └────────────────────┘  │
          └─────────────────────────┘

Components

Component Port Tech Stack Role
Predictor 8000 FastAPI + FAISS + scikit-learn Predict output tokens, cost, risk from prompt
Scheduler 8010 FastAPI + httpx QoS scheduling with dual-queue/budget
LLM Proxy 3000 Node.js + TypeScript Multi-account load balancing with AIMD

Quick Start

Option 1: Docker Compose (Recommended)

git clone https://github.com/Sakura66/sagesched.git
cd sagesched

# Configure your LLM API keys
cp .env.example .env
# Edit .env with your API keys

# For the proxy, create config from example
mkdir -p config
cp predictor/proxy/llm-proxy.config.example.yaml config/llm-proxy.config.yaml
# Edit config/llm-proxy.config.yaml with your deployments

# Start all services
docker compose up --build -d

# Verify
curl http://localhost:8000/health   # Predictor
curl http://localhost:8010/health   # Scheduler
curl http://localhost:3000/health   # Proxy

Option 2: Local Development

# Terminal 1: Predictor
cd predictor
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --port 8000 --reload

# Terminal 2: Scheduler
cd scheduler_mvp
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --port 8010 --reload

# Terminal 3: LLM Proxy (optional)
cd predictor/proxy
npm install && npx tsc
node dist/index.js

Option 3: One-Click Script

./scripts/start.sh all        # Start everything
./scripts/start.sh predictor  # Start predictor only
./scripts/start.sh docker     # Docker mode
./scripts/start.sh stop       # Stop Docker services

Usage

1. Submit a Request (via Scheduler)

curl -X POST http://localhost:8010/submit \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 512,
    "model": "gpt-5.4-mini",
    "task_type": "general"
  }'
# → {"request_id": "abc123"}

2. Get Prediction Only

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 512,
    "model": "gpt-5.4-mini",
    "task_type": "general"
  }'
# → {
#     "size_bucket": "short",
#     "expected_output_tokens_p50": 128,
#     "expected_output_tokens_p90": 256,
#     "predicted_cost_p50": 1024.0,
#     "risk_score": 0.23,
#     "confidence_score": 0.85,
#     "fallback_level": "tenant_neighbors"
#   }

3. Ingest Historical Data

curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing",
    "output_tokens": 142,
    "prompt_tokens": 12,
    "max_tokens": 512,
    "model": "gpt-5.4-mini",
    "task_type": "general",
    "success": true
  }'

4. Use the Proxy as OpenAI Drop-in

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="anything",  # proxy handles real keys
)

response = client.chat.completions.create(
    model="gpt-5.4-mini",  # matches config model_name
    messages=[{"role": "user", "content": "Hello!"}],
)

5. Monitor Metrics

curl http://localhost:8010/metrics
# → queue lengths, budget usage, adaptive mode, dispatch stats

API Reference

Predictor (:8000)

Method Endpoint Description
POST /predict Predict workload for a prompt
POST /ingest Record completed request for future predictions
GET /models List known model profiles
GET /health Health check

Scheduler (:8010)

Method Endpoint Description
POST /submit Submit request for scheduled execution
POST /tick Trigger one scheduling round
GET /status/{id} Check request status
GET /metrics Queue, budget, and adaptive metrics
GET /events Recent scheduling events
POST /reset Reset all state
POST /admin/config Hot-reload scheduling config

Proxy (:3000)

Method Endpoint Description
* /v1/* OpenAI-compatible proxy endpoint
GET /health Health check
GET /_stats Load balancer statistics (token-protected)

Configuration

Predictor (predictor/.env)

Variable Default Description
EMBEDDING_MODEL_NAME paraphrase-multilingual-MiniLM-L12-v2 Sentence transformer model
TOP_K_NEIGHBORS 5 Nearest neighbors for retrieval
SIMILARITY_THRESHOLD 0.8 Minimum cosine similarity
HISTORY_WINDOW_SIZE 10000 FIFO history window
PORT 8000 Service port

Scheduler (scheduler_mvp/.env)

Variable Default Description
PREDICTOR_BASE_URL http://127.0.0.1:8000 Predictor service URL
INTERACTIVE_INFLIGHT_TOKENS 3000 Interactive queue budget
HEAVY_INFLIGHT_TOKENS 7000 Heavy queue budget
PORT 8010 Service port

Proxy (predictor/proxy/llm-proxy.config.yaml)

See llm-proxy.config.example.yaml for full configuration with multi-account setup.


Evaluation & Testing

# Predictor: strict replay evaluation (no data leakage)
cd predictor
python scripts/run_evaluation.py --sqlite data/history.db --strict-replay

# Predictor: scheduling strategy simulation
python scripts/run_simulation.py --db data/history.db --concurrency 4

# Scheduler: load test (requires predictor + scheduler running)
cd ../scheduler_mvp
python scripts/load_test.py --mode scheduler --tick-ms 10 --out result.json

# Full pipeline benchmark on real ShareGPT trace (no API key needed)
python scripts/run_sharegpt_benchmark.py --total-requests 200 --seed-count 300

Benchmark Results

End-to-end QoS comparison on a real ShareGPT trace (200 requests, congested provider simulated by MockProvider with congestion_alpha enabled). The full report and methodology are in docs/benchmarks/sharegpt_v1.json.

Workload distribution (real, from ShareGPT): short=10.1%, medium=47.8%, long=42.1%

Headline numbers

Metric Naive SageSched Δ
Short avg latency 4 891 ms 2 275 ms −53.5%
Short P95 latency 11 403 ms 8 684 ms −23.8%
Short P50 latency 2 697 ms 743 ms −72.5%
Deadline satisfaction rate 50.5 % 64.0 % +26.7%
Completion rate 100.0 % 99.5 % −0.5 %
Makespan 50 635 ms 54 651 ms +7.9 %
Heavy P95 latency 25 875 ms 39 058 ms +51.0 %
Global P95 latency 24 238 ms 37 218 ms +53.6 %

★ Short P95 is the primary QoS metric: it measures whether interactive requests stay responsive when the upstream provider is saturated by heavy workloads.

Interpretation

  • Short-request protection works. Under provider congestion, SageSched cuts short-request P95 by 23.8% and P50 by 72.5% vs. naive direct dispatch — exactly the goal of the dual-queue, dual-budget design.
  • Deadline satisfaction improves by +26.7%, meaning more requests finish inside their SLO window despite identical upstream capacity.
  • The trade-off is principled, not free. Heavy requests pay a slowdown (+51% P95) because the scheduler intentionally throttles them to protect interactive traffic. This is the classic LAS/Gittins-style policy — and exactly what you want for an interactive-heavy production deployment.
  • Makespan is essentially flat (+7.9%): no throughput collapse, only reordering.

Reproduce locally

# 1. Start services (no API keys required for benchmark)
cd predictor && uvicorn app.main:app --port 8000 &
cd scheduler_mvp && python scripts/run_local.py &

# 2. Run the benchmark (downloads ShareGPT ~200MB on first run, then cached)
cd scheduler_mvp
python scripts/run_sharegpt_benchmark.py \
    --total-requests 200 \
    --seed-count 300 \
    --out results.json

The benchmark seeds the predictor with 300 historical entries, then submits 200 fresh requests through both naive and scheduler modes against the same congestion-sensitive MockProvider, so the comparison is apples-to-apples.


Project Structure

sagesched/
├── predictor/                  # Workload prediction service
│   ├── app/
│   │   ├── main.py             # FastAPI entrypoint
│   │   ├── api/routes.py       # /predict, /ingest, /models
│   │   ├── services/           # Core prediction logic
│   │   │   ├── predictor.py    # Hybrid ML + retrieval predictor
│   │   │   ├── faiss_store.py  # FAISS vector store
│   │   │   ├── embedding_service.py
│   │   │   ├── output_model.py # Quantile GBM
│   │   │   ├── cost_model.py
│   │   │   ├── confidence.py
│   │   │   └── retrieval_strategy.py  # Tenant/global dual-pool
│   │   ├── eval/               # Strict replay evaluation
│   │   └── scheduler/          # Gittins Index strategy
│   ├── proxy/                  # Node.js LLM load balancer
│   │   └── src/
│   │       ├── proxy.ts        # AIMD multi-account routing
│   │       ├── config.ts       # YAML config loader
│   │       └── congestion.ts   # Per-account congestion control
│   ├── scripts/
│   │   ├── seed_mock_data.py   # Generate demo data
│   │   ├── rebuild_index.py    # Rebuild FAISS index
│   │   ├── run_evaluation.py   # Offline evaluation
│   │   └── run_simulation.py   # Strategy simulation
│   ├── Dockerfile
│   └── requirements.txt
├── scheduler_mvp/              # Request scheduling service
│   ├── app/
│   │   ├── main.py             # FastAPI entrypoint
│   │   ├── scheduler/
│   │   │   ├── engine.py       # Core scheduling engine
│   │   │   ├── budget.py       # Dual token budget
│   │   │   ├── queues.py       # Interactive + heavy queues
│   │   │   └── congestion_controller.py
│   │   ├── dispatcher/
│   │   │   ├── worker.py       # Async dispatch worker
│   │   │   └── provider_client.py  # Mock / real LLM provider
│   │   └── predictor_client/   # HTTP client → Predictor
│   ├── scripts/
│   │   ├── run_local.py        # Start scheduler locally
│   │   └── load_test.py        # Load testing tool
│   ├── Dockerfile
│   └── requirements.txt
├── docker-compose.yml          # One-command deployment
├── scripts/start.sh            # Local startup script
└── .env.example                # Environment template

Research

This project is the reference implementation for the following paper:

Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control

arXiv:2603.07917

LLM inference is increasingly consumed through black-box APIs where the caller controls only when and whether to submit each request. Under high concurrency, provider congestion inflates short-request tail latency, causes expensive requests to be delayed or silently dropped, and degrades useful throughput. We observe that this scheduling problem decomposes into three separable concerns — allocation, ordering, and overload control — and instantiate this architecture with adaptive deficit round-robin, slowdown-aware feasible-set ordering, and conservative admission control.

Key Results

  • Short P95 latency within 5–8% of quota-based baseline
  • Completion rate improved from 0.70–0.97 to 0.92–1.00
  • Robust across 4 workload-congestion scenarios (5 seeds each)

Implemented Techniques

Technique Module Description
Gittins Index Scheduling predictor/app/scheduler/gittins.py Theoretically optimal policy for minimizing mean TTLT
Hybrid ML + Retrieval Prediction predictor/app/services/predictor.py Quantile GBM baseline + FAISS semantic retrieval correction
Dual-Pool Retrieval predictor/app/services/retrieval_strategy.py Tenant/global pool with waterfall fallback
Dual-Queue Dual-Budget scheduler_mvp/app/scheduler/ Interactive vs. heavy isolation with tier caps
Adaptive Congestion Control scheduler_mvp/app/scheduler/congestion_controller.py Three-mode (free/moderate/congested) automatic scaling
AIMD Load Balancing predictor/proxy/src/congestion.ts Latency-driven window sizing for multi-account routing
Strict Replay Evaluation predictor/app/eval/evaluator.py Leak-free offline evaluation using temporal ordering

Citation

If you use SageSched in your research, please cite:

@article{sagesched2025,
  title   = {Layered Scheduling for Black-Box LLM APIs: Allocation, Ordering, and Overload Control},
  author  = {SageSched Contributors},
  journal = {arXiv preprint arXiv:2603.07917},
  year    = {2025},
  url     = {https://arxiv.org/abs/2603.07917}
}

Contributing

See CONTRIBUTING.md for guidelines.

License

Apache License 2.0

About

SageSched: Intelligent LLM Request Scheduler with Workload Prediction — QoS-aware dual-queue scheduling for black-box LLM APIs (OpenAI/Azure/Doubao/Gemini)

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors