Benchmarking tool for xPyD-proxy — measure latency, throughput, and PD-disaggregated inference performance.
pip install xpyd-bench
# Optional dependencies
pip install xpyd-bench[tokenizer] # tiktoken for accurate token counting
pip install xpyd-bench[http2] # HTTP/2 support
pip install xpyd-bench[dev] # Development & testingInstall from source:
git clone https://github.com/xPyD-hub/xPyD-bench.git
cd xPyD-bench
pip install -e ".[dev]"Main entry point for running benchmarks.
xpyd-bench [OPTIONS]| Parameter | Default | Description |
|---|---|---|
--base-url |
http://127.0.0.1:8000 |
Target server URL |
--model |
(auto-detect) | Model name; auto-fetched from server if omitted |
--endpoint |
/v1/completions |
API endpoint path |
--num-prompts |
1000 |
Total number of requests |
--request-rate |
inf |
Requests per second; inf sends all concurrently |
--max-concurrency |
(unlimited) | Maximum concurrent requests |
--input-len |
256 |
Input prompt token length |
--output-len |
128 |
Maximum output token count |
--stream / --no-stream |
(auto) | Enable/disable streaming responses |
--duration |
(none) | Fixed run duration (seconds); auto-stops when elapsed |
--dataset-name |
random |
Dataset type: random / synthetic |
--dataset-path |
(none) | Custom dataset file path (.jsonl/.json/.csv) |
--seed |
0 |
Random seed |
--burstiness |
1.0 |
Burstiness factor (1.0 = Poisson distribution) |
--repeat |
1 |
Number of repeat runs |
--repeat-delay |
0 |
Delay between repeat runs (seconds) |
--output / -o |
(stdout) | Output file path for results |
--backend |
openai |
Backend type |
--backend-plugin |
(none) | Custom backend plugin module path |
| Parameter | Description |
|---|---|
--temperature |
Sampling temperature |
--top-p |
Nucleus sampling |
--frequency-penalty |
Frequency penalty |
--presence-penalty |
Presence penalty |
--stop |
Stop sequence |
xpyd-bench compare # Compare multiple benchmark results
xpyd-bench profile # Performance profiling mode
xpyd-bench replay # Replay recorded requests
xpyd-bench lora-compare # Compare LoRA adapters on same endpoint (M89)
xpyd-bench config-dump # Export current configuration
xpyd-bench config-validate # Validate configuration file
xpyd-dummy # Start dummy server for testingRun a benchmark against a locally running vLLM / xPyD instance:
xpyd-bench \
--base-url http://localhost:8000 \
--model Qwen/Qwen2.5-7B \
--num-prompts 500 \
--max-concurrency 32 \
--input-len 512 \
--output-len 256 \
--stream \
-o results.jsonRoute through xpyd-proxy to separate prefill / decode nodes:
# 1) Start prefill node (xpyd-sim or real vLLM)
xpyd-sim --role prefill --port 8100
# 2) Start decode node
xpyd-sim --role decode --port 8200
# 3) Start proxy
xpyd-proxy --prefill http://localhost:8100 --decode http://localhost:8200 --port 8080
# 4) Run benchmark (against proxy)
xpyd-bench \
--base-url http://localhost:8080 \
--num-prompts 1000 \
--request-rate 50 \
--stream \
-o pd_results.jsonUse the built-in multi-model comparison mode:
# Compare two models
xpyd-bench \
--base-url http://localhost:8000 \
--model Qwen/Qwen2.5-7B \
--num-prompts 200 \
-o model_a.json
xpyd-bench \
--base-url http://localhost:8000 \
--model Qwen/Qwen2.5-14B \
--num-prompts 200 \
-o model_b.json
# Compare results
xpyd-bench compare model_a.json model_b.jsonRun for a fixed duration without limiting request count:
xpyd-bench \
--base-url http://localhost:8000 \
--duration 300 \
--request-rate 20 \
--streamStart a mock server without a real model:
# Terminal 1: Start dummy server
xpyd-dummy --port 8000
# Terminal 2: Run benchmark
xpyd-bench --base-url http://localhost:8000 --num-prompts 100| Metric | Description |
|---|---|
| TTFT (Time To First Token) | Time from sending a request to receiving the first token. Reflects prefill stage latency. |
| TPOT (Time Per Output Token) | Average time to generate each output token. Reflects decode stage speed. |
| TPS (Tokens Per Second) | Tokens generated per second (per-request / aggregate). |
| Throughput | Total throughput: requests/sec (req/s) and tokens/sec (tok/s). |
| Error Rate | Percentage of failed requests. |
| Metric | Description |
|---|---|
| P50 | Median — 50% of requests are below this value |
| P90 | 90% of requests are below this value |
| P99 | 99% of requests are below this value; reflects tail latency |
| Mean | Arithmetic mean |
| Std | Standard deviation; reflects latency stability |
- TTFT < 200ms: Good prefill performance (7B model, 512 token input)
- TPOT < 30ms: Normal decode speed
- P99/P50 < 3x: Healthy latency distribution with no severe tail latency
- Error Rate = 0%: Stable service
- Throughput: Should scale near-linearly with concurrency, plateauing at saturation
Output JSON contains:
{
"config": { ... },
"results": {
"total_requests": 1000,
"successful_requests": 998,
"failed_requests": 2,
"total_duration_s": 45.2,
"requests_per_second": 22.1,
"tokens_per_second": 2834,
"ttft_ms": { "mean": 152, "p50": 140, "p90": 210, "p99": 380 },
"tpot_ms": { "mean": 22, "p50": 20, "p90": 28, "p99": 45 },
"latency_ms": { "mean": 2950, "p50": 2800, "p90": 3500, "p99": 4200 }
}
}Client (xpyd-bench)
│
▼
xpyd-proxy (routing layer)
┌──┴──┐
▼ ▼
Prefill Decode
(xpyd-sim / vLLM)
See scripts/run_benchmark.sh for an all-in-one launch script.
# Manual steps
pip install xpyd-sim xpyd-proxy xpyd-bench
# Start sim nodes
xpyd-sim --role prefill --port 8100 &
xpyd-sim --role decode --port 8200 &
# Start proxy
xpyd-proxy \
--prefill http://localhost:8100 \
--decode http://localhost:8200 \
--port 8080 &
# Wait for services to be ready
sleep 3
# Run benchmark
xpyd-bench \
--base-url http://localhost:8080 \
--num-prompts 500 \
--request-rate 30 \
--max-concurrency 64 \
--input-len 256 \
--output-len 128 \
--stream \
-o benchmark_results.json
echo "Results saved to benchmark_results.json"- Checkpoint & Resume (
--checkpoint): Resume long-running benchmarks after interruption - Benchmark Fingerprint (
--fingerprint): Uniquely identify benchmark configurations for easy comparison - Configuration Inheritance (
--extends): Configuration file inheritance - Rolling Window Metrics: Real-time rolling window statistics
- Baseline Registry: Register baseline results for automatic regression comparison
- Speculative Decoding Metrics: Metrics related to speculative decoding
- Prefix Caching Impact: Analyze prefix caching effectiveness
- Adaptive Timeout: Automatically adjust timeout based on observed latency
- Multimodal Vision Benchmark: Support for vision model testing
- SLA Validation: Define SLA rules and automatically check compliance
- Distributed Benchmark: Multi-node coordinated distributed load testing