A CLI tool for testing OpenAI API-compatible inference endpoints. Discovers actual parameter limits, validates configurability, and analyzes rate limiting behavior.
- Parameter Testing: Tests which OpenAI API parameters are accepted and actually work
- Max Tokens Discovery: Aggressively finds streaming and non-streaming limits
- Rate Limit Analysis: Detects RPM/TPM limits with automatic retry handling
- Retry Logic: 3x retry with exponential backoff for rate limits
- Multiple Configuration Sources: Environment variables, .env files, saved endpoints
# Copy to your PATH
cp inference-tester ~/bin/
chmod +x ~/bin/inference-tester
# Or use directly from apps directory
~/apps/inference-tester/inference-tester --helpmkdir -p ~/.config/inference-tester
cat > ~/.config/inference-tester/.env << 'EOF'
FIREWORKS_API_KEY=fw_XXX
INFERENCE_BASE_URL=https://api.fireworks.ai/inference/v1
INFERENCE_MODEL=accounts/fireworks/models/kimi-k2p5-turbo
EOFinference-tester -s fireworks-kimi \
-u https://api.fireworks.ai/inference/v1 \
-k fw_XXX \
-m accounts/fireworks/models/kimi-k2p5-turbo# Run all tests
inference-tester -e fireworks-kimi -a
# Run specific test
inference-tester -e fireworks-kimi --test-max-tokens-limits
# Save results to JSON
inference-tester -e fireworks-kimi -a -j- Command line arguments (
--api-key,--base-url,--model) - Environment variables (
FIREWORKS_API_KEY,INFERENCE_BASE_URL,INFERENCE_MODEL) - .env file (searched in:
./.env,~/.inference-tester.env,~/.config/inference-tester/.env) - Saved endpoint (
--use-endpoint)
Verifies endpoint responds and model follows instructions.
Checks:
- API connectivity (200 OK)
- Response time
- Rate limit headers (RPM/TPM)
- Instruction following capability
Output:
✓ Status: 200 OK (967ms)
✓ Instruction Following: PASS
RPM Limit: 60 | TPM Limit: 12000
Tests which OpenAI API parameters are accepted AND if they actually affect output.
Parameters tested:
temperature(0.0 vs 1.5)max_tokens(50 vs 200)top_p(0.1 vs 1.0)top_k(1 vs 50)presence_penalty(0.0 vs 2.0)frequency_penalty(0.0, 0.5, 1.0, 1.5, 2.0 - incremental)
Output:
✓ temperature: ACCEPTED, EFFECTIVE (outputs differ)
✓ max_tokens: ACCEPTED, WEAK EFFECT (length differs by 8 chars)
⚠ frequency_penalty: 3/5 values accepted (max 1.0, rejects at 1.5)
Aggressively discovers actual max_tokens limits.
Phases:
- Non-streaming limit (1024 → 16384)
- Streaming baseline (4K-8K)
- Aggressive extension (8K → 200K with doubling)
- Fine-tuning (8K increments)
With --confirm-limit:
Retries failures 3x to distinguish rate limits from hard limits.
Output:
Non-Streaming Maximum: 4,096 tokens
Streaming Maximum: 24,576 tokens (or higher)
✓ STREAMING ADVANTAGE: 20,480 more tokens (6.0x multiplier)
Verifies model generates requested token count at discovered limits.
Tests:
- Non-streaming at its max (e.g., 4K)
- Streaming at large scale (e.g., 8K+)
Checks:
finish_reason:length(hit limit) vsstop(stopped early)- Throughput: tokens/second
- Streaming chunk count
| Option | Description | Default |
|---|---|---|
-u, --base-url |
API base URL | From env/.env |
-k, --api-key |
API key | From env/.env |
-m, --model |
Model ID | From env/.env |
-p, --provider |
Provider preset | - |
-t, --timeout |
Request timeout (seconds) | 60 |
--env-file |
Path to .env file | Auto-search |
| Option | Description |
|---|---|
-s, --save-endpoint |
Save as named endpoint |
-e, --use-endpoint |
Use saved endpoint |
-l, --list-endpoints |
List saved endpoints |
| Option | Description |
|---|---|
-a, --test-all |
Run all tests |
--test-connectivity |
Test connectivity only |
--test-params |
Test parameter configurability |
--test-max-tokens-limits |
Test max tokens limits |
--test-actual-output |
Test actual output length |
| Option | Description | Default |
|---|---|---|
--max-test |
Maximum tokens to test | 32768 |
--output-tokens |
Tokens for output test | 4096 |
--confirm-limit |
Confirm hard limits with retries | Off |
| Option | Description |
|---|---|
-j, --json-output |
Save results to JSON (optional filename) |
-r, --results-dir |
Directory for JSON results (default: CWD) |
-P, --parameter-help |
Show parameter reference |
inference-tester -s my-fireworks \
-u https://api.fireworks.ai/inference/v1 \
-k fw_XXX \
-m accounts/fireworks/models/kimi-k2p5-turbo
inference-tester -e my-fireworks -a -jexport FIREWORKS_API_KEY=fw_XXX
export INFERENCE_MODEL=accounts/fireworks/models/kimi-k2p5-turbo
export INFERENCE_BASE_URL=https://api.fireworks.ai/inference/v1
inference-tester --test-allinference-tester -e fireworks-kimi \
--test-max-tokens-limits \
--max-test 65536 \
--confirm-limit \
-jinference-tester -e fireworks-kimi -a -j -r ~/test-resultsAll tests use automatic retry logic:
Rate Limits (429):
- Always retried (up to 3x)
- Backoff: 5s → 10s → 15s
Hard Rejections:
- Only retried with
--confirm-limit - Backoff: 3s → 6s → 9s
Results:
✓ OK: Success on first try✓ ACCEPTED (retry): Success after rate limit⚠ UNCONFIRMED: Failed all retries (may be temporary)✗ HARD LIMIT CONFIRMED: True provider limit
Human-readable test results with checkmarks and status indicators.
{
"endpoint": "https://api.fireworks.ai/inference/v1",
"model": "accounts/fireworks/models/kimi-k2p5-turbo",
"tested_at": "2026-04-03T11:31:03",
"summary": {
"total_iterations": 24,
"total_tokens_used": 23632,
"total_elapsed_seconds": 92.1,
"rate_limits": {
"rpm_limit": "60",
"tpm_limit": "12000"
}
},
"results": [...]
}Saved endpoints with API keys (masked in list view).
Environment variables for default connection settings.
Tested providers:
- Fireworks AI (full support)
- OpenAI (compatible)
- Anthropic (compatible)
- Other OpenAI API-compatible endpoints
Create a .env file or use -e with a saved endpoint.
The tool auto-retries 3x. Wait 60s between full test runs if consistently hitting limits.
Parameter test makes 10+ rapid calls. Rate limits may cause early termination. Run with delays between attempts.
MIT License - Feel free to modify and distribute.