inference-tester

A CLI tool for testing OpenAI API-compatible inference endpoints. Discovers actual parameter limits, validates configurability, and analyzes rate limiting behavior.

Features

Parameter Testing: Tests which OpenAI API parameters are accepted and actually work
Max Tokens Discovery: Aggressively finds streaming and non-streaming limits
Rate Limit Analysis: Detects RPM/TPM limits with automatic retry handling
Retry Logic: 3x retry with exponential backoff for rate limits
Multiple Configuration Sources: Environment variables, .env files, saved endpoints

Installation

# Copy to your PATH
cp inference-tester ~/bin/
chmod +x ~/bin/inference-tester

# Or use directly from apps directory
~/apps/inference-tester/inference-tester --help

Quick Start

1. Create a .env file

mkdir -p ~/.config/inference-tester
cat > ~/.config/inference-tester/.env << 'EOF'
FIREWORKS_API_KEY=fw_XXX
INFERENCE_BASE_URL=https://api.fireworks.ai/inference/v1
INFERENCE_MODEL=accounts/fireworks/models/kimi-k2p5-turbo
EOF

2. Save an endpoint (optional but recommended)

inference-tester -s fireworks-kimi \
  -u https://api.fireworks.ai/inference/v1 \
  -k fw_XXX \
  -m accounts/fireworks/models/kimi-k2p5-turbo

3. Run tests

# Run all tests
inference-tester -e fireworks-kimi -a

# Run specific test
inference-tester -e fireworks-kimi --test-max-tokens-limits

# Save results to JSON
inference-tester -e fireworks-kimi -a -j

Configuration Sources (Precedence Order)

Command line arguments (--api-key, --base-url, --model)
Environment variables (FIREWORKS_API_KEY, INFERENCE_BASE_URL, INFERENCE_MODEL)
.env file (searched in: ./.env, ~/.inference-tester.env, ~/.config/inference-tester/.env)
Saved endpoint (--use-endpoint)

Available Tests

Test 1: Basic Connectivity (`--test-connectivity`)

Verifies endpoint responds and model follows instructions.

Checks:

API connectivity (200 OK)
Response time
Rate limit headers (RPM/TPM)
Instruction following capability

Output:

✓ Status: 200 OK (967ms)
✓ Instruction Following: PASS
RPM Limit: 60 | TPM Limit: 12000

Test 2: Parameter Configurability (`--test-params`)

Tests which OpenAI API parameters are accepted AND if they actually affect output.

Parameters tested:

temperature (0.0 vs 1.5)
max_tokens (50 vs 200)
top_p (0.1 vs 1.0)
top_k (1 vs 50)
presence_penalty (0.0 vs 2.0)
frequency_penalty (0.0, 0.5, 1.0, 1.5, 2.0 - incremental)

Output:

✓ temperature: ACCEPTED, EFFECTIVE (outputs differ)
✓ max_tokens: ACCEPTED, WEAK EFFECT (length differs by 8 chars)
⚠ frequency_penalty: 3/5 values accepted (max 1.0, rejects at 1.5)

Test 3: Max Tokens Limits (`--test-max-tokens-limits`)

Aggressively discovers actual max_tokens limits.

Phases:

Non-streaming limit (1024 → 16384)
Streaming baseline (4K-8K)
Aggressive extension (8K → 200K with doubling)
Fine-tuning (8K increments)

With --confirm-limit: Retries failures 3x to distinguish rate limits from hard limits.

Output:

Non-Streaming Maximum: 4,096 tokens
Streaming Maximum: 24,576 tokens (or higher)
✓ STREAMING ADVANTAGE: 20,480 more tokens (6.0x multiplier)

Test 4: Actual Output Length (`--test-actual-output`)

Verifies model generates requested token count at discovered limits.

Tests:

Non-streaming at its max (e.g., 4K)
Streaming at large scale (e.g., 8K+)

Checks:

finish_reason: length (hit limit) vs stop (stopped early)
Throughput: tokens/second
Streaming chunk count

Command Line Options

Connection Settings

Option	Description	Default
`-u, --base-url`	API base URL	From env/.env
`-k, --api-key`	API key	From env/.env
`-m, --model`	Model ID	From env/.env
`-p, --provider`	Provider preset	-
`-t, --timeout`	Request timeout (seconds)	60
`--env-file`	Path to .env file	Auto-search

Saved Endpoints

Option	Description
`-s, --save-endpoint`	Save as named endpoint
`-e, --use-endpoint`	Use saved endpoint
`-l, --list-endpoints`	List saved endpoints

Tests

Option	Description
`-a, --test-all`	Run all tests
`--test-connectivity`	Test connectivity only
`--test-params`	Test parameter configurability
`--test-max-tokens-limits`	Test max tokens limits
`--test-actual-output`	Test actual output length

Test Configuration

Option	Description	Default
`--max-test`	Maximum tokens to test	32768
`--output-tokens`	Tokens for output test	4096
`--confirm-limit`	Confirm hard limits with retries	Off

Output Options

Option	Description
`-j, --json-output`	Save results to JSON (optional filename)
`-r, --results-dir`	Directory for JSON results (default: CWD)
`-P, --parameter-help`	Show parameter reference

Examples

Save endpoint and run all tests

inference-tester -s my-fireworks \
  -u https://api.fireworks.ai/inference/v1 \
  -k fw_XXX \
  -m accounts/fireworks/models/kimi-k2p5-turbo

inference-tester -e my-fireworks -a -j

Test with environment variables

export FIREWORKS_API_KEY=fw_XXX
export INFERENCE_MODEL=accounts/fireworks/models/kimi-k2p5-turbo
export INFERENCE_BASE_URL=https://api.fireworks.ai/inference/v1

inference-tester --test-all

Test specific max tokens with confirmation

inference-tester -e fireworks-kimi \
  --test-max-tokens-limits \
  --max-test 65536 \
  --confirm-limit \
  -j

Save results to specific directory

inference-tester -e fireworks-kimi -a -j -r ~/test-results

Retry Strategy

All tests use automatic retry logic:

Rate Limits (429):

Always retried (up to 3x)
Backoff: 5s → 10s → 15s

Hard Rejections:

Only retried with --confirm-limit
Backoff: 3s → 6s → 9s

Results:

✓ OK: Success on first try
✓ ACCEPTED (retry): Success after rate limit
⚠ UNCONFIRMED: Failed all retries (may be temporary)
✗ HARD LIMIT CONFIRMED: True provider limit

Output Files

Terminal Output (Default)

Human-readable test results with checkmarks and status indicators.

JSON Output (With `-j`)

{
  "endpoint": "https://api.fireworks.ai/inference/v1",
  "model": "accounts/fireworks/models/kimi-k2p5-turbo",
  "tested_at": "2026-04-03T11:31:03",
  "summary": {
    "total_iterations": 24,
    "total_tokens_used": 23632,
    "total_elapsed_seconds": 92.1,
    "rate_limits": {
      "rpm_limit": "60",
      "tpm_limit": "12000"
    }
  },
  "results": [...]
}

Configuration Files

~/.config/inference-tester/config.json

Saved endpoints with API keys (masked in list view).

~/.config/inference-tester/.env

Environment variables for default connection settings.

Providers

Tested providers:

Fireworks AI (full support)
OpenAI (compatible)
Anthropic (compatible)
Other OpenAI API-compatible endpoints

Troubleshooting

"Error: Missing connection parameters"

Create a .env file or use -e with a saved endpoint.

Rate limit errors

The tool auto-retries 3x. Wait 60s between full test runs if consistently hitting limits.

Inconsistent parameter test results

Parameter test makes 10+ rapid calls. Rate limits may cause early termination. Run with delays between attempts.

License

MIT License - Feel free to modify and distribute.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
inference-tester		inference-tester

Folders and files

Latest commit

History

Repository files navigation

inference-tester

Features

Installation

Quick Start

1. Create a .env file

2. Save an endpoint (optional but recommended)

3. Run tests

Configuration Sources (Precedence Order)

Available Tests

Test 1: Basic Connectivity (--test-connectivity)

Test 2: Parameter Configurability (--test-params)

Test 3: Max Tokens Limits (--test-max-tokens-limits)

Test 4: Actual Output Length (--test-actual-output)

Command Line Options

Connection Settings

Saved Endpoints

Tests

Test Configuration

Output Options

Examples

Save endpoint and run all tests

Test with environment variables

Test specific max tokens with confirmation

Save results to specific directory

Retry Strategy

Output Files

Terminal Output (Default)

JSON Output (With -j)

Configuration Files

~/.config/inference-tester/config.json

~/.config/inference-tester/.env

Providers

Troubleshooting

"Error: Missing connection parameters"

Rate limit errors

Inconsistent parameter test results

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Test 1: Basic Connectivity (`--test-connectivity`)

Test 2: Parameter Configurability (`--test-params`)

Test 3: Max Tokens Limits (`--test-max-tokens-limits`)

Test 4: Actual Output Length (`--test-actual-output`)

JSON Output (With `-j`)

Packages