Skip to content

Heavy-A/inference-tester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

inference-tester

A CLI tool for testing OpenAI API-compatible inference endpoints. Discovers actual parameter limits, validates configurability, and analyzes rate limiting behavior.

Features

  • Parameter Testing: Tests which OpenAI API parameters are accepted and actually work
  • Max Tokens Discovery: Aggressively finds streaming and non-streaming limits
  • Rate Limit Analysis: Detects RPM/TPM limits with automatic retry handling
  • Retry Logic: 3x retry with exponential backoff for rate limits
  • Multiple Configuration Sources: Environment variables, .env files, saved endpoints

Installation

# Copy to your PATH
cp inference-tester ~/bin/
chmod +x ~/bin/inference-tester

# Or use directly from apps directory
~/apps/inference-tester/inference-tester --help

Quick Start

1. Create a .env file

mkdir -p ~/.config/inference-tester
cat > ~/.config/inference-tester/.env << 'EOF'
FIREWORKS_API_KEY=fw_XXX
INFERENCE_BASE_URL=https://api.fireworks.ai/inference/v1
INFERENCE_MODEL=accounts/fireworks/models/kimi-k2p5-turbo
EOF

2. Save an endpoint (optional but recommended)

inference-tester -s fireworks-kimi \
  -u https://api.fireworks.ai/inference/v1 \
  -k fw_XXX \
  -m accounts/fireworks/models/kimi-k2p5-turbo

3. Run tests

# Run all tests
inference-tester -e fireworks-kimi -a

# Run specific test
inference-tester -e fireworks-kimi --test-max-tokens-limits

# Save results to JSON
inference-tester -e fireworks-kimi -a -j

Configuration Sources (Precedence Order)

  1. Command line arguments (--api-key, --base-url, --model)
  2. Environment variables (FIREWORKS_API_KEY, INFERENCE_BASE_URL, INFERENCE_MODEL)
  3. .env file (searched in: ./.env, ~/.inference-tester.env, ~/.config/inference-tester/.env)
  4. Saved endpoint (--use-endpoint)

Available Tests

Test 1: Basic Connectivity (--test-connectivity)

Verifies endpoint responds and model follows instructions.

Checks:

  • API connectivity (200 OK)
  • Response time
  • Rate limit headers (RPM/TPM)
  • Instruction following capability

Output:

✓ Status: 200 OK (967ms)
✓ Instruction Following: PASS
RPM Limit: 60 | TPM Limit: 12000

Test 2: Parameter Configurability (--test-params)

Tests which OpenAI API parameters are accepted AND if they actually affect output.

Parameters tested:

  • temperature (0.0 vs 1.5)
  • max_tokens (50 vs 200)
  • top_p (0.1 vs 1.0)
  • top_k (1 vs 50)
  • presence_penalty (0.0 vs 2.0)
  • frequency_penalty (0.0, 0.5, 1.0, 1.5, 2.0 - incremental)

Output:

✓ temperature: ACCEPTED, EFFECTIVE (outputs differ)
✓ max_tokens: ACCEPTED, WEAK EFFECT (length differs by 8 chars)
⚠ frequency_penalty: 3/5 values accepted (max 1.0, rejects at 1.5)

Test 3: Max Tokens Limits (--test-max-tokens-limits)

Aggressively discovers actual max_tokens limits.

Phases:

  1. Non-streaming limit (1024 → 16384)
  2. Streaming baseline (4K-8K)
  3. Aggressive extension (8K → 200K with doubling)
  4. Fine-tuning (8K increments)

With --confirm-limit: Retries failures 3x to distinguish rate limits from hard limits.

Output:

Non-Streaming Maximum: 4,096 tokens
Streaming Maximum: 24,576 tokens (or higher)
✓ STREAMING ADVANTAGE: 20,480 more tokens (6.0x multiplier)

Test 4: Actual Output Length (--test-actual-output)

Verifies model generates requested token count at discovered limits.

Tests:

  • Non-streaming at its max (e.g., 4K)
  • Streaming at large scale (e.g., 8K+)

Checks:

  • finish_reason: length (hit limit) vs stop (stopped early)
  • Throughput: tokens/second
  • Streaming chunk count

Command Line Options

Connection Settings

Option Description Default
-u, --base-url API base URL From env/.env
-k, --api-key API key From env/.env
-m, --model Model ID From env/.env
-p, --provider Provider preset -
-t, --timeout Request timeout (seconds) 60
--env-file Path to .env file Auto-search

Saved Endpoints

Option Description
-s, --save-endpoint Save as named endpoint
-e, --use-endpoint Use saved endpoint
-l, --list-endpoints List saved endpoints

Tests

Option Description
-a, --test-all Run all tests
--test-connectivity Test connectivity only
--test-params Test parameter configurability
--test-max-tokens-limits Test max tokens limits
--test-actual-output Test actual output length

Test Configuration

Option Description Default
--max-test Maximum tokens to test 32768
--output-tokens Tokens for output test 4096
--confirm-limit Confirm hard limits with retries Off

Output Options

Option Description
-j, --json-output Save results to JSON (optional filename)
-r, --results-dir Directory for JSON results (default: CWD)
-P, --parameter-help Show parameter reference

Examples

Save endpoint and run all tests

inference-tester -s my-fireworks \
  -u https://api.fireworks.ai/inference/v1 \
  -k fw_XXX \
  -m accounts/fireworks/models/kimi-k2p5-turbo

inference-tester -e my-fireworks -a -j

Test with environment variables

export FIREWORKS_API_KEY=fw_XXX
export INFERENCE_MODEL=accounts/fireworks/models/kimi-k2p5-turbo
export INFERENCE_BASE_URL=https://api.fireworks.ai/inference/v1

inference-tester --test-all

Test specific max tokens with confirmation

inference-tester -e fireworks-kimi \
  --test-max-tokens-limits \
  --max-test 65536 \
  --confirm-limit \
  -j

Save results to specific directory

inference-tester -e fireworks-kimi -a -j -r ~/test-results

Retry Strategy

All tests use automatic retry logic:

Rate Limits (429):

  • Always retried (up to 3x)
  • Backoff: 5s → 10s → 15s

Hard Rejections:

  • Only retried with --confirm-limit
  • Backoff: 3s → 6s → 9s

Results:

  • ✓ OK: Success on first try
  • ✓ ACCEPTED (retry): Success after rate limit
  • ⚠ UNCONFIRMED: Failed all retries (may be temporary)
  • ✗ HARD LIMIT CONFIRMED: True provider limit

Output Files

Terminal Output (Default)

Human-readable test results with checkmarks and status indicators.

JSON Output (With -j)

{
  "endpoint": "https://api.fireworks.ai/inference/v1",
  "model": "accounts/fireworks/models/kimi-k2p5-turbo",
  "tested_at": "2026-04-03T11:31:03",
  "summary": {
    "total_iterations": 24,
    "total_tokens_used": 23632,
    "total_elapsed_seconds": 92.1,
    "rate_limits": {
      "rpm_limit": "60",
      "tpm_limit": "12000"
    }
  },
  "results": [...]
}

Configuration Files

~/.config/inference-tester/config.json

Saved endpoints with API keys (masked in list view).

~/.config/inference-tester/.env

Environment variables for default connection settings.

Providers

Tested providers:

  • Fireworks AI (full support)
  • OpenAI (compatible)
  • Anthropic (compatible)
  • Other OpenAI API-compatible endpoints

Troubleshooting

"Error: Missing connection parameters"

Create a .env file or use -e with a saved endpoint.

Rate limit errors

The tool auto-retries 3x. Wait 60s between full test runs if consistently hitting limits.

Inconsistent parameter test results

Parameter test makes 10+ rapid calls. Rate limits may cause early termination. Run with delays between attempts.

License

MIT License - Feel free to modify and distribute.

About

Tool tests inference endpoints and models for actual capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages