Skip to content

gowtham0992/redline

Repository files navigation

redline

Catch prompt regressions before they ship.

Automatic eval suites from the prompt logs you already have.

redline turns real prompt-response logs into local regression tests. It selects representative cases, replays your changed prompt, and shows the behavioral diff before a bad prompt reaches users.

Website · Docs · MCP · MCP Registry · Security · License

CI GitHub Pages PyPI MCP Registry License: MIT Stars

Start Here

Install from PyPI:

python -m pip install redline-ai

Run the public proof:

redline demo --public --compact

The demo catches ten synthetic regressions without API keys, private logs, a cloud account, or an LLM judge. It writes JSON, Markdown, and self-contained HTML reports under .redline/demo.

Open the local demo report index:

redline dashboard --reports-dir .redline/demo/reports --open

On headless CI or remote shells, skip --open and use the printed HTML path:

redline dashboard --reports-dir .redline/demo/reports --out .redline/dashboard.html
First-run troubleshooting
  • redline: command not found: run python -m pip install redline-ai, then confirm python -m pip show redline-ai.
  • Dashboard did not open: use --out .redline/dashboard.html and open or upload that file from your environment.
  • Suite not found: run redline suite logs/baseline.jsonl --out redline-suite.json.
  • Validation failed: run redline validate redline-suite.json --strict and fix the first reported error.
  • GitHub Action cannot find a suite: commit redline-suite.json or point the action suite input at your prompt manifest.

Full guide: docs/troubleshooting.md.

redline product demo

Product Proof

These are real local artifacts generated by redline demo --public --compact.

Dashboard HTML report
redline dashboard showing reports, benchmark evidence, history, and ship readiness redline HTML report showing concrete regression reasons and side-by-side baseline and candidate outputs

What Is redline?

redline is an open-source, local-first eval tool for AI teams. It uses logs you already have: prompts, outputs, support tickets, traces, model responses, and production JSONL exports.

Instead of asking you to hand-write evals first, redline generates the first suite from real behavior. You can then run that suite every time a prompt, model, or runner changes.

No cloud account is required. No manual test writing is required. No LLM judge is required for the core regression signal. The package has zero runtime dependencies, which keeps installs fast and the default supply-chain surface small.

How It Works

redline gives you three primitives that cover the prompt-regression loop:

1. Logs

Start with prompt-response data you already have. Import JSONL, convert exports from tools like Langfuse or Helicone, capture OpenAI/Anthropic SDK calls, or add bounded FastAPI/ASGI middleware.

redline import downloaded.jsonl --input-field instruction --output-field response --out logs/baseline.jsonl
redline suite logs/baseline.jsonl --out redline-suite.json
redline cases redline-suite.json

2. Suite

redline groups behavior into deterministic signatures and selects representative cases first. You can add pinned edge cases and explicit requirements when a scenario must never be missed.

redline cases redline-suite.json
redline suite add redline-suite.json --prompt "..." --response "..."

3. Eval

Replay a changed prompt or compare candidate outputs. redline names the behavior that broke: missing JSON keys, URLs, numbers, tables, code blocks, refusals, empty answers, or requirement failures.

redline eval --prompt prompts/v2.txt
redline diff redline-suite.json logs/candidate.jsonl

Product Promise

In under five minutes, on a real prompt log, redline should catch one regression you did not want to ship.

That promise is intentionally narrow. redline is not a hosted eval platform, a generic score, or a replacement for human judgment. It is the local safety loop between "I changed the prompt" and "this is safe enough to merge."

Real Workflow

Build a suite from baseline logs:

redline suite logs/baseline.jsonl --out redline-suite.json

Evaluate a changed prompt file through your configured runner:

redline eval --prompt prompts/v2.txt

Or compare candidate outputs you already generated:

redline diff redline-suite.json logs/candidate.jsonl

When redline finds a blocking change, it exits non-zero for CI and prints the reason:

REGRESSION case_004
- candidate missing JSON keys: owner, required_action
- candidate missing URL: https://example.com/policies/refunds

Confidence: HIGH | fix blocking cases before shipping

What redline Catches

Signal Example regression
JSON validity and keys Candidate stops returning valid JSON or drops owner.
Tables, lists, and code blocks Markdown table becomes prose; code fence disappears.
Numbers, URLs, and entities Refund window, ticket ID, policy URL, or owner is missing.
Empty outputs and refusals Candidate newly refuses a safe task or returns nothing.
Content drift Same-shape response changes substantially.
Explicit requirements Pinned cases require or forbid exact strings.

redline is deterministic and local-first by default. Optional judge commands are available for ambiguous changed cases, but redline does not call a cloud model unless you explicitly configure that command.

Suite generation groups logs by deterministic behavior signatures, not opaque embedding clusters. It picks one representative per group first, then adds high-variance edges and evenly spread prompt-diverse samples from large clusters when the case budget allows.

Trust Boundary

A green redline run means no configured high-signal structural blockers were found. It does not prove factual correctness, tone, hallucination safety, policy compliance, or subtle reasoning quality.

That boundary is visible in CLI output and reports because over-trusting eval tools is dangerous. Each reported case includes a confidence and signal (structural, shallow_semantic, requirement, judge, or human_judgment) so reviewers can see why redline is making the call. Use requirements or an optional judge for semantic risks that structural checks cannot prove.

Product Surface

redline is built around the full prompt-regression loop:

  • redline watch: collect prompt-response observations from logs, Python functions, OpenAI/Anthropic-compatible SDK calls, or ASGI apps, with best-effort common secrets and PII redacted before write by default.
  • RedlineMiddleware: capture bounded JSON FastAPI or ASGI request/response pairs locally, with optional skip diagnostics.
  • redline redact --check: scan logs for common secrets and PII, then write a scrubbed copy when needed. Redaction is best-effort pattern matching, not a privacy boundary; review sensitive logs before sharing.
  • redline cluster: inspect behavior groups before suite generation.
  • redline suite: generate a representative eval suite from baseline logs.
  • redline prompts: scan many prompt files and write or check a versionable prompt-to-suite manifest. Add --check-suites in CI when every prompt should already have a built and valid suite.
  • redline suite add: pin hand-picked edge cases the algorithm should never miss.
  • redline budget / redline benchmark: estimate suite or prompt-manifest runtime without executing replay commands, write budget artifacts, and optionally fail on a CI time budget. Add --measure-local to time redline's deterministic local diff work on your suite baselines without calling a model.
  • redline eval: replay each suite case through your local app or model runner.
  • redline diff: compare candidate JSONL outputs against the suite baseline.
  • redline mark and redline accept: review intentional changes and promote the new baseline.
  • redline require: add deterministic must-include or must-not-include rules.
  • redline audit --verify: inspect the local audit trail and verify the hash chain. Add --expect-last-hash or --expect-entries when you want to prove the local log tail still matches a checkpoint from CI or release evidence. Add --out-checkpoint .redline/audit-checkpoint.json to persist that evidence, then --checkpoint .redline/audit-checkpoint.json to verify against it later.
  • redline sbom: write CycloneDX SBOM release evidence for security review.
  • redline history, redline compare, and redline dashboard: track quality over time and inspect reports locally. The dashboard surfaces feature-level rollups, prompt-level eval rows, benchmark evidence, and a latest-report review queue when reports come from a prompt manifest. It also warns when reports exist without benchmark evidence from the same project.
  • redline summary: inspect suite readiness, or pass redline-prompts.json to roll up multi-prompt suite coverage, owners, requirements, and missing suites.
  • redline-mcp: let AI coding assistants run checks inside Claude, Codex, Cursor, Kiro, or any MCP client.

For repos with many prompt files, the manifest becomes the eval plan:

redline prompts prompts/ --suite-dir suites --out redline-prompts.json
redline prompts prompts/ --suite-dir suites --out redline-prompts.json --check --check-suites
redline summary redline-prompts.json
redline validate redline-prompts.json --strict
redline budget redline-prompts.json
redline eval redline-prompts.json

Manifest summaries show readiness across every mapped suite, manifest validation checks every mapped suite, manifest benchmarks aggregate runtime budget, and manifest evals print prompt-level rollups before case details. Large repos can see which prompt files or feature folders need attention first.

When mapped suites are valid, the check prints ready commands such as:

redline eval suites/support/triage.redline-suite.json --prompt prompts/support/triage.txt

Connect Your App

Any command that reads a prompt from stdin and prints a response to stdout can be a redline runner:

redline init --runner stdio --copy-runner --github-action

Built-in adapters cover provider-neutral stdio, OpenAI, Anthropic, LiteLLM, HTTP APIs, Python chains, JSONL log imports, and OpenAI/Anthropic SDK capture:

redline runners
redline runners --copy all

Runner details live in docs/runners.md. Log import and SDK capture adapters are for building suites from real observations, not for redline eval replay. The JSONL log adapter includes Langfuse, Helicone, LangSmith, and Braintrust presets for exported observability logs.

AI Assistant Native

redline ships a local Model Context Protocol server:

redline-mcp

Use docs/mcp.md to wire redline into an MCP client. The MCP surface exposes safe capture-readiness, privacy, audit, scale, read, case-inspection, eval, and report tools plus workflow prompts like setup_redline_project, check_prompt_change, build_suite_from_logs, and review_candidate_outputs. It can also list or copy runner adapters and optional judge templates during setup. The only mutating MCP tool is guarded: redline_mark requires allow_write: true and a note before it records an intentional case judgment. Baseline promotion stays CLI-only.

CI And GitHub

Create config plus a GitHub Actions workflow:

redline init --runner stdio --copy-runner --github-action

Use redline as a composite GitHub Action from another repo:

- uses: gowtham0992/redline@v0.2.1
  with:
    prompt-path: prompts/v2.txt
    benchmark-max-seconds: "300"

For multi-prompt repos, point suite at redline-prompts.json. The action checks every mapped suite with redline prompts --check --check-suites, runs a manifest-wide benchmark, then runs the manifest eval.

The action writes JSON, full Markdown, concise PR-comment Markdown, HTML, JUnit, Slack-ready JSON, history, dashboard, and audit checkpoint artifacts under .redline/, appends benchmark, concise eval, and trend summaries to the GitHub step summary, and exits with the eval gate status. Set benchmark-max-seconds when a suite should fail CI if its worst-case runtime budget grows too far.

Reports

Every diff and eval run can write:

  • JSON for machines and dashboards
  • full Markdown for detailed summaries, including prompt-manifest rollups
  • concise PR-comment Markdown for merge-review surfaces
  • self-contained HTML for side-by-side inspection, including feature and prompt eval tables
  • JUnit XML for CI test reporting
  • Slack Block Kit JSON for CI bots or webhook integrations you control
  • GitHub annotations for changed or blocking cases

Example:

redline diff redline-suite.json logs/candidate.jsonl \
  --out-json .redline/reports/diff.json \
  --out-md .redline/reports/diff.md \
  --out-comment .redline/reports/diff-comment.md \
  --out-html .redline/reports/diff.html \
  --out-junit .redline/reports/diff.xml \
  --out-slack .redline/reports/diff.slack.json

Optional Judges

Use judges only where structural checks are not enough. redline sends only ambiguous changed cases to the configured command as JSON on stdin:

redline judges
redline judges --copy openai
redline judges --copy support-rubric
redline diff logs/candidate.jsonl --judge "python examples/judge_changed.py"

Repo examples and installable templates:

Calibration guidance lives in docs/judges.md.

Config

redline init writes redline.json with a $schema reference for editor autocomplete. Important keys:

Key Purpose
suite Suite baseline path, default redline-suite.json.
input_field, output_field JSONL field paths for prompts and responses.
max_cases Maximum representative cases selected for a suite.
replay Command used by eval; prompts go to stdin by default. {prompt} is for small legacy argv runners; {prompt_file} passes a temporary rendered-prompt file path.
workers Number of replay cases to run concurrently.
owners Optional pattern-to-owner rules so regressions show the responsible team.
approval Optional local guardrail; require_approver makes accept record an approver.
fail_on Statuses that fail diff or eval; use "none" for report-only setup.
reports JSON, Markdown, PR-comment Markdown, HTML, JUnit, and Slack-ready JSON output paths.
logs Observed prompt-response log path and optional middleware skip diagnostics path.
audit Append-only JSONL audit log path for evals, judgments, requirements, and accepted baselines. New entries include operator/approver context plus a local hash chain that redline audit --verify can check; use expected hash/count checkpoints or --out-checkpoint evidence files to detect tail truncation.
judge Optional command for ambiguous changed cases.

Check setup before relying on a suite:

redline doctor --strict
redline validate redline-suite.json --strict
redline summary redline-suite.json

doctor shows whether the suite has explicit requirements or recorded judgments before you rely on structural checks in CI. summary reports cluster/case coverage, owner coverage, accepted baseline history, approver coverage, and explicit guard coverage for cases with requirements or recorded judgments so teams can review suite readiness before CI. dashboard also shows audit checkpoint evidence when .redline/audit-checkpoint.json is present.

Dogfood Assets

The public fixture is synthetic, shaped after public instruction/chat dataset patterns, and documented in examples/public_dogfood_sources.md.

python -m redline suite examples/public_dogfood_baseline.jsonl --out /tmp/redline-public-suite.json --all-cases
python -m redline diff /tmp/redline-public-suite.json examples/public_dogfood_candidate.jsonl --compact --fail-on none

For AI-assistant session dogfood, use docs/ai-session-dogfood-prompts.jsonl and normalize raw exports with scripts/normalize_ai_session_logs.py. Reproducible dogfood case studies live in docs/case-studies.md. Public dataset candidates for internet dogfood are ranked in docs/internet-dogfood-sources.md.

From a repo checkout, record the public demo:

bash scripts/demo_terminal.sh
bash scripts/demo_gif.sh .redline/launch .redline/launch/redline-demo.gif

Development

python -m pip install -e ".[dev]"
python -m pytest -q
python -m ruff check .
python -m mypy redline tests scripts examples

Before cutting a release or asking someone else to try a branch:

bash scripts/release_check.sh

Project Docs

Website source for GitHub Pages lives in site/ and deploys from the committed static assets on main.