Skip to content

Releases: sauravvenkat/forkline

v0.5.0 — CI Integration

27 Feb 23:42
fb0c7f4

Choose a tag to compare

Forkline v0.5.0 Release Notes — CI Integration

Release: v0.5.0
Date: 2026-02-25
Milestone: v0.5 — CI + Test Harness (Roadmap item 5 of 5)


Summary

This release delivers CI integration: a deterministic, offline, build-failing diff system that lets teams gate merges on behavioral identity. If an agent's output changes, the build fails — with a clear diff, a machine-readable exit code, and a suggested fix.

This completes the v0 roadmap. Forkline now covers the full loop: record → replay → diff → CI gate.


What's New

1. forkline ci Command Suite (forkline/ci/commands.py)

Five new subcommands purpose-built for CI pipelines:

Command Purpose
forkline ci record Run a script, produce a normalized JSON artifact
forkline ci replay Validate an artifact's schema and structure offline
forkline ci diff Compare two artifacts, exit 1 on divergence
forkline ci check All-in-one: record actual, diff against expected
forkline ci normalize Strip timestamps and metadata for stable diffs

ci check is the primary CI command — a single call that records behavior, normalizes the output, and diffs against a committed baseline:

forkline ci check \
  --entrypoint examples/my_flow.py \
  --expected tests/testdata/my_flow.run.json \
  --offline

Exit 0 means identical behavior. Exit 1 means the build should fail.


2. Offline Enforcement (forkline/ci/offline.py)

A hard no-network guarantee for CI runs. When --offline is set (or FORKLINE_OFFLINE=1), Forkline monkeypatches socket.connect, socket.create_connection, and socket.getaddrinfo to raise ForklineOfflineError immediately.

Properties:

  • Fail-closed. Network calls error instantly — no hangs, no timeouts.
  • Deterministic. Same call always produces the same error message.
  • Scoped. offline_context() restores normal access on exit.
  • Cross-library. Blocks requests, httpx, urllib3, and any library built on socket.
from forkline.ci.offline import offline_context

with offline_context():
    # Any network call raises ForklineOfflineError
    requests.get("https://api.example.com")  # raises immediately
# Normal access restored here

3. Artifact Normalization (forkline/ci/normalize.py)

Strips unstable fields from artifacts so that recordings made at different times, on different machines, produce identical diffs when behavior is the same.

Normalized by default:

  • Timestamp fields (ts, started_at, ended_at, created_at) → 2000-01-01T00:00:00+00:00
  • Platform metadata (python_version, platform, cwd) → removed
  • Events sorted by event_id

Preserved:

  • Event types and payloads (the behavioral data)
  • Schema version and entrypoint path

Normalization is applied automatically by ci record and ci diff. It can also be run explicitly:

forkline ci normalize artifact.run.json --out normalized.run.json

4. Exit Code Contract (forkline/ci/exitcodes.py)

A strict, stable exit code contract for CI automation. These values will not change across releases.

Code Constant Meaning
0 EXIT_SUCCESS Success, no diff
1 EXIT_DIFF_DETECTED Diff detected — fail the build
2 EXIT_USAGE_ERROR Bad args, missing file
3 EXIT_REPLAY_FAILED Script failed, runtime exception
4 EXIT_OFFLINE_VIOLATION Network attempted in offline mode
5 EXIT_ARTIFACT_ERROR Cannot parse artifact, schema error
6 EXIT_INTERNAL_ERROR Unexpected bug

Every exit code is exercised by a dedicated test.


5. Python Test Helper (forkline/testing.py)

A one-line API for snapshot-style testing of agentic workflows:

from forkline.testing import assert_no_diff

def test_my_flow():
    assert_no_diff(
        entrypoint="examples/my_flow.py",
        expected_artifact="tests/testdata/my_flow.run.json",
        offline=True,
    )

On failure, raises ArtifactDiffError with:

  • First divergent event index
  • Expected vs actual payloads
  • Structured diff for programmatic inspection
  • Suggested re-record command

6. Diff Output (forkline/ci/commands.py)

Diff output is concise in text mode and machine-readable in JSON mode.

Text mode:

DIFF: First divergence at event[1] (type: output)
  $.answer: "4" -> "5"
  $.note: <missing> -> "wrong!"

Suggested fix: Re-record baseline: forkline ci record --entrypoint <script> --out <path>

JSON mode:

{
  "identical": false,
  "first_divergent_index": 1,
  "event_type": "output",
  "expected": {"type": "output", "payload": {"answer": "4"}},
  "actual": {"type": "output", "payload": {"answer": "5", "note": "wrong!"}},
  "payload_diff": [
    {"op": "add", "path": "$.note", "value": "wrong!"},
    {"op": "replace", "path": "$.answer", "old": "4", "new": "5"}
  ],
  "suggestion": "Re-record baseline: forkline ci record --entrypoint <script> --out <path>"
}

7. CLI Integration (forkline/cli/__init__.py)

The ci subcommand is wired into the existing forkline CLI with full argparse integration:

$ forkline ci --help
usage: forkline ci [-h] {record,replay,diff,check,normalize} ...

Commands for CI/CD pipelines: record baselines, replay artifacts,
diff for behavioral changes, and gate merges on behavioral identity.

All subcommands support --help and produce meaningful error messages on invalid usage.


8. Documentation (docs/ci.md)

Full CI guide covering:

  • Quick start (3-command workflow)
  • All commands with flags and descriptions
  • Offline mode details
  • Artifact normalization behavior
  • Recommended repo layout (tests/testdata/*.run.json)
  • Re-recording baselines
  • Python test helper usage
  • GitHub Actions example (copy-paste ready)
  • Programmatic API usage

9. Examples

Four new runnable examples:

File Demonstrates
examples/ci_record_and_diff.py Record baseline, diff identical and changed behavior, JSON output
examples/ci_check_gate.py All-in-one build gate with ci_check
examples/ci_offline_enforcement.py Offline mode blocking all network calls
examples/ci_test_helper.py assert_no_diff for pytest/unittest

Tests

60 new tests in tests/unit/test_ci.py. All hermetic — no network, no external dependencies.

TestExitCodes (2 tests)

Test Validates
test_exit_code_values Each code has its documented integer value
test_all_distinct All 7 codes are unique

TestOfflineMode (7 tests)

Test Validates
test_offline_context_blocks_socket socket.connect raises ForklineOfflineError
test_offline_blocks_create_connection socket.create_connection blocked
test_offline_blocks_getaddrinfo DNS resolution blocked
test_offline_error_is_deterministic Same call → same error message
test_offline_restores_after_context Normal access restored on exit
test_enable_disable_idempotent Double-enable/disable is safe
test_offline_error_attributes Error has operation attribute, includes FORKLINE_OFFLINE

TestNormalization (9 tests)

Test Validates
test_timestamps_normalized All timestamp fields → sentinel
test_metadata_stripped Platform metadata removed
test_metadata_preserved_when_disabled Opt-out works
test_timestamps_preserved_when_disabled Opt-out works
test_events_ordered_by_event_id Stable sort
test_normalize_ids IDs replaced with sequential values
test_normalize_deterministic Same input → same output
test_normalize_json_roundtrip JSON string → normalize → parse
test_original_not_mutated Input dict unchanged

TestCIRecord (6 tests)

Test Validates
test_record_success Produces valid artifact with schema_version and events
test_record_missing_entrypoint Exit 2
test_record_failed_script Exit 3
test_record_creates_directories Nested output path created
test_record_artifact_is_normalized Timestamps are sentinel values
test_record_deterministic Two recordings → identical normalized output

TestCIReplay (6 tests)

Test Validates
test_replay_valid_artifact Exit 0, JSON output with status/event_count
test_replay_missing_file Exit 2
test_replay_invalid_json Exit 5
test_replay_missing_schema_version Exit 5
test_replay_strict_empty_payload Exit 5 in strict mode
test_replay_not_strict_allows_empty_payload Exit 0 in default mode

TestCIDiff (9 tests)

Test Validates
test_identical_artifacts Exit 0, "No differences" text
test_different_artifacts Exit 1, "DIFF" in output
test_diff_json_format JSON output with first_divergent_index and suggestion
test_diff_json_identical JSON output with identical: true
test_diff_missing_expected / test_diff_missing_actual Exit 2
test_diff_event_count_mismatch Exit 1, "mismatch" in output
test_diff_bad_json Exit 5
test_diff_normalizes_timestamps Artifacts at different times still match

TestCINormalize (4 tests)

Test Validates
test_normalize_in_place Overwrites file, timestamps normalized
test_normalize_to_new_path Writes to separate output
test_normalize_missing_file Exit 2
test_normalize_bad_json Exit 5

TestCICheck (4 ...

Read more

v0.4.2 - Tool Invocation Recording + Deterministic Redaction

23 Feb 23:54
236d6a9

Choose a tag to compare

Forkline v0.4.2 Release Notes

Tool Invocation Recording + Deterministic Redaction

Agents without tool visibility are blind. Agents with unsafe logs are
unusable in real systems. This release solves both: tool calls are now
first-class events, and sensitive data is redacted deterministically
before anything touches disk.


New: Tool Call Events

Every tool invocation — DB queries, API calls, file operations — is
now recorded as a tool_call event in the run artifact with a
canonical schema:

{
  "tool_name": "http.request",
  "invocation_id": "a1b2c3...",
  "request": { "url": "https://api.example.com" },
  "response": { "status": 200 },
  "error": null,
  "timing": {
    "started_at": "2026-02-23T10:00:00Z",
    "ended_at": "2026-02-23T10:00:00.250Z",
    "duration_ms": 250.0
  },
  "metadata": { "bytes_read": 1024, "cache_hit": false }
}

Three ways to record:

# Context manager — full control over request/response/metadata
with ToolCallRecorder(recorder, run_id, "http.get") as tc:
    tc.set_request({"url": "https://api.example.com"})
    resp = requests.get("https://api.example.com")
    tc.set_response({"status": resp.status_code})
    tc.set_metadata({"bytes_read": len(resp.content)})

# Decorator — wrap existing functions
@record_tool_call(recorder, run_id, "db.query")
def query_db(sql):
    return db.execute(sql).fetchall()

# Convenience method — manual construction
recorder.log_tool_call(
    run_id=run_id,
    tool_name="file.read",
    request={"path": "/tmp/data.txt"},
    response={"content": "hello"},
)

Replay integration: ToolCallRecorder enforces determinism
guardrails — live tool calls are blocked during replay mode. Set
allow_in_replay=True for re-execution scenarios.

New: Regex-Based Redaction

The redaction engine now supports three matching strategies:

Strategy Matches on Example
Key-based Dict key names (substring, case-insensitive) password, api_key
Path-based Dot-separated paths headers.authorization
Regex-based (new) String values anywhere in the payload JWTs, Bearer tokens, AWS keys

Default policy now includes regex rules that catch secrets embedded in
string values, not just key names:

eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOi...  →  [REDACTED:jwt]
Bearer sk-12345abcdef                 →  Bearer [REDACTED]
AKIAIOSFODNN7EXAMPLE                 →  [REDACTED:aws_key]

New: Redaction Config Files

Configure redaction via YAML or JSON instead of code:

# forkline.redact.yaml
fields:
  redact_keys:
    - password
    - token
    - api_key
  redact_paths:
    - "headers.Authorization"
  redact_regex:
    - name: jwt
      pattern: "eyJ[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+"
      replacement: "[REDACTED:jwt]"
    - name: connection_string
      pattern: "://[^:]+:[^@]+@"
      replacement: "://[REDACTED:credentials]@"
forkline run my_agent.py --redact-config forkline.redact.yaml

JSON configs work with zero extra dependencies. YAML requires pyyaml.

Improved: Determinism Guarantees

  • Dict keys are now traversed in sorted order during redaction,
    eliminating dependence on construction order
  • Hash action uses json.dumps(sort_keys=True) for stable dict hashing
  • Custom replacement strings (e.g. [REDACTED:jwt]) for traceability

New API Surface

Symbol Module Description
ToolCallRecorder forkline.core.tool_call Context manager for recording tool calls
record_tool_call forkline.core.tool_call Decorator for wrapping functions
ToolCallPayload forkline.core.tool_call Canonical tool_call event schema
ToolCallTiming forkline.core.tool_call Timing fields dataclass
RegexRedactionRule forkline.core.redaction Regex-based value redaction rule
RedactionConfig forkline.core.redaction Config object for loading from files
load_redaction_config() forkline.core.redaction Load config from YAML/JSON
RunRecorder.log_tool_call() forkline.storage.recorder Convenience method
RunRecorder.with_config() forkline.storage.recorder Factory with config file
--redact-config CLI Flag on forkline run

Examples

  • examples/tool_call_basic.py — All three recording APIs
  • examples/tool_call_production.py — Full agentic workflow: LLM
    planning, DB query, HTTP webhook with JWT auth, file write, failing
    upstream call. Demonstrates redaction of connection strings, Bearer
    tokens, cookies, API keys, and JWTs with zero secrets in stored
    artifacts.

Test Coverage

58 new tests (316 total), covering:

  • Tool call payload serialization, roundtrip, JSON export/import
  • Context manager timing, error capture, metadata, unique invocation IDs
  • Decorator captures calls, dict returns, exceptions
  • End-to-end "no raw secrets persisted" verification
  • Replay mode guardrails (blocks live calls, allows re-exec)
  • Regex redaction: JWT, Bearer, AWS keys, nested structures
  • Sorted key traversal determinism across construction orders
  • Config loading from JSON, rule order stability
  • Hash determinism for dicts with different key order

Documentation

  • docs/tool_visibility.md — Event schema, recording APIs, replay
    integration, event ordering contract
  • docs/redaction.md — Matching strategies, rule application order,
    config format, default policy reference, determinism guarantees,
    redaction pipeline diagram

Migration Notes

  • No breaking changes. All existing APIs and stored artifacts
    continue to work unchanged.
  • Default redaction policy expanded. The default policy now includes
    passwd as a key pattern and 3 regex rules (jwt, bearer, aws_key).
    Payloads that previously passed through unredacted may now be
    redacted if they contain these patterns.
  • Sorted key traversal. Redacted output dicts now have sorted keys.
    This is semantically identical but may affect tests that assert on
    key ordering of redacted output.
  • Zero new dependencies. YAML config support is optional
    (pip install pyyaml).

v0.4.1 — Versioned Artifact Schema

23 Feb 23:24
45afb11

Choose a tag to compare

Forkline v0.4.1 Release Notes — Versioned Artifact Schema

Release: v0.4.1
Date: 2026-02-23
Milestone: Versioned Artifact Schema (v1.0)


Summary

This release delivers a documented, forward-compatible artifact schema for Forkline run artifacts. Every artifact now carries a mandatory schema_version field, older artifacts are migrated transparently via a deterministic migration pipeline, and unknown fields from newer versions are tolerated without crashing.

This is foundational infrastructure. Schema versioning protects determinism, stability, auditability, diff integrity, and long-term replay trust. Forkline can now evolve its artifact format without breaking history.


What's New

1. Canonical Artifact Schema (forkline/artifact/schema.py)

Typed, versioned models for run artifacts, implemented as frozen dataclasses with zero new dependencies.

Models:

  • RunArtifact — Top-level artifact with mandatory schema_version, run_id, entrypoint, started_at, optional ended_at, status, forkline_version, events, and extensible metadata dict.
  • ArtifactEvent — Single event with event_id, run_id, ts, type, and payload.
  • SchemaVersionError — Exception for missing or unsupported schema versions.

Guarantees:

  • schema_version is mandatory. RunArtifact.from_dict() raises SchemaVersionError if missing.
  • Unknown fields are silently ignored in from_dict() — forward compatibility by design.
  • Artifacts are immutable (frozen dataclasses).
  • validate() returns a list of structural errors without raising.
  • to_json() / from_json() provide deterministic JSON roundtrip.

Schema version: "1.0" (SemVer-style, replacing the legacy "recording_v0" format).


2. Deterministic Migration Registry (forkline/artifact/migrate.py)

A versioned migration pipeline that transforms older artifact schemas to the current canonical format.

Primary entry point:

from forkline.artifact import migrate_artifact

migrated = migrate_artifact(raw_json_dict)

Behavior:

  • schema_version == "1.0" → returns a deep copy unchanged.
  • schema_version is older (e.g. "recording_v0") → routes through chained migration functions.
  • schema_version is newer (e.g. "2.0") → warns but returns unchanged (best-effort forward compat).
  • schema_version is missing → raises SchemaVersionError.

Migration registration pattern:

from forkline.artifact import register_migration

def migrate_1_0_to_1_1(raw: dict) -> dict:
    result = dict(raw)
    result.setdefault("new_field", "default_value")
    return result

register_migration("1.0", "1.1", migrate_1_0_to_1_1)

Built-in migration — recording_v01.0:

  • Environment fields (python_version, platform, cwd) moved to metadata dict.
  • Event timestamps normalized from created_at to ts.
  • schema_version updated to "1.0".

Migration invariants:

  • Deterministic: same input always produces same output.
  • Side-effect free: no I/O, no network, no state mutation.
  • Input is never mutated: deep copy is always made.
  • Chains compose: recording_v01.01.1 applied sequentially.

3. Storage Integration

Both storage backends now support schema-aware artifact loading and canonical JSON export.

RunRecorder (flat event model):

  • load_artifact(run_id) -> Optional[RunArtifact] — loads run as canonical artifact, applies migration if needed.
  • export_artifact_json(run_id) -> Optional[str] — exports run as canonical JSON with schema_version.

SQLiteStore (step-based model):

  • load_artifact(run_id) -> Optional[RunArtifact] — flattens step hierarchy into flat event list, applies migration.
  • export_artifact_json(run_id) -> Optional[str] — exports run as canonical JSON.

Both methods handle legacy databases transparently. Artifacts with schema_version: "recording_v0" are migrated to "1.0" on load.


4. Replay Engine — Version Validation

The replay engine now validates schema_version at the load boundary.

Behavior on ReplayEngine.load_run():

  • Current version ("1.0"): loaded normally.
  • Older version: loaded via migration layer (transparent to caller).
  • Newer version: warning issued, best-effort replay.
  • Missing version: warning issued, default assumptions applied.

Critical invariant: Replay never crashes due to version mismatches. Degradation is always graceful.


5. CLI JSON Output — schema_version Included

All CLI JSON output now includes schema_version and forkline_version:

  • forkline list --json — each run includes schema_version.
  • forkline replay <run_id> --json — artifact includes schema_version and forkline_version.

6. Version Constants Updated (forkline/version.py)

Constant Old Value New Value Purpose
SCHEMA_VERSION "recording_v0" "1.0" Stamped on all new artifacts
LEGACY_SCHEMA_VERSION (new) "recording_v0" Identifies pre-v1.0 artifacts
DEFAULT_SCHEMA_VERSION "recording_v0" "recording_v0" Backward compat for NULL columns

7. Documentation

docs/artifact_schema.md — Full artifact schema specification including:

  • Versioning policy (SemVer-style major/minor rules)
  • Schema v1.0 field tables (RunArtifact, Event)
  • Example artifact JSON
  • Backward compatibility matrix
  • Migration guarantees and registration pattern
  • SQLite and JSON storage format details
  • Replay integration behavior
  • Stability guarantees
  • Design principles

README.md — Added "Artifact Stability Guarantee" section:

Forkline guarantees replay compatibility across minor versions. Breaking changes require a major version increment and migration support.


8. Module Exports

New public symbols exported from forkline:

Symbol Module Description
RunArtifact forkline.artifact.schema Canonical run artifact model
ArtifactEvent forkline.artifact.schema Canonical event model
SchemaVersionError forkline.artifact.schema Missing/unsupported version exception
migrate_artifact forkline.artifact.migrate Primary migration entry point
register_migration forkline.artifact.migrate Migration registration for future versions

Tests

34 new tests in tests/unit/test_artifact_schema.py. All hermetic — no network, deterministic.

TestRunArtifactSchema (9 tests)

Test Validates
test_schema_version_required schema_version field is present
test_to_dict_includes_schema_version Serialized dict includes schema_version
test_to_json_roundtrip JSON serialize/deserialize preserves all data
test_from_dict_rejects_missing_schema_version SchemaVersionError raised when missing
test_from_dict_ignores_unknown_fields Unknown fields silently dropped
test_validate_catches_missing_required_fields Validation reports empty required fields
test_validate_passes_for_valid_artifact Valid artifact returns no errors
test_metadata_extensibility Arbitrary keys in metadata dict
test_immutability Frozen dataclass rejects mutation

TestArtifactEvent (2 tests)

Test Validates
test_from_dict_ignores_unknown_fields Unknown event fields silently dropped
test_to_dict_roundtrip Event roundtrip preserves data

TestMigrationRegistry (7 tests)

Test Validates
test_migrate_current_version_is_noop "1.0""1.0" returns copy unchanged
test_migrate_current_version_returns_deep_copy Deep copy, not same object
test_migrate_missing_schema_version_raises SchemaVersionError on missing version
test_migrate_recording_v0_to_1_0 Full migration: env fields → metadata, ts normalization
test_migrate_is_deterministic Same input → same output across invocations
test_migrate_does_not_mutate_input Input dict unchanged after migration
test_newer_version_returns_with_warning Warning issued, data preserved
test_migrate_non_dict_raises Non-dict input raises SchemaVersionError

TestVersionComparison (6 tests)

Test Validates
test_compare_equal "1.0" == "1.0"
test_compare_less "1.0" < "2.0"
test_compare_greater "2.0" > "1.0"
test_legacy_less_than_semver "recording_v0" < "1.0"
test_migration_path_exists Path from recording_v01.0 found
test_migration_path_same_version Same version returns empty path

TestStorageArtifactIntegration (6 tests)

Test Validates
test_recorder_load_artifact RunRecorder.load_artifact() returns valid RunArtifact
test_recorder_export_json JSON export includes schema_version
test_recorder_load_artifact_nonexistent Returns None for missing run
test_sqlitestore_load_artifact SQLiteStore.load_artifact() with step flattening
test_sqlitestore_export_json JSON export from step-based store
test_legacy_db_migrates_on_load_artifact Legacy recording_v0 DB migrates to "1.0"

TestSchemaVersionConsistency (2 tests)

Test Validates
test_versions_match SCHEMA_VERSION == CURRENT_SCHEMA_VERSION
test_schema_version_is_1_0 Current version is "1.0"

Updated Tests (2 tests in test_version_schema.py)

Test Change
test_schema_version_format Updated to validate "major.minor" numeric format instead of "recording_" prefix
test_default_versions_are_reasonable Added assertion that current schema differs from default

Total test count after this release: 258 (22...

Read more

v0.4 - CLI

23 Feb 01:13
cd420bb

Choose a tag to compare

Forkline v0.4 Release Notes — CLI

Release: v0.4.0
Date: 2026-02-23
Milestone: v0.4 — CLI (Roadmap item 4 of 5)


Summary

This release delivers the full forkline CLI: four subcommands that let you run, list, replay, and diff agent runs from the terminal. This is the adoption wedge — Forkline is now usable without writing any Python.

The CLI is thin by design: parse args → call library APIs → render output. No business logic lives in the CLI layer.


What's New

1. forkline run — Execute under tracing

Run any Python script under Forkline tracing. Records execution metadata (script path, timestamps, exit code) and prints the assigned run ID.

$ forkline run examples/ollama_qwen3.py
Calling qwen3 ...
Response: A fork bomb is a denial-of-service attack that recursively spawns
an infinite number of processes to exhaust system resources, causing a crash
or severe performance degradation.
run_id: b015f49f45c04002a3c489fe84b45c5c

Behavior:

  • Validates the script file exists (exits 2 if not)
  • Executes the script in a subprocess via sys.executable
  • Sets environment variables for script integration: FORKLINE_TRACING=1, FORKLINE_RUN_ID=<id>, FORKLINE_DB=<path>
  • Records run start/end timestamps and exit code via RunRecorder
  • On non-zero exit: stores status=failed, prints run_id, and propagates the exit code
  • Script arguments are passed after --: forkline run script.py -- --arg1 value

2. forkline list — List stored runs

Show all recorded runs, newest first.

$ forkline list
ID                                    Created               Script                          Status
------------------------------------------------------------------------------------------------------
7b08ac5e533d456daa7a24921c0d1687      2026-02-23 01:04:34   examples/ollama_qwen3.py        ok
b015f49f45c04002a3c489fe84b45c5c      2026-02-23 01:04:20   examples/ollama_qwen3.py        ok

Options:

Flag Default Description
--limit N all Maximum number of runs to show
--json off Output as JSON array
--db PATH runs.db SQLite database path

JSON output:

[
  {
    "created_at": "2026-02-23T01:04:34.989096+00:00",
    "ended_at": "2026-02-23T01:04:45.039067+00:00",
    "entrypoint": "examples/ollama_qwen3.py",
    "run_id": "7b08ac5e533d456daa7a24921c0d1687",
    "status": "ok"
  }
]

Output is deterministic: runs are ordered by started_at DESC, JSON keys are sorted.


3. forkline replay — Replay a recorded run

Load a recorded run by ID and print a summary of its events, duration, and status.

$ forkline replay b015f49f45c04002a3c489fe84b45c5c
Run: b015f49f45c04002a3c489fe84b45c5c
Script: examples/ollama_qwen3.py
Status: ok
Duration: 10.74s
Total events: 2
Events by type:
  input: 1
  output: 1

Options:

Flag Default Description
--json off Output full run and events as JSON
--db PATH runs.db SQLite database path

Exits 2 with a stderr message if the run ID is not found.


4. forkline diff — Diff two runs

Compare two recorded runs event-by-event and report the first point of divergence.

$ forkline diff b015f49f... 7b08ac5e...
Step 1 diverged:
  old.type: output
  old.payload: {"model": "qwen3", "response": "A fork bomb is a denial-of-service attack tha...
  new.type: output
  new.payload: {"model": "qwen3", "response": "A fork bomb is a type of denial-of-service at...

Options:

Flag Default Description
--format pretty|json pretty Output format
--first-divergence on Stop at first divergence
--db PATH runs.db SQLite database path

Identical runs: prints No differences and exits 0.

JSON output:

{
  "divergence_index": 1,
  "identical": false,
  "new": {
    "payload": {"model": "qwen3", "response": "A fork bomb is a type of..."},
    "type": "output"
  },
  "old": {
    "payload": {"model": "qwen3", "response": "A fork bomb is a denial-of..."},
    "type": "output"
  },
  "run_a": "b015f49f45c04002a3c489fe84b45c5c",
  "run_b": "7b08ac5e533d456daa7a24921c0d1687",
  "total_events_a": 2,
  "total_events_b": 2
}

Diff engine: compares events by index. At each position, checks type and payload for equality. If event counts differ, reports different_event_count at the index where the shorter run ends. Always reports the first divergence.


5. Exit Code Contract

All commands follow a consistent exit code convention for CI scriptability:

Code Meaning
0 Success (command completed, runs match for diff)
1 Divergence found (diff only)
2 Input error (missing run ID, file not found, invalid args)

6. Environment Variable Bridge

forkline run sets three environment variables before executing the script subprocess:

Variable Value Purpose
FORKLINE_TRACING 1 Signal that tracing is active
FORKLINE_RUN_ID <hex> Run ID assigned by the CLI
FORKLINE_DB <path> Database path for event logging

Scripts can read these to log events to the same run:

import os
from forkline.storage.recorder import RunRecorder

db = os.environ.get("FORKLINE_DB", "runs.db")
run_id = os.environ.get("FORKLINE_RUN_ID")
recorder = RunRecorder(db_path=db)
recorder.log_event(run_id, "input", {"prompt": "hello"})

7. RunRecorder.list_runs() — New API

Added list_runs(limit=None) to RunRecorder. Returns runs ordered by started_at DESC with backward-compatible version defaults.

recorder = RunRecorder()
runs = recorder.list_runs(limit=10)  # newest 10 runs

8. Ollama Qwen3 Example (examples/ollama_qwen3.py)

A live example that calls Ollama's Qwen3 model and records the input/output as Forkline events. Demonstrates nondeterminism detection: same prompt, same model, different response.

$ forkline run examples/ollama_qwen3.py    # run 1
$ forkline run examples/ollama_qwen3.py    # run 2
$ forkline diff <run_1> <run_2>            # nondeterminism caught

Uses only urllib.request from the standard library — no new dependencies.


Tests

38 new tests in tests/unit/test_cli.py. All hermetic except TestCLIRun which spawns real subprocesses against temporary scripts and databases.

TestRenderRunResult (1 test)

Test Validates
test_format Output is run_id: <id>

TestRenderListTable (2 tests)

Test Validates
test_empty "No runs found." for empty list
test_header_and_row Header columns, timestamp formatting, values present

TestRenderListJSON (1 test)

Test Validates
test_json_array Valid JSON array with correct fields

TestRenderReplaySummary (1 test)

Test Validates
test_contains_fields Run ID, status, duration, event counts, events by type

TestRenderReplayJSON (2 tests)

Test Validates
test_valid_json_with_all_fields All fields present and correct in parsed JSON
test_empty_events Zero events, null timestamps handled

TestRenderDiffPretty (2 tests)

Test Validates
test_identical "No differences"
test_diverged "Step N diverged:" with old/new type and payload

TestRenderDiffJSON (2 tests)

Test Validates
test_identical {"identical": true}
test_diverged divergence_index and old/new present

TestDiffEvents (6 tests)

Test Validates
test_identical Same events → identical: true
test_type_mismatch Different event types → divergence at index 0
test_payload_mismatch Same type, different payload → divergence at index 0
test_different_lengths Shorter list → reason: different_event_count
test_both_empty Two empty lists → identical
test_finds_first_divergence Three events, divergence at index 1 (not 2)

TestListRuns (3 tests)

Test Validates
test_empty Empty database → empty list
test_ordered_newest_first Most recent run is first
test_limit limit=2 returns exactly 2 from 3 runs

TestCLIList (3 tests)

Test Validates
test_list_shows_runs Run ID and script name in table output
test_list_json Valid JSON array with correct run ID
test_list_empty "No runs found" for empty database

TestCLIReplay (3 tests)

Test Validates
test_replay_success Run ID, status, event count in output; exit 0
test_replay_missing_run Exit 2 for nonexistent run
test_replay_json Valid JSON with run_id and total_events

TestCLIDiff (6 tests)

Test Validates
test_identical_runs "No differences"; exit 0
test_different_runs "Step 0 diverged"; exit 1
test_diff_json_format JSON with "identical": true
test_diff_missing_run Exit 2 for nonexistent run
test_diff_different_event_counts "Event count differs" message
test_diff_json_diverged JSON with divergence_index and old/new

TestCLIRun (4 tests)

Test Validates
test_run_missing_file Exit 2 for nonexistent script
test_run_success run_id: in output; run stored with `...
Read more

v0.3 — First-Divergence Diffing

21 Feb 17:46
8f159c7

Choose a tag to compare

Forkline v0.3 Release Notes — First-Divergence Diffing

Release: v0.3.0
Date: 2026-02-21
Milestone: v0.3 — First-Divergence Diffing (Roadmap item 3 of 5)


Summary

This release delivers first-divergence diffing: given two recorded runs, Forkline now compares them step-by-step and returns the first point of divergence with deterministic classification, structured JSON diff patches, and rule-based explanations.

This is the core feature that turns Forkline from a recording/replay tool into a forensic debugging tool — answering not just that two runs differ, but where, how, and what changed.


What's New

1. Deterministic Canonicalization (forkline/core/canon.py)

A canonicalization layer that produces stable, deterministic byte representations of any value before hashing or diffing.

Functions:

  • canon(value, profile="strict") -> bytes — Canonicalize any value to bytes
  • sha256_hex(data: bytes) -> str — SHA-256 hex digest
  • bytes_preview(data: bytes) -> str — Human-readable sha256:<hash>:<hex_prefix> format

Canonicalization guarantees:

  • Dict key order is irrelevant. Keys are sorted lexicographically before serialization.
  • Unicode is NFC-normalized. "café" (precomposed) and "café" (decomposed e + combining accent) produce identical output.
  • Newlines are normalized to LF. \r\n and \r are collapsed to \n.
  • Floats use 17-significant-digit precision. -0.0 collapses to 0.0. NaN and Inf are serialized as stable strings.
  • Booleans and integers are distinct. True and 1 produce different canonical bytes.
  • Bytes pass through unchanged. Binary data is not re-encoded; hashing uses SHA-256 with a hex prefix preview for display.
  • Compact JSON encoding. No whitespace separators ("," and ":"), ensure_ascii=False.

Zero dependencies. Uses only hashlib, json, math, unicodedata from the standard library.


2. Deterministic JSON Diff Patches (forkline/core/json_diff.py)

A recursive JSON diff algorithm that produces a stable, ordered list of patch operations for any two JSON-like values.

Function:

  • json_diff(old, new, path="$") -> List[Dict]

Patch operation format:

[
  {"op": "remove", "path": "$.a.b", "old": "<removed_value>"},
  {"op": "add",    "path": "$.x",   "value": "<added_value>"},
  {"op": "replace","path": "$.k",   "old": "<old_value>", "new": "<new_value>"}
]

Ordering guarantees (deterministic across invocations):

  • Dicts: removed keys (sorted) → added keys (sorted) → common keys (sorted, recursed).
  • Lists: compared by index; removes at tail, then adds at tail.
  • Type mismatch: replace whole node.
  • Numeric compatibility: int vs float compared as numeric, not as type mismatch.

Paths use JSONPath-style notation: $.outer.inner, $.list[0], $.nested.array[2].field.


3. First-Divergence Engine (forkline/core/first_divergence.py)

The core diffing algorithm: compare two Run objects step-by-step, classify the first mismatch, and return a structured result.

Algorithm

  1. Lockstep comparison. Walk both runs at the same index. At each step, classify by comparing (in priority order): step name → input hash → error state → output hash → all events hash.

  2. Resync window. On mismatch, search within a configurable window (default W=10) for matching "soft signatures" — (step_name, input_hash) tuples. The search iterates by increasing combined distance from the mismatch point, finding the nearest resync.

  3. Gap classification.

    • Resync with gap_a > 0, gap_b == 0missing_steps (steps in run_a absent from run_b)
    • Resync with gap_b > 0, gap_a == 0extra_steps (steps in run_b not in run_a)
    • Both gaps > 0 → falls through to classify the mismatch at current position
    • No resync → classify by what differs at current position
  4. Length mismatch. If one run is longer after lockstep exhausts the shorter, classify as missing_steps or extra_steps.

Divergence Types

Type Trigger Explanation Pattern
exact_match Runs identical "Runs are identical (N steps compared)"
op_divergence Step names differ "Step 3: operation mismatch ('tool_call' vs 'llm_call')"
input_divergence Same name, different input "Step 3 'tool_call': input differs"
output_divergence Same name + input, different output "Step 3 'tool_call': output differs (same input)"
error_divergence Error presence or content differs "Step 3 'tool_call': error state differs"
missing_steps Steps in run_a not in run_b "Step 5 from run_a missing in run_b"
extra_steps Steps in run_b not in run_a "Steps 3..4 in run_b not present in run_a"

All explanations are deterministic and rule-based — no LLM narration, no randomness.

Classification Priority

When two steps share a name but differ, classification follows strict priority:

  1. Input divergence — checked first because differing inputs explain differing outputs
  2. Error divergence — error presence/absence or content differs
  3. Output divergence — same input but different output (nondeterminism signal)
  4. All-events fallback — catches differences in tool_call, artifact_ref, or other event types

Data Models

StepSummary — Compact step representation included in results:

StepSummary(
    idx=2,
    name="generate_response",
    input_hash="a1b2c3d4...",
    output_hash="e5f6a7b8...",
    event_count=3,
    has_error=False,
)

FirstDivergenceResult — Complete result object:

FirstDivergenceResult(
    status="output_divergence",        # DivergenceType
    idx_a=2,                           # Index in run_a at divergence
    idx_b=2,                           # Index in run_b at divergence
    explanation="Step 2 'generate_response': output differs (same input)",
    old_step=StepSummary(...),         # Step from run_a
    new_step=StepSummary(...),         # Step from run_b
    input_diff=None,                   # JSON patch (when applicable)
    output_diff=[{"op": "replace", "path": "$[0].text", ...}],
    last_equal_idx=1,                  # Last step where both matched
    context_a=[StepSummary(...),...],   # 2 steps before/after in run_a
    context_b=[StepSummary(...),...],   # 2 steps before/after in run_b
)

Both models are frozen dataclasses with .to_dict() for JSON serialization.

API

from forkline.core.first_divergence import find_first_divergence, DivergenceType

result = find_first_divergence(
    run_a,
    run_b,
    window=10,          # Resync window size
    context_size=2,     # Steps before/after divergence in context
    show="both",        # "input", "output", or "both"
)

# JSON-serializable output
import json
print(json.dumps(result.to_dict(), indent=2))

4. CLI — forkline diff (forkline/cli.py)

The first CLI subcommand, establishing the forkline command-line interface.

Usage:

forkline diff --first <run_a> <run_b> [OPTIONS]

Options:

Flag Default Description
--first true Show first divergence only
--window N 10 Resync window size
--format json|text text Output format
--show input|output|both both Which diffs to include
--canon strict strict Canonicalization profile
--db PATH forkline.db SQLite database path

Exit codes:

  • 0 — Runs are identical (exact_match)
  • 1 — Divergence detected (any other status)

This makes forkline diff directly usable in CI pipelines and shell scripts.

Text output sample:

First divergence: output_divergence
  Step 2 'generate_response': output differs (same input)

  Run A step 2 'generate_response':
    input_hash:  a1b2c3d4e5f6a7b8...
    output_hash: 1234567890abcdef...
    events: 3
    has_error: False

  Run B step 2 'generate_response':
    input_hash:  a1b2c3d4e5f6a7b8...
    output_hash: fedcba0987654321...
    events: 3
    has_error: False

  Output diff:
    replace $[0].text: "Expected response" -> "Different response"

  Last equal: step 1
  Context A: [step 0 'init', step 1 'prepare', step 2 'generate_response']
  Context B: [step 0 'init', step 1 'prepare', step 2 'generate_response']

Entry point: Registered as forkline = "forkline.cli:main" in pyproject.toml ([project.scripts]).


5. Module Exports

New public symbols exported from forkline and forkline.core:

Symbol Module Description
find_first_divergence forkline.core.first_divergence Main engine function
FirstDivergenceResult forkline.core.first_divergence Result dataclass
StepSummary forkline.core.first_divergence Compact step summary
DivergenceType forkline.core.first_divergence Type classification constants
canon forkline.core.canon Value → canonical bytes
sha256_hex forkline.core.canon Bytes → SHA-256 hex
bytes_preview forkline.core.canon Bytes → human-readable hash preview
json_diff forkline.core.json_diff Deterministic JSON diff patches

Tests

45 new tests across 3 test classes in tests/unit/test_first_divergence.py. All hermetic — no database, no disk I/O, no network.

TestCanonStability (14 tests)

Test Validates
test_dict_key_order_irrelevant {"z":1,"a":2} == {"a":2,"z":1}
test_nested_dict_stability Deep nesting with mixed key order
test_unicode_normalization NFC: \u00e9 == e\u0301
test_newline_normalization `\r...
Read more

v0.1.1 - Recording & Artifact Foundations

03 Feb 01:48
defd8fb

Choose a tag to compare

Forkline v0.1.1 — Recording & Artifact Foundations

Release focus: establish Forkline’s core recording primitives and artifact model for deterministic agent workflows.

v0.1.1 intentionally does not include replay. This release lays the groundwork by making runs recordable, inspectable, and diffable in a local-first, immutable format.


✨ What’s New

Deterministic Run Recording (Foundational)

  • Introduced a structured run recording model for agentic workflows.
  • Captures ordered execution steps including:
    • LLM inputs and outputs
    • Tool invocations
    • Execution metadata
  • Artifacts are written locally and treated as immutable once persisted.

This establishes Forkline’s core abstraction:
a run is a durable, replayable artifact — not a log stream.


Artifact Schema v0

  • Added a first-pass, explicit artifact schema (recording_v0).
  • Clearly separates:
    • Run-level metadata
    • Step-level inputs and outputs
    • Execution ordering
  • Schema is designed for future replay and diffing, not observability.

This schema defines the baseline contract for Forkline artifacts.


Redaction Support

  • Introduced a redaction layer for recorded artifacts.
  • Enables sensitive fields (API keys, tokens, PII) to be:
    • Redacted at record time, or
    • Scrubbed before persistence
  • Redaction is explicit and policy-driven — never implicit.

This makes Forkline artifacts safe for local inspection and sharing.


Run & Step Diffing (Structural)

  • Added initial diff utilities for comparing recorded runs or steps.
  • Focuses on structural and semantic differences, not textual logs.
  • Intended for offline analysis and future replay divergence detection.

This is a foundational capability, not a visualization layer.


Core Types & Invariants

  • Formalized core domain types:
    • Runs
    • Steps
    • Artifacts
    • Diff results
  • Introduced the concept of core invariants:
    • Ordering matters
    • Artifacts are the source of truth
    • No mutation after persistence

These invariants guide all future Forkline features.


❌ Explicit Non-Goals (v0.1.1)

To avoid confusion, v0.1.1 does not include:

  • Replay or re-execution
  • First-divergence detection
  • OpenTelemetry integration
  • Observability, tracing, or metrics
  • Production or distributed runtime support

These exclusions are deliberate.


Why This Release Matters

v0.1.1 is about credibility, not completeness.

It answers one question clearly:

Can Forkline reliably capture an agent run as a durable, inspectable artifact?

The answer is now yes.

Replay, divergence detection, and developer-facing workflows are built on top of this foundation.


What’s Next

  • Deterministic replay from recorded artifacts
  • First-divergence detection
  • Golden replay tests
  • Minimal replay demos

(Tracked for the next release.)

What's Changed

Full Changelog: v0.1.0...v0.1.1

v0.1.0 - Deterministic Run Recording

19 Jan 00:33
b629fd2

Choose a tag to compare

Forkline v0.1.0

First release of Forkline: local-first, replay-first tracing for agentic AI workflows.

What's in v0.1

Deterministic recording of agent runs
Self-contained artifacts stored in SQLite
Security-first with automatic redaction (SAFE mode)
Human-inspectable with sqlite3 or helper scripts
Append-only logging with versioned schema

Quick Start

git clone https://github.com/sauravvenkat/forkline.git
cd forkline
source dev.env
python examples/minimal.py
python scripts/inspect_runs.py

What's Changed

New Contributors

Full Changelog: https://github.com/sauravvenkat/forkline/commits/v0.1.0