Releases: sauravvenkat/forkline
v0.5.0 — CI Integration
Forkline v0.5.0 Release Notes — CI Integration
Release: v0.5.0
Date: 2026-02-25
Milestone: v0.5 — CI + Test Harness (Roadmap item 5 of 5)
Summary
This release delivers CI integration: a deterministic, offline, build-failing diff system that lets teams gate merges on behavioral identity. If an agent's output changes, the build fails — with a clear diff, a machine-readable exit code, and a suggested fix.
This completes the v0 roadmap. Forkline now covers the full loop: record → replay → diff → CI gate.
What's New
1. forkline ci Command Suite (forkline/ci/commands.py)
Five new subcommands purpose-built for CI pipelines:
| Command | Purpose |
|---|---|
forkline ci record |
Run a script, produce a normalized JSON artifact |
forkline ci replay |
Validate an artifact's schema and structure offline |
forkline ci diff |
Compare two artifacts, exit 1 on divergence |
forkline ci check |
All-in-one: record actual, diff against expected |
forkline ci normalize |
Strip timestamps and metadata for stable diffs |
ci check is the primary CI command — a single call that records behavior, normalizes the output, and diffs against a committed baseline:
forkline ci check \
--entrypoint examples/my_flow.py \
--expected tests/testdata/my_flow.run.json \
--offlineExit 0 means identical behavior. Exit 1 means the build should fail.
2. Offline Enforcement (forkline/ci/offline.py)
A hard no-network guarantee for CI runs. When --offline is set (or FORKLINE_OFFLINE=1), Forkline monkeypatches socket.connect, socket.create_connection, and socket.getaddrinfo to raise ForklineOfflineError immediately.
Properties:
- Fail-closed. Network calls error instantly — no hangs, no timeouts.
- Deterministic. Same call always produces the same error message.
- Scoped.
offline_context()restores normal access on exit. - Cross-library. Blocks
requests,httpx,urllib3, and any library built onsocket.
from forkline.ci.offline import offline_context
with offline_context():
# Any network call raises ForklineOfflineError
requests.get("https://api.example.com") # raises immediately
# Normal access restored here3. Artifact Normalization (forkline/ci/normalize.py)
Strips unstable fields from artifacts so that recordings made at different times, on different machines, produce identical diffs when behavior is the same.
Normalized by default:
- Timestamp fields (
ts,started_at,ended_at,created_at) →2000-01-01T00:00:00+00:00 - Platform metadata (
python_version,platform,cwd) → removed - Events sorted by
event_id
Preserved:
- Event types and payloads (the behavioral data)
- Schema version and entrypoint path
Normalization is applied automatically by ci record and ci diff. It can also be run explicitly:
forkline ci normalize artifact.run.json --out normalized.run.json4. Exit Code Contract (forkline/ci/exitcodes.py)
A strict, stable exit code contract for CI automation. These values will not change across releases.
| Code | Constant | Meaning |
|---|---|---|
0 |
EXIT_SUCCESS |
Success, no diff |
1 |
EXIT_DIFF_DETECTED |
Diff detected — fail the build |
2 |
EXIT_USAGE_ERROR |
Bad args, missing file |
3 |
EXIT_REPLAY_FAILED |
Script failed, runtime exception |
4 |
EXIT_OFFLINE_VIOLATION |
Network attempted in offline mode |
5 |
EXIT_ARTIFACT_ERROR |
Cannot parse artifact, schema error |
6 |
EXIT_INTERNAL_ERROR |
Unexpected bug |
Every exit code is exercised by a dedicated test.
5. Python Test Helper (forkline/testing.py)
A one-line API for snapshot-style testing of agentic workflows:
from forkline.testing import assert_no_diff
def test_my_flow():
assert_no_diff(
entrypoint="examples/my_flow.py",
expected_artifact="tests/testdata/my_flow.run.json",
offline=True,
)On failure, raises ArtifactDiffError with:
- First divergent event index
- Expected vs actual payloads
- Structured diff for programmatic inspection
- Suggested re-record command
6. Diff Output (forkline/ci/commands.py)
Diff output is concise in text mode and machine-readable in JSON mode.
Text mode:
DIFF: First divergence at event[1] (type: output)
$.answer: "4" -> "5"
$.note: <missing> -> "wrong!"
Suggested fix: Re-record baseline: forkline ci record --entrypoint <script> --out <path>
JSON mode:
{
"identical": false,
"first_divergent_index": 1,
"event_type": "output",
"expected": {"type": "output", "payload": {"answer": "4"}},
"actual": {"type": "output", "payload": {"answer": "5", "note": "wrong!"}},
"payload_diff": [
{"op": "add", "path": "$.note", "value": "wrong!"},
{"op": "replace", "path": "$.answer", "old": "4", "new": "5"}
],
"suggestion": "Re-record baseline: forkline ci record --entrypoint <script> --out <path>"
}7. CLI Integration (forkline/cli/__init__.py)
The ci subcommand is wired into the existing forkline CLI with full argparse integration:
$ forkline ci --help
usage: forkline ci [-h] {record,replay,diff,check,normalize} ...
Commands for CI/CD pipelines: record baselines, replay artifacts,
diff for behavioral changes, and gate merges on behavioral identity.
All subcommands support --help and produce meaningful error messages on invalid usage.
8. Documentation (docs/ci.md)
Full CI guide covering:
- Quick start (3-command workflow)
- All commands with flags and descriptions
- Offline mode details
- Artifact normalization behavior
- Recommended repo layout (
tests/testdata/*.run.json) - Re-recording baselines
- Python test helper usage
- GitHub Actions example (copy-paste ready)
- Programmatic API usage
9. Examples
Four new runnable examples:
| File | Demonstrates |
|---|---|
examples/ci_record_and_diff.py |
Record baseline, diff identical and changed behavior, JSON output |
examples/ci_check_gate.py |
All-in-one build gate with ci_check |
examples/ci_offline_enforcement.py |
Offline mode blocking all network calls |
examples/ci_test_helper.py |
assert_no_diff for pytest/unittest |
Tests
60 new tests in tests/unit/test_ci.py. All hermetic — no network, no external dependencies.
TestExitCodes (2 tests)
| Test | Validates |
|---|---|
test_exit_code_values |
Each code has its documented integer value |
test_all_distinct |
All 7 codes are unique |
TestOfflineMode (7 tests)
| Test | Validates |
|---|---|
test_offline_context_blocks_socket |
socket.connect raises ForklineOfflineError |
test_offline_blocks_create_connection |
socket.create_connection blocked |
test_offline_blocks_getaddrinfo |
DNS resolution blocked |
test_offline_error_is_deterministic |
Same call → same error message |
test_offline_restores_after_context |
Normal access restored on exit |
test_enable_disable_idempotent |
Double-enable/disable is safe |
test_offline_error_attributes |
Error has operation attribute, includes FORKLINE_OFFLINE |
TestNormalization (9 tests)
| Test | Validates |
|---|---|
test_timestamps_normalized |
All timestamp fields → sentinel |
test_metadata_stripped |
Platform metadata removed |
test_metadata_preserved_when_disabled |
Opt-out works |
test_timestamps_preserved_when_disabled |
Opt-out works |
test_events_ordered_by_event_id |
Stable sort |
test_normalize_ids |
IDs replaced with sequential values |
test_normalize_deterministic |
Same input → same output |
test_normalize_json_roundtrip |
JSON string → normalize → parse |
test_original_not_mutated |
Input dict unchanged |
TestCIRecord (6 tests)
| Test | Validates |
|---|---|
test_record_success |
Produces valid artifact with schema_version and events |
test_record_missing_entrypoint |
Exit 2 |
test_record_failed_script |
Exit 3 |
test_record_creates_directories |
Nested output path created |
test_record_artifact_is_normalized |
Timestamps are sentinel values |
test_record_deterministic |
Two recordings → identical normalized output |
TestCIReplay (6 tests)
| Test | Validates |
|---|---|
test_replay_valid_artifact |
Exit 0, JSON output with status/event_count |
test_replay_missing_file |
Exit 2 |
test_replay_invalid_json |
Exit 5 |
test_replay_missing_schema_version |
Exit 5 |
test_replay_strict_empty_payload |
Exit 5 in strict mode |
test_replay_not_strict_allows_empty_payload |
Exit 0 in default mode |
TestCIDiff (9 tests)
| Test | Validates |
|---|---|
test_identical_artifacts |
Exit 0, "No differences" text |
test_different_artifacts |
Exit 1, "DIFF" in output |
test_diff_json_format |
JSON output with first_divergent_index and suggestion |
test_diff_json_identical |
JSON output with identical: true |
test_diff_missing_expected / test_diff_missing_actual |
Exit 2 |
test_diff_event_count_mismatch |
Exit 1, "mismatch" in output |
test_diff_bad_json |
Exit 5 |
test_diff_normalizes_timestamps |
Artifacts at different times still match |
TestCINormalize (4 tests)
| Test | Validates |
|---|---|
test_normalize_in_place |
Overwrites file, timestamps normalized |
test_normalize_to_new_path |
Writes to separate output |
test_normalize_missing_file |
Exit 2 |
test_normalize_bad_json |
Exit 5 |
TestCICheck (4 ...
v0.4.2 - Tool Invocation Recording + Deterministic Redaction
Forkline v0.4.2 Release Notes
Tool Invocation Recording + Deterministic Redaction
Agents without tool visibility are blind. Agents with unsafe logs are
unusable in real systems. This release solves both: tool calls are now
first-class events, and sensitive data is redacted deterministically
before anything touches disk.
New: Tool Call Events
Every tool invocation — DB queries, API calls, file operations — is
now recorded as a tool_call event in the run artifact with a
canonical schema:
{
"tool_name": "http.request",
"invocation_id": "a1b2c3...",
"request": { "url": "https://api.example.com" },
"response": { "status": 200 },
"error": null,
"timing": {
"started_at": "2026-02-23T10:00:00Z",
"ended_at": "2026-02-23T10:00:00.250Z",
"duration_ms": 250.0
},
"metadata": { "bytes_read": 1024, "cache_hit": false }
}Three ways to record:
# Context manager — full control over request/response/metadata
with ToolCallRecorder(recorder, run_id, "http.get") as tc:
tc.set_request({"url": "https://api.example.com"})
resp = requests.get("https://api.example.com")
tc.set_response({"status": resp.status_code})
tc.set_metadata({"bytes_read": len(resp.content)})
# Decorator — wrap existing functions
@record_tool_call(recorder, run_id, "db.query")
def query_db(sql):
return db.execute(sql).fetchall()
# Convenience method — manual construction
recorder.log_tool_call(
run_id=run_id,
tool_name="file.read",
request={"path": "/tmp/data.txt"},
response={"content": "hello"},
)Replay integration: ToolCallRecorder enforces determinism
guardrails — live tool calls are blocked during replay mode. Set
allow_in_replay=True for re-execution scenarios.
New: Regex-Based Redaction
The redaction engine now supports three matching strategies:
| Strategy | Matches on | Example |
|---|---|---|
| Key-based | Dict key names (substring, case-insensitive) | password, api_key |
| Path-based | Dot-separated paths | headers.authorization |
| Regex-based (new) | String values anywhere in the payload | JWTs, Bearer tokens, AWS keys |
Default policy now includes regex rules that catch secrets embedded in
string values, not just key names:
eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOi... → [REDACTED:jwt]
Bearer sk-12345abcdef → Bearer [REDACTED]
AKIAIOSFODNN7EXAMPLE → [REDACTED:aws_key]
New: Redaction Config Files
Configure redaction via YAML or JSON instead of code:
# forkline.redact.yaml
fields:
redact_keys:
- password
- token
- api_key
redact_paths:
- "headers.Authorization"
redact_regex:
- name: jwt
pattern: "eyJ[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+"
replacement: "[REDACTED:jwt]"
- name: connection_string
pattern: "://[^:]+:[^@]+@"
replacement: "://[REDACTED:credentials]@"forkline run my_agent.py --redact-config forkline.redact.yamlJSON configs work with zero extra dependencies. YAML requires pyyaml.
Improved: Determinism Guarantees
- Dict keys are now traversed in sorted order during redaction,
eliminating dependence on construction order - Hash action uses
json.dumps(sort_keys=True)for stable dict hashing - Custom replacement strings (e.g.
[REDACTED:jwt]) for traceability
New API Surface
| Symbol | Module | Description |
|---|---|---|
ToolCallRecorder |
forkline.core.tool_call |
Context manager for recording tool calls |
record_tool_call |
forkline.core.tool_call |
Decorator for wrapping functions |
ToolCallPayload |
forkline.core.tool_call |
Canonical tool_call event schema |
ToolCallTiming |
forkline.core.tool_call |
Timing fields dataclass |
RegexRedactionRule |
forkline.core.redaction |
Regex-based value redaction rule |
RedactionConfig |
forkline.core.redaction |
Config object for loading from files |
load_redaction_config() |
forkline.core.redaction |
Load config from YAML/JSON |
RunRecorder.log_tool_call() |
forkline.storage.recorder |
Convenience method |
RunRecorder.with_config() |
forkline.storage.recorder |
Factory with config file |
--redact-config |
CLI | Flag on forkline run |
Examples
examples/tool_call_basic.py— All three recording APIsexamples/tool_call_production.py— Full agentic workflow: LLM
planning, DB query, HTTP webhook with JWT auth, file write, failing
upstream call. Demonstrates redaction of connection strings, Bearer
tokens, cookies, API keys, and JWTs with zero secrets in stored
artifacts.
Test Coverage
58 new tests (316 total), covering:
- Tool call payload serialization, roundtrip, JSON export/import
- Context manager timing, error capture, metadata, unique invocation IDs
- Decorator captures calls, dict returns, exceptions
- End-to-end "no raw secrets persisted" verification
- Replay mode guardrails (blocks live calls, allows re-exec)
- Regex redaction: JWT, Bearer, AWS keys, nested structures
- Sorted key traversal determinism across construction orders
- Config loading from JSON, rule order stability
- Hash determinism for dicts with different key order
Documentation
docs/tool_visibility.md— Event schema, recording APIs, replay
integration, event ordering contractdocs/redaction.md— Matching strategies, rule application order,
config format, default policy reference, determinism guarantees,
redaction pipeline diagram
Migration Notes
- No breaking changes. All existing APIs and stored artifacts
continue to work unchanged. - Default redaction policy expanded. The default policy now includes
passwdas a key pattern and 3 regex rules (jwt, bearer, aws_key).
Payloads that previously passed through unredacted may now be
redacted if they contain these patterns. - Sorted key traversal. Redacted output dicts now have sorted keys.
This is semantically identical but may affect tests that assert on
key ordering of redacted output. - Zero new dependencies. YAML config support is optional
(pip install pyyaml).
v0.4.1 — Versioned Artifact Schema
Forkline v0.4.1 Release Notes — Versioned Artifact Schema
Release: v0.4.1
Date: 2026-02-23
Milestone: Versioned Artifact Schema (v1.0)
Summary
This release delivers a documented, forward-compatible artifact schema for Forkline run artifacts. Every artifact now carries a mandatory schema_version field, older artifacts are migrated transparently via a deterministic migration pipeline, and unknown fields from newer versions are tolerated without crashing.
This is foundational infrastructure. Schema versioning protects determinism, stability, auditability, diff integrity, and long-term replay trust. Forkline can now evolve its artifact format without breaking history.
What's New
1. Canonical Artifact Schema (forkline/artifact/schema.py)
Typed, versioned models for run artifacts, implemented as frozen dataclasses with zero new dependencies.
Models:
RunArtifact— Top-level artifact with mandatoryschema_version,run_id,entrypoint,started_at, optionalended_at,status,forkline_version,events, and extensiblemetadatadict.ArtifactEvent— Single event withevent_id,run_id,ts,type, andpayload.SchemaVersionError— Exception for missing or unsupported schema versions.
Guarantees:
schema_versionis mandatory.RunArtifact.from_dict()raisesSchemaVersionErrorif missing.- Unknown fields are silently ignored in
from_dict()— forward compatibility by design. - Artifacts are immutable (frozen dataclasses).
validate()returns a list of structural errors without raising.to_json()/from_json()provide deterministic JSON roundtrip.
Schema version: "1.0" (SemVer-style, replacing the legacy "recording_v0" format).
2. Deterministic Migration Registry (forkline/artifact/migrate.py)
A versioned migration pipeline that transforms older artifact schemas to the current canonical format.
Primary entry point:
from forkline.artifact import migrate_artifact
migrated = migrate_artifact(raw_json_dict)Behavior:
schema_version == "1.0"→ returns a deep copy unchanged.schema_versionis older (e.g."recording_v0") → routes through chained migration functions.schema_versionis newer (e.g."2.0") → warns but returns unchanged (best-effort forward compat).schema_versionis missing → raisesSchemaVersionError.
Migration registration pattern:
from forkline.artifact import register_migration
def migrate_1_0_to_1_1(raw: dict) -> dict:
result = dict(raw)
result.setdefault("new_field", "default_value")
return result
register_migration("1.0", "1.1", migrate_1_0_to_1_1)Built-in migration — recording_v0 → 1.0:
- Environment fields (
python_version,platform,cwd) moved tometadatadict. - Event timestamps normalized from
created_attots. schema_versionupdated to"1.0".
Migration invariants:
- Deterministic: same input always produces same output.
- Side-effect free: no I/O, no network, no state mutation.
- Input is never mutated: deep copy is always made.
- Chains compose:
recording_v0→1.0→1.1applied sequentially.
3. Storage Integration
Both storage backends now support schema-aware artifact loading and canonical JSON export.
RunRecorder (flat event model):
load_artifact(run_id) -> Optional[RunArtifact]— loads run as canonical artifact, applies migration if needed.export_artifact_json(run_id) -> Optional[str]— exports run as canonical JSON withschema_version.
SQLiteStore (step-based model):
load_artifact(run_id) -> Optional[RunArtifact]— flattens step hierarchy into flat event list, applies migration.export_artifact_json(run_id) -> Optional[str]— exports run as canonical JSON.
Both methods handle legacy databases transparently. Artifacts with schema_version: "recording_v0" are migrated to "1.0" on load.
4. Replay Engine — Version Validation
The replay engine now validates schema_version at the load boundary.
Behavior on ReplayEngine.load_run():
- Current version (
"1.0"): loaded normally. - Older version: loaded via migration layer (transparent to caller).
- Newer version: warning issued, best-effort replay.
- Missing version: warning issued, default assumptions applied.
Critical invariant: Replay never crashes due to version mismatches. Degradation is always graceful.
5. CLI JSON Output — schema_version Included
All CLI JSON output now includes schema_version and forkline_version:
forkline list --json— each run includesschema_version.forkline replay <run_id> --json— artifact includesschema_versionandforkline_version.
6. Version Constants Updated (forkline/version.py)
| Constant | Old Value | New Value | Purpose |
|---|---|---|---|
SCHEMA_VERSION |
"recording_v0" |
"1.0" |
Stamped on all new artifacts |
LEGACY_SCHEMA_VERSION |
(new) | "recording_v0" |
Identifies pre-v1.0 artifacts |
DEFAULT_SCHEMA_VERSION |
"recording_v0" |
"recording_v0" |
Backward compat for NULL columns |
7. Documentation
docs/artifact_schema.md — Full artifact schema specification including:
- Versioning policy (SemVer-style major/minor rules)
- Schema v1.0 field tables (
RunArtifact,Event) - Example artifact JSON
- Backward compatibility matrix
- Migration guarantees and registration pattern
- SQLite and JSON storage format details
- Replay integration behavior
- Stability guarantees
- Design principles
README.md — Added "Artifact Stability Guarantee" section:
Forkline guarantees replay compatibility across minor versions. Breaking changes require a major version increment and migration support.
8. Module Exports
New public symbols exported from forkline:
| Symbol | Module | Description |
|---|---|---|
RunArtifact |
forkline.artifact.schema |
Canonical run artifact model |
ArtifactEvent |
forkline.artifact.schema |
Canonical event model |
SchemaVersionError |
forkline.artifact.schema |
Missing/unsupported version exception |
migrate_artifact |
forkline.artifact.migrate |
Primary migration entry point |
register_migration |
forkline.artifact.migrate |
Migration registration for future versions |
Tests
34 new tests in tests/unit/test_artifact_schema.py. All hermetic — no network, deterministic.
TestRunArtifactSchema (9 tests)
| Test | Validates |
|---|---|
test_schema_version_required |
schema_version field is present |
test_to_dict_includes_schema_version |
Serialized dict includes schema_version |
test_to_json_roundtrip |
JSON serialize/deserialize preserves all data |
test_from_dict_rejects_missing_schema_version |
SchemaVersionError raised when missing |
test_from_dict_ignores_unknown_fields |
Unknown fields silently dropped |
test_validate_catches_missing_required_fields |
Validation reports empty required fields |
test_validate_passes_for_valid_artifact |
Valid artifact returns no errors |
test_metadata_extensibility |
Arbitrary keys in metadata dict |
test_immutability |
Frozen dataclass rejects mutation |
TestArtifactEvent (2 tests)
| Test | Validates |
|---|---|
test_from_dict_ignores_unknown_fields |
Unknown event fields silently dropped |
test_to_dict_roundtrip |
Event roundtrip preserves data |
TestMigrationRegistry (7 tests)
| Test | Validates |
|---|---|
test_migrate_current_version_is_noop |
"1.0" → "1.0" returns copy unchanged |
test_migrate_current_version_returns_deep_copy |
Deep copy, not same object |
test_migrate_missing_schema_version_raises |
SchemaVersionError on missing version |
test_migrate_recording_v0_to_1_0 |
Full migration: env fields → metadata, ts normalization |
test_migrate_is_deterministic |
Same input → same output across invocations |
test_migrate_does_not_mutate_input |
Input dict unchanged after migration |
test_newer_version_returns_with_warning |
Warning issued, data preserved |
test_migrate_non_dict_raises |
Non-dict input raises SchemaVersionError |
TestVersionComparison (6 tests)
| Test | Validates |
|---|---|
test_compare_equal |
"1.0" == "1.0" |
test_compare_less |
"1.0" < "2.0" |
test_compare_greater |
"2.0" > "1.0" |
test_legacy_less_than_semver |
"recording_v0" < "1.0" |
test_migration_path_exists |
Path from recording_v0 → 1.0 found |
test_migration_path_same_version |
Same version returns empty path |
TestStorageArtifactIntegration (6 tests)
| Test | Validates |
|---|---|
test_recorder_load_artifact |
RunRecorder.load_artifact() returns valid RunArtifact |
test_recorder_export_json |
JSON export includes schema_version |
test_recorder_load_artifact_nonexistent |
Returns None for missing run |
test_sqlitestore_load_artifact |
SQLiteStore.load_artifact() with step flattening |
test_sqlitestore_export_json |
JSON export from step-based store |
test_legacy_db_migrates_on_load_artifact |
Legacy recording_v0 DB migrates to "1.0" |
TestSchemaVersionConsistency (2 tests)
| Test | Validates |
|---|---|
test_versions_match |
SCHEMA_VERSION == CURRENT_SCHEMA_VERSION |
test_schema_version_is_1_0 |
Current version is "1.0" |
Updated Tests (2 tests in test_version_schema.py)
| Test | Change |
|---|---|
test_schema_version_format |
Updated to validate "major.minor" numeric format instead of "recording_" prefix |
test_default_versions_are_reasonable |
Added assertion that current schema differs from default |
Total test count after this release: 258 (22...
v0.4 - CLI
Forkline v0.4 Release Notes — CLI
Release: v0.4.0
Date: 2026-02-23
Milestone: v0.4 — CLI (Roadmap item 4 of 5)
Summary
This release delivers the full forkline CLI: four subcommands that let you run, list, replay, and diff agent runs from the terminal. This is the adoption wedge — Forkline is now usable without writing any Python.
The CLI is thin by design: parse args → call library APIs → render output. No business logic lives in the CLI layer.
What's New
1. forkline run — Execute under tracing
Run any Python script under Forkline tracing. Records execution metadata (script path, timestamps, exit code) and prints the assigned run ID.
$ forkline run examples/ollama_qwen3.py
Calling qwen3 ...
Response: A fork bomb is a denial-of-service attack that recursively spawns
an infinite number of processes to exhaust system resources, causing a crash
or severe performance degradation.
run_id: b015f49f45c04002a3c489fe84b45c5cBehavior:
- Validates the script file exists (exits 2 if not)
- Executes the script in a subprocess via
sys.executable - Sets environment variables for script integration:
FORKLINE_TRACING=1,FORKLINE_RUN_ID=<id>,FORKLINE_DB=<path> - Records run start/end timestamps and exit code via
RunRecorder - On non-zero exit: stores
status=failed, printsrun_id, and propagates the exit code - Script arguments are passed after
--:forkline run script.py -- --arg1 value
2. forkline list — List stored runs
Show all recorded runs, newest first.
$ forkline list
ID Created Script Status
------------------------------------------------------------------------------------------------------
7b08ac5e533d456daa7a24921c0d1687 2026-02-23 01:04:34 examples/ollama_qwen3.py ok
b015f49f45c04002a3c489fe84b45c5c 2026-02-23 01:04:20 examples/ollama_qwen3.py okOptions:
| Flag | Default | Description |
|---|---|---|
--limit N |
all | Maximum number of runs to show |
--json |
off | Output as JSON array |
--db PATH |
runs.db |
SQLite database path |
JSON output:
[
{
"created_at": "2026-02-23T01:04:34.989096+00:00",
"ended_at": "2026-02-23T01:04:45.039067+00:00",
"entrypoint": "examples/ollama_qwen3.py",
"run_id": "7b08ac5e533d456daa7a24921c0d1687",
"status": "ok"
}
]Output is deterministic: runs are ordered by started_at DESC, JSON keys are sorted.
3. forkline replay — Replay a recorded run
Load a recorded run by ID and print a summary of its events, duration, and status.
$ forkline replay b015f49f45c04002a3c489fe84b45c5c
Run: b015f49f45c04002a3c489fe84b45c5c
Script: examples/ollama_qwen3.py
Status: ok
Duration: 10.74s
Total events: 2
Events by type:
input: 1
output: 1Options:
| Flag | Default | Description |
|---|---|---|
--json |
off | Output full run and events as JSON |
--db PATH |
runs.db |
SQLite database path |
Exits 2 with a stderr message if the run ID is not found.
4. forkline diff — Diff two runs
Compare two recorded runs event-by-event and report the first point of divergence.
$ forkline diff b015f49f... 7b08ac5e...
Step 1 diverged:
old.type: output
old.payload: {"model": "qwen3", "response": "A fork bomb is a denial-of-service attack tha...
new.type: output
new.payload: {"model": "qwen3", "response": "A fork bomb is a type of denial-of-service at...Options:
| Flag | Default | Description |
|---|---|---|
--format pretty|json |
pretty |
Output format |
--first-divergence |
on | Stop at first divergence |
--db PATH |
runs.db |
SQLite database path |
Identical runs: prints No differences and exits 0.
JSON output:
{
"divergence_index": 1,
"identical": false,
"new": {
"payload": {"model": "qwen3", "response": "A fork bomb is a type of..."},
"type": "output"
},
"old": {
"payload": {"model": "qwen3", "response": "A fork bomb is a denial-of..."},
"type": "output"
},
"run_a": "b015f49f45c04002a3c489fe84b45c5c",
"run_b": "7b08ac5e533d456daa7a24921c0d1687",
"total_events_a": 2,
"total_events_b": 2
}Diff engine: compares events by index. At each position, checks type and payload for equality. If event counts differ, reports different_event_count at the index where the shorter run ends. Always reports the first divergence.
5. Exit Code Contract
All commands follow a consistent exit code convention for CI scriptability:
| Code | Meaning |
|---|---|
0 |
Success (command completed, runs match for diff) |
1 |
Divergence found (diff only) |
2 |
Input error (missing run ID, file not found, invalid args) |
6. Environment Variable Bridge
forkline run sets three environment variables before executing the script subprocess:
| Variable | Value | Purpose |
|---|---|---|
FORKLINE_TRACING |
1 |
Signal that tracing is active |
FORKLINE_RUN_ID |
<hex> |
Run ID assigned by the CLI |
FORKLINE_DB |
<path> |
Database path for event logging |
Scripts can read these to log events to the same run:
import os
from forkline.storage.recorder import RunRecorder
db = os.environ.get("FORKLINE_DB", "runs.db")
run_id = os.environ.get("FORKLINE_RUN_ID")
recorder = RunRecorder(db_path=db)
recorder.log_event(run_id, "input", {"prompt": "hello"})7. RunRecorder.list_runs() — New API
Added list_runs(limit=None) to RunRecorder. Returns runs ordered by started_at DESC with backward-compatible version defaults.
recorder = RunRecorder()
runs = recorder.list_runs(limit=10) # newest 10 runs8. Ollama Qwen3 Example (examples/ollama_qwen3.py)
A live example that calls Ollama's Qwen3 model and records the input/output as Forkline events. Demonstrates nondeterminism detection: same prompt, same model, different response.
$ forkline run examples/ollama_qwen3.py # run 1
$ forkline run examples/ollama_qwen3.py # run 2
$ forkline diff <run_1> <run_2> # nondeterminism caughtUses only urllib.request from the standard library — no new dependencies.
Tests
38 new tests in tests/unit/test_cli.py. All hermetic except TestCLIRun which spawns real subprocesses against temporary scripts and databases.
TestRenderRunResult (1 test)
| Test | Validates |
|---|---|
test_format |
Output is run_id: <id> |
TestRenderListTable (2 tests)
| Test | Validates |
|---|---|
test_empty |
"No runs found." for empty list |
test_header_and_row |
Header columns, timestamp formatting, values present |
TestRenderListJSON (1 test)
| Test | Validates |
|---|---|
test_json_array |
Valid JSON array with correct fields |
TestRenderReplaySummary (1 test)
| Test | Validates |
|---|---|
test_contains_fields |
Run ID, status, duration, event counts, events by type |
TestRenderReplayJSON (2 tests)
| Test | Validates |
|---|---|
test_valid_json_with_all_fields |
All fields present and correct in parsed JSON |
test_empty_events |
Zero events, null timestamps handled |
TestRenderDiffPretty (2 tests)
| Test | Validates |
|---|---|
test_identical |
"No differences" |
test_diverged |
"Step N diverged:" with old/new type and payload |
TestRenderDiffJSON (2 tests)
| Test | Validates |
|---|---|
test_identical |
{"identical": true} |
test_diverged |
divergence_index and old/new present |
TestDiffEvents (6 tests)
| Test | Validates |
|---|---|
test_identical |
Same events → identical: true |
test_type_mismatch |
Different event types → divergence at index 0 |
test_payload_mismatch |
Same type, different payload → divergence at index 0 |
test_different_lengths |
Shorter list → reason: different_event_count |
test_both_empty |
Two empty lists → identical |
test_finds_first_divergence |
Three events, divergence at index 1 (not 2) |
TestListRuns (3 tests)
| Test | Validates |
|---|---|
test_empty |
Empty database → empty list |
test_ordered_newest_first |
Most recent run is first |
test_limit |
limit=2 returns exactly 2 from 3 runs |
TestCLIList (3 tests)
| Test | Validates |
|---|---|
test_list_shows_runs |
Run ID and script name in table output |
test_list_json |
Valid JSON array with correct run ID |
test_list_empty |
"No runs found" for empty database |
TestCLIReplay (3 tests)
| Test | Validates |
|---|---|
test_replay_success |
Run ID, status, event count in output; exit 0 |
test_replay_missing_run |
Exit 2 for nonexistent run |
test_replay_json |
Valid JSON with run_id and total_events |
TestCLIDiff (6 tests)
| Test | Validates |
|---|---|
test_identical_runs |
"No differences"; exit 0 |
test_different_runs |
"Step 0 diverged"; exit 1 |
test_diff_json_format |
JSON with "identical": true |
test_diff_missing_run |
Exit 2 for nonexistent run |
test_diff_different_event_counts |
"Event count differs" message |
test_diff_json_diverged |
JSON with divergence_index and old/new |
TestCLIRun (4 tests)
| Test | Validates |
|---|---|
test_run_missing_file |
Exit 2 for nonexistent script |
test_run_success |
run_id: in output; run stored with `... |
v0.3 — First-Divergence Diffing
Forkline v0.3 Release Notes — First-Divergence Diffing
Release: v0.3.0
Date: 2026-02-21
Milestone: v0.3 — First-Divergence Diffing (Roadmap item 3 of 5)
Summary
This release delivers first-divergence diffing: given two recorded runs, Forkline now compares them step-by-step and returns the first point of divergence with deterministic classification, structured JSON diff patches, and rule-based explanations.
This is the core feature that turns Forkline from a recording/replay tool into a forensic debugging tool — answering not just that two runs differ, but where, how, and what changed.
What's New
1. Deterministic Canonicalization (forkline/core/canon.py)
A canonicalization layer that produces stable, deterministic byte representations of any value before hashing or diffing.
Functions:
canon(value, profile="strict") -> bytes— Canonicalize any value to bytessha256_hex(data: bytes) -> str— SHA-256 hex digestbytes_preview(data: bytes) -> str— Human-readablesha256:<hash>:<hex_prefix>format
Canonicalization guarantees:
- Dict key order is irrelevant. Keys are sorted lexicographically before serialization.
- Unicode is NFC-normalized.
"café"(precomposed) and"café"(decomposed e + combining accent) produce identical output. - Newlines are normalized to LF.
\r\nand\rare collapsed to\n. - Floats use 17-significant-digit precision.
-0.0collapses to0.0.NaNandInfare serialized as stable strings. - Booleans and integers are distinct.
Trueand1produce different canonical bytes. - Bytes pass through unchanged. Binary data is not re-encoded; hashing uses SHA-256 with a hex prefix preview for display.
- Compact JSON encoding. No whitespace separators (
","and":"),ensure_ascii=False.
Zero dependencies. Uses only hashlib, json, math, unicodedata from the standard library.
2. Deterministic JSON Diff Patches (forkline/core/json_diff.py)
A recursive JSON diff algorithm that produces a stable, ordered list of patch operations for any two JSON-like values.
Function:
json_diff(old, new, path="$") -> List[Dict]
Patch operation format:
[
{"op": "remove", "path": "$.a.b", "old": "<removed_value>"},
{"op": "add", "path": "$.x", "value": "<added_value>"},
{"op": "replace","path": "$.k", "old": "<old_value>", "new": "<new_value>"}
]Ordering guarantees (deterministic across invocations):
- Dicts: removed keys (sorted) → added keys (sorted) → common keys (sorted, recursed).
- Lists: compared by index; removes at tail, then adds at tail.
- Type mismatch: replace whole node.
- Numeric compatibility:
intvsfloatcompared as numeric, not as type mismatch.
Paths use JSONPath-style notation: $.outer.inner, $.list[0], $.nested.array[2].field.
3. First-Divergence Engine (forkline/core/first_divergence.py)
The core diffing algorithm: compare two Run objects step-by-step, classify the first mismatch, and return a structured result.
Algorithm
-
Lockstep comparison. Walk both runs at the same index. At each step, classify by comparing (in priority order): step name → input hash → error state → output hash → all events hash.
-
Resync window. On mismatch, search within a configurable window (default W=10) for matching "soft signatures" —
(step_name, input_hash)tuples. The search iterates by increasing combined distance from the mismatch point, finding the nearest resync. -
Gap classification.
- Resync with
gap_a > 0, gap_b == 0→missing_steps(steps in run_a absent from run_b) - Resync with
gap_b > 0, gap_a == 0→extra_steps(steps in run_b not in run_a) - Both gaps > 0 → falls through to classify the mismatch at current position
- No resync → classify by what differs at current position
- Resync with
-
Length mismatch. If one run is longer after lockstep exhausts the shorter, classify as
missing_stepsorextra_steps.
Divergence Types
| Type | Trigger | Explanation Pattern |
|---|---|---|
exact_match |
Runs identical | "Runs are identical (N steps compared)" |
op_divergence |
Step names differ | "Step 3: operation mismatch ('tool_call' vs 'llm_call')" |
input_divergence |
Same name, different input | "Step 3 'tool_call': input differs" |
output_divergence |
Same name + input, different output | "Step 3 'tool_call': output differs (same input)" |
error_divergence |
Error presence or content differs | "Step 3 'tool_call': error state differs" |
missing_steps |
Steps in run_a not in run_b | "Step 5 from run_a missing in run_b" |
extra_steps |
Steps in run_b not in run_a | "Steps 3..4 in run_b not present in run_a" |
All explanations are deterministic and rule-based — no LLM narration, no randomness.
Classification Priority
When two steps share a name but differ, classification follows strict priority:
- Input divergence — checked first because differing inputs explain differing outputs
- Error divergence — error presence/absence or content differs
- Output divergence — same input but different output (nondeterminism signal)
- All-events fallback — catches differences in
tool_call,artifact_ref, or other event types
Data Models
StepSummary — Compact step representation included in results:
StepSummary(
idx=2,
name="generate_response",
input_hash="a1b2c3d4...",
output_hash="e5f6a7b8...",
event_count=3,
has_error=False,
)FirstDivergenceResult — Complete result object:
FirstDivergenceResult(
status="output_divergence", # DivergenceType
idx_a=2, # Index in run_a at divergence
idx_b=2, # Index in run_b at divergence
explanation="Step 2 'generate_response': output differs (same input)",
old_step=StepSummary(...), # Step from run_a
new_step=StepSummary(...), # Step from run_b
input_diff=None, # JSON patch (when applicable)
output_diff=[{"op": "replace", "path": "$[0].text", ...}],
last_equal_idx=1, # Last step where both matched
context_a=[StepSummary(...),...], # 2 steps before/after in run_a
context_b=[StepSummary(...),...], # 2 steps before/after in run_b
)Both models are frozen dataclasses with .to_dict() for JSON serialization.
API
from forkline.core.first_divergence import find_first_divergence, DivergenceType
result = find_first_divergence(
run_a,
run_b,
window=10, # Resync window size
context_size=2, # Steps before/after divergence in context
show="both", # "input", "output", or "both"
)
# JSON-serializable output
import json
print(json.dumps(result.to_dict(), indent=2))4. CLI — forkline diff (forkline/cli.py)
The first CLI subcommand, establishing the forkline command-line interface.
Usage:
forkline diff --first <run_a> <run_b> [OPTIONS]Options:
| Flag | Default | Description |
|---|---|---|
--first |
true |
Show first divergence only |
--window N |
10 |
Resync window size |
--format json|text |
text |
Output format |
--show input|output|both |
both |
Which diffs to include |
--canon strict |
strict |
Canonicalization profile |
--db PATH |
forkline.db |
SQLite database path |
Exit codes:
0— Runs are identical (exact_match)1— Divergence detected (any other status)
This makes forkline diff directly usable in CI pipelines and shell scripts.
Text output sample:
First divergence: output_divergence
Step 2 'generate_response': output differs (same input)
Run A step 2 'generate_response':
input_hash: a1b2c3d4e5f6a7b8...
output_hash: 1234567890abcdef...
events: 3
has_error: False
Run B step 2 'generate_response':
input_hash: a1b2c3d4e5f6a7b8...
output_hash: fedcba0987654321...
events: 3
has_error: False
Output diff:
replace $[0].text: "Expected response" -> "Different response"
Last equal: step 1
Context A: [step 0 'init', step 1 'prepare', step 2 'generate_response']
Context B: [step 0 'init', step 1 'prepare', step 2 'generate_response']
Entry point: Registered as forkline = "forkline.cli:main" in pyproject.toml ([project.scripts]).
5. Module Exports
New public symbols exported from forkline and forkline.core:
| Symbol | Module | Description |
|---|---|---|
find_first_divergence |
forkline.core.first_divergence |
Main engine function |
FirstDivergenceResult |
forkline.core.first_divergence |
Result dataclass |
StepSummary |
forkline.core.first_divergence |
Compact step summary |
DivergenceType |
forkline.core.first_divergence |
Type classification constants |
canon |
forkline.core.canon |
Value → canonical bytes |
sha256_hex |
forkline.core.canon |
Bytes → SHA-256 hex |
bytes_preview |
forkline.core.canon |
Bytes → human-readable hash preview |
json_diff |
forkline.core.json_diff |
Deterministic JSON diff patches |
Tests
45 new tests across 3 test classes in tests/unit/test_first_divergence.py. All hermetic — no database, no disk I/O, no network.
TestCanonStability (14 tests)
| Test | Validates |
|---|---|
test_dict_key_order_irrelevant |
{"z":1,"a":2} == {"a":2,"z":1} |
test_nested_dict_stability |
Deep nesting with mixed key order |
test_unicode_normalization |
NFC: \u00e9 == e\u0301 |
test_newline_normalization |
`\r... |
v0.1.1 - Recording & Artifact Foundations
Forkline v0.1.1 — Recording & Artifact Foundations
Release focus: establish Forkline’s core recording primitives and artifact model for deterministic agent workflows.
v0.1.1 intentionally does not include replay. This release lays the groundwork by making runs recordable, inspectable, and diffable in a local-first, immutable format.
✨ What’s New
Deterministic Run Recording (Foundational)
- Introduced a structured run recording model for agentic workflows.
- Captures ordered execution steps including:
- LLM inputs and outputs
- Tool invocations
- Execution metadata
- Artifacts are written locally and treated as immutable once persisted.
This establishes Forkline’s core abstraction:
a run is a durable, replayable artifact — not a log stream.
Artifact Schema v0
- Added a first-pass, explicit artifact schema (
recording_v0). - Clearly separates:
- Run-level metadata
- Step-level inputs and outputs
- Execution ordering
- Schema is designed for future replay and diffing, not observability.
This schema defines the baseline contract for Forkline artifacts.
Redaction Support
- Introduced a redaction layer for recorded artifacts.
- Enables sensitive fields (API keys, tokens, PII) to be:
- Redacted at record time, or
- Scrubbed before persistence
- Redaction is explicit and policy-driven — never implicit.
This makes Forkline artifacts safe for local inspection and sharing.
Run & Step Diffing (Structural)
- Added initial diff utilities for comparing recorded runs or steps.
- Focuses on structural and semantic differences, not textual logs.
- Intended for offline analysis and future replay divergence detection.
This is a foundational capability, not a visualization layer.
Core Types & Invariants
- Formalized core domain types:
- Runs
- Steps
- Artifacts
- Diff results
- Introduced the concept of core invariants:
- Ordering matters
- Artifacts are the source of truth
- No mutation after persistence
These invariants guide all future Forkline features.
❌ Explicit Non-Goals (v0.1.1)
To avoid confusion, v0.1.1 does not include:
- Replay or re-execution
- First-divergence detection
- OpenTelemetry integration
- Observability, tracing, or metrics
- Production or distributed runtime support
These exclusions are deliberate.
Why This Release Matters
v0.1.1 is about credibility, not completeness.
It answers one question clearly:
Can Forkline reliably capture an agent run as a durable, inspectable artifact?
The answer is now yes.
Replay, divergence detection, and developer-facing workflows are built on top of this foundation.
What’s Next
- Deterministic replay from recorded artifacts
- First-divergence detection
- Golden replay tests
- Minimal replay demos
(Tracked for the next release.)
What's Changed
- Replay Engine by @sauravvenkat in #16
Full Changelog: v0.1.0...v0.1.1
v0.1.0 - Deterministic Run Recording
Forkline v0.1.0
First release of Forkline: local-first, replay-first tracing for agentic AI workflows.
What's in v0.1
✅ Deterministic recording of agent runs
✅ Self-contained artifacts stored in SQLite
✅ Security-first with automatic redaction (SAFE mode)
✅ Human-inspectable with sqlite3 or helper scripts
✅ Append-only logging with versioned schema
Quick Start
git clone https://github.com/sauravvenkat/forkline.git
cd forkline
source dev.env
python examples/minimal.py
python scripts/inspect_runs.py
What's Changed
- Boilerplate by @sauravvenkat in #1
- Update README.md and adding ROADMAP in docs folder with design/roadmap for v0.* until v1.0 by @sauravvenkat in #10
- Deterministic run recording v0 + repo module restructure (core/, storage/, tracer/) by @sauravvenkat in #11
- Adding REDACTION_POLICY.md doc by @sauravvenkat in #13
- Implement RedactionPolicy v0 by @sauravvenkat in #14
New Contributors
- @sauravvenkat made their first contribution in #1
Full Changelog: https://github.com/sauravvenkat/forkline/commits/v0.1.0