Skip to content

Integrate waves 1-5 + live validation & benchmarks#44

Merged
urmzd merged 6 commits into
mainfrom
integration/all-waves
May 31, 2026
Merged

Integrate waves 1-5 + live validation & benchmarks#44
urmzd merged 6 commits into
mainfrom
integration/all-waves

Conversation

@urmzd

@urmzd urmzd commented May 31, 2026

Copy link
Copy Markdown
Owner

Combines the five roadmap waves on top of main (foundation #38 already merged), with the inter-wave agent.go conflicts resolved, plus a live validation harness and benchmarks.

What's here

Five waves cherry-picked and reconciled (all compose cleanly — go build, go vet, golangci-lint, and go test ./... (39 pkgs) all green):

  • Wave 1 — channel-wrapping retry + token-usage metric
  • Wave 2 — conversation-tree persistence via a Store seam + in-memory memstore
  • Wave 3 — path-traversal guard in read_file + HITL-gated mutating KG/RAG tools
  • Wave 4 — time-decay recency scoring (RAG) + wired contradiction invalidation (KG)
  • Wave 5 — per-call timeouts + parallel-tool concurrency cap + ErrMaxIterations signal

The only real conflicts were in getAssistantMessage and AgentConfig (wave 1's token metric vs. wave 5's LLM-timeout); resolved so both run (timeout check first, then token recording).

Live validation (gpt-4o-mini)

examples/validation exercises the features against a real model and writes a report. Last run: 8/8 passing.

Feature Result
basic_generation, tool_calling, response_caching, token_metrics
agent_handoff, durable_memoization, llm_timeout, multimodal_tool_output

Run: OPENAI_API_KEY=... just validate. Sample outputs are committed under examples/validation/results/.

Benchmarks (mock-based, deterministic)

agent/bench_test.go + agent/provider/cache/bench_test.go — agent loop ~4.9µs/op, durable-noop overhead ~negligible, cache hit ~0.5µs. Regenerate with just bench-report.

Relationship to the wave PRs

Each wave is also a standalone, now-mergeable PR (#41 #39 #40 #43 #42), all rebased onto main and green. Merge those individually or merge this integration branch — either way the conflicts are pre-resolved here.

urmzd added 6 commits May 31, 2026 16:10
… KG/RAG tools

Wave 3 isolation & security hardening.

(a) Path-traversal confinement in tools/research:
- Add resolveWithinRoot helper (safepath.go): resolves the requested path to
  an absolute path under a configured root and rejects ../ traversal, absolute
  paths outside root, and symlinks whose real target escapes root.
- read_file and file_search now confine all reads to the root.
- Table-driven tests cover ../../etc/passwd, absolute escapes, symlink escape.

(b) HITL-gate mutating KG/RAG agent tools:
- rag/tool, knowledge/tool, tools/research NewTools now wrap mutating tools
  (rag_update, rag_delete, kg_ingest, store_knowledge) in a human_approval
  Marker by default so the agent loop pauses for approval.
- Add ReadOnly() functional option to omit mutating tools entirely.
- Tests assert mutating tools carry the marker and are absent in read-only mode.

Backward compatible: NewTools signatures gain only variadic options; tool name
ordering preserved.
…ation

RAG: add WithRecency(halfLife, weight) SearchOption that blends an
exponential time-decay factor exp(-ln2*age/halfLife) into each hit's
fused score after RRF. Opt-in (non-positive half-life is a no-op).
SearchHit gains a Timestamp populated from the document UpdatedAt
(fallback CreatedAt) in memstore and pgstore.

KG: wire the previously-unused Store.InvalidateRelation into the engine
ingestion path. A new relation for an existing (source,target,type)
rule-based supersedes active prior relation(s) of the same type, setting
their InvalidAt to the new relation's ValidAt.
Wave 1 correctness floor.

(a) retry.Provider now retries when the stream emits an ErrorDelta BEFORE
any content delta, not just when ChatStream returns a synchronous error.
Streaming adapters surface transient failures (529 overload, mid-handshake
timeouts) as a channel-delivered ErrorDelta; the decorator buffers leading
metadata deltas, classifies the error via the existing transient/ShouldRetry
path, and re-invokes with backoff. Once content has streamed, the error is
surfaced (never retry a partially consumed turn).

(b) The agent loop now calls Metrics.RecordTokenUsage once per completed LLM
call with the merged prompt/completion tokens (skipped on cache hit or when
no usage was reported). agent/otel collapses the three duplicate
gen_ai.client.operation.duration histograms into one instrument keyed by
gen_ai.operation.name.
Wire AGENT TREE PERSISTENCE behind the existing types.Store seam, testable
without Postgres.

- Add AgentConfig.Store + WithStore option (default nil = today's
  in-memory-only behavior, fully backward compatible).
- runLoop persists each new node (and branch tip) to the Store as it is
  added, via Store.Tx so the tip never points at an unsaved node.
  Best-effort: errors are logged, never fatal.
- NewAgent persists the root node + main branch up front when a Store is
  configured, giving LoadTreeFromStore an anchor before the first Invoke.
- Add LoadTreeFromStore helper (Store.LoadTree + tree.FromStore) for the
  read/resume path.
- New package agent/store/memstore: in-memory types.Store implementing the
  full interface (SaveNode/LoadNode/LoadChildren/LoadPath/SaveBranch/
  LoadBranch/ListBranches/SaveCheckpoint/LoadCheckpoint/LoadTree/Tx) with
  atomic buffered transactions.

Tests: memstore unit tests (round-trip, children order, path, branches,
checkpoints, reachable-subtree LoadTree, Tx commit/rollback); agent
multi-turn Invoke -> reconstruct tree from memstore -> assert full message
history round-trips; backward-compat (nil Store) and root-on-construction.
…eration signal

Add GA-hardening limits to the agent loop:

- LLMTimeout/ToolTimeout (+ WithLLMTimeout/WithToolTimeout): derive a child
  context.WithTimeout around the provider call in getAssistantMessage and
  around each tool step in executeOneTool. A slow provider surfaces a transient
  ProviderError; a slow tool surfaces a deadline-exceeded tool error (even if
  the tool ignores ctx and completes late). 0 = no timeout (default).
- MaxParallelTools (+ WithMaxParallelTools): bound the parallel-tool goroutines
  with a buffered-channel semaphore. 0 = unlimited. Durable-runner sequential
  path is unchanged.
- ErrMaxIterations signal: emit types.ErrorDelta{Error: ErrMaxIterations} when
  runLoop breaks on the iteration cap while the last assistant turn still had
  pending tool calls, so consumers can tell truncated from a clean finish. Not
  emitted on a natural text-only/empty finish.

Table-driven tests in agent/limits_test.go cover all three plus the disabled/
unlimited defaults. Existing tests unchanged.
Add examples/validation: a runnable harness that exercises the agent SDK's
features against a real model (gpt-4o-mini) — basic generation, tool calling,
response caching (CacheHit), token metrics, agent handoff, durable memoization,
LLM timeout, and multimodal tool output. Skips cleanly without OPENAI_API_KEY.
Committed sample outputs under examples/validation/results/ (report + bench
numbers) so users can see real runs. Adds mock-based benchmarks in
agent/bench_test.go and agent/provider/cache/bench_test.go, plus `just validate`
and `just bench-report` targets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant