Integrate waves 1-5 + live validation & benchmarks by urmzd · Pull Request #44 · urmzd/saige

urmzd · 2026-05-31T21:19:17Z

Combines the five roadmap waves on top of main (foundation #38 already merged), with the inter-wave agent.go conflicts resolved, plus a live validation harness and benchmarks.

What's here

Five waves cherry-picked and reconciled (all compose cleanly — go build, go vet, golangci-lint, and go test ./... (39 pkgs) all green):

Wave 1 — channel-wrapping retry + token-usage metric
Wave 2 — conversation-tree persistence via a Store seam + in-memory memstore
Wave 3 — path-traversal guard in read_file + HITL-gated mutating KG/RAG tools
Wave 4 — time-decay recency scoring (RAG) + wired contradiction invalidation (KG)
Wave 5 — per-call timeouts + parallel-tool concurrency cap + ErrMaxIterations signal

The only real conflicts were in getAssistantMessage and AgentConfig (wave 1's token metric vs. wave 5's LLM-timeout); resolved so both run (timeout check first, then token recording).

Live validation (gpt-4o-mini)

examples/validation exercises the features against a real model and writes a report. Last run: 8/8 passing.

Feature	Result
basic_generation, tool_calling, response_caching, token_metrics	✅
agent_handoff, durable_memoization, llm_timeout, multimodal_tool_output	✅

Run: OPENAI_API_KEY=... just validate. Sample outputs are committed under examples/validation/results/.

Benchmarks (mock-based, deterministic)

agent/bench_test.go + agent/provider/cache/bench_test.go — agent loop ~4.9µs/op, durable-noop overhead ~negligible, cache hit ~0.5µs. Regenerate with just bench-report.

Relationship to the wave PRs

Each wave is also a standalone, now-mergeable PR (#41 #39 #40 #43 #42), all rebased onto main and green. Merge those individually or merge this integration branch — either way the conflicts are pre-resolved here.

… KG/RAG tools Wave 3 isolation & security hardening. (a) Path-traversal confinement in tools/research: - Add resolveWithinRoot helper (safepath.go): resolves the requested path to an absolute path under a configured root and rejects ../ traversal, absolute paths outside root, and symlinks whose real target escapes root. - read_file and file_search now confine all reads to the root. - Table-driven tests cover ../../etc/passwd, absolute escapes, symlink escape. (b) HITL-gate mutating KG/RAG agent tools: - rag/tool, knowledge/tool, tools/research NewTools now wrap mutating tools (rag_update, rag_delete, kg_ingest, store_knowledge) in a human_approval Marker by default so the agent loop pauses for approval. - Add ReadOnly() functional option to omit mutating tools entirely. - Tests assert mutating tools carry the marker and are absent in read-only mode. Backward compatible: NewTools signatures gain only variadic options; tool name ordering preserved.

…ation RAG: add WithRecency(halfLife, weight) SearchOption that blends an exponential time-decay factor exp(-ln2*age/halfLife) into each hit's fused score after RRF. Opt-in (non-positive half-life is a no-op). SearchHit gains a Timestamp populated from the document UpdatedAt (fallback CreatedAt) in memstore and pgstore. KG: wire the previously-unused Store.InvalidateRelation into the engine ingestion path. A new relation for an existing (source,target,type) rule-based supersedes active prior relation(s) of the same type, setting their InvalidAt to the new relation's ValidAt.

Wave 1 correctness floor. (a) retry.Provider now retries when the stream emits an ErrorDelta BEFORE any content delta, not just when ChatStream returns a synchronous error. Streaming adapters surface transient failures (529 overload, mid-handshake timeouts) as a channel-delivered ErrorDelta; the decorator buffers leading metadata deltas, classifies the error via the existing transient/ShouldRetry path, and re-invokes with backoff. Once content has streamed, the error is surfaced (never retry a partially consumed turn). (b) The agent loop now calls Metrics.RecordTokenUsage once per completed LLM call with the merged prompt/completion tokens (skipped on cache hit or when no usage was reported). agent/otel collapses the three duplicate gen_ai.client.operation.duration histograms into one instrument keyed by gen_ai.operation.name.

Wire AGENT TREE PERSISTENCE behind the existing types.Store seam, testable without Postgres. - Add AgentConfig.Store + WithStore option (default nil = today's in-memory-only behavior, fully backward compatible). - runLoop persists each new node (and branch tip) to the Store as it is added, via Store.Tx so the tip never points at an unsaved node. Best-effort: errors are logged, never fatal. - NewAgent persists the root node + main branch up front when a Store is configured, giving LoadTreeFromStore an anchor before the first Invoke. - Add LoadTreeFromStore helper (Store.LoadTree + tree.FromStore) for the read/resume path. - New package agent/store/memstore: in-memory types.Store implementing the full interface (SaveNode/LoadNode/LoadChildren/LoadPath/SaveBranch/ LoadBranch/ListBranches/SaveCheckpoint/LoadCheckpoint/LoadTree/Tx) with atomic buffered transactions. Tests: memstore unit tests (round-trip, children order, path, branches, checkpoints, reachable-subtree LoadTree, Tx commit/rollback); agent multi-turn Invoke -> reconstruct tree from memstore -> assert full message history round-trips; backward-compat (nil Store) and root-on-construction.

…eration signal Add GA-hardening limits to the agent loop: - LLMTimeout/ToolTimeout (+ WithLLMTimeout/WithToolTimeout): derive a child context.WithTimeout around the provider call in getAssistantMessage and around each tool step in executeOneTool. A slow provider surfaces a transient ProviderError; a slow tool surfaces a deadline-exceeded tool error (even if the tool ignores ctx and completes late). 0 = no timeout (default). - MaxParallelTools (+ WithMaxParallelTools): bound the parallel-tool goroutines with a buffered-channel semaphore. 0 = unlimited. Durable-runner sequential path is unchanged. - ErrMaxIterations signal: emit types.ErrorDelta{Error: ErrMaxIterations} when runLoop breaks on the iteration cap while the last assistant turn still had pending tool calls, so consumers can tell truncated from a clean finish. Not emitted on a natural text-only/empty finish. Table-driven tests in agent/limits_test.go cover all three plus the disabled/ unlimited defaults. Existing tests unchanged.

Add examples/validation: a runnable harness that exercises the agent SDK's features against a real model (gpt-4o-mini) — basic generation, tool calling, response caching (CacheHit), token metrics, agent handoff, durable memoization, LLM timeout, and multimodal tool output. Skips cleanly without OPENAI_API_KEY. Committed sample outputs under examples/validation/results/ (report + bench numbers) so users can see real runs. Adds mock-based benchmarks in agent/bench_test.go and agent/provider/cache/bench_test.go, plus `just validate` and `just bench-report` targets.

urmzd added 6 commits May 31, 2026 16:10

urmzd merged commit 53a6aff into main May 31, 2026
6 checks passed

urmzd deleted the integration/all-waves branch May 31, 2026 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate waves 1-5 + live validation & benchmarks#44

Integrate waves 1-5 + live validation & benchmarks#44
urmzd merged 6 commits into
mainfrom
integration/all-waves

urmzd commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

urmzd commented May 31, 2026

What's here

Live validation (gpt-4o-mini)

Benchmarks (mock-based, deterministic)

Relationship to the wave PRs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant