feat: OpenRouter agentic harness, SWE-bench hybrid benchmarks, repo knowledge loop#63
Merged
Conversation
…WE-bench findings Add an OpenRouter adapter that runs the native agentic loop (runAgenticLoop) with PKCE auth, ZDR (data_collection: deny), prompt caching, and optional reasoning-off for mechanical roles. Route worker/reviewer/planner models per cost-efficiency measurements (lightweight worker + frontier escalate). Harden the agentic harness with fixes for 7 defects discovered by running real SWE-bench instances (none were visible on synthetic benchmarks): 1. Inject the working directory into the prompt — models guessed absolute paths and every file tool call was rejected. 2. Surface stdout/stderr + exit code on bash failures — grep "no match" (exit 1) looked like a fatal error and caused infinite retries. 3. Raise compaction thresholds (24k→60k tokens, keep 16 recent messages) — early compaction erased freshly-read files and caused endless re-reads. 4. Final-answer turn — when maxTurns is exhausted, make one last tool-less call so the model still produces a conclusion. 5. No-edit guard (nudgeMaxOnNoEdit) — push back when a model tries to finish an edit-required task with analysis only. 6. Protected files (protectedFiles) — reject edit/write on verification harness files; implementers were rewriting the test script when tests failed. 7. Configurable bash timeout (bashTimeoutMs) + explicit TIMEOUT message — the 30s default died silently on docker-based test runs and read as a broken environment. Also: loop-level read cache (dedup repeated reads), edit_file returns the resulting region (no re-read needed), git-diff-based success promotion in the worker (structured output no longer required).
…oss tasks The write infrastructure (memoryCore LanceDB with a repo field) existed but nothing ever read it back: task outcomes were stored as one generic line and never reached a worker prompt. Close the loop: - src/memory/repoKnowledge.ts: recordTaskOutcome() stores successes as system_pattern (files changed + approach + iterations) and review rejections as constraint (pitfalls), scoped to the project path. recallRepoKnowledge() retrieves the top task-relevant memories. skipDistillation keeps the intended types (distillation was downgrading them to belief, which type-filtered recall then missed). - pairPipeline.collectWorkerContext() recalls repo knowledge into WorkerContext.repoMemories; worker prompts render it as a "Repository Knowledge" section with pattern/pitfall tags (en + ko). - autonomousRunner completed/rejected handlers record outcomes instead of the old one-line strategy memo. All memory paths are non-blocking — the pipeline runs even if the memory DB is unavailable.
…WE-bench harness - benchmarks/tasks/codingTasks.ts + modelSelect.ts: 12 synthetic tasks (L0-L5) with deterministic grading (regex / test run / tsc); produces a model x level pass-rate table and a cost-efficiency Pareto ranking used to set the default worker/reviewer/planner routing. - benchmarks/sweBench.ts: L6 = real GitHub issues (SWE-bench Lite). Solve with the OpenSwarm harness (openrouter adapter), grade with the official swebench harness in Docker. Supports hybrid mode: SWE_DIAG_MODEL runs a frontier read-only diagnosis stage, SWE_MODEL implements with the verification loop; SWE_DIAG_FILE reuses a saved diagnosis. - benchmarks/RUBRIC.md: level definitions, recommended models per level, L6 grading procedure and pitfalls, and measured results. Headline result: hybrid (gpt-5 diagnosis + lightweight implementer) resolved 3/3 attempted SWE-bench Lite instances (pylint 7080/5859/7993) where every single-model lightweight run had failed — including a re-diagnosis escalate loop on 7993 where the frontier found the bug in its own first fix plan from the failing patch + test output. Evidence under benchmarks/results/.
…ample configs - repoMetadata.ts (+tests): per-repo openswarm.json for explicit Linear project mapping, consumed by projectMapper (+tests). - Web dashboard/chat backend/TUI updates and event hub/service plumbing. - .env.example / config.example.yaml refreshed for the OpenRouter adapter and model routing defaults. - .gitignore: exclude local experiments (testing/), SWE-bench evaluation logs, and root grading report artifacts.
loadConfig now disables Discord/Linear integration (deletes the block) when credentials are missing instead of rejecting the whole config. Update the two tests that still expected a validation error.
a596bd7 to
f36c440
Compare
- agenticLoop: count only successful edit/write calls toward the no-edit guard — a model whose edits all fail (old_string not found, protected file) was counted as having edited and slipped past the guard. - gpt/local adapters: forward nudgeMaxOnNoEdit / protectedFiles / bashTimeoutMs to the agentic loop (previously silently dropped, so the guards only worked on the openrouter adapter). - tools edit_file: locate the result snippet via old_string's position in the original content (guaranteed unique) — indexOf(new_string) on the updated text could match an earlier occurrence and show the wrong region. - tools bash: extract DEFAULT_BASH_TIMEOUT_MS so the exec call and the TIMEOUT message cannot drift. - repoKnowledge: normalize the repo key via realpath before write and recall — trailing slashes or symlinked paths to the same repo would split knowledge across keys that never match.
This was referenced Jun 10, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four feature sets accumulated on this branch (commits are split accordingly):
1. LM Studio adapter (
621c09e)Local model serving via LM Studio's OpenAI-compatible API with auto model selection.
2. OpenRouter agentic adapter + harness hardening (
a8d3172)openrouteradapter running the native agentic loop with PKCE auth, ZDR (data_collection: deny), Anthropic prompt caching, and reasoning-off for mechanical roles.3. Repo-scoped knowledge loop (
6c0a0e0)Workers now learn a repository across tasks: task outcomes are stored as repo-scoped memories (success →
system_pattern, review rejection →constraint) and recalled by relevance into the next worker prompt as a "Repository Knowledge" section. Closes the read-side gap of the existing LanceDB memory core.4. L0–L6 benchmark suite (
60ac35f)Headline result: hybrid mode (frontier read-only diagnosis + lightweight implementer with verification loop) resolved 3/3 attempted SWE-bench Lite instances (pylint 7080/5859/7993) where every single-lightweight-model run had failed — including a re-diagnosis escalate loop where gpt-5 found the bug in its own first fix plan from the failing patch + test output. Full evidence in
benchmarks/results/, methodology inbenchmarks/RUBRIC.md.Test plan
tsc --noEmitclean; 688/691 vitest tests pass.issueStore.test.ts(event ordering on equal timestamps — passes on re-run, untouched since April).🤖 Generated with Claude Code