feat: OpenRouter agentic harness, SWE-bench hybrid benchmarks, repo knowledge loop by unohee · Pull Request #63 · unohee/OpenSwarm

unohee · 2026-06-10T15:28:41Z

Summary

Four feature sets accumulated on this branch (commits are split accordingly):

1. LM Studio adapter (`621c09e`)

Local model serving via LM Studio's OpenAI-compatible API with auto model selection.

2. OpenRouter agentic adapter + harness hardening (`a8d3172`)

New openrouter adapter running the native agentic loop with PKCE auth, ZDR (data_collection: deny), Anthropic prompt caching, and reasoning-off for mechanical roles.
7 harness defects fixed, all discovered by running real SWE-bench instances (none were visible on synthetic benchmarks): cwd injection, bash exit-code surfacing, compaction thresholds (fixes infinite re-read loops), final-answer turn, no-edit guard, protected verification files, configurable bash timeout with explicit TIMEOUT messages.

3. Repo-scoped knowledge loop (`6c0a0e0`)

Workers now learn a repository across tasks: task outcomes are stored as repo-scoped memories (success → system_pattern, review rejection → constraint) and recalled by relevance into the next worker prompt as a "Repository Knowledge" section. Closes the read-side gap of the existing LanceDB memory core.

4. L0–L6 benchmark suite (`60ac35f`)

Synthetic L0–L5 tasks with deterministic grading + model×level pass-rate table → data-driven model routing (lightweight worker, frontier planner/reviewer).
L6 = real SWE-bench Lite instances, solved by the OpenSwarm harness and graded by the official swebench harness.

Headline result: hybrid mode (frontier read-only diagnosis + lightweight implementer with verification loop) resolved 3/3 attempted SWE-bench Lite instances (pylint 7080/5859/7993) where every single-lightweight-model run had failed — including a re-diagnosis escalate loop where gpt-5 found the bug in its own first fix plan from the failing patch + test output. Full evidence in benchmarks/results/, methodology in benchmarks/RUBRIC.md.

Test plan

tsc --noEmit clean; 688/691 vitest tests pass.
3 remaining failures are a pre-existing flaky test in issueStore.test.ts (event ordering on equal timestamps — passes on re-run, untouched since April).
New coverage: agenticLoop guards, tools protection/timeout, openrouter adapter, prompt rendering of repo memories, repoMetadata/projectMapper.
End-to-end: repo knowledge round-trip verified against a real LanceDB (record → relevance recall → repo isolation).

🤖 Generated with Claude Code

…WE-bench findings Add an OpenRouter adapter that runs the native agentic loop (runAgenticLoop) with PKCE auth, ZDR (data_collection: deny), prompt caching, and optional reasoning-off for mechanical roles. Route worker/reviewer/planner models per cost-efficiency measurements (lightweight worker + frontier escalate). Harden the agentic harness with fixes for 7 defects discovered by running real SWE-bench instances (none were visible on synthetic benchmarks): 1. Inject the working directory into the prompt — models guessed absolute paths and every file tool call was rejected. 2. Surface stdout/stderr + exit code on bash failures — grep "no match" (exit 1) looked like a fatal error and caused infinite retries. 3. Raise compaction thresholds (24k→60k tokens, keep 16 recent messages) — early compaction erased freshly-read files and caused endless re-reads. 4. Final-answer turn — when maxTurns is exhausted, make one last tool-less call so the model still produces a conclusion. 5. No-edit guard (nudgeMaxOnNoEdit) — push back when a model tries to finish an edit-required task with analysis only. 6. Protected files (protectedFiles) — reject edit/write on verification harness files; implementers were rewriting the test script when tests failed. 7. Configurable bash timeout (bashTimeoutMs) + explicit TIMEOUT message — the 30s default died silently on docker-based test runs and read as a broken environment. Also: loop-level read cache (dedup repeated reads), edit_file returns the resulting region (no re-read needed), git-diff-based success promotion in the worker (structured output no longer required).

…oss tasks The write infrastructure (memoryCore LanceDB with a repo field) existed but nothing ever read it back: task outcomes were stored as one generic line and never reached a worker prompt. Close the loop: - src/memory/repoKnowledge.ts: recordTaskOutcome() stores successes as system_pattern (files changed + approach + iterations) and review rejections as constraint (pitfalls), scoped to the project path. recallRepoKnowledge() retrieves the top task-relevant memories. skipDistillation keeps the intended types (distillation was downgrading them to belief, which type-filtered recall then missed). - pairPipeline.collectWorkerContext() recalls repo knowledge into WorkerContext.repoMemories; worker prompts render it as a "Repository Knowledge" section with pattern/pitfall tags (en + ko). - autonomousRunner completed/rejected handlers record outcomes instead of the old one-line strategy memo. All memory paths are non-blocking — the pipeline runs even if the memory DB is unavailable.

…WE-bench harness - benchmarks/tasks/codingTasks.ts + modelSelect.ts: 12 synthetic tasks (L0-L5) with deterministic grading (regex / test run / tsc); produces a model x level pass-rate table and a cost-efficiency Pareto ranking used to set the default worker/reviewer/planner routing. - benchmarks/sweBench.ts: L6 = real GitHub issues (SWE-bench Lite). Solve with the OpenSwarm harness (openrouter adapter), grade with the official swebench harness in Docker. Supports hybrid mode: SWE_DIAG_MODEL runs a frontier read-only diagnosis stage, SWE_MODEL implements with the verification loop; SWE_DIAG_FILE reuses a saved diagnosis. - benchmarks/RUBRIC.md: level definitions, recommended models per level, L6 grading procedure and pitfalls, and measured results. Headline result: hybrid (gpt-5 diagnosis + lightweight implementer) resolved 3/3 attempted SWE-bench Lite instances (pylint 7080/5859/7993) where every single-model lightweight run had failed — including a re-diagnosis escalate loop on 7993 where the frontier found the bug in its own first fix plan from the failing patch + test output. Evidence under benchmarks/results/.

…ample configs - repoMetadata.ts (+tests): per-repo openswarm.json for explicit Linear project mapping, consumed by projectMapper (+tests). - Web dashboard/chat backend/TUI updates and event hub/service plumbing. - .env.example / config.example.yaml refreshed for the OpenRouter adapter and model routing defaults. - .gitignore: exclude local experiments (testing/), SWE-bench evaluation logs, and root grading report artifacts.

loadConfig now disables Discord/Linear integration (deletes the block) when credentials are missing instead of rejecting the whole config. Update the two tests that still expected a validation error.

- agenticLoop: count only successful edit/write calls toward the no-edit guard — a model whose edits all fail (old_string not found, protected file) was counted as having edited and slipped past the guard. - gpt/local adapters: forward nudgeMaxOnNoEdit / protectedFiles / bashTimeoutMs to the agentic loop (previously silently dropped, so the guards only worked on the openrouter adapter). - tools edit_file: locate the result snippet via old_string's position in the original content (guaranteed unique) — indexOf(new_string) on the updated text could match an earlier occurrence and show the wrong region. - tools bash: extract DEFAULT_BASH_TIMEOUT_MS so the exec call and the TIMEOUT message cannot drift. - repoKnowledge: normalize the repo key via realpath before write and recall — trailing slashes or symlinked paths to the same repo would split knowledge across keys that never match.

… benchmarks

unohee added 5 commits June 11, 2026 00:40

test(config): align Zod validation tests with standalone mode

f36c440

loadConfig now disables Discord/Linear integration (deletes the block) when credentials are missing instead of rejecting the whole config. Update the two tests that still expected a validation error.

unohee force-pushed the feat/lmstudio-adapter branch from a596bd7 to f36c440 Compare June 10, 2026 15:42

unohee added 2 commits June 11, 2026 00:54

docs(readme): document openrouter adapter, repo knowledge loop, L0-L6…

4672e12

… benchmarks

unohee merged commit e9c3eba into main Jun 10, 2026
9 checks passed

unohee deleted the feat/lmstudio-adapter branch June 10, 2026 16:01

This was referenced Jun 10, 2026

docs(readme): surface OpenRouter, SWE-bench results, and repo knowledge in the hero #64

Merged

chore(release): v0.5.0 #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: OpenRouter agentic harness, SWE-bench hybrid benchmarks, repo knowledge loop#63

feat: OpenRouter agentic harness, SWE-bench hybrid benchmarks, repo knowledge loop#63
unohee merged 7 commits into
mainfrom
feat/lmstudio-adapter

unohee commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

unohee commented Jun 10, 2026

Summary

1. LM Studio adapter (621c09e)

2. OpenRouter agentic adapter + harness hardening (a8d3172)

3. Repo-scoped knowledge loop (6c0a0e0)

4. L0–L6 benchmark suite (60ac35f)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. LM Studio adapter (`621c09e`)

2. OpenRouter agentic adapter + harness hardening (`a8d3172`)

3. Repo-scoped knowledge loop (`6c0a0e0`)

4. L0–L6 benchmark suite (`60ac35f`)