Skip to content

feat: OpenRouter agentic harness, SWE-bench hybrid benchmarks, repo knowledge loop#63

Merged
unohee merged 7 commits into
mainfrom
feat/lmstudio-adapter
Jun 10, 2026
Merged

feat: OpenRouter agentic harness, SWE-bench hybrid benchmarks, repo knowledge loop#63
unohee merged 7 commits into
mainfrom
feat/lmstudio-adapter

Conversation

@unohee

@unohee unohee commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Four feature sets accumulated on this branch (commits are split accordingly):

1. LM Studio adapter (621c09e)

Local model serving via LM Studio's OpenAI-compatible API with auto model selection.

2. OpenRouter agentic adapter + harness hardening (a8d3172)

  • New openrouter adapter running the native agentic loop with PKCE auth, ZDR (data_collection: deny), Anthropic prompt caching, and reasoning-off for mechanical roles.
  • 7 harness defects fixed, all discovered by running real SWE-bench instances (none were visible on synthetic benchmarks): cwd injection, bash exit-code surfacing, compaction thresholds (fixes infinite re-read loops), final-answer turn, no-edit guard, protected verification files, configurable bash timeout with explicit TIMEOUT messages.

3. Repo-scoped knowledge loop (6c0a0e0)

Workers now learn a repository across tasks: task outcomes are stored as repo-scoped memories (success → system_pattern, review rejection → constraint) and recalled by relevance into the next worker prompt as a "Repository Knowledge" section. Closes the read-side gap of the existing LanceDB memory core.

4. L0–L6 benchmark suite (60ac35f)

  • Synthetic L0–L5 tasks with deterministic grading + model×level pass-rate table → data-driven model routing (lightweight worker, frontier planner/reviewer).
  • L6 = real SWE-bench Lite instances, solved by the OpenSwarm harness and graded by the official swebench harness.

Headline result: hybrid mode (frontier read-only diagnosis + lightweight implementer with verification loop) resolved 3/3 attempted SWE-bench Lite instances (pylint 7080/5859/7993) where every single-lightweight-model run had failed — including a re-diagnosis escalate loop where gpt-5 found the bug in its own first fix plan from the failing patch + test output. Full evidence in benchmarks/results/, methodology in benchmarks/RUBRIC.md.

Test plan

  • tsc --noEmit clean; 688/691 vitest tests pass.
  • 3 remaining failures are a pre-existing flaky test in issueStore.test.ts (event ordering on equal timestamps — passes on re-run, untouched since April).
  • New coverage: agenticLoop guards, tools protection/timeout, openrouter adapter, prompt rendering of repo memories, repoMetadata/projectMapper.
  • End-to-end: repo knowledge round-trip verified against a real LanceDB (record → relevance recall → repo isolation).

🤖 Generated with Claude Code

unohee added 5 commits June 11, 2026 00:40
…WE-bench findings

Add an OpenRouter adapter that runs the native agentic loop (runAgenticLoop)
with PKCE auth, ZDR (data_collection: deny), prompt caching, and optional
reasoning-off for mechanical roles. Route worker/reviewer/planner models
per cost-efficiency measurements (lightweight worker + frontier escalate).

Harden the agentic harness with fixes for 7 defects discovered by running
real SWE-bench instances (none were visible on synthetic benchmarks):

1. Inject the working directory into the prompt — models guessed absolute
   paths and every file tool call was rejected.
2. Surface stdout/stderr + exit code on bash failures — grep "no match"
   (exit 1) looked like a fatal error and caused infinite retries.
3. Raise compaction thresholds (24k→60k tokens, keep 16 recent messages) —
   early compaction erased freshly-read files and caused endless re-reads.
4. Final-answer turn — when maxTurns is exhausted, make one last tool-less
   call so the model still produces a conclusion.
5. No-edit guard (nudgeMaxOnNoEdit) — push back when a model tries to
   finish an edit-required task with analysis only.
6. Protected files (protectedFiles) — reject edit/write on verification
   harness files; implementers were rewriting the test script when tests
   failed.
7. Configurable bash timeout (bashTimeoutMs) + explicit TIMEOUT message —
   the 30s default died silently on docker-based test runs and read as a
   broken environment.

Also: loop-level read cache (dedup repeated reads), edit_file returns the
resulting region (no re-read needed), git-diff-based success promotion in
the worker (structured output no longer required).
…oss tasks

The write infrastructure (memoryCore LanceDB with a repo field) existed but
nothing ever read it back: task outcomes were stored as one generic line and
never reached a worker prompt. Close the loop:

- src/memory/repoKnowledge.ts: recordTaskOutcome() stores successes as
  system_pattern (files changed + approach + iterations) and review
  rejections as constraint (pitfalls), scoped to the project path.
  recallRepoKnowledge() retrieves the top task-relevant memories.
  skipDistillation keeps the intended types (distillation was downgrading
  them to belief, which type-filtered recall then missed).
- pairPipeline.collectWorkerContext() recalls repo knowledge into
  WorkerContext.repoMemories; worker prompts render it as a
  "Repository Knowledge" section with pattern/pitfall tags (en + ko).
- autonomousRunner completed/rejected handlers record outcomes instead of
  the old one-line strategy memo.

All memory paths are non-blocking — the pipeline runs even if the memory
DB is unavailable.
…WE-bench harness

- benchmarks/tasks/codingTasks.ts + modelSelect.ts: 12 synthetic tasks
  (L0-L5) with deterministic grading (regex / test run / tsc); produces a
  model x level pass-rate table and a cost-efficiency Pareto ranking used
  to set the default worker/reviewer/planner routing.
- benchmarks/sweBench.ts: L6 = real GitHub issues (SWE-bench Lite). Solve
  with the OpenSwarm harness (openrouter adapter), grade with the official
  swebench harness in Docker. Supports hybrid mode: SWE_DIAG_MODEL runs a
  frontier read-only diagnosis stage, SWE_MODEL implements with the
  verification loop; SWE_DIAG_FILE reuses a saved diagnosis.
- benchmarks/RUBRIC.md: level definitions, recommended models per level,
  L6 grading procedure and pitfalls, and measured results.

Headline result: hybrid (gpt-5 diagnosis + lightweight implementer) resolved
3/3 attempted SWE-bench Lite instances (pylint 7080/5859/7993) where every
single-model lightweight run had failed — including a re-diagnosis escalate
loop on 7993 where the frontier found the bug in its own first fix plan from
the failing patch + test output. Evidence under benchmarks/results/.
…ample configs

- repoMetadata.ts (+tests): per-repo openswarm.json for explicit Linear
  project mapping, consumed by projectMapper (+tests).
- Web dashboard/chat backend/TUI updates and event hub/service plumbing.
- .env.example / config.example.yaml refreshed for the OpenRouter adapter
  and model routing defaults.
- .gitignore: exclude local experiments (testing/), SWE-bench evaluation
  logs, and root grading report artifacts.
loadConfig now disables Discord/Linear integration (deletes the block)
when credentials are missing instead of rejecting the whole config.
Update the two tests that still expected a validation error.
@unohee unohee force-pushed the feat/lmstudio-adapter branch from a596bd7 to f36c440 Compare June 10, 2026 15:42
unohee added 2 commits June 11, 2026 00:54
- agenticLoop: count only successful edit/write calls toward the no-edit
  guard — a model whose edits all fail (old_string not found, protected
  file) was counted as having edited and slipped past the guard.
- gpt/local adapters: forward nudgeMaxOnNoEdit / protectedFiles /
  bashTimeoutMs to the agentic loop (previously silently dropped, so the
  guards only worked on the openrouter adapter).
- tools edit_file: locate the result snippet via old_string's position in
  the original content (guaranteed unique) — indexOf(new_string) on the
  updated text could match an earlier occurrence and show the wrong region.
- tools bash: extract DEFAULT_BASH_TIMEOUT_MS so the exec call and the
  TIMEOUT message cannot drift.
- repoKnowledge: normalize the repo key via realpath before write and
  recall — trailing slashes or symlinked paths to the same repo would
  split knowledge across keys that never match.
@unohee unohee merged commit e9c3eba into main Jun 10, 2026
9 checks passed
@unohee unohee deleted the feat/lmstudio-adapter branch June 10, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant