adapters/claude-code: use CC skill system for inject; split skill telemetry fields (#14)#20
adapters/claude-code: use CC skill system for inject; split skill telemetry fields (#14)#20lec77 wants to merge 13 commits into
Conversation
Splits the conflated skillLoaded field into: - skillProvided: structural — was the skill placed in agent's context? - skillObserved: behavioral — visible engagement evidence (diagnostic only). - skillMode: which mode (inject/discover) the adapter ran in. skillLoaded stays as a deprecated alias for one release for on-disk evidence file back-compat. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nMeta Mirrors the RunResult schema split from the previous commit: RunMeta gains skillProvided / skillObserved / skillMode; buildRunMeta forwards them from RunResult. skillLoaded stays as a deprecated alias. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The evidence-summary WARNING that drives optimizer abstain decisions now distinguishes inject-mode infra failures (skill never reached the model) from discover-mode non-loads (description mismatch / model didn't engage). Legacy evidence files without skillMode get the original wording for back-compat. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d helpers The old detectSkillDiscover conflated two qualitatively distinct signals: the structural one (did CC's system.init event list the skill?) and the behavioral one (did the model invoke the Skill tool?). Splitting them lets downstream code report the right value to skillProvided vs skillObserved. detectSkillDiscover stays as a deprecated back-compat shim that ORs the two new helpers; it goes away with the SKILL_INJECT_SENTINEL machinery in the next commit. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue SJTU-IPADS#14: the previous inject path dumped the skill content into CC's system prompt via --append-system-prompt, bypassing CC's native skill system entirely — the skill never appeared in init event's skills[], the model couldn't invoke it via the Skill tool, and the only way to detect 'load' was a sentinel-echo heuristic that systematically failed on thinking-mode models. New inject path: - Writes the skill to .claude/skills/<name>/SKILL.md (same as discover). - Adds a hard MUST directive via --append-system-prompt instructing the model to invoke the Skill tool with the skill name before any other action. - Uses CC's system.init event as the structural skillProvided signal (model-independent, robust against thinking-mode). - Uses Skill tool_use as the behavioral skillObserved signal. The artifact-building logic is factored into two new exports (buildClaudeCodeInjectArtifacts, buildClaudeCodeDiscoverArtifacts) so the on-disk effects can be unit-tested without spawning CC. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were heuristics for the old text-append inject path that Task 5 replaced with CC's native skill system. Without that path nothing emits the sentinel, nothing reads it, and the heuristic only ever produced false negatives on thinking-mode models. Also drops the deprecated detectSkillDiscover shim — its two callers were always going to be replaced by the split provided/observed helpers and the inject rewrite removed the last one. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the inject/register step and populates skillObserved from behavioral signals. In inject mode the skill content lands in the system prompt immediately so skillProvided flips to true at that write site. In discover mode the skill is only spliced into context on a matching <load-skill> fire, so the protocol callback raises BOTH skillProvided (structural — content is about to land in context) and skillObserved (behavioral — agent named the skill in the registry); the once-only gate is renamed discoverSkillLoaded → discoverSkillProvided to match. The deprecated skillLoaded alias mirrors skillProvided for one release. Tests gain parallel skillProvided/skillObserved/skillMode assertions next to existing skillLoaded ones. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the inject/register step (prompt concat for inject mode, SKILL.md write + -s flag for discover mode). Drops the 'steps > 0 implies skill was loaded' inference — that heuristic conflated 'agent ran' with 'skill was used' and was the exact bug class motivating the field split. The snippet-match scan now populates skillObserved instead of overloading skillLoaded. The deprecated skillLoaded alias mirrors skillProvided for one release. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the prompt-concat step. jiuwenclaw uses prompt prepend for both inject and discover modes (no well-known CLI skill path), so skillMode is still surfaced for downstream consumers but the actual delivery path is identical. Drops the 'steps > 0 implies skill was used' inference; behavioral evidence (snippet echo in response text) now populates skillObserved instead of overloading skillLoaded. Deprecated skillLoaded mirrors skillProvided for one release. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the BOOTSTRAP.md / SKILL.md write sites. Drops the 'hasAssistantMessage implies skill loaded' inference in inject mode — that heuristic conflated 'agent ran' with 'skill was used'. Behavioral evidence (skill/load_skill tool call, SKILL.md path in assistant text, or skill snippet echoed in a tool result) now lands in skillObserved rather than overloading skillLoaded. The deprecated skillLoaded alias mirrors skillProvided for one release. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the CONTEXT.md (inject) and SKILL.md (discover) write sites. Drops the inject-mode 'step_finish + CONTEXT.md exists' check — both halves were structural and already covered by setting skillProvided at the write step. The skill-tool call (discover) and text-snippet echo (both modes) now populate skillObserved as behavioral evidence. The deprecated skillLoaded alias mirrors skillProvided for one release. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the AGENTS.md (inject) and SKILL.md (discover) write sites. Drops the 'steps > 0 implies skill was used' inference. Snippet echo in assistant text now populates skillObserved as behavioral evidence rather than overloading skillLoaded. The deprecated skillLoaded alias mirrors skillProvided for one release. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Propagates the RunResult field split into bench reports. The deprecated skillLoaded alias stays so existing JSON bench reports keep parsing. Refs SJTU-IPADS#14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@YYangZiXin Hi! This PR attempts to fix #14, but takes a slightly different approach than what you proposed in the issue — would appreciate your help verifying it. You suggested putting The PR reads Would you be willing to run the PR branch against your original failing setup (the GLM / DeepSeek thinking mode environment from #14)? Two things to confirm:
git fetch origin pull/20/head:pr-20
git checkout pr-20
# then run the same jit-optimize command that originally hit infra-blockedThanks! |
Linked issue
Closes #14.
Summary
The claude-code adapter's inject mode bypassed CC's native skill system entirely — it dumped the skill content as generic text through
--append-system-prompt. CC never registered it as a skill (no init-event listing, no Skill tool dispatch), and the only way SkVM could tell whether the skill was "loaded" was a sentinel-echo heuristic that fails systematically on thinking-mode models. The downstream effect was that DeepSeek-thinking / GLM-thinking jit-optimize runs reportedskillLoaded=false, the optimizer abstained at round 1 withinfraBlocked, and the entire skill-iteration loop died before producing a single proposal.This PR rewires inject to use CC's actual skill system (
.claude/skills/<name>/SKILL.md+ a hard MUST directive forcingSkilltool invocation), and splits the conflatedRunResult.skillLoadedfield into two semantically distinct signals across all 7 adapters:skillProvided(structural — did we get the skill into the agent's context?) andskillObserved(behavioral — did we see the agent engage?). The mode-aware abstain gate injit-optimize/workspace.tsnow only treats inject-modeskillProvided=falseas an infra failure; discover-mode non-loads are merely surfaced as a note.Changes
Schema & data plumbing (Tasks 1–2):
core/types.ts:RunResultSchemagainsskillProvided?,skillObserved?,skillMode?: "inject" | "discover".skillLoadedretained for one release as a@deprecatedmirror so on-disk~/.skvm/proposals/evidence files keep parsing.jit-optimize/types.ts+jit-optimize/evidence.ts: same three fields land onRunMeta;buildRunMetaforwards them and mirrorsskillProvided → skillLoaded(one-way — legacyskillLoaded-only evidence is left alone so older runs aren't retroactively given askillProvidedvalue we can't actually substantiate).Optimizer gating (Task 3):
jit-optimize/workspace.ts:536— the WARNING line that drives optimizer abstain decisions is now mode-aware:inject + provided=false→WARNING: skill failed to load (inject mode — likely infra issue)(drives abstain as before)discover + provided=false→WARNING: skill not recognized by harness (discover mode — may be description mismatch)(no abstain — discover legitimately allows the model to skip loading)skillMode→ originalWARNING: skill was not loadedwording preserved.skillObservedis intentionally NOT surfaced in evidence summaries — it would produce systematic false negatives on thinking-mode models.claude-code adapter rewrite (Tasks 4–6):
buildClaudeCodeInjectArtifacts/buildClaudeCodeDiscoverArtifactsfactor the on-disk skill-file work out ofrun()so it's unit-testable without spawning CC.<workDir>/.claude/skills/<name>/SKILL.md(same path as discover) AND sets--append-system-promptto a hard MUST directive:You have access to a project skill named `<name>` (<desc>). For this task you MUST invoke the Skill tool with name="<name>" before any other action…skillProvidedis now read from CC'ssystem.initevent'sskills[]array — a structural, model-independent signal that resolves the thinking-mode false-negative cleanly.skillObservedis read fromSkill/skilltool_use events with matchinginput.name.detectSkillDiscoveris split intodetectSkillProvided_Discover(init-event) anddetectSkillObserved_Discover(Skill tool-use). The oldSKILL_INJECT_SENTINEL/injectedSystemPrompt/detectSkillInjectmachinery is deleted.Peer adapters — field alignment (Tasks 7–12, one commit each):
bare-agent,opencode,pi,openclaw,hermes,jiuwenclawkeep their existing inject mechanisms (system-message concat / CONTEXT.md / AGENTS.md / BOOTSTRAP.md / prompt prepend). Each harness's native auto-load file IS the documented inject path; only claude-code had the wrong-mechanism bug.skillLoadedplumbing toskillProvided+skillObserved, setskillMode, and write the deprecatedskillLoadedmirror. The "agent produced steps → assume skill was used" inference is dropped —skillProvided=trueis now set structurally at the inject step site, so the heuristic is redundant. Snippet-match and skill-tool-call detections becomeskillObserved.bare-agent's<load-skill>protocol is the one place where structural and behavioral fire simultaneously — both fields go true together.Bench report wiring (Task 13):
bench/types.tsConditionResultexposes the two new fields;bench/conditions.tstoConditionResultpropagates them.skillLoadeddeprecated alias retained for older JSON reports.Test plan
bunx tsc --noEmit— cleanbun test— 793 pass / 2 skip / 0 fail (was 777 onorigin/main; net +16 tests, no regressions)buildRunMetaforwarding (including legacy-onlyskillLoadedback-compat), mode-aware workspace WARNING wordings, the two new claude-code detection helpers, claude-code inject artifact builders (writes the skill file + emits MUST directive), thinking-mode-safeskillProvideddetection from init events alone, bare-agent parallel-assertion coverage..claude/skills/+--append-system-promptdirective combination at runtime.Notes for reviewers
A few design choices worth double-checking:
skillModesemantics retained. Theinject/discoverbinary stays — but the meaning is now strictly "loading bias" (forced vs discoverable), independent of delivery mechanism. Each adapter chooses the best mechanism its harness offers. For claude-code, both modes now write to.claude/skills/; the only difference is whether a MUST directive accompanies it. For the other adapters, the existing per-harness inject path (CONTEXT.md / AGENTS.md / etc.) is unchanged because those harnesses don't expose a richer skill subsystem than their auto-load file.Why a system prompt directive and not
/skill <name>in the user prompt. The original issue suggested putting/skill <name>in the prompt to trigger the skill./skillis interactive slash-command syntax — its behavior inclaude -pnon-interactive mode is undocumented and would have to be verified empirically. System-prompt directives are the documented headless path; they also let us keep the inject/discover semantic distinction crisp (directive is the only difference between the two modes' artifacts).Forward-only mirror for the deprecated
skillLoaded.buildRunMetamirrorsskillProvided → skillLoadedbut never the reverse: if an older adapter writes onlyskillLoaded, the newskillProvidedon the output staysundefined. Rationale: the oldskillLoadedfield merged structural and behavioral semantics, so retroactively claimingskillProvided=truefrom a legacy record would be lying. Consumers (workspace gate) useskillProvided ?? skillLoadedfor explicit fallback.Dropped the "agent ran → skill was used" inference in peer adapters. Tests like
if (result.steps.length > 0) skillLoaded = truewere always weak — they don't tell you anything about the skill, just that the agent produced output. WithskillProvidednow set structurally at the inject step (Bun.write(...)succeeded), the inference is redundant. Snippet-echo and skill-tool-call detections survive asskillObservedsignals (diagnostic only).skillObservedis intentionally NOT surfaced in optimizer evidence summaries. It would systematically false-negative on thinking-mode models (no echo to detect) and create noise. Optimizer code that wants the signal can readrunMeta.skillObserveddirectly.PR scope vs. broader inject redesign. A wider redesign — three modes (
forced/discoverable/none) or richer per-harness skill mechanisms — was considered during brainstorming and intentionally deferred. This PR fixes the specific bug in issue [bug] Claude Code adapter inject模式实现Bug报告 #14 and the field-semantics conflation that amplified it. Larger architectural changes can land as a follow-up if useful.Checklist
<scope>: <summary>style~/.skvm/proposals/schema is back-compat via the deprecatedskillLoadedalias)