Skip to content

adapters/claude-code: use CC skill system for inject; split skill telemetry fields (#14)#20

Open
lec77 wants to merge 13 commits into
SJTU-IPADS:mainfrom
lec77:fix/skill-telemetry-claude-code-inject
Open

adapters/claude-code: use CC skill system for inject; split skill telemetry fields (#14)#20
lec77 wants to merge 13 commits into
SJTU-IPADS:mainfrom
lec77:fix/skill-telemetry-claude-code-inject

Conversation

@lec77

@lec77 lec77 commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Linked issue

Closes #14.

Summary

The claude-code adapter's inject mode bypassed CC's native skill system entirely — it dumped the skill content as generic text through --append-system-prompt. CC never registered it as a skill (no init-event listing, no Skill tool dispatch), and the only way SkVM could tell whether the skill was "loaded" was a sentinel-echo heuristic that fails systematically on thinking-mode models. The downstream effect was that DeepSeek-thinking / GLM-thinking jit-optimize runs reported skillLoaded=false, the optimizer abstained at round 1 with infraBlocked, and the entire skill-iteration loop died before producing a single proposal.

This PR rewires inject to use CC's actual skill system (.claude/skills/<name>/SKILL.md + a hard MUST directive forcing Skill tool invocation), and splits the conflated RunResult.skillLoaded field into two semantically distinct signals across all 7 adapters: skillProvided (structural — did we get the skill into the agent's context?) and skillObserved (behavioral — did we see the agent engage?). The mode-aware abstain gate in jit-optimize/workspace.ts now only treats inject-mode skillProvided=false as an infra failure; discover-mode non-loads are merely surfaced as a note.

Changes

Schema & data plumbing (Tasks 1–2):

  • core/types.ts: RunResultSchema gains skillProvided?, skillObserved?, skillMode?: "inject" | "discover". skillLoaded retained for one release as a @deprecated mirror so on-disk ~/.skvm/proposals/ evidence files keep parsing.
  • jit-optimize/types.ts + jit-optimize/evidence.ts: same three fields land on RunMeta; buildRunMeta forwards them and mirrors skillProvided → skillLoaded (one-way — legacy skillLoaded-only evidence is left alone so older runs aren't retroactively given a skillProvided value we can't actually substantiate).

Optimizer gating (Task 3):

  • jit-optimize/workspace.ts:536 — the WARNING line that drives optimizer abstain decisions is now mode-aware:
    • inject + provided=falseWARNING: skill failed to load (inject mode — likely infra issue) (drives abstain as before)
    • discover + provided=falseWARNING: skill not recognized by harness (discover mode — may be description mismatch) (no abstain — discover legitimately allows the model to skip loading)
    • Legacy evidence without skillMode → original WARNING: skill was not loaded wording preserved.
  • skillObserved is intentionally NOT surfaced in evidence summaries — it would produce systematic false negatives on thinking-mode models.

claude-code adapter rewrite (Tasks 4–6):

  • New buildClaudeCodeInjectArtifacts / buildClaudeCodeDiscoverArtifacts factor the on-disk skill-file work out of run() so it's unit-testable without spawning CC.
  • Inject path now writes <workDir>/.claude/skills/<name>/SKILL.md (same path as discover) AND sets --append-system-prompt to a hard MUST directive: You have access to a project skill named `<name>` (<desc>). For this task you MUST invoke the Skill tool with name="<name>" before any other action…
  • skillProvided is now read from CC's system.init event's skills[] array — a structural, model-independent signal that resolves the thinking-mode false-negative cleanly. skillObserved is read from Skill/skill tool_use events with matching input.name.
  • detectSkillDiscover is split into detectSkillProvided_Discover (init-event) and detectSkillObserved_Discover (Skill tool-use). The old SKILL_INJECT_SENTINEL / injectedSystemPrompt / detectSkillInject machinery is deleted.

Peer adapters — field alignment (Tasks 7–12, one commit each):

  • bare-agent, opencode, pi, openclaw, hermes, jiuwenclaw keep their existing inject mechanisms (system-message concat / CONTEXT.md / AGENTS.md / BOOTSTRAP.md / prompt prepend). Each harness's native auto-load file IS the documented inject path; only claude-code had the wrong-mechanism bug.
  • All six rewire their skillLoaded plumbing to skillProvided + skillObserved, set skillMode, and write the deprecated skillLoaded mirror. The "agent produced steps → assume skill was used" inference is dropped — skillProvided=true is now set structurally at the inject step site, so the heuristic is redundant. Snippet-match and skill-tool-call detections become skillObserved.
  • bare-agent's <load-skill> protocol is the one place where structural and behavioral fire simultaneously — both fields go true together.

Bench report wiring (Task 13):

  • bench/types.ts ConditionResult exposes the two new fields; bench/conditions.ts toConditionResult propagates them. skillLoaded deprecated alias retained for older JSON reports.

Test plan

  • bunx tsc --noEmit — clean
  • bun test — 793 pass / 2 skip / 0 fail (was 777 on origin/main; net +16 tests, no regressions)
  • New tests added for: schema field shape, buildRunMeta forwarding (including legacy-only skillLoaded back-compat), mode-aware workspace WARNING wordings, the two new claude-code detection helpers, claude-code inject artifact builders (writes the skill file + emits MUST directive), thinking-mode-safe skillProvided detection from init events alone, bare-agent parallel-assertion coverage.
  • Live smoke test against a real model — not yet run. Reviewer or CI can validate against DeepSeek-thinking / GLM-thinking once landed. The integration tests cover the structural contract; the live test is only needed to confirm CC's real CLI accepts the .claude/skills/ + --append-system-prompt directive combination at runtime.

Notes for reviewers

A few design choices worth double-checking:

  1. skillMode semantics retained. The inject / discover binary stays — but the meaning is now strictly "loading bias" (forced vs discoverable), independent of delivery mechanism. Each adapter chooses the best mechanism its harness offers. For claude-code, both modes now write to .claude/skills/; the only difference is whether a MUST directive accompanies it. For the other adapters, the existing per-harness inject path (CONTEXT.md / AGENTS.md / etc.) is unchanged because those harnesses don't expose a richer skill subsystem than their auto-load file.

  2. Why a system prompt directive and not /skill <name> in the user prompt. The original issue suggested putting /skill <name> in the prompt to trigger the skill. /skill is interactive slash-command syntax — its behavior in claude -p non-interactive mode is undocumented and would have to be verified empirically. System-prompt directives are the documented headless path; they also let us keep the inject/discover semantic distinction crisp (directive is the only difference between the two modes' artifacts).

  3. Forward-only mirror for the deprecated skillLoaded. buildRunMeta mirrors skillProvided → skillLoaded but never the reverse: if an older adapter writes only skillLoaded, the new skillProvided on the output stays undefined. Rationale: the old skillLoaded field merged structural and behavioral semantics, so retroactively claiming skillProvided=true from a legacy record would be lying. Consumers (workspace gate) use skillProvided ?? skillLoaded for explicit fallback.

  4. Dropped the "agent ran → skill was used" inference in peer adapters. Tests like if (result.steps.length > 0) skillLoaded = true were always weak — they don't tell you anything about the skill, just that the agent produced output. With skillProvided now set structurally at the inject step (Bun.write(...) succeeded), the inference is redundant. Snippet-echo and skill-tool-call detections survive as skillObserved signals (diagnostic only).

  5. skillObserved is intentionally NOT surfaced in optimizer evidence summaries. It would systematically false-negative on thinking-mode models (no echo to detect) and create noise. Optimizer code that wants the signal can read runMeta.skillObserved directly.

  6. PR scope vs. broader inject redesign. A wider redesign — three modes (forced / discoverable / none) or richer per-harness skill mechanisms — was considered during brainstorming and intentionally deferred. This PR fixes the specific bug in issue [bug] Claude Code adapter inject模式实现Bug报告 #14 and the field-semantics conflation that amplified it. Larger architectural changes can land as a follow-up if useful.

Checklist

  • Linked an issue ([bug] Claude Code adapter inject模式实现Bug报告 #14)
  • Commit messages follow <scope>: <summary> style
  • Added or updated tests where appropriate
  • No user-visible CLI flags / on-disk paths changed — no README/docs update needed (the ~/.skvm/proposals/ schema is back-compat via the deprecated skillLoaded alias)
  • No unrelated changes mixed in (the 13 commits all serve this one logical change; squash if you prefer fewer atomic commits)

lec77 and others added 13 commits May 14, 2026 01:04
Splits the conflated skillLoaded field into:
  - skillProvided: structural — was the skill placed in agent's context?
  - skillObserved: behavioral — visible engagement evidence (diagnostic only).
  - skillMode: which mode (inject/discover) the adapter ran in.

skillLoaded stays as a deprecated alias for one release for on-disk
evidence file back-compat. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nMeta

Mirrors the RunResult schema split from the previous commit: RunMeta gains
skillProvided / skillObserved / skillMode; buildRunMeta forwards them from
RunResult. skillLoaded stays as a deprecated alias. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The evidence-summary WARNING that drives optimizer abstain decisions now
distinguishes inject-mode infra failures (skill never reached the model)
from discover-mode non-loads (description mismatch / model didn't engage).
Legacy evidence files without skillMode get the original wording for
back-compat. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d helpers

The old detectSkillDiscover conflated two qualitatively distinct signals:
the structural one (did CC's system.init event list the skill?) and the
behavioral one (did the model invoke the Skill tool?). Splitting them lets
downstream code report the right value to skillProvided vs skillObserved.

detectSkillDiscover stays as a deprecated back-compat shim that ORs the
two new helpers; it goes away with the SKILL_INJECT_SENTINEL machinery
in the next commit. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue SJTU-IPADS#14: the previous inject path dumped the skill content into CC's
system prompt via --append-system-prompt, bypassing CC's native skill
system entirely — the skill never appeared in init event's skills[],
the model couldn't invoke it via the Skill tool, and the only way to
detect 'load' was a sentinel-echo heuristic that systematically failed
on thinking-mode models.

New inject path:
  - Writes the skill to .claude/skills/<name>/SKILL.md (same as discover).
  - Adds a hard MUST directive via --append-system-prompt instructing the
    model to invoke the Skill tool with the skill name before any other
    action.
  - Uses CC's system.init event as the structural skillProvided signal
    (model-independent, robust against thinking-mode).
  - Uses Skill tool_use as the behavioral skillObserved signal.

The artifact-building logic is factored into two new exports
(buildClaudeCodeInjectArtifacts, buildClaudeCodeDiscoverArtifacts) so the
on-disk effects can be unit-tested without spawning CC. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were heuristics for the old text-append inject path that Task 5
replaced with CC's native skill system. Without that path nothing emits
the sentinel, nothing reads it, and the heuristic only ever produced
false negatives on thinking-mode models.

Also drops the deprecated detectSkillDiscover shim — its two callers
were always going to be replaced by the split provided/observed helpers
and the inject rewrite removed the last one. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the inject/register step and
populates skillObserved from behavioral signals. In inject mode the
skill content lands in the system prompt immediately so skillProvided
flips to true at that write site. In discover mode the skill is only
spliced into context on a matching <load-skill> fire, so the protocol
callback raises BOTH skillProvided (structural — content is about to
land in context) and skillObserved (behavioral — agent named the skill
in the registry); the once-only gate is renamed
discoverSkillLoaded → discoverSkillProvided to match.

The deprecated skillLoaded alias mirrors skillProvided for one release.
Tests gain parallel skillProvided/skillObserved/skillMode assertions
next to existing skillLoaded ones. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the inject/register step (prompt
concat for inject mode, SKILL.md write + -s flag for discover mode).
Drops the 'steps > 0 implies skill was loaded' inference — that
heuristic conflated 'agent ran' with 'skill was used' and was the
exact bug class motivating the field split. The snippet-match scan
now populates skillObserved instead of overloading skillLoaded.

The deprecated skillLoaded alias mirrors skillProvided for one
release. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the prompt-concat step. jiuwenclaw
uses prompt prepend for both inject and discover modes (no well-known
CLI skill path), so skillMode is still surfaced for downstream
consumers but the actual delivery path is identical.

Drops the 'steps > 0 implies skill was used' inference; behavioral
evidence (snippet echo in response text) now populates skillObserved
instead of overloading skillLoaded. Deprecated skillLoaded mirrors
skillProvided for one release. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the BOOTSTRAP.md / SKILL.md write
sites. Drops the 'hasAssistantMessage implies skill loaded' inference
in inject mode — that heuristic conflated 'agent ran' with 'skill was
used'. Behavioral evidence (skill/load_skill tool call, SKILL.md path
in assistant text, or skill snippet echoed in a tool result) now lands
in skillObserved rather than overloading skillLoaded.

The deprecated skillLoaded alias mirrors skillProvided for one
release. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the CONTEXT.md (inject) and
SKILL.md (discover) write sites. Drops the inject-mode
'step_finish + CONTEXT.md exists' check — both halves were structural
and already covered by setting skillProvided at the write step. The
skill-tool call (discover) and text-snippet echo (both modes) now
populate skillObserved as behavioral evidence.

The deprecated skillLoaded alias mirrors skillProvided for one
release. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets skillProvided structurally at the AGENTS.md (inject) and
SKILL.md (discover) write sites. Drops the 'steps > 0 implies skill
was used' inference. Snippet echo in assistant text now populates
skillObserved as behavioral evidence rather than overloading
skillLoaded.

The deprecated skillLoaded alias mirrors skillProvided for one
release. Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Propagates the RunResult field split into bench reports. The deprecated
skillLoaded alias stays so existing JSON bench reports keep parsing.
Refs SJTU-IPADS#14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lec77

lec77 commented May 15, 2026

Copy link
Copy Markdown
Collaborator Author

@YYangZiXin Hi! This PR attempts to fix #14, but takes a slightly different approach than what you proposed in the issue — would appreciate your help verifying it.

You suggested putting /skill <name> in the user prompt. The PR instead writes .claude/skills/<name>/SKILL.md and uses --append-system-prompt to add a MUST directive ("You MUST invoke the Skill tool with name=\"<name>\""). Reasoning: /skill is interactive slash-command syntax, and its behavior in claude -p non-interactive mode is undocumented — we didn't want to bet the fix on it.

The PR reads skillProvided from CC's system.init event's skills[] array, assuming CC will auto-discover workDir-scoped .claude/skills/<name>/SKILL.md and list it at init time.

Would you be willing to run the PR branch against your original failing setup (the GLM / DeepSeek thinking mode environment from #14)? Two things to confirm:

  1. skillProvided=true appears in the evidence (CC actually recognized the skill)
  2. jit-optimize proceeds into Round 1 instead of optimizer abstained (infraBlocked)
git fetch origin pull/20/head:pr-20
git checkout pr-20
# then run the same jit-optimize command that originally hit infra-blocked

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Claude Code adapter inject模式实现Bug报告

1 participant